Tempora: cell trajectory inference using time-series single-cell RNA sequencing data

This page contains supplementary data for:

''Tempora: cell trajectory inference using time-series single-cell RNA sequencing data''

Thinh N. Tran+, Gary D. Bader*

Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada

The Donnelly Centre for Cellular and Biomolecular Research

+Current address: Gerstner Sloan Kettering Graduate School of Biomedical Sciences, New York, NY, USA

*Corresponding author

Tempora is a novel cell trajectory inference method that orders cells using time information from time-series scRNAseq data. Tempora uses biological pathway information to help identify cell type relationships and can identify important time-dependent pathways to help interpret the inferred trajectory.

Tempora source code and vignettes can be accessed at https://github.com/BaderLab/Tempora.

Sample Data for Tempora

The Tempora package was validated using three time course scRNAseq datasets: an in vitro differentiation of human skeletal muscle myoblasts, in vivo early development of murine cerebral cortex and in vivo embryonic and postnatal development of murine cerebellum. For both datasets, cells from all timepoints were filtered to remove low-quality reads, normalized with scran and corrected for batch effect using Harmony. All cells were then iteratively clustered until the number of differentially expressed genes between neighboring clusters reached 0. The datasets were exported as Seurat v2 objects, ready to be imported into Tempora.

Human skeletal muscle myoblast data (HSMM)

The HSMM dataset contains approximately 271 cells collected at 0, 24, 48 and 72 hours after the switch of human myoblast culture from growth to differentiation media. Cells were sequenced using Fluidigm C1. Raw sequencing reads can be accessed in the Gene Expression Omnibus, accession number GSE52529.

Original reference: Trapnell, Cole, et al. "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells." Nature biotechnology 32.4 (2014): 381. doi:https://doi.org/10.1038/nbt.2859.

Human muscle cell development time course scRNA-seq data (99.5 MB)

Murine cerebral cortex data (MouseCortex)

The MouseCortex dataset contains approximately 6,000 neural cells collected at embryonic days 11.5 (E11.5), E13.5, E15.5 and E17.5. Cells were sequenced using DropSeq. These cells cover a wide spectrum of neuronal development, from the early precursors (apical and radial precursors) to intermediate progenitors and differentiated cortical neurons. As per the original publication, data was filtered to remove non-cortical cells, as done in the original publication. Removed cells included cells expressing Aif1 (microglia), hemoglobin genes (blood cells), collagen genes (mesenchymal cells), as well as Dlx transcription factors and/or interneuron genes (ganglionic eminence-derived cells). All retained cells were then iteratively clustered as described above. Raw sequencing reads can be accessed in the Gene Expression Omnibus, accession number GSE107122.

Original reference: Yuzwa, Scott A., et al. "Developmental emergence of adult neural stem cells as revealed by single-cell transcriptional profiling." Cell reports 21.13 (2017): 3970-3986. doi:https://doi.org/10.1016/j.celrep.2017.12.017.

Mouse brain cortex development time course scRNA-seq data (1.02 GB)

Murine cerebellum (MouseCerebellum)

The MouseCerebellum dataset contains approximately 55,000 neural cells collected at embryonic day 10 (E10), E12, E14, E16, E18 and postnatal day 0 (P0), P5, P7 and P14. Cells were sequenced using the 10X Chromium platform. The original ~60,000 sequenced cells include both neural and non-neural cells, and the neural cells span a large developmental spectrum from progenitors to fully differentiated neurons. To focus our trajectory analysis on the neural lineages, we filtered the data set to remove non-neural cells. Eliminated cells include mesenchymal stem cells (expressing Prrx2), brainstem progenitors (expressing Olig3), endothelial cells (expressing Cldn5), blood cells (expressing hemoglobin genes), meninges (expressing Cxcl12), pericytes (expressing Rgs5) and microglia (expressing Aif1). This results in a subset of ~55,000 neural cells belonging to three main cerebellar lineages: GABAergic neurons, glutamatergic neurons and glia. All retained cells were then iteratively clustered as described above. Raw sequencing reads can be accessed in the Gene Expression Omnibus, accession number GSE118068.

Original reference: Vladoiu MC, El-Hamamy I, et al. "Childhood cerebellar tumours mirror conserved fetal transcriptional programs. " Nature. 2019;572(7767):67-73. doi:https://doi.org/10.1038/s41586-019-1158-7

Mouse cerebellum development time course scRNA-seq data (zip file) (8.69 GB) (note: this may need to be uncompressed on MacOS)

Mouse cerebellum development time course scRNA-seq data (original RData file) (10.86 GB)

File Format

All datasets are packaged as Seurat v2 objects (https://satijalab.org/seurat/) with the following slots:

raw.data: a sparse matrix containing raw sequencing reads of cells from all time points. Each column represents a cell and each row represents a gene.
data*: a matrix containing processed counts of cells (after filter, normalization and batch effect correction) from all time points
scale.data: a matrix containing scaled expression of all genes, required for downstream principal component analysis.
var.genes: a list of genes identified as variable genes across all cells.
ident: a vector containing the cluster identity of all cells at the chosen resolution (1.5 for HSMM and 0.6 for MouseCortex data)
meta.data*: a data frame containing various meta data information for all cells, including the number of genes expressed, number of UMIs detected, the collection time point of each cell, and cluster identity of cells at various clustering resolutions.
dr: a list of key results from various dimensionality reduction techniques applied on the dataset, including principal component analysis (accessible at dr$pca), tSNE (accessible at dr$tsne) and Harmony (accessible at dr$harmony).
hvg.info: a data frame containing information about the most highly variable genes across all cells.
calc.params: a list of parameters used in each analysis step to produce the final Seurat objects.

All slots can be accessed by object_name@slot_name. Slots required by Tempora are denoted with an asterisk.