Tools For The Analysis of High-Dimensional Single-Cell RNA Sequencing Data
Tools For The Analysis of High-Dimensional Single-Cell RNA Sequencing Data
In a single organism, most cells have the same genome, and work towards a better understanding of physiology,
but specific gene expression varies across different tis biological development and disease2–6. For example,
sues and cell types. Any given tissue or cell type expres researchers generated an improved quantitative map of
ses ~11,000–13,000 genes, of which ~3,000–5,000 have the cell types present in the developing human kidney,
a cell-type-specific expression pattern, whereas the which has provided insights into renal physiology7.
remaining genes are ubiquitously expressed1. These Another single-cell study demonstrated the similarities
unique patterns of gene expression translate to differ between fetal human kidney and human kidney orga
ences at the protein level between different cell types noids, reaffirming the utility of kidney organoids as a
and result in the vast array of cellular phenotypes found model for the study of disease and for drug screening8.
throughout the body. Therefore, a snapshot of the gene However, deriving biological insights from single-cell
expression profile of a cell can be indicative of its pheno RNA sequencing (scRNA-seq) methods demands that
type. Owing to the limited amount of RNA present in researchers handle the large volume of data generated
each cell, gene expression profiling was historically by these technologies and their accompanying sources of
performed on pooled cells, but this bulk sequencing technical noise9. Addressing the scale and complexity
approach obscured the potential cell heterogeneity in of these datasets thus requires a complex ecosystem of
a sample or tissue2. For example, in a pool of develop computational methods.
ing progenitor cells, different cells might be primed to Beyond scRNA-seq analysis, other available techno
make distinct fate decisions but these transcriptional logies can profile genomes10, methylation patterns11 and
Department of Bioengineering, programmes are indistinguishable in a bulk analysis of chromatin accessibility patterns12,13 at the single-cell
University of California at San
Diego, La Jolla, CA, USA.
the average gene expression in the progenitor pool. level. Each type of single-cell profiling comes with its
✉e-mail: kzhang@ The development of technologies that can isolate own challenges in terms of data analysis. Additionally,
bioeng.ucsd.edu thousands to tens of thousands of cells and assess their the development of ‘multi-omics’ approaches, in which
https://doi.org/10.1038/ gene expression profiles at the single-cell level has ena multiple types of biological molecules are profiled in the
s41581-020-0262-0 bled researchers to dissect this cellular heterogeneity same cell, has advanced substantially in recent years.
www.nature.com/nrneph
Reviews
Box 1 | Dataset uploading and exploration The scran package pools cells with similar expres
sion patterns before estimating size factors, therefore
A crucial step in the analysis of single-cell RNA addressing normalization issues due to cell-type-specific
sequencing (scRNA-seq) data is the assessment of gene expression or UMI counts34. However, scaling
related available datasets, not only to enable a better genes with high expression and low expression using
understanding of existing work but also to assist in the
the same size factor can lead to overcorrection of genes
formulation of novel hypotheses. Online data browsers
are particularly useful for the investigation of existing with low expression, such as transcription factors, and
scRNA-seq datasets. University of California Santa Cruz under-correction of genes with high expression, such as
(UCSC), the Broad Institute and European Molecular housekeeping genes35,36. SCnorm addresses this issue by
Biology Laboratory-European Bioinformatics Institute pooling genes with similar dependencies on total UMI
(EMBL-EBI) have single-cell data browsers that enable or read count and computing size factors within each
users to interactively visualize the different cell types pool35. sctransform (implemented in the Seurat package)
and marker genes present in these datasets. The uses a probabilistic model to compute the effect of total
convenient user interfaces of these browsers allow users UMI or read count on each gene, which also enables it to
to quickly explore the datasets without the need to stabilize gene variances (discussed later in more detail)
download the data or previous programming experience.
and identify over-dispersed genes36.
Additionally, uploading a dataset to these browsers
increases the visibility of the study that generated the Overall, some type of normalization is crucial for
data and maximizes the potential for new insights from scRNA-seq analysis and, although total count normal
the dataset. ization successfully mitigates technical bias, it can
partially obscure true biological heterogeneity. Using
specialized normalization methods, such as SCnorm
when an artificial cluster of dead or dying cells becomes and sctransform, can unlock additional heterogeneity
apparent in the downstream analysis, modifying these in a dataset33.
thresholds after running the entire analysis pipeline
and repeating the analysis can also be helpful (Fig. 2b). Variance stabilization
Seurat29 and SCANPY30 are scRNA-seq analysis pipe Gene expression levels can vary enormously and the
line packages that include functions for computing QC average expression (or magnitude) of a gene is strongly
metrics, such as the fraction of genes expressed per cell, associated with its variance37, an effect known as the
mitochondrial fraction and total counts; users determine mean–variance relationship (Fig. 2d). Variance stabi
the thresholds with which to filter genes and cells in the lization adjusts the data to remove the influence of
dataset. Scater31 also offers a suite of tools for computing gene expression magnitude on gene variance. This
key QC metrics. step ensures that downstream analyses are focused on
the most biologically relevant genes (that is, the genes
Data normalization that are expressed in specific cell types in the dataset)
The fraction of RNA captured in each cell can vary rather than simply focusing on the genes with the
owing to factors such as reverse transcription efficiency, highest expression. For example, variance stabilization
primer capture efficiency and errors associated with might facilitate the separation of two subpopulations
collapsing UMIs2,32,33. Differences in the total amount of of a developmental progenitor, which might otherwise
UMIs or reads in each cell might thus result from techni be merged, by enabling genes with low average levels of
cal factors rather than biological variation (Fig. 2c). If not expression, such as transcription factors, to still contri
normalized, technical differences in total UMIs or reads bute to the analysis. Despite having low overall expres
can dominate the downstream analysis. For example, sion levels within a cell, such genes might be important
cells with similar amounts of total UMIs or reads clus in uncovering the fate of that cell.
ter together instead of cells with similar gene expression One simple variance-stabilizing approach is to
patterns33. Normalization is therefore crucial to reveal log-transform normalized counts, which reduces the
ing the true biological heterogeneity of a dataset. Most difference between genes with high and low expres
Total counts normalization methods attempt to estimate the bias for sion38 (Fig. 2d). Pipelines that can be used to remove
The total number of reads each cell (also known as a size factor). The UMI or read the effect of average gene expression on gene variance
or UMIs in a given cell. counts of all cells can then be normalized by dividing include Seurat, Pagoda2 (Fig. 2d) and SCANPY, which
those values by the size factor, enabling the compari explicitly fit a mean–variance relationship and apply a
Size factor
An estimate of how much
son of gene expression levels across different cells. Total scaling factor30,37,39,40. ZINB-Wave, single-cell variational
variation in sequencing counts normalization is a simple normalization strat inference (scVI) and deep count autoencoder (DCA)
depth or RNA capture egy, in which the size factors consist of the total num are alternative methods that use a different approach
efficiency affects the overall ber of UMIs or reads in each cell. However, total counts (negative binomial distribution) to model single-cell
quantification of gene
normalization can be dominated by highly expressed count data36,41–43.
expression in a cell.
genes and results in biased size factor estimation when Overall, although variance stabilization is not strictly
Over-dispersed genes strong cell-type-specific gene expression exists, which necessary for scRNA-seq analysis, adjusting the data
Genes that show a greater than can occur when very different cells or tissue types are set for the wide variation in average gene expression
expected variance between present in the dataset33,34. Also, some cell types are larger enhances the contribution of biologically relevant genes
cells given their average
expression, which suggests
and have more RNA molecules than others, a biological to downstream analyses. This approach removes the
that they are expressed in factor that is obscured when simply dividing the number influence of genes, such as housekeeping genes, which
a cell-type-specific manner. of UMIs or reads by the total counts34. are abundantly expressed but at similar levels in all cells
www.nature.com/nrneph
Reviews
a Sequencing reads Gene expression matrix This optional processing step involves selecting the genes
Aligned cDNA Cell barcode UMI Cells with the highest residual variance after adjusting for the
A B C D
differences in average gene expression.
a 1.2 0.3 2.1 3.6 .....
b 3.2 1.9 5.2 1.1 .....
Genes
c 2.6 4.6 0.8 2.2 ..... Batch effects and data integration
d 0.6 3.3 0.9 4.4 ..... Joint analysis of multiple scRNA-s eq datasets gen
erated using different technologies, obtained from
.....
.....
.....
.....
b c different patients or samples, or from different exper
Cell A iments, increases the total number of cells analysed.
20
Mitochondrial fraction
Normalized counts
Raw read counts
15 subtypes and the detection of rare cell phenotypes, and
also enables direct comparisons of patients, samples
10 Dead cells or technologies. However, this type of analysis is often
challenging owing to batch effects — technical differ
5 ences in gene expression can mask relevant biological
0
phenomena29,45–48. Intra-batch variation is typically due
0 5,000 10,000 15,000 Gene 1 Gene 2 Library Gene 1 Gene 2 Library
to differences between cell types and biologically rele
Number of UMIs size size vant factors, whereas inter-batch variation might also
d Adjusted result from technical factors. These batch effects can
0
2.5 arise from variability in patients, samples or protocols
(including operator-driven variation) that affect RNA
capture efficiency or cell viability45. The strength of a
–2
2.0 batch effect depends on the type of dataset and can be
difficult to predict before running the analysis.
log10[variance]
a separate cluster after integration. Of note, if the knock the performance of clustering and trajectory inference.
out caused a uniform shift in gene expression across all Clustering refers to partitioning cells into groups based
podocytes rather than podocyte loss, that shift might on similar patterns of gene expression; these groups
be lost after integration as it would be indistinguishable (also known as clusters) usually correspond to distinct
from a batch effect. biological cell types or states52. Trajectory inference is
Identifying MNNs can be difficult if the batch effects usually applied to cells that are dynamically transitioning
are stronger than the differences in gene expression across a continuum of cellular states52,53.
between cell types. To overcome this challenge, canon Upstream analysis choices, such as QC filtering and
ical correlation analysis can be applied to focus the normalization, can have a substantial impact on both
analysis on intra-batch variation and not on inter-batch clustering and trajectory inference. For example, data
variation29, even if the differences between batches are normalization is a critical step, otherwise clusters are
stronger than the differences between cell types40. almost entirely based on the number of UMIs or reads
One caveat of these methods of data integration is of the cell rather than on similarities in gene expression
Regression model that a compromise between reducing the size of the profiles. Dead or dying cells, as well as doublets, can
A model that compares the batch effect and resolving cell types might be required; also generate artificial clusters that might be difficult to
relationship between two this parameter can be explicitly tuned in methods such as distinguish from real clusters if not removed.
variables. In the context of
single-cell RNA sequencing,
clustering on network of samples (CONOS)47. For exam
regression can assess ple, completely removing the batch effect from kidney Dimensionality reduction and imputation
relationships between cells collected from two different patients might reduce The dimensionality of a dataset refers to the number of
observed gene expression, the ability to resolve podocyte subtypes. The extent of variables being measured for each data point. In the con
and technical and/or biological
this compromise depends on the specific datasets being text of scRNA-seq, each data point corresponds to a cell
factors.
integrated and the strength of the batch effects. and the variables are the genes. scRNA-seq experiments
Mutual nearest neighbours Overall, the advent of MNN-based methods has ena are characterized as ‘high dimensional’ as they typically
(MNNs). Cells from different bled scRNA-seq users to analyse and compare samples measure the expression of ~20,000 variables (genes).
batches that belong to each across platforms, patients or samples, and even across Even after selecting only a subset of highly variable
other’s set of k-nearest
neighbours (that is, cells
species, improving the capacity of scRNA-seq to resolve and/or biologically relevant genes, users often still have
with the most similar gene cell types and trajectories29,40,46,51. a dataset with thousands of genes, many of which are
expression patterns). highly correlated and provide redundant information,
Downstream analyses potentially masking more subtle biological patterns52,54,55.
Dimensionality reduction
Once the pre-processing steps are completed, down Additionally, the metrics used to measure similarity in
Summarizing a large set of
variables with a smaller set stream analysis steps, which include dimensionality gene expression patterns between cells become less reli
of variables, while retaining as reduction , clustering and trajectory inference, focus able in a high-dimensional space, a phenomenon known
much information as possible. on identifying patterns in the data that provide bio as the ‘curse of dimensionality’52,54. Therefore, applying
logical insight. Dimensionality reduction involves dimensionality reduction to scRNA-seq datasets can
Embedding
The set of variables that
transforming the dataset into a more compact, and pos improve downstream analyses. The reduced dimen
remains after running some sibly more interpretable, representation that captures sions are typically called an embedding of the dataset.
form of dimensional reduction. the primary biological axes of variation and improves Dimensionality reduction has the added benefit of
www.nature.com/nrneph
Reviews
Dropout
improving the speed of most downstream analyses. interpretable dimensions by attempting to find discrete
The absence of a detectable However, although it is extremely helpful for most data components (such as a collecting duct or tubule) that
gene or transcript in a cell. sets, dimensionality reduction is not strictly necessary underlie the dataset59,60.
for downstream analyses.
Non-linear methods. The relationship between genes
Linear methods. A linear relationship between two vari can be highly non-linear, which affects the ability of lin
ables exists when both variables change at the same rate ear models such as PCA to analyse scRNA-seq data42.
(direct proportion). The most common dimensionality Methods that can generate a non-linear transformation
reduction method for scRNA-seq analysis is principal of the dataset can thus outperform linear methods in
component analysis (PCA), which creates a linear com certain cases. Specifically, locally linear embedding
bination of genes that best capture the variance in the (LLE) and diffusion maps (Dmaps) were shown to be
data56 (Table 2). The ability of PCA to reduce the dimen effective when the dataset follows a continuous trajec
sionality of the data while finding the dimensions of tory, such as with datasets from developmental time
highest variance makes it a very useful dimensionality series61–63 (Table 2).
reduction tool before clustering. Another approach to non-linear dimensionality
Only a relatively small fraction of the total RNA of a reduction is the use of deep neural networks, which are
cell is captured and reverse transcribed in an scRNA-seq models that apply iterative, non-linear transformations
experiment. Consequently, no molecules are detected for to the dataset64. By layering these iterative transforma
many genes in most cells, resulting in a large amount of tions, deep neural networks can learn complex features
zeros in the single-cell counts matrix, which is known of a dataset, which enables them to represent the data
as zero inflation3,4. Zero-inflated factor analysis (ZIFA) using fewer dimensions64. scScope and DCA use neural
is a variation of PCA that is designed to explicitly model networks that can outperform linear dimensional reduc
the expected high amount of zero values in scRNA-seq tion methods such as PCA43,65 (Table 2). scVI also uses
count data57 (Table 2). neural networks to create a framework for modelling
One downside of PCA is that the principal com gene expression in a way that enables the quantification
ponents themselves can be difficult to interpret bio of uncertainty for each gene expression estimate, while
logically. Ideally, each dimension obtained after accounting for technical effects such as batch effects and
dimensionality reduction would correspond to a bio zero inflation42 (Table 2).
logical process. For example, for a developmental kidney For users who are interested in simply reducing the
dataset, each dimension would correspond to a devel dimensionality of the data and proceeding to clustering
oping kidney compartment (for example, the collect and visualization, PCA is a good default approach, but
ing duct or the tubule). The factorial single-cell latent more specialized methods such as f-scLVM or scVI can
variable model (f-scLVM) addresses this interpretability generate low-dimensional embeddings that are either
issue by explicitly modelling annotated gene sets as the more interpretable or capture the non-linear structure
reduced dimensions58 (Table 2). Therefore, after running of the data more faithfully43.
f-scLVM, each reduced dimension corresponds to a
pre-annotated gene set. Pagoda and Pagoda2 also create Zero inflation and imputation. Zero inflation is a tech
highly interpretable dimensions by running PCA within nical limitation of more recent high-throughput scRNA-
pre-annotated gene sets and selecting the dimensions seq methods and is driven by several factors, including
that show significant variance in the dataset37,39 (Table 2). incomplete reverse transcription or RNA capture. Total
Non-negative matrix factorization (NMF) is another efficiency calculations estimate that only 10–15% of
linear matrix factorization method that generates more the total RNA in a cell is captured and transcribed3–5.
Of note, some researchers argue that the zero inflation
a b for droplet-based methods is mostly due to biological
variance and not due to technical noise66. However,
the newer generation of combinatorial indexing meth
Dataset 1 ods tends to capture even fewer molecules per cell
than droplet-based methods and technical zero infla
tion might thus be present in those datasets5,6. Several
methods have been developed to impute these missing
values (that is, to replace the zeros in the counts matrix
with estimated values). One class of methods, including
MAGIC and kNN-smoothing, uses information from
neighbouring cells to impute missing values for any
given cell67,68. Another class of methods such as single-
cell analysis via expression recovery, clustering through
Dataset 2 Integrated dataset
imputation and dimensionality reduction (CIDR) and
Fig. 3 | Integration of single-cell rNA sequencing data. a | The first step of data scImpute use probabilistic models and relationships
integration based on mutual nearest-neighbour data involves the identification of between genes to distinguish technical from biological
matching cell types across datasets. b | These matching cell types can then be grouped dropout69–71. However, these imputation methods should
together and integrated into one dataset, while preserving any biologically relevant be used with care as they can introduce false-positive
cell types that are unique to different datasets. results when analysing differential gene expression72.
counts matrix
f-scLVM Uses latent variable modelling and gene sets to generate https://github.com/bioFAM/slalom 58
lower dimensions
scScope Uses a recurrent neural network to remove technical https://github.com/AltschulerWu-Lab/ 65
noise and then encode the dataset into lower dimensions scScope
scVI Uses probabilistic modelling with deep neural networks to https://github.com/YosefLab/scVI 42
Therefore, users should be cautious when analysing dif broad identification of well-defined cell types, which are
ferences in genes with low levels of expression and high then sub-clustered to further resolve their heterogeneity.
levels of dropout. Both k-means and hierarchical clustering methods
are slow to run for large datasets and are limited in the
Clustering types of clusters they can detect77. Seurat, Pagoda2,
Generally, most scRNA-seq datasets either comprise SCANPY and CellRanger use graph-based clustering
discrete cell types or reflect a continuous trajectory of algorithms, which tend to run quickly and generate bio
development or differentiation. For datasets in which logically relevant clusters for larger datasets4,29,30,39. Graph
individual cells can be grouped into discrete cell types, clustering requires building a graph by connecting each
clustering needs to be applied to resolve those cell cell to its nearest neighbours. The Louvain clustering
types. Each cluster generally expresses a set of genes algorithm, for example, can be applied to cells that have
(marker genes) that are not expressed in cells from been connected in a graph. Starting with each single
other clusters (Fig. 4). k-means clustering is a simple cell as its own cluster, the algorithm iteratively merges
and popular clustering method that iteratively assigns clusters as long as the merging increases the modularity
cells to clusters73. However, k-means clustering requires of the graph (the higher the modularity, the lower the
users to pre-specify the number of cell clusters present likelihood that cells were connected in the network by
in the dataset and determining the number of biologi random chance)78. However, the Louvain method can
cally relevant clusters in an scRNA-seq dataset remains a sometimes generate erroneous clusters composed of cells
challenge52,73. One strategy for dealing with this problem that are not well connected79. Leiden clustering improves
is to generate more clusters than those expected to be on Louvain clustering by guaranteeing well-connected
found in the dataset and then iteratively either merge clusters and improving runtime79.
neighbouring clusters or divide larger clusters based on a Other approaches include SC3 Consensus Clustering,
similarity threshold, such as the number of differentially which uses the consensus of multiple clustering methods
expressed genes between clusters. CIDR, BackSPIN and to improve clustering accuracy80. Reference component
pcaReduce use this hierarchical clustering approach70,74,75. analysis projects single cells onto a low-dimensional space
Users can then select the groupings that best match the defined by existing bulk RNA-seq datasets, which can be
required level of cluster granularity. Hierarchical analy very useful for cell populations that are highly hetero
sis with multiple stages of clustering might be necessary geneous and difficult to interpret such as those found
for extremely large datasets (>100,000 cells) with many in cancer81. Overall, graph clustering methods such as
different cell types. This approach, used for example in a Leiden or Louvain have a strong clustering performance
study of the mouse nervous system76, requires an initial with fairly fast running times.
www.nature.com/nrneph
Reviews
Cells Cluster a
Genes
Cluster b
Cluster c
Cluster a Cluster c
Cluster b
Fig. 4 | Cell clustering in datasets with discrete cell types. An important objective of single-cell RNA sequencing analysis
is to resolve the cellular heterogeneity of a dataset by identifying the different subpopulations present. Cell clustering
identifies and groups cells from a heterogeneous dataset into clusters, according to similarities in their patterns of gene
expression, as illustrated in the heatmap. These cell clusters usually correspond to different cell types present in a dataset.
Progenitor Gene 1
cells local structure of datasets (Fig. 6a), including separating
closely related cell types, while performing vastly better
Expression
in terms of visualizing the global properties of the data
Branching (Fig. 6b). Thus, UMAP is a very useful default visuali
point
zation option for most users89,90. Additional testing of
Gene 2 UMAP and t-SNE has suggested that the way in which
these methods are initialized is very important to their
overall performance93,94. In fact, t-SNE and UMAP seem
to perform equally well in terms of preserving global
Branch A Branch B Developmental pseudotime
structure when initialized using PCA93,94.
Fig. 5 | Modelling continuous cellular states. In datasets that are characterized by the Similarity-weighted non-negative embedding
presence of a continuum of cell states, the objective of the single-cell RNA sequencing (SWNE) uses NMF (Table 2) to reduce the dimen
analysis is to model the cellular trajectory. This process includes computing the develop- sionality of the data and then uses the dimensions as a
mental pseudotime (that is, the approximation of how far a cell has progressed along framework with which to project the cells in two dimen
a developmental or differentiation pathway), the branch identities (for example, distal sions, adjusting the relative positions of the cells using a
and proximal tubule branches for cells in these distinct developmental pathways) and weighted nearest-neighbours graph87. This framework
the location of the branching point (for example, the point where nephron progenitors also enables genes to be visualized alongside the cells,
split into cells that will either develop as distal or proximal tubule cells). a | Illustration of adding biological context and interpretability to the
a dataset with a continuous developmental trajectory that branches off into two lineages. visualizations87 (Table 3). SWNE performs better than
The arrow represents the direction of developmental pseudotime. b | Example of a typical
t-SNE and is similar to UMAP in terms of capturing
pattern of gene expression in a dataset with a continuous trajectory.
global structure, although its representation of local
structure is inferior to both t-SNE and UMAP87.
Although this step is conceptually identical to dimen Potential of heat diffusion for affinity-based transi
sionality reduction, visually separating closely related tion embedding (PHATE) uses a diffusion-based dis
cell types (maintaining the local structure of the data) tance metric that is accurate for both local and global
in just two or three dimensions, while also ensuring that structure 95. PHATE first computes local distances
the relative distances between cell types and trajectories between neighbouring cells and then propagates those
reflects the magnitude of the gene expression differences distances (in a manner similar to that of Dmaps) to com
between those cell types (maintaining the global struc pute global distances between all cells. PHATE seems
ture of the data), is a complex task. Many linear dimen to perform very well for datasets with developmental
sional reduction methods, such as PCA, are unable to trajectories, outperforming both t-SNE and UMAP in
generate accurate visual representations of the data in capturing global and local structure.
two or three dimensions87,88. Thus, visualization methods Deep learning methods can also capture the struc
tend to transform the data using a non-linear method, ture of high-dimensional data in a 2D embedding owing
which can distort the structure of the data if not used to their ability to capture non-linearity in the data96.
correctly87,89–91. scvis uses a deep neural network to condense high-
t-stochastic neighbour embedding (t-SNE) is one dimensional data into a low-dimensional embedding,
of the most popular visualization methods and uses which results in better cell-type separation (the ability
pairwise similarities of cells to embed them in a low- to capture local structure) than t-SNE (as measured
dimensional space, ensuring that cells with similar by classification accuracy), as well as faster runtimes96
gene expression profiles are close in the embedding88,92. (Table 3). Other deep learning-based methods such
t-SNE thus prioritizes the local structure of the data, as scScope, DCA and scVI can also be used to encode
essentially ensuring that neighbouring cells remain high-dimensional data in two dimensions42,43,65,96.
together in the 2D visualization88,92 (Table 3). This feature Overall, visualization is critical for understanding
enables t-SNE to visually separate complex datasets with and communicating the properties of a dataset. One
closely related cell types. However, t-SNE, as tradition common misconception is that clustering and visuali
ally implemented, does not visualize global properties of zation are identical analyses. Although clusters can be
the datasets, such as relative distances between cell types, created based on UMAP or t-SNE coordinates, using
as effectively88,92 (Fig. 6). t-SNE is currently implemented more dimensions with a generalized method such
in Seurat, Pagoda2, SCANPY and CellRanger–Loupe as PCA to create cell clusters is typically more useful,
Cell Browser. because all the structure and nuances of a dataset cannot
Classification In the past few years, uniform manifold approxi be accurately compressed into two or three dimensions.
A machine learning task in mation and projection (UMAP) has overtaken t-SNE In fact, a benchmarking study found that dimensionality
which an algorithm learns as the default visualization method for scRNA-s eq reduction methods that work well for clustering often do
the relevant features that data89,90. Similar to graph clustering, UMAP gener not work as well for visualization55. However, for trajec
distinguish the different
classes of a training dataset
ates a nearest-neighbours graph of the cells, weighting tory inference, methods that are used for visualization,
to predict the classes of an each cell–cell connection by the strength of similarity; such as UMAP, Dmaps and LLE, generally work well as
unknown test dataset. the graph is then embedded in two dimensions 89. a basis for building trajectory graphs63,84.
www.nature.com/nrneph
Reviews
As a starting point, UMAP is a very useful default extremely helpful at this stage. One class of methods
method that faithfully visualizes most datasets and that can help to accelerate this process uses enrichment
requires less parameter adjusting than t-SNE or SWNE of marker genes in functional pathways and gene onto
to work well. However, users still need to take care not to logy terms, which can greatly enhance interpretability100.
over-interpret visualizations, as all methods result in For example, a cell-type cluster with marker genes that
a certain degree of data distortion. Additionally, more are highly enriched for the gene ontology term ‘nephron
research is needed on how different initializations epithelium development’ is likely to contain cells related
of these non-linear methods can affect their overall to the nephron epithelium.
performance. A second class of methods matches individual cells or
clusters to either a single-cell or bulk reference RNA-seq
Cell-type annotation dataset for automated cell-type classification. A bench
Often, the most time-consuming step of scRNA-seq marking analysis of these automated classification meth
analysis is the identification of the biological cell types ods97 found that the best-performing method was the
present in the dataset. The standard protocol for this support-vector machine, a common type of machine
cell-type annotation is to find the genes that are uniquely learning classifier101. The analysis also found that meth
expressed in each cluster and match those genes to lists ods that use previously known sets of canonical marker
of canonical cell-type markers52,97. Tools for marker genes, such as Garnett102, do not outperform unbiased
gene discovery and visualization are included in methods97. Other automated cell-type annotation meth
Seurat29, Pagoda2 (ref.39), SCANPY30 and the Loupe Cell ods include scmap, which classifies scRNA-seq clusters
Browser4. An evaluation of marker gene discovery using correlations with reference datasets and a feature
methods found that most methods developed for bulk selection approach based on machine learning103, and
RNA-seq, such as edgeR and limma, perform just as scPred, which uses a combination of dimensionality
well as scRNA-seq-specific methods98. Nonetheless, reduction and classification104.
the Wilcoxon method, which is the default method for Data integration methods such as Seurat, CONOS
both Seurat and Pagoda2, performed relatively well98. and Scanorama also offer automated cell-type classifi
Interpreting the output of these marker gene dis cation methods40,47,51. These methods find the MNNs
covery methods can be challenging for new users. across datasets, which enables them to classify cell types
For single-cell methods such as the Wilcoxon test, the in a dataset without pre-set cell-type labels, based on
P values are often extremely low as the test treats each the labels of a reference dataset. For example, if a cell of
cell as an independent replicate. In these cases, the log unknown type has ten MNNs in the reference dataset,
fold change of gene expression can be a helpful metric, and nine of them are podocytes, the unknown cell is
as it is indicative of the magnitude of the difference in most likely to be a podocyte too.
gene expression. When an experiment contains multiple Although automated cell-type annotation meth
biological or technical replicates, one useful approach is ods can be convenient, they require existing reference
to create a pseudo-bulk counts matrix after clustering by scRNA-seq datasets. If a dataset contains novel cell types
summing or averaging the counts from cells in a single or cell states, manual annotation with marker genes is
replicate and a single cluster99. Bulk approaches such as still necessary. Of note, even with a reference dataset,
edgeR or limma can then be used to assess differential manual inspection of marker genes is critical to validat
gene expression. ing the identified cell types. Nonetheless, as single-cell
Manually examining lists of marker genes can be atlases such as the human cell atlas and other refer
extremely time-consuming and requires knowledge of ence catalogues of single-cell gene expression become
the biological system being studied. Close collaboration more widely available, the use of automated cell-type
between biologists and bioinformaticians can thus be classification will become more widespread105.
dimensions
NMF, non-negative matrix factorization; PHATE, potential of heat diffusion for affinity-based transition embedding; SWNE,
similarity weighted non-negative embedding; t-SNE, t-stochastic neighbour embedding; UMAP, uniform manifold approximation
and projection.
a Local structure b Global structure Loading more cells enables greater throughput at the
(neighbourhood distances) (cell type and cluster distances) cost of a potential increase in cell doublets4.
The choice of tissue dissociation methodology can
also have a substantial impact on the types of cells availa
ble for analysis107. One key choice is whether to dissociate
the sample into single cells or single nuclei. Dissociation
of single cells has been widely applied to fresh tissue
samples107. For frozen tissues, single-nucleus isolation
Cluster
distances and sequencing is a more viable option 107–109. Both
Cell neighbourhood types of protocol seem to have their own specific biases,
although for some sample types, such as human neurons,
only single-nuclei dissociation has been shown to work
well107–109. One limitation of single-nuclei methods is that
they generally result in fewer molecules captured per cell,
as most RNA is in the cytoplasm. However, information
captured from nuclei alone can often be sufficient for
accurate classification of cell types and subtypes7.
Fig. 6 | local and global structure in a dataset. a | Preserving the local structure
of a dataset ensures that the neighbouring cells of each cell remain together in the Conclusions
visualization, rather than preserving the original gene expression space. The distance
Technical advances in scRNA-seq technologies have
between cells in the final visualization is therefore representative of the degree of
similarity between their gene expression patterns. b | Preserving the global structure
led to the generation of datasets of increasing scale and
of a dataset ensures that large-scale distances, such as the distances between cell types, complexity. In response, an ecosystem of computational
are maintained. methods has been developed to deal with the challenges
involved in analysing these datasets. Methods based on
the identification of MNNs successfully integrate datasets
Experimental design considerations across patients, conditions and technologies, addressing
Experimental design can have a substantial impact on the crucial issue of batch effects in scRNA-seq data.
analysis. For example, if multiple biological samples are Additionally, a number of methods have been developed
to be collected and analysed, cells from each sample to model cellular trajectories and identify cell clusters.
should ideally be tagged to allow multiplexing, using However, one remaining limitation is that most clus
methods such as cell hashing, and then analysed on the tering methods require users to specify the number of
same scRNA-seq run106. For example, in an analysis of clusters and finding the optimal number of clusters for a
kidney samples from five different patients across three given dataset is challenging. A second limitation is that
scRNA-seq runs, each run would ideally contain tagged manually annotating cell types using marker genes can be
cells from each patient. This approach enables the dis extremely time-consuming. Fortunately, new automated
tinction between sample-specific effects and experi and semi-automated cell-type classification methods are
mental batch effects, which is especially crucial if the being developed to address this issue, although novel cell
samples are from a case–control study32. For example, types and states will still need to be manually annotated.
when comparing gene knockout mice with wild-type The ability to integrate datasets across samples, along
controls, cells from both types of mice are ideally run with the increased throughput of the latest scRNA-seq
on the same experiment. Combinatorial indexing methods, will increase our ability to resolve cell sub
methods facilitate this approach as cells from different types and discover rare cell types. Additionally, many
samples can be positioned in different wells during the newer methods, especially those used for low-level data
first round of barcoding5,6. For droplet-based methods, pre-processing, take into account memory and central
some form of sample-specific cell tag is necessary to processing unit usage, which is critical, as the size of
identify the sample source of a cell106. However, from a single-cell datasets continues to increase. Further develop
logistics standpoint, gathering all samples for processing ment of these computational methods will help research
in the same experimental batch is not always possible, ers to unlock additional biological insights. Despite these
especially for animal experiments across various condi advances in computational methodology, validation of
tions and/or timepoints, or for patient samples that are any computational findings by testing multiple biolog
collected during clinical procedures. ical replicates or conducting additional experiments,
The choice of scRNA-seq methodology also has an such as immunostaining or RNA-FISH, is still required.
effect on the number of molecules captured per cell and The advent of multi-omics approaches will require a
the total number of cells analysed. In general, the com new set of tools that can link data on different cellular
binatorial indexing methods capture fewer UMIs per cell parameters, such as protein expression or epigenetic data,
Cell hashing
A technique that attaches than the droplet-based methods, which can affect their to provide additional biological insight. For example,
unique molecular barcodes to ability to resolve some closely related cell subtypes4–6. analysing the relationship between gene expression and
multiple batches of samples for However, combinatorial indexing methods can capture enhancer and/or promoter accessibility might delineate
pooling and processing in one far more cells per experiment, potentially enabling the cell-type-specific maps of gene regulation, maximizing
batch, which not only improves
the experimental throughput
identification of rare cell populations4–6. For all of these the utility of scRNA-seq datasets.
but also reduces technical methods, the user can generally control the number
batch differences. of cells that are loaded onto the scRNA-seq platform. Published online xx xx xxxx
www.nature.com/nrneph
Reviews
1. Ramsköld, D., Wang, E. T., Burge, C. B. & Sandberg, R. 27. van den Brink, S. et al. Single-cell sequencing reveals 53. Saelens, W., Cannoodt, R., Todorov HelenaSaeys, Y.,
An abundance of ubiquitously expressed genes dissociation-induced gene expression in tissue Todorov, H. & Saeys, Y. A comparison of single-cell
revealed by tissue transcriptome sequence data. subpopulations. Nat. Methods 14, 935–936 (2017). trajectory inference methods: towards more accurate
PLoS Comput. Biol. 5, e1000598 (2009). 28. McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. and robust tools. Nat. Biotechnol. 37, 547–554
2. Potter, S. S. Single-cell RNA sequencing for the study DoubletFinder: doublet detection in single-cell RNA (2019).
of development, physiology and disease. Nat. Rev. sequencing data using artificial nearest neighbors. A benchmark analysis of methods for single-cell
Nephrol. 14, 479–492 (2018). Cell Syst. 8, 329–337.e4 (2019). trajectory inference.
3. Macosko, E. Z. et al. Highly parallel genome-wide 29. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & 54. Bellman, R. On the theory of dynamic programming.
expression profiling of individual cells using nanoliter Satija, R. Integrating single-cell transcriptomic data Proc. Natl Acad. Sci. USA 38, 716–719 (1952).
droplets. Cell 161, 1202–1214 (2015). across different conditions, technologies, and species. 55. Sun, S., Zhu, J., Ma, Y. & Zhou, X. Accuracy,
4. Zheng, G. X. Y. et al. Massively parallel digital Nat. Biotechnol. 36, 411–420 (2018). robustness and scalability of dimensionality reduction
transcriptional profiling of single cells. Nat. Commun. 30. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: methods for single-cell RNA-seq analysis. Genome Biol.
8, 14049 (2017). large-scale single-cell gene expression data analysis. 20, 269 (2019).
5. Rosenberg, A. B. et al. Single-cell profiling of the Genome Biol. 19, 15 (2018). A benchmark study of methods used for
developing mouse brain and spinal cord with split-pool 31. McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & dimensionality reduction of scRNA-seq data.
barcoding. Science 12, eaam8999 (2018). Wills, Q. F. Scater: pre-processing, quality control, 56. Abdi, H. & Williams, L. J. Principal component
6. Cao, J. et al. Comprehensive single cell transcriptional normalization and visualization of single-cell RNA-seq analysis. Chemom. Intell. Lab. Syst. 2, 433–459
profiling of a multicellular organism by combinatorial data in R. Bioinformatics 33, 1179–1186 (2017). (2010).
indexing. Science 357, 661–667 (2017). 32. Wagner, A., Regev, A. & Yosef, N. Uncovering the 57. Pierson, E. & Yau, C. ZIFA: dimensionality reduction
7. Lake, B. B. et al. A single-nucleus RNA-sequencing vectors of cellular states with single cell genomics. for zero-inflated single-cell gene expression analysis.
pipeline to decipher the molecular anatomy and Nat. Publ. Gr. 34, 1–53 (2016). Genome Biol. 16, 241 (2015).
pathophysiology of human kidneys. Nat. Commun. 10, 33. Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & 58. Buettner, F., Pratanwanich, N., McCarthy, D. J.,
2832 (2019). Marioni, J. C. Normalizing single-cell RNA sequencing Marioni, J. C. & Stegle, O. f-scLVM: scalable and
8. Combes, A. N., Zappia, L., Er, P. X., Oshlack, A. & data: challenges and opportunities. Nat. Methods 14, versatile factor analysis for single-cell RNA-seq.
Little, M. H. Single-cell analysis reveals congruence 565–571 (2017). Genome Biol. 18, 212 (2017).
between kidney organoids and human fetal kidney. 34. L. Lun, A. T., Bach, K. & Marioni, J. C. Pooling across 59. Lee, D. D. & Seung, H. S. Learning the parts of objects
Genome Med. 11, 3 (2019). cells to normalize single-cell RNA sequencing data with by non-negative matrix factorization. Nature 401,
9. Tanay, A. & Regev, A. Scaling single-cell genomics from many zero counts. Genome Biol. 17, 75 (2016). 788–791 (1999).
phenomenology to mechanism. Nature 541, 331–338 35. Bacher, R. et al. SCnorm: robust normalization of 60. Lin, X. & Boutros, P. C. Optimization and expansion of
(2017). single-cell RNA-seq data. Nat. Methods 14, 584–586 non-negative matrix factorization. BMC Bioinformatics
10. Chen, C. et al. Single-cell whole-genome analyses by (2017). 21, 7 (2020).
linear amplification via transposon insertion (LIANTI). 36. Hafemeister, C. & Satija, R. Normalization and 61. Roweis, S. T. & Saul, L. K. Nonlinear dimensionality
Science 356, 189–194 (2017). variance stabilization of single-cell RNA-seq data reduction by locally linear embedding. Science 290,
11. Smallwood, S. A. et al. Single-cell genome-wide using regularized negative binomial regression. 2323–2326 (2000).
bisulfite sequencing for assessing epigenetic Genome Biol. 20, 296 (2019). 62. Angerer, P. et al. destiny: diffusion maps for large-scale
heterogeneity. Nat. Methods 11, 817–820 (2014). 37. Fan, J. et al. Characterizing transcriptional single-cell data in R. Bioinformatics 32, 1241–1243
12. Cusanovich, D. A. et al. Multiplex single-cell profiling heterogeneity through pathway and gene set (2015).
of chromatin accessibility by combinatorial cellular overdispersion analysis. Nat. Methods 13, 241–244 63. Welch, J. D., Hartemink, A. J. & Prins, J. F. SLICER:
indexing. Science 348, 910–914 (2015). (2016). inferring branched, nonlinear cellular trajectories
13. Buenrostro, J. D. et al. Single-cell chromatin 38. Brennecke, P. et al. Accounting for technical noise in from single cell RNA-seq data. Genome Biol. 17, 106
accessibility reveals principles of regulatory variation. single-cell RNA-seq experiments. Nat. Methods 10, (2016).
Nature 523, 486–490 (2015). 1093–1095 (2013). 64. Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J.
14. Chen, S., Lake, B. B. & Zhang, K. High-throughput 39. Barkas, N. et al. pagoda2: a package for analyzing Deep learning: new computational modelling
sequencing of the transcriptome and chromatin and interactively exploring large single-cell RNA-seq techniques for genomics. Nat. Rev. Genet. 20,
accessibility in the same cell. Nat. Biotechnol. 37, datasets. GitHub https://github.com/hms-dbmi/ 389–403 (2019).
1452–1457 (2019). pagoda2 (2018). 65. Deng, Y., Bao, F., Dai, Q., Wu, L. F. & Altschuler, S. J.
15. Linker, S. M. et al. Combined single-cell profiling of 40. Stuart, T. et al. Comprehensive integration of Scalable analysis of cell-type composition from
expression and DNA methylation reveals splicing single-cell data. Cell 177, 1888–1902.e21 (2019). single-cell transcriptomics using deep recurrent
regulation and heterogeneity. Genome Biol. 20, 30 41. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & learning. Nat. Methods 16, 311–314 (2019).
(2019). Vert, J. P. A general and flexible method for signal 66. Svensson, V. Droplet scRNA-seq is not zero-inflated.
16. Gu, C., Liu, S., Wu, Q., Zhang, L. & Guo, F. Integrative extraction from single-cell RNA-seq data. Nat. Commun. Nat Biotechnol. 38, 147–150 (2020).
single-cell analysis of transcriptome, DNA methylome 9, 284 (2018). 67. Wagner, F., Yan, Y. & Yanai, I. K-nearest neighbor
and chromatin accessibility in mouse oocytes. Cell Res. 42. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. smoothing for single-cell RNA-seq data. Preprint at
29, 110–123 (2019). & Yosef, N. Deep generative modeling for single-cell https://doi.org/10.1101/217737 (2017).
17. Amezquita, R. A. et al. Orchestrating single-cell transcriptomics. Nat. Methods 15, 1053–1058 68. van Dijk, D. et al. Recovering gene interactions
analysis with Bioconductor. Nat. Methods 17, (2018). from single-cell data using data diffusion. Cell 174,
137–145 (2020). 43. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. 716–729.e27 (2018).
A useful stepwise practical tutorial on how to & Theis, F. J. DCA: single cell RNA-seq denoising using 69. Huang, M. et al. SAVER: gene expression recovery
perform scRNA-seq analysis in the R programming a deep count autoencoder. Nat. Commun. 10, 390 for single-cell RNA sequencing. Nat. Methods 15,
language using the Bioconductor suite of tools. (2019). 539–542 (2018).
18. Lun, A. T. L., Mccarthy, D. J. & Marioni, J. C. 44. Yip, S. H., Sham, P. C. & Wang, J. Evaluation of tools 70. Lin, P., Troup, M. & Ho, J. W. K. CIDR: ultrafast and
A step-by-step workflow for low-level analysis for highly variable gene discovery from single-cell accurate clustering through imputation for single cell
of single-cell RNA-seq data with bioconductor. RNA-seq data. Brief. Bioinform. 20, 1583–1589 RNA-seq data. Genome Biol. 18, 59 (2017).
F1000Res. 5, 2122 (2016). (2018). 71. Li, W. V. & Li, J. J. An accurate and robust imputation
19. Luecken, M. D. & Theis, F. J. Current best practices in A benchmark analysis of methods available for method scImpute for single-cell RNA-seq data.
single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. selecting over-dispersed genes. Nat. Commun. 9, 997 (2018).
15, e8746 (2019). 45. Leek, J. T. et al. Tackling the widespread and critical 72. Andrews, T. S. & Hemberg, M. False signals induced
This tutorial discusses scRNA-seq analysis impact of batch effects in high-throughput data. by single-cell imputation. F1000Res. 7, 1740 (2019).
steps using the latest methods developed for Nat. Rev. Genet. 11, 733–739 (2010). 73. Lloyd, S. P. Least squares quantization in PCM.
each step. 46. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & IEEE Trans. Inf. Theory 28, 129–137 (1982).
20. Petukhov, V. et al. Accurate estimation of molecular Marioni, J. C. Batch effects in single-cell RNA-sequencing 74. Žurauskiene, J. & Yau, C. pcaReduce: hierarchical
counts in droplet-based single-cell RNA-seq data are corrected by matching mutual nearest clustering of single cell transcriptional profiles.
experiments. Genome Biol. 19, 78 (2018). neighbors. Nat. Biotechnol. 36, 421–427 (2018). BMC Bioinformatics 17, 140 (2016).
21. Melsted, P. et al. Modular and efficient pre-processing 47. Barkas, N. et al. Joint analysis of heterogeneous 75. Zeisel, A. et al. Cell types in the mouse cortex and
of single-cell RNA-seq. Preprint at https://doi.org/ single-cell RNA-seq dataset collections. Nat. Methods hippocampus revealed by single-cell RNA-seq.
10.1101/673285 (2019). 16, 695–698 (2019). Science 347, 1138–1142 (2015).
22. Islam, S. et al. Quantitative single-cell RNA-seq with 48. Tran, H. T. N. et al. A benchmark of batch-effect 76. Zeisel, A. et al. Molecular architecture of the mouse
unique molecular identifiers. Nat. Methods 11, correction methods for single-cell RNA sequencing nervous system. Cell 174, 999–1014.e22 (2018).
163–166 (2014). data. Genome Biol. 21, 12 (2020). 77. Duò, A., Robinson, M. D. & Soneson, C. A systematic
23. Smith, T. & Sudbery, I. UMI-tools: modelling A benchmark study of methods available for batch performance evaluation of clustering methods for
sequencing errors in unique molecular identifiers to correction during analysis of scRNA-seq data. single-cell RNA-seq data. F1000Res. 7, 1141 (2018).
improve quantification accuracy. Genome Res. 27, 49. Leek, J. T. Svaseq: removing batch effects and other A benchmark analysis of methods available for
491–499 (2017). unwanted noise from sequencing data. Nucleic Acids clustering in scRNA-seq data analysis.
24. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Res. 42, e161 (2014). 78. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. &
Near-optimal probabilistic RNA-seq quantification. 50. Stuart, T. & Satija, R. Integrative single-cell analysis. Lefebvre, E. Fast unfolding of communities in large
Nat. Biotechnol. 34, 525–527 (2016). Nat. Rev. Genet. 20, 257–272 (2019). networks. J. Stat. Mech. 2008, P10008 (2008).
25. Dobin, A. et al. STAR: ultrafast universal RNA-seq 51. Hie, B., Bryson, B. & Berger, B. Efficient integration 79. Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain
aligner. Bioinformatics 29, 15–21 (2013). of heterogeneous single-cell transcriptomes using to Leiden: guaranteeing well-connected communities.
26. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Scanorama. Nat. Biotechnol. 37, 685–691 (2019). Sci. Rep. 9, 5233 (2019).
Kingsford, C. Salmon provides fast and bias-aware 52. Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges 80. Kiselev, V. Y. et al. SC3: consensus clustering of
quantification of transcript expression. Nat. Methods in unsupervised clustering of single-cell RNA-seq data. single-cell RNA-seq data. Nat. Methods 14, 483–486
14, 417–419 (2017). Nat. Rev. Genet. 20, 273–282 (2019). (2017).
81. Li, H. et al. Reference component analysis of single-cell 95. Moon, K. R. et al. Visualizing structure and transitions 107. Denisenko, E. et al. Systematic bias assessment in
transcriptomes elucidates cellular heterogeneity in in high-dimensional biological data. Nat. Biotechnol. solid tissue 10x scRNA-seq workflows. Preprint at
human colorectal tumors. Nat. Genet. 49, 708–718 37, 1482–1492 (2019). https://doi.org/10.1101/832444 (2019).
(2017). 96. Ding, J., Condon, A. & Shah, S. P. Interpretable 108. Lake, B. et al. Neuronal subtypes and diversity
82. Combes, A. N. et al. Single cell analysis of the dimensionality reduction of single cell transcriptome revealed by single-nucleus RNA sequencing of the
developing mouse kidney provides deeper insight into data with deep generative models. Nat. Commun. 9, human brain. Science 352, 1586–1590 (2016).
marker gene expression and ligand-receptor crosstalk. 2002 (2018). 109. Krishnaswami, S. R. et al. Using single nuclei for
Development 146, dev178673 (2019). 97. Abdelaal, T. et al. A comparison of automatic cell RNA-seq to capture the transcriptome of postmortem
83. Qiu, X. et al. Single-cell mRNA quantification and identification methods for single-cell RNA-sequencing neurons. Nat. Protoc. 11, 499–524 (2016).
differential analysis with Census. Nat. Methods 14, data. Genome Biol. 20, 194 (2019).
309–315 (2017). A benchmark study of methods available Acknowledgements
84. Cao, J. et al. The single-cell transcriptional landscape for automated cell-type classification in The authors were supported by NIH grants U01MH098977,
of mammalian organogenesis. Nature 566, 496–502 scRNA-seq data. R01HL123755, U54HL145608, UH3DK114933 and
(2019). 98. Soneson, C. & Robinson, M. D. Bias, robustness and R01HG009285.
85. Wolf, F. A. et al. PAGA: graph abstraction reconciles scalability in single-cell differential expression analysis.
clustering with trajectory inference through a topology Nat. Methods 15, 255–261 (2018). Author contributions
preserving map of single cells. Genome Biol. 20, 59 99. Lun, A. T. L. & Marioni, J. C. Overcoming confounding All authors researched data for the article, wrote the manu-
(2019). plate effects in differential expression analyses of script, made substantial contributions to discussions of the
86. La Manno, G. et al. RNA velocity of single cells. single-cell RNA-seq data. Biostatistics 18, 451–464 content and reviewed or edited the manuscript before
Nature 560, 494–498 (2018). (2017). submission.
87. Wu, Y., Tamayo, P. & Zhang, K. Visualizing and 100. Subramanian, A. et al. Gene set enrichment analysis:
interpreting single-cell gene expression datasets with a knowledge-based approach for interpreting Competing interests
similarity weighted nonnegative embedding. Cell Syst. genome-wide expression profiles. Proc. Natl Acad. Y.W. declares no competing interests. K.Z. is a co-founder,
7, 656–666.e4 (2018). Sci. USA 102, 15545–15550 (2005). equity holder, scientific advisory board member and paid
88. van der Maaten, L. & Hinton, G. Visualizing data using 101. Suykens, J. A. K. & Vandewalle, J. Indefinite kernels in consultant of Singlera Genomics, which has no commercial
t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). least squares support vector machines and principal interests related to this article.
89. McInnes, L. & Healy, J. UMAP: uniform manifold component analysis. Neural Process. Lett. 43,
approximation and projection for dimension reduction. 162–172 (2017). Peer review information
Preprint at https://arxiv.org/abs/1802.03426 (2018). 102. Pliner, H. A., Shendure, J. & Trapnell, C. Supervised Nature Reviews Nephrology thanks B. J. Aronow and the other,
90. Becht, E. et al. Dimensionality reduction for visualizing classification enables rapid annotation of cell atlases. anonymous, reviewer(s) for their contribution to the peer
single-cell data using UMAP. Nat. Biotechnol. 37, Nat. Methods 16, 983–986 (2019). review of this work.
38–44 (2018). 103. Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap:
91. Wattenberg, M., Viegas, F. & Johnson, I. How to use projection of single-cell RNA-seq data across data Publisher’s note
t-SNE effectively. Distill https://doi.org/10.23915/ sets. Nat. Methods 15, 359–362 (2017). Springer Nature remains neutral with regard to jurisdictional
distill.00002 (2016). 104. Alquicira-Hernandez, J., Sathe, A., Ji, H. P., claims in published maps and institutional affiliations.
92. van der Maaten, L. Accelerating t-SNE using Nguyen, Q. & Powell, J. E. ScPred: accurate
tree-based algorithms. J. Mach. Learn. Res. 15, supervised method for cell-type classification from Related links
3221–3245 (2014). single-cell RNA-seq data. Genome Biol. 20, 264 Broad institute online single-cell data browser: https://portals.
93. Kobak, D. & Linderman, G. C. UMAP does not (2019). broadinstitute.org/single_cell
preserve global structure any better than t-SNE when 105. Regev, A. et al. The human cell atlas. eLife 6, e27041 eMBL-eBi online single-cell data browser: https://www.ebi.ac.
using the same initialization. Preprint at https://doi.org/ (2017). uk/gxa/sc/home
10.1101/2019.12.19.877522 (2019). 106. Stoeckius, M. et al. Cell hashing with barcoded UCsC online single-cell data browser: https://cells.ucsc.edu/
94. Kobak, D. & Berens, P. The art of using t-SNE for antibodies enables multiplexing and doublet
single-cell transcriptomics. Nat. Commun. 10, 5416 detection for single cell genomics. Genome Biol. 19,
(2019). 224 (2018). © Springer Nature Limited 2020
www.nature.com/nrneph