0% found this document useful (0 votes)
19 views

Tools For The Analysis of High-Dimensional Single-Cell RNA Sequencing Data

Uploaded by

Mohammad Zabeh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Tools For The Analysis of High-Dimensional Single-Cell RNA Sequencing Data

Uploaded by

Mohammad Zabeh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

REVIEWS

Tools for the analysis of


high-​dimensional single-​cell RNA
sequencing data
Yan Wu and Kun Zhang ✉

Abstract | Breakthroughs in the development of high-​throughput technologies for profiling


transcriptomes at the single-​cell level have helped biologists to understand the heterogeneity
of cell populations, disease states and developmental lineages. However, these single-​cell RNA
sequencing (scRNA-​seq) technologies generate an extraordinary amount of data, which creates
analysis and interpretation challenges. Additionally, scRNA-​seq datasets often contain technical
sources of noise owing to incomplete RNA capture, PCR amplification biases and/or batch effects
specific to the patient or sample. If not addressed, this technical noise can bias the analysis and
interpretation of the data. In response to these challenges, a suite of computational tools has
been developed to process, analyse and visualize scRNA-​seq datasets. Although the specific
steps of any given scRNA-​seq analysis might differ depending on the biological questions being
asked, a core workflow is used in most analyses. Typically, raw sequencing reads are processed
into a gene expression matrix that is then normalized and scaled to remove technical noise.
Next, cells are grouped according to similarities in their patterns of gene expression, which can be
summarized in two or three dimensions for visualization on a scatterplot. These data can then
be further analysed to provide an in-​depth view of the cell types or developmental trajectories
in the sample of interest.

In a single organism, most cells have the same genome, and work towards a better understanding of physiology,
but specific gene expression varies across different tis­ biological development and disease2–6. For example,
sues and cell types. Any given tissue or cell type expres­ researchers generated an improved quantitative map of
ses ~11,000–13,000 genes, of which ~3,000–5,000 have the cell types present in the developing human kidney,
a cell-​type-​specific expression pattern, whereas the which has provided insights into renal physiology7.
remaining genes are ubiquitously expressed1. These Another single-​cell study demonstrated the similarities
unique patterns of gene expression translate to differ­ between fetal human kidney and human kidney orga­
ences at the protein level between different cell types noids, reaffirming the utility of kidney organoids as a
and result in the vast array of cellular phenotypes found model for the study of disease and for drug screening8.
throughout the body. Therefore, a snapshot of the gene However, deriving biological insights from single-​cell
expression profile of a cell can be indicative of its pheno­ RNA sequencing (scRNA-​seq) methods demands that
type. Owing to the limited amount of RNA present in researchers handle the large volume of data generated
each cell, gene expression profiling was historically by these technologies and their accompanying sources of
performed on pooled cells, but this bulk sequencing technical noise9. Addressing the scale and complexity
approach obscured the potential cell heterogeneity in of these datasets thus requires a complex ecosystem of
a sample or tissue2. For example, in a pool of develop­ computational methods.
ing progenitor cells, different cells might be primed to Beyond scRNA-​seq analysis, other available techno­
make distinct fate decisions but these transcriptional logies can profile genomes10, methylation patterns11 and
Department of Bioengineering, programmes are indistinguishable in a bulk analysis of chromatin accessibility patterns12,13 at the single-​cell
University of California at San
Diego, La Jolla, CA, USA.
the average gene expression in the progenitor pool. level. Each type of single-​cell profiling comes with its
✉e-​mail: kzhang@ The development of technologies that can isolate own challenges in terms of data analysis. Additionally,
bioeng.ucsd.edu thousands to tens of thousands of cells and assess their the development of ‘multi-​omics’ approaches, in which
https://doi.org/10.1038/ gene expression profiles at the single-​cell level has ena­ multiple types of biological molecules are profiled in the
s41581-020-0262-0 bled researchers to dissect this cellular heterogeneity same cell, has advanced substantially in recent years.

Nature Reviews | Nephrology


Reviews

identify and collapse duplicate reads that might be gen­


Key points
erated during this amplification step, thus reducing tech­
• As single-​cell RNA sequencing datasets increase in scale and complexity, faster and nical noise22. Of note, sequencing errors in the UMI can
more efficient computational tools for processing and analysis are required. artificially inflate gene expression, as duplicate reads that
• New computational tools that correct technical and batch effects can unlock additional should be collapsed are treated as distinct molecules20,23.
heterogeneity and enable higher-​resolution clustering and trajectory inference. Conversely, distinct molecules might be incorrectly
• Graph-​based methods for clustering and trajectory inference allow for the scalable labelled with the same UMI sequence and thus be treated
analysis of large single-​cell RNA sequencing datasets. as one molecule20.
• Visualization methods can distort the structure of the data and batch correction For most sequencing technologies, background
methods can reduce cell-​type resolution; both methods should therefore be used RNA contamination and sequencing errors result in
with care and might require specific parameter tuning for each dataset. a large number of cell barcodes that have a low num­
• High-​level biological interpretation, such as cell-​type annotation, remains challenging ber of reads but do not correspond to real cells. These
and time-​consuming — new automated methods, alongside the creation of single-​cell empty barcodes can be detected and removed by setting
reference atlases, promise to address these issues. a minimum number of reads or a UMI threshold for cell
barcodes. More sophisticated methods such as dropEst
For example, some methods simultaneously profile RNA are also available4,20.
and chromatin accessibility14, RNA and methylation15, or Several tools can be used for read processing (Table 1),
even a combination of chromatin accessibility, RNA and including CellRanger, which accompanies the 10X
methylation, albeit at a lower throughput16. genomics Chromium scRNA-​seq platform. CellRanger
In this Review, we provide the non-​expert reader handles cDNA reads, runs sequence alignment, collapses
with a broad overview of the different steps required for duplicate reads by their UMIs and outputs a counts matrix
scRNA-​seq analysis, including pre-​processing of data and along with quality control (QC) statistics4. CellRanger
downstream analysis (Fig. 1). We discuss challenges that can also perform secondary analyses such as cluster­
are typically encountered in every step of scRNA-​seq data ing (that is, grouping cells according to similarities in
analysis and examine the different computational tools their patterns of gene expression) and visualization
and approaches developed to address these issues, includ­ (discussed in more detail later), albeit using a rather
ing their strengths and their limitations. We also explore basic pipeline4. However, CellRanger can be fairly slow
how experimental design choices can affect downstream and memory intensive, using a maximum of 30 GB of
data analyses. In-​depth, technical explanations of specific RAM and taking ~22 h to process 784 million reads
scRNA-​seq analysis steps are available elsewhere17–19. (equivalent to ~50,000 cells at a depth of 15,000 reads per
cell)21. Nevertheless, the integration of CellRanger with
Data pre-​processing the Loupe Cell Browser, another piece of 10X genomics
The raw data obtained from scRNA-​seq platforms must software, offers non-​expert users an interactive browser
first go through several pre-​processing steps before it can that can be used to visualize the results of clustering and
be used to assess biologically relevant changes in gene the expression of marker genes4.
expression. These pre-​processing steps transform the In the past few years, researchers have developed
raw data into a more usable format and address issues scRNA-​seq methods that can profile hundreds of thou­
related to sample quality, the wide range of gene expres­ sands to millions of cells in a single experiment by using
sion levels and variance. Additionally, these steps can combinatorial indexing. Such methods include split pool
reduce the impact of technical batch effects if multiple ligation-​based transcriptome sequencing and single-​cell
datasets are to be analysed simultaneously. combinatorial indexing RNA-​seq5,6. Given these techno­
logical advances and considering the amount of memory
Generating a gene expression matrix and processing time required by CellRanger21, alternative
The initial output FASTQ file (or files) generated in an computational pipelines for processing cDNA reads into
scRNA-​seq experiment consists of complimentary DNA single-​cell gene expression counts have also been devel­
(cDNA) reads. Each read contains an RNA sequence, oped5,6. The dropEst pipeline, for example, has faster
a cell barcode that identifies the cell from which the read runtimes and lower memory usage than CellRanger,
was generated and a unique molecular index (UMI) that and provides more accurate gene expression estimates
FASTQ file identifies the exact mRNA molecule3–6. The first step by correcting sequencing errors in the cell barcodes and
A text file that stores DNA of scRNA-​seq analysis is to process these reads into a UMIs20 (Table 1). DropEst also improves data recovery by
sequences and their associated
counts matrix that summarizes the number of molecules using a machine learning model to identify empty bar­
quality metrics and metadata;
a single sequence in a FASTQ of each gene detected in each cell in the dataset4,20,21. The codes, enabling the recovery of cell types that are smaller
file is called a ‘read’. counts matrix serves as the input for the remaining analy­ than average in size, and cell types with low RNA content
sis steps and is also an efficient way of storing and shar­ that might otherwise be excluded from the analysis20.
Counts matrix
ing information on gene expression (Box 1). The creation UMI-​Tools is another pipeline that corrects sequencing
An integer matrix (that is,
numerical data arranged in a of a counts matrix typically involves aligning the cDNA errors in the cell barcodes and UMIs to provide more
set of columns and rows) in sequence in each read to a reference genome to identify accurate quantification of gene expression23.
which the columns typically the specific gene that the read originated from and then One of the slowest steps in the CellRanger pipe­
correspond to cells, whereas assigning each read to its cell of origin through its cell line is the alignment of cDNA reads to the reference
the rows correspond to genes;
each entry represents the
barcode4,20,21 (Fig. 2a). genome24. The Kallisto pseudo-​aligner, used alongside
number of molecules of that scRNA-​seq technologies use PCR to exponentially the BUStools suite of methods for storing and manipu­
gene expressed in that cell. amplify cDNA molecules and UMIs enable users to lating scRNA-​seq data, is a highly efficient alternative

www.nature.com/nrneph
Reviews

to CellRanger because it creates a list of compatible Raw scRNA-seq data


transcripts for each read (pseudo-​alignment) instead
of aligning individual reads to an exact position in
Generate single-cell
the transcriptome (alignment) 21,24. The combined counts matrix
Kallisto–BUStools method is up to 51 times faster than
CellRanger and uses a maximum of ~12 GB of RAM
Run QC checks
when processing 50,000 cells 21 (Table 1) . However, and normalize counts
Kallisto–BUStools does not remove empty cell barcodes.
STARSolo and Alevin are extensions of two alignment
and pseudo-​alignment methods, respectively, that can also Variance stabilization
and feature selection
be used for processing of scRNA-​seq data25,26 (Table 1).
Both STARSolo and Alevin have significantly faster
runtimes than CellRanger, but STARSolo has a higher Batch effect correction
and data integration
maximum RAM usage21.
In summary, the first step of scRNA-​seq analysis is to
process raw reads into a matrix of single-​cell gene expres­ Dimensionality reduction
sion counts. For users of the 10X genomics scRNA-​seq
platform, CellRanger offers a convenient, albeit slow and Trajectory
Clustering
memory-​intensive method for this processing. CellRanger inference
also runs basic clustering and marker gene analysis that
can be visualized with the Loupe Cell Browser. DropEst, Visualization
Kallisto–BUStools, UMI-​Tools, STARSolo and Alevin
are alternative read processing methods that offer sub­
Cell-type annotation
stantial runtime and memory improvements, enabling
users to process their scRNA-​seq runs without hav­ Fig. 1 | overview of the single-cell rNA sequencing
ing to invest as much in computational infrastructure. analysis pipeline. The raw data generated by single-​cell
Additionally, the enhanced correction of UMI and cell RNA sequencing (scRNA-​seq) contain all sequenced
barcode errors available with DropEst, UMI-​Tools and complementary DNA reads and the first analysis step
Kallisto–BUStools can improve gene expression estimates consists of assigning individual reads to their cell of origin
compared with CellRanger. to generate a single-​cell counts matrix. The next step
involves filtering cells and genes according to quality
control (QC) metrics. Data normalization, scaling and
Quality control and doublet detection
variance stabilization are used to address technical biases
All scRNA-​seq methods generate technical biases and and facilitate the selection of the most biologically relevant
noise — some basic QC addresses these issues before genes, ensuring that the downstream analysis is driven
downstream analysis. Protocols used for single-​cell dis­ by relevant biological phenomena and not technical
sociation and sequencing, for example, can induce cel­ noise. The comparison of datasets acquired from different
lular stress and result in cell death, which biases gene experiments also requires the correction of batch effects
expression and can result in artificial clusters of dead to enable appropriate data integration. Dimensionality
cells in downstream analyses27. Filtering out cells with reduction summarizes the expression patterns of thousands
either a low cDNA read or UMI count, as well as cells of genes in fewer dimensions, which are used to create
with a large number of mitochondrial reads per total clusters of cells with similar patterns of gene expression.
In developmental datasets, cells often do not group into
number of UMIs (also known as mitochondrial fraction)
discrete clusters but instead follow continuous trajectories,
can help to remove dead cells2. Unlike cytoplasmic RNA, requiring a continuous model of cell states — trajectory
the presence of mitochondrial RNA is indicative of cell inference aims to identify the location of cells along the
death. The appropriate threshold for the number of reads developmental continuum. Finally, the dataset can then be
or UMIs, and mitochondrial read fraction depends on visualized in two dimensions and analysed to identify key
the cell types present in the dataset and the scRNA-​seq marker genes in each cluster. Any unknown cell types or
method being used. Setting a threshold for the mini­ states are then annotated using these key marker genes
mum number of cells in which a gene is detected can or through comparisons with existing reference datasets.
also help to exclude genes that are only expressed in a
small number of cells and are unlikely to be informative. datasets with many cell types28. One common strategy
However, users should ensure that this threshold is not for identifying doublets involves generating simulated
too high, as rare cell types might be otherwise missed in doublets by combining cells from different clusters
the downstream analysis. in the dataset and assessing which cells have similar
For most scRNA-​seq methods, the presence of dou­ expression profiles to the simulated doublet cells28.
blets, generated when two or more cells are assigned to However, this strategy is only feasible when the data­
the same cell barcode, can create artificial clusters in the set con­tains discrete cell types, rather than continuous
downstream analysis, as merging the gene expression cellular trajectories28.
patterns of two distinct cell types might create a unique QC thresholds might differ between datasets and
expression signature that is not found in any real cell some exploratory data analysis, such as histograms of
type. However, manually differentiating doublet clusters the distribution of UMIs per cell or gene, can help to
from true clusters can be challenging, especially for large set thresholds for each dataset. In some cases, such as

Nature Reviews | Nephrology


Reviews

Box 1 | Dataset uploading and exploration The scran package pools cells with similar expres­
sion patterns before estimating size factors, therefore
A crucial step in the analysis of single-​cell RNA addressing normalization issues due to cell-​type-​specific
sequencing (scRNA-​seq) data is the assessment of gene expression or UMI counts34. However, scaling
related available datasets, not only to enable a better genes with high expression and low expression using
understanding of existing work but also to assist in the
the same size factor can lead to overcorrection of genes
formulation of novel hypotheses. Online data browsers
are particularly useful for the investigation of existing with low expression, such as transcription factors, and
scRNA-​seq datasets. University of California Santa Cruz under-​correction of genes with high expression, such as
(UCSC), the Broad Institute and European Molecular housekeeping genes35,36. SCnorm addresses this issue by
Biology Laboratory-​European Bioinformatics Institute pooling genes with similar dependencies on total UMI
(EMBL-​EBI) have single-​cell data browsers that enable or read count and computing size factors within each
users to interactively visualize the different cell types pool35. sctransform (implemented in the Seurat package)
and marker genes present in these datasets. The uses a probabilistic model to compute the effect of total
convenient user interfaces of these browsers allow users UMI or read count on each gene, which also enables it to
to quickly explore the datasets without the need to stabilize gene variances (discussed later in more detail)
download the data or previous programming experience.
and identify over-​dispersed genes36.
Additionally, uploading a dataset to these browsers
increases the visibility of the study that generated the Overall, some type of normalization is crucial for
data and maximizes the potential for new insights from scRNA-​seq analysis and, although total count normal­
the dataset. ization successfully mitigates technical bias, it can
partially obscure true biological heterogeneity. Using
specialized normalization methods, such as SCnorm
when an artificial cluster of dead or dying cells becomes and sctransform, can unlock additional heterogeneity
apparent in the downstream analysis, modifying these in a dataset33.
thresholds after running the entire analysis pipeline
and repeating the analysis can also be helpful (Fig. 2b). Variance stabilization
Seurat29 and SCANPY30 are scRNA-​seq analysis pipe­ Gene expression levels can vary enormously and the
line packages that include functions for computing QC average expression (or magnitude) of a gene is strongly
metrics, such as the fraction of genes expressed per cell, associated with its variance37, an effect known as the
mitochondrial fraction and total counts; users determine mean–variance relationship (Fig. 2d). Variance stabi­
the thresholds with which to filter genes and cells in the lization adjusts the data to remove the influence of
dataset. Scater31 also offers a suite of tools for computing gene expression magnitude on gene variance. This
key QC metrics. step ensures that downstream analyses are focused on
the most biologically relevant genes (that is, the genes
Data normalization that are expressed in specific cell types in the dataset)
The fraction of RNA captured in each cell can vary rather than simply focusing on the genes with the
owing to factors such as reverse transcription efficiency, highest expression. For example, variance stabilization
primer capture efficiency and errors associated with might facilitate the separation of two subpopulations
collapsing UMIs2,32,33. Differences in the total amount of of a developmental progenitor, which might otherwise
UMIs or reads in each cell might thus result from techni­ be merged, by enabling genes with low average levels of
cal factors rather than biological variation (Fig. 2c). If not expression, such as transcription factors, to still contri­
normalized, technical differences in total UMIs or reads bute to the analysis. Despite having low overall expres­
can dominate the downstream analysis. For example, sion levels within a cell, such genes might be important
cells with similar amounts of total UMIs or reads clus­ in uncovering the fate of that cell.
ter together instead of cells with similar gene expression One simple variance-​stabilizing approach is to
patterns33. Normalization is therefore crucial to reveal­ log-​transform normalized counts, which reduces the
ing the true biological heterogeneity of a dataset. Most difference between genes with high and low expres­
Total counts normalization methods attempt to estimate the bias for sion38 (Fig. 2d). Pipelines that can be used to remove
The total number of reads each cell (also known as a size factor). The UMI or read the effect of average gene expression on gene variance
or UMIs in a given cell. counts of all cells can then be normalized by dividing include Seurat, Pagoda2 (Fig. 2d) and SCANPY, which
those values by the size factor, enabling the compari­ explicitly fit a mean–variance relationship and apply a
Size factor
An estimate of how much
son of gene expression levels across different cells. Total scaling factor30,37,39,40. ZINB-​Wave, single-​cell variational
variation in sequencing counts normalization is a simple normalization strat­ inference (scVI) and deep count autoencoder (DCA)
depth or RNA capture egy, in which the size factors consist of the total num­ are alternative methods that use a different approach
efficiency affects the overall ber of UMIs or reads in each cell. However, total counts (negative binomial distribution) to model single-​cell
quantification of gene
normalization can be dominated by highly expressed count data36,41–43.
expression in a cell.
genes and results in biased size factor estimation when Overall, although variance stabilization is not strictly
Over-​dispersed genes strong cell-​type-​specific gene expression exists, which necessary for scRNA-​seq analysis, adjusting the data­
Genes that show a greater than can occur when very different cells or tissue types are set for the wide variation in average gene expression
expected variance between present in the dataset33,34. Also, some cell types are larger enhances the contribution of biologically relevant genes
cells given their average
expression, which suggests
and have more RNA molecules than others, a biological to downstream analyses. This approach removes the
that they are expressed in factor that is obscured when simply dividing the number influence of genes, such as housekeeping genes, which
a cell-​type-​specific manner. of UMIs or reads by the total counts34. are abundantly expressed but at similar levels in all cells

www.nature.com/nrneph
Reviews

a Sequencing reads Gene expression matrix This optional processing step involves selecting the genes
Aligned cDNA Cell barcode UMI Cells with the highest residual variance after adjusting for the
A B C D
differences in average gene expression.
a 1.2 0.3 2.1 3.6 .....
b 3.2 1.9 5.2 1.1 .....

Genes
c 2.6 4.6 0.8 2.2 ..... Batch effects and data integration
d 0.6 3.3 0.9 4.4 ..... Joint analysis of multiple scRNA-​s eq datasets gen­
erated using different technologies, obtained from

.....
.....
.....
.....
b c different patients or samples, or from different exper­
Cell A iments, increases the total number of cells analysed.
20
Mitochondrial fraction

Cell B This approach can improve the resolution of cellular

Normalized counts
Raw read counts
15 subtypes and the detection of rare cell phenotypes, and
also enables direct comparisons of patients, samples
10 Dead cells or technologies. However, this type of analysis is often
challenging owing to batch effects — technical differ­
5 ences in gene expression can mask relevant biological
0
phenomena29,45–48. Intra-​batch variation is typically due
0 5,000 10,000 15,000 Gene 1 Gene 2 Library Gene 1 Gene 2 Library
to differences between cell types and biologically rele­
Number of UMIs size size vant factors, whereas inter-​batch variation might also
d Adjusted result from technical factors. These batch effects can
0
2.5 arise from variability in patients, samples or protocols
(including operator-​driven variation) that affect RNA
capture efficiency or cell viability45. The strength of a
–2
2.0 batch effect depends on the type of dataset and can be
difficult to predict before running the analysis.
log10[variance]

A simple approach for eliminating technical batch


–4
effects is to essentially assume that each batch must
1.5
have the same average gene expression across all
–6 cells and remove any differences across batches using
a regression model 49. Although this batch correc­
1.0 tion approach works for bulk RNA-​s eq data, it can
–8 over-​correct when the different batches are not iden­
tical29,46,47,50. Specifically, if the cell-​type proportions
–6 –4 –2 0 –6 –4 –2 0 differ between batches, this type of crude batch cor­
log10[magnitude] log10[magnitude] rection might have an impact on the ability to resolve
cell types40,46. For example, if one kidney sample con­
Fig. 2 | pre-processing of single-cell rNA sequencing data. a | The first step of tains more collecting ducts than another sample that
single-​cell RNA sequencing (scRNA-​seq) data pre-​processing involves the generation
is enriched for proximal tubular cells, then the average
of a gene expression counts matrix from raw sequencing reads. These reads contain
a cDNA sequence, a cell barcode that identifies the cell from which the cDNA was
gene expression across the cells from each sample is
amplified, and a unique molecular identifier (UMI) that identifies the RNA molecule. different owing to the differences in cell-​type composi­
The matrix comprises the gene expression values for the complete dataset, organized tion. Applying a batch correction that forces the average
by gene (rows) and single cell (columns). b | Common quality control metrics include gene expression across cells from both samples to be the
mitochondrial fraction and UMI count. These metrics can be used, for example, to same reduces the magnitude of the gene expression dif­
identify and exclude cells with a high mitochondrial fraction and low UMI count, which ferences between collecting ducts and tubules, reducing
might correspond to dead cells. An example plot was generated using the Seurat the ability to resolve those cell types.
scRNA-​seq analysis pipeline. c | Normalization adjusts data for cell-​specific differences Methods that are tailored specifically for the inte­
in total UMI count and reveals true gene expression differences between cells. In this gration of scRNA-​seq data enable the preservation of
example, discrepancies in library size (that is, the total number of reads in a cell) masked
differences in cell-​type proportions between batches
the variation in the expression of gene 2 between cells A and B. d | Variance stabilization
facilitates the identification of the genes with the highest variance in a dataset by
while eliminating batch effects50. Most of these meth­
transforming the data to ensure that the analysis is not dominated by genes that, ods rely on the concept of finding pairs of cells that cor­
despite being expressed at high levels, do not vary greatly across the dataset. The left respond to the same cell type or state across different
panel shows a mean–variance fit from Pagoda2, which demonstrates the relationship batches40,46,47,51 (Fig. 3a). Once these pairs, also known as
between average gene expression (x axis) and gene variance (y axis). The right panel mutual nearest neighbours (MNNs), are identified, any
shows residual variances after adjusting for the mean–variance relationship (that is, remaining gene expression differences between MNNs
the correlation between the magnitude of expression of a gene and its variance). are assumed to be due to batch effects and can be cor­
Data depicted in parts b and d were obtained from a dataset of peripheral blood rected40,46,47. An advantage of the MNN approach is that
mononuclear cells sequenced using the 10X genomics Chromium scRNA-​seq platform. if a cell type or state is unique to a specific batch, it is
not identified as an MNN, thus preserving the unique
and are therefore not useful to investigations of cellular biological properties of each batch40,46,47 (Fig. 3b). For
heterogeneity. After variance stabilization, identifying example, when using an MNN approach to integrate
and selecting highly variable genes can improve the kidney scRNA-​seq data from two mice, one wild-​type
resolution of cell types in downstream analyses, espe­ control and one genetic knockout that lacks podocytes,
cially if the cell types being assayed are fairly similar44. the podocytes from the control mouse would remain in

Nature Reviews | Nephrology


Reviews

Table 1 | FASTQ processing tools


Method Description Documentation Detects ref.
empty
barcodes
CellRanger Default 10X genomics software package https://support.10xgenomics.com/ Yes 4

for processing data generated on the 10X single-​cell-​gene-​expression/software/


platform pipelines/latest/what-​is-​cell-​ranger
DropEst Improves on quantification accuracy https://github.com/hms-​dbmi/ Yes 20

compared with CellRanger. Supports 10X, dropEst


Split-​seq, Drop-​seq, inDrop, iCLIP and
Seq-​Well
Kallisto– Extremely efficient memory and CPU https://www.kallistobus.tools/ No 21

BUStools usage through the use of the BUStools file getting_started


formats. Supports any platform that uses cell
barcodes
Alevin Extension of the Salmon pseudo-​aligner for https://salmon.readthedocs.io/en/ Yes 26

scRNA-​seq data. Supports 10X and Drop-​seq latest/alevin.html


platforms
STARSolo Extension of the STAR read aligner for https://github.com/alexdobin/STAR/ Yes 25

processing single-​cell data. Supports the blob/master/doc/STARmanual.pdf


10X platform
UMI-​Tools Models potential errors in UMIs and corrects https://github.com/CGATOxford/ Yes 23

them to improve gene expression accuracy UMI-​tools


CPU, central processing unit; scRNA-​seq, single-​cell RNA sequencing; UMI, unique molecular index.

a separate cluster after integration. Of note, if the knock­ the performance of clustering and trajectory inference.
out caused a uniform shift in gene expression across all Clustering refers to partitioning cells into groups based
podocytes rather than podocyte loss, that shift might on similar patterns of gene expression; these groups
be lost after integration as it would be indistinguishable (also known as clusters) usually correspond to distinct
from a batch effect. biological cell types or states52. Trajectory inference is
Identifying MNNs can be difficult if the batch effects usually applied to cells that are dynamically transitioning
are stronger than the differences in gene expression across a continuum of cellular states52,53.
between cell types. To overcome this challenge, canon­ Upstream analysis choices, such as QC filtering and
ical correlation analysis can be applied to focus the normalization, can have a substantial impact on both
analysis on intra-​batch variation and not on inter-​batch clustering and trajectory inference. For example, data
variation29, even if the differences between batches are normalization is a critical step, otherwise clusters are
stronger than the differences between cell types40. almost entirely based on the number of UMIs or reads
One caveat of these methods of data integration is of the cell rather than on similarities in gene expression
Regression model that a compromise between reducing the size of the profiles. Dead or dying cells, as well as doublets, can
A model that compares the batch effect and resolving cell types might be required; also generate artificial clusters that might be difficult to
relationship between two this parameter can be explicitly tuned in methods such as distinguish from real clusters if not removed.
variables. In the context of
single-​cell RNA sequencing,
clustering on network of samples (CONOS)47. For exam­
regression can assess ple, completely removing the batch effect from kidney Dimensionality reduction and imputation
relationships between cells collected from two different patients might reduce The dimensionality of a dataset refers to the number of
observed gene expression, the ability to resolve podocyte subtypes. The extent of variables being measured for each data point. In the con­
and technical and/or biological
this compromise depends on the specific datasets being text of scRNA-​seq, each data point corresponds to a cell
factors.
integrated and the strength of the batch effects. and the variables are the genes. scRNA-​seq experiments
Mutual nearest neighbours Overall, the advent of MNN-​based methods has ena­ are characterized as ‘high dimensional’ as they typically
(MNNs). Cells from different bled scRNA-​seq users to analyse and compare samples measure the expression of ~20,000 variables (genes).
batches that belong to each across platforms, patients or samples, and even across Even after selecting only a subset of highly variable
other’s set of k-​nearest
neighbours (that is, cells
species, improving the capacity of scRNA-​seq to resolve and/or biologically relevant genes, users often still have
with the most similar gene cell types and trajectories29,40,46,51. a dataset with thousands of genes, many of which are
expression patterns). highly correlated and provide redundant information,
Downstream analyses potentially masking more subtle biological patterns52,54,55.
Dimensionality reduction
Once the pre-​processing steps are completed, down­ Additionally, the metrics used to measure similarity in
Summarizing a large set of
variables with a smaller set stream analysis steps, which include dimensionality gene expression patterns between cells become less reli­
of variables, while retaining as reduction , clustering and trajectory inference, focus able in a high-​dimensional space, a phenomenon known
much information as possible. on identifying patterns in the data that provide bio­ as the ‘curse of dimensionality’52,54. Therefore, applying
logical insight. Dimensionality reduction involves dimensionality reduction to scRNA-​seq datasets can
Embedding
The set of variables that
transforming the dataset into a more compact, and pos­ improve downstream analyses. The reduced dimen­
remains after running some sibly more interpretable, representation that captures sions are typically called an embedding of the dataset.
form of dimensional reduction. the primary biological axes of variation and improves Dimensionality reduction has the added benefit of

www.nature.com/nrneph
Reviews

Dropout
improving the speed of most downstream analyses. interpretable dimensions by attempting to find discrete
The absence of a detectable However, although it is extremely helpful for most data­ components (such as a collecting duct or tubule) that
gene or transcript in a cell. sets, dimensionality reduction is not strictly necessary underlie the dataset59,60.
for downstream analyses.
Non-​linear methods. The relationship between genes
Linear methods. A linear relationship between two vari­ can be highly non-​linear, which affects the ability of lin­
ables exists when both variables change at the same rate ear models such as PCA to analyse scRNA-​seq data42.
(direct proportion). The most common dimensionality Methods that can generate a non-​linear transformation
reduction method for scRNA-​seq analysis is principal of the dataset can thus outperform linear methods in
component analysis (PCA), which creates a linear com­ certain cases. Specifically, locally linear embedding
bination of genes that best capture the variance in the (LLE) and diffusion maps (Dmaps) were shown to be
data56 (Table 2). The ability of PCA to reduce the dimen­ effective when the dataset follows a continuous trajec­
sionality of the data while finding the dimensions of tory, such as with datasets from developmental time
highest variance makes it a very useful dimensionality series61–63 (Table 2).
reduction tool before clustering. Another approach to non-​linear dimensionality
Only a relatively small fraction of the total RNA of a reduction is the use of deep neural networks, which are
cell is captured and reverse transcribed in an scRNA-​seq models that apply iterative, non-​linear transformations
experiment. Consequently, no molecules are detected for to the dataset64. By layering these iterative transforma­
many genes in most cells, resulting in a large amount of tions, deep neural networks can learn complex features
zeros in the single-​cell counts matrix, which is known of a dataset, which enables them to represent the data
as zero inflation3,4. Zero-​inflated factor analysis (ZIFA) using fewer dimensions64. scScope and DCA use neural
is a variation of PCA that is designed to explicitly model networks that can outperform linear dimensional reduc­
the expected high amount of zero values in scRNA-​seq tion methods such as PCA43,65 (Table 2). scVI also uses
count data57 (Table 2). neural networks to create a framework for modelling
One downside of PCA is that the principal com­ gene expression in a way that enables the quantification
ponents themselves can be difficult to interpret bio­ of uncertainty for each gene expression estimate, while
logically. Ideally, each dimension obtained after accounting for technical effects such as batch effects and
dimensionality reduction would correspond to a bio­ zero inflation42 (Table 2).
logical process. For example, for a developmental kidney For users who are interested in simply reducing the
dataset, each dimension would correspond to a devel­ dimensionality of the data and proceeding to clustering
oping kidney compartment (for example, the collect­ and visualization, PCA is a good default approach, but
ing duct or the tubule). The factorial single-​cell latent more specialized methods such as f-​scLVM or scVI can
variable model (f-​scLVM) addresses this interpretability generate low-​dimensional embeddings that are either
issue by explicitly modelling annotated gene sets as the more interpretable or capture the non-​linear structure
reduced dimensions58 (Table 2). Therefore, after running of the data more faithfully43.
f-​scLVM, each reduced dimension corresponds to a
pre-​annotated gene set. Pagoda and Pagoda2 also create Zero inflation and imputation. Zero inflation is a tech­
highly interpretable dimensions by running PCA within nical limitation of more recent high-​throughput scRNA-​
pre-​annotated gene sets and selecting the dimensions seq methods and is driven by several factors, including
that show significant variance in the dataset37,39 (Table 2). incomplete reverse transcription or RNA capture. Total
Non-​negative matrix factorization (NMF) is another efficiency calculations estimate that only 10–15% of
linear matrix factorization method that generates more the total RNA in a cell is captured and transcribed3–5.
Of note, some researchers argue that the zero inflation
a b for droplet-​based methods is mostly due to biological
variance and not due to technical noise66. However,
the newer generation of combinatorial indexing meth­
Dataset 1 ods tends to capture even fewer molecules per cell
than droplet-​based methods and technical zero infla­
tion might thus be present in those datasets5,6. Several
methods have been developed to impute these missing
values (that is, to replace the zeros in the counts matrix
with estimated values). One class of methods, including
MAGIC and kNN-​smoothing, uses information from
neighbouring cells to impute missing values for any
given cell67,68. Another class of methods such as single-​
cell analysis via expression recovery, clustering through
Dataset 2 Integrated dataset
imputation and dimensionality reduction (CIDR) and
Fig. 3 | Integration of single-cell rNA sequencing data. a | The first step of data scImpute use probabilistic models and relationships
integration based on mutual nearest-​neighbour data involves the identification of between genes to distinguish technical from biological
matching cell types across datasets. b | These matching cell types can then be grouped dropout69–71. However, these imputation methods should
together and integrated into one dataset, while preserving any biologically relevant be used with care as they can introduce false-​positive
cell types that are unique to different datasets. results when analysing differential gene expression72.

Nature Reviews | Nephrology


Reviews

Table 2 | Methods of dimensionality reduction


Method Description Documentation ref.
PCA Default dimensionality reduction method for most Implemented in Seurat, SCANPY and 56

single-​cell pipelines Pagoda2; https://github.com/ujjwalkarn/


DataScienceR/blob/master/PCA.R
ZIFA Variation of PCA that accounts for zero inflation in the https://github.com/epierson9/ZIFA 57

counts matrix
f-​scLVM Uses latent variable modelling and gene sets to generate https://github.com/bioFAM/slalom 58

interpretable lower dimensional factors


Pagoda2 Runs PCA on gene sets to identify interpretable https://github.com/hms-​dbmi/pagoda2 39

components and find the ones with the highest variability


for the given dataset
NMF Generates a more interpretable dimensional reduction in https://github.com/linxihui/NNLM 60

which each dimension typically corresponds to a group of


genes expressed in a group of cells
LLE Generates a piecewise locally linear dimensional https://github.com/jw156605/SLICER 63

reduction that can capture non-​linearity in the data.


Works well for capturing trajectories
Dmaps Generates a smooth dimensional reduction under the https://github.com/theislab/destiny 62

assumption that the cells follow a continuous path


DCA Uses a deep neural network to encode the dataset into https://github.com/theislab/dca 43

lower dimensions
scScope Uses a recurrent neural network to remove technical https://github.com/AltschulerWu-​Lab/ 65

noise and then encode the dataset into lower dimensions scScope
scVI Uses probabilistic modelling with deep neural networks to https://github.com/YosefLab/scVI 42

generate a lower dimensional embedding of the dataset


DCA, deep count autoencoder; Dmaps, diffusion maps; f-​scLVM, single-​cell latent variable model; LLE, locally linear embedding;
NMF, non-​negative matrix factorization; PCA, principal component analysis; scVI, single-​cell variational inference; ZIFA,
zero-​inflated factor analysis.

Therefore, users should be cautious when analysing dif­ broad identification of well-​defined cell types, which are
ferences in genes with low levels of expression and high then sub-​clustered to further resolve their heterogeneity.
levels of dropout. Both k-​means and hierarchical clustering methods
are slow to run for large datasets and are limited in the
Clustering types of clusters they can detect77. Seurat, Pagoda2,
Generally, most scRNA-​seq datasets either comprise SCANPY and CellRanger use graph-​based clustering
discrete cell types or reflect a continuous trajectory of algorithms, which tend to run quickly and generate bio­
development or differentiation. For datasets in which logically relevant clusters for larger datasets4,29,30,39. Graph
individual cells can be grouped into discrete cell types, clustering requires building a graph by connecting each
clustering needs to be applied to resolve those cell cell to its nearest neighbours. The Louvain clustering
types. Each cluster generally expresses a set of genes algorithm, for example, can be applied to cells that have
(marker genes) that are not expressed in cells from been connected in a graph. Starting with each single
other clusters (Fig. 4). k-​means clustering is a simple cell as its own cluster, the algorithm iteratively merges
and popular clustering method that iteratively assigns clusters as long as the merging increases the modularity
cells to clusters73. However, k-​means clustering requires of the graph (the higher the modularity, the lower the
users to pre-​specify the number of cell clusters present likelihood that cells were connected in the network by
in the dataset and determining the number of biologi­ random chance)78. However, the Louvain method can
cally relevant clusters in an scRNA-​seq dataset remains a sometimes generate erroneous clusters composed of cells
challenge52,73. One strategy for dealing with this problem that are not well connected79. Leiden clustering improves
is to generate more clusters than those expected to be on Louvain clustering by guaranteeing well-​connected
found in the dataset and then iteratively either merge clusters and improving runtime79.
neighbouring clusters or divide larger clusters based on a Other approaches include SC3 Consensus Clustering,
similarity threshold, such as the number of differentially which uses the consensus of multiple clustering methods
expressed genes between clusters. CIDR, BackSPIN and to improve clustering accuracy80. Reference component
pcaReduce use this hierarchical clustering approach70,74,75. analysis projects single cells onto a low-​dimensional space
Users can then select the groupings that best match the defined by existing bulk RNA-​seq datasets, which can be
required level of cluster granularity. Hierarchical analy­ very useful for cell populations that are highly hetero­
sis with multiple stages of clustering might be necessary geneous and difficult to interpret such as those found
for extremely large datasets (>100,000 cells) with many in cancer81. Overall, graph clustering methods such as
different cell types. This approach, used for example in a Leiden or Louvain have a strong clustering performance
study of the mouse nervous system76, requires an initial with fairly fast running times.

www.nature.com/nrneph
Reviews

Trajectory inference partition-​based graph abstraction (PAGA), which was


Although clustering is useful for grouping cells into dis­ one of the few methods that performed well on most
crete cell types, in many cases the gene expression pat­ datasets in the aforementioned comparison study, while
terns of cells form a continuum as they transition between maintaining a reasonable computational runtime53,85.
cell states52,53 (Fig. 5). For these datasets, clustering is gen­ In an approach that resembles graph clustering, PAGA
erally performed first to identify the cell states that the generates a nearest-​neighbours graph of the data and
trajectory runs through, as well as any cell states that then generates a grouping of the cells, connecting groups
are not part of the trajectory. Compared with a dataset that have more connections between cells than one
that mainly comprises discrete cell types, a continuum of would expect by random chance, to construct a summary
cell states is generally characterized by the presence graph of the data85. As ‘short circuits’ are more easily
of fewer discrete marker genes and more genes that identified in connections between groups of cells than
are expressed along a continuous gradient (Fig. 5b). For in connections between individual cells, PAGA prunes
example, during kidney development in mice, cells dif­ spurious connections between groups 85. Monocle3
ferentiate from nephron progenitor cells to proximal and builds on this approach by constructing a cell-​level
distal tubules in a continuous manner82. In these types graph with connections between individual cells, where
of studies, assigning cells to a specific point in devel­ any connections between groups of cells that are not
opment along this continuum is an important analysis connected in the summary graph are pruned away84.
objective; this approach is known as pseudotime estima­ Many methods have been developed to identify the
tion52,53. Identifying the point where a continuum splits position of a cell along a developmental trajectory, but
into different branches is also important, as these branch they do not provide information on the direction of the
points represent key fate decisions53,83,84 (Fig. 5a). In one trajectory. One approach to predicting the transcriptional
study, identifying the branch point where progenitor cells direction of a cell is to estimate RNA velocity86. This
separate and either become proximal or distal tubular method is based on an assessment of whether the RNA
cells enabled researchers to identify key regulators of molecule is spliced or unspliced (that is, nascent RNA that
tubule development in mice82. Analysing this type of con­ still contains intronic sequences) 86. A high ratio of
tinuous cell state data is generally known as trajectory unspliced to spliced RNA for a given gene indicates that
inference and is a highly active area of research. A study the expression of the gene is increasing, as the higher
that compared 45 different trajectory inference methods amount of unspliced RNA suggests that more of the
concluded that, owing to large differences in the struc­ RNA is being transcribed than degraded. Conversely,
ture of these continuous datasets, no single method per­ a high ratio of spliced to unspliced RNA for a given
forms well in all cases53, suggesting that multiple methods gene is indicative of decreasing gene expression86. RNA
should be tested for any given dataset. Additionally, many velo­city is thus able to predict the future gene expres­
datasets include both discrete and continuous compo­ sion state of a given cell86 and can help to determine,
nents; thus, it might be necessary to use both clustering for example, whether a nephron progenitor is primed to
and trajectory inference during analysis. become a proximal or a distal tubular cell.
One common issue with trajectory inference is that
biologically dissimilar cells might be placed close to each Visualization
other on this continuum owing to technical or biologi­ After clustering and/or trajectory inference, the next
cal noise, a phenomenon known as ‘short circuiting’53,85. step is to generate a 2D or 3D scatterplot of the cells
One interesting approach to dealing with this issue is to visualize major trends and trajectories in the data.

Cells Cluster a
Genes

Cluster b

Cluster c

Cluster a Cluster c
Cluster b

Fig. 4 | Cell clustering in datasets with discrete cell types. An important objective of single-​cell RNA sequencing analysis
is to resolve the cellular heterogeneity of a dataset by identifying the different subpopulations present. Cell clustering
identifies and groups cells from a heterogeneous dataset into clusters, according to similarities in their patterns of gene
expression, as illustrated in the heatmap. These cell clusters usually correspond to different cell types present in a dataset.

Nature Reviews | Nephrology


Reviews

a b UMAP can also be initialized with the PAGA graph to


generate highly accurate visualizations of continuous
developmental datasets85. In practice, UMAP has been
found to perform as well as t-​SNE in visualizing the
Developmental pseudotime

Progenitor Gene 1
cells local structure of datasets (Fig. 6a), including separating
closely related cell types, while performing vastly better

Expression
in terms of visualizing the global properties of the data
Branching (Fig. 6b). Thus, UMAP is a very useful default visuali­
point
zation option for most users89,90. Additional testing of
Gene 2 UMAP and t-​SNE has suggested that the way in which
these methods are initialized is very important to their
overall performance93,94. In fact, t-​SNE and UMAP seem
to perform equally well in terms of preserving global
Branch A Branch B Developmental pseudotime
structure when initialized using PCA93,94.
Fig. 5 | Modelling continuous cellular states. In datasets that are characterized by the Similarity-​weighted non-​negative embedding
presence of a continuum of cell states, the objective of the single-​cell RNA sequencing (SWNE) uses NMF (Table 2) to reduce the dimen­
analysis is to model the cellular trajectory. This process includes computing the develop- sionality of the data and then uses the dimensions as a
mental pseudotime (that is, the approximation of how far a cell has progressed along framework with which to project the cells in two dimen­
a developmental or differentiation pathway), the branch identities (for example, distal sions, adjusting the relative positions of the cells using a
and proximal tubule branches for cells in these distinct developmental pathways) and weighted nearest-​neighbours graph87. This framework
the location of the branching point (for example, the point where nephron progenitors also enables genes to be visualized alongside the cells,
split into cells that will either develop as distal or proximal tubule cells). a | Illustration of adding biological context and interpretability to the
a dataset with a continuous developmental trajectory that branches off into two lineages. visualizations87 (Table 3). SWNE performs better than
The arrow represents the direction of developmental pseudotime. b | Example of a typical
t-​SNE and is similar to UMAP in terms of capturing
pattern of gene expression in a dataset with a continuous trajectory.
global structure, although its representation of local
structure is inferior to both t-​SNE and UMAP87.
Although this step is conceptually identical to dimen­ Potential of heat diffusion for affinity-​based transi­
sionality reduction, visually separating closely related tion embedding (PHATE) uses a diffusion-​based dis­
cell types (maintaining the local structure of the data) tance metric that is accurate for both local and global
in just two or three dimensions, while also ensuring that structure 95. PHATE first computes local distances
the relative distances between cell types and trajectories between neighbouring cells and then propagates those
reflects the magnitude of the gene expression differences distances (in a manner similar to that of Dmaps) to com­
between those cell types (maintaining the global struc­ pute global distances between all cells. PHATE seems
ture of the data), is a complex task. Many linear dimen­ to perform very well for datasets with developmental
sional reduction methods, such as PCA, are unable to trajectories, outperforming both t-​SNE and UMAP in
generate accurate visual representations of the data in capturing global and local structure.
two or three dimensions87,88. Thus, visualization methods Deep learning methods can also capture the struc­
tend to transform the data using a non-​linear method, ture of high-​dimensional data in a 2D embedding owing
which can distort the structure of the data if not used to their ability to capture non-​linearity in the data96.
correctly87,89–91. scvis uses a deep neural network to condense high-
t-​stochastic neighbour embedding (t-​SNE) is one ​dimensional data into a low-​dimensional embedding,
of the most popular visualization methods and uses which results in better cell-​type separation (the ability
pairwise similarities of cells to embed them in a low-​ to capture local structure) than t-​SNE (as measured
dimensional space, ensuring that cells with similar by classification accuracy), as well as faster runtimes96
gene expression profiles are close in the embedding88,92. (Table 3). Other deep learning-​based methods such
t-​SNE thus prioritizes the local structure of the data, as scScope, DCA and scVI can also be used to encode
essentially ensuring that neighbouring cells remain high-​dimensional data in two dimensions42,43,65,96.
together in the 2D visualization88,92 (Table 3). This feature Overall, visualization is critical for understanding
enables t-​SNE to visually separate complex datasets with and communicating the properties of a dataset. One
closely related cell types. However, t-​SNE, as tradition­ common misconception is that clustering and visuali­
ally implemented, does not visualize global properties of zation are identical analyses. Although clusters can be
the datasets, such as relative distances between cell types, created based on UMAP or t-​SNE coordinates, using
as effectively88,92 (Fig. 6). t-​SNE is currently implemented more dimensions with a generalized method such
in Seurat, Pagoda2, SCANPY and CellRanger–Loupe as PCA to create cell clusters is typically more useful,
Cell Browser. because all the structure and nuances of a dataset cannot
Classification In the past few years, uniform manifold approxi­ be accurately compressed into two or three dimensions.
A machine learning task in mation and projection (UMAP) has overtaken t-​SNE In fact, a benchmarking study found that dimensionality
which an algorithm learns as the default visualization method for scRNA-​s eq reduction methods that work well for clustering often do
the relevant features that data89,90. Similar to graph clustering, UMAP gener­ not work as well for visualization55. However, for trajec­
distinguish the different
classes of a training dataset
ates a nearest-​neighbours graph of the cells, weighting tory inference, methods that are used for visualization,
to predict the classes of an each cell–cell connection by the strength of similarity; such as UMAP, Dmaps and LLE, generally work well as
unknown test dataset. the graph is then embedded in two dimensions 89. a basis for building trajectory graphs63,84.

www.nature.com/nrneph
Reviews

As a starting point, UMAP is a very useful default extremely helpful at this stage. One class of methods
method that faithfully visualizes most datasets and that can help to accelerate this process uses enrichment
requires less parameter adjusting than t-​SNE or SWNE of marker genes in functional pathways and gene onto­
to work well. However, users still need to take care not to logy terms, which can greatly enhance interpretability100.
over-​interpret visualizations, as all methods result in For example, a cell-​type cluster with marker genes that
a certain degree of data distortion. Additionally, more are highly enriched for the gene ontology term ‘nephron
research is needed on how different initializations epithelium development’ is likely to contain cells related
of these non-​linear methods can affect their overall to the nephron epithelium.
performance. A second class of methods matches individual cells or
clusters to either a single-​cell or bulk reference RNA-​seq
Cell-​type annotation dataset for automated cell-​type classification. A bench­
Often, the most time-​consuming step of scRNA-​seq marking analysis of these automated classification meth­
analysis is the identification of the biological cell types ods97 found that the best-​performing method was the
present in the dataset. The standard protocol for this support-​vector machine, a common type of machine
cell-​type annotation is to find the genes that are uniquely learning classifier101. The analysis also found that meth­
expressed in each cluster and match those genes to lists ods that use previously known sets of canonical marker
of canonical cell-​type markers52,97. Tools for marker genes, such as Garnett102, do not outperform unbiased
gene discovery and visualization are included in methods97. Other automated cell-​type annotation meth­
Seurat29, Pagoda2 (ref.39), SCANPY30 and the Loupe Cell ods include scmap, which classifies scRNA-​seq clusters
Browser4. An evaluation of marker gene discovery using correlations with reference datasets and a feature
methods found that most methods developed for bulk selection approach based on machine learning103, and
RNA-​seq, such as edgeR and limma, perform just as scPred, which uses a combination of dimensionality
well as scRNA-​seq-​specific methods98. Nonetheless, reduction and classification104.
the Wilcoxon method, which is the default method for Data integration methods such as Seurat, CONOS
both Seurat and Pagoda2, performed relatively well98. and Scanorama also offer automated cell-​type classifi­
Interpreting the output of these marker gene dis­ cation methods40,47,51. These methods find the MNNs
covery methods can be challenging for new users. across datasets, which enables them to classify cell types
For single-​cell methods such as the Wilcoxon test, the in a dataset without pre-​set cell-​type labels, based on
P values are often extremely low as the test treats each the labels of a reference dataset. For example, if a cell of
cell as an independent replicate. In these cases, the log unknown type has ten MNNs in the reference dataset,
fold change of gene expression can be a helpful metric, and nine of them are podocytes, the unknown cell is
as it is indicative of the magnitude of the difference in most likely to be a podocyte too.
gene expression. When an experiment contains multiple Although automated cell-​type annotation meth­
biological or technical replicates, one useful approach is ods can be convenient, they require existing reference
to create a pseudo-​bulk counts matrix after clustering by scRNA-​seq datasets. If a dataset contains novel cell types
summing or averaging the counts from cells in a single or cell states, manual annotation with marker genes is
replicate and a single cluster99. Bulk approaches such as still necessary. Of note, even with a reference dataset,
edgeR or limma can then be used to assess differential manual inspection of marker genes is critical to validat­
gene expression. ing the identified cell types. Nonetheless, as single-​cell
Manually examining lists of marker genes can be atlases such as the human cell atlas and other refer­
extremely time-​consuming and requires knowledge of ence catalogues of single-​cell gene expression become
the biological system being studied. Close collaboration more widely available, the use of automated cell-​type
between biologists and bioinformaticians can thus be classification will become more widespread105.

Table 3 | Visualization methods for single-​cell rNA sequencing data


Method Description Documentation refs
t-​SNE Generates 2D visualizations that maintain cell-​type Implemented in most single-​cell pipelines; 88,92

separation but sacrifice global structure https://github.com/jkrijthe/Rtsne


UMAP Creates a low-​dimensional embedding of a Implemented in Seurat and SCANPY; 90

nearest-​neighbours graph that maintains cell-​type https://github.com/lmcinnes/umap


separation and global structure
SWNE Uses a combination of NMF and nearest-​neighbours https://github.com/yanwu2014/swne 87

smoothing to generate a highly interpretable 2D


embedding
PHATE Specifically encodes local and global information by https://github.com/KrishnaswamyLab/ 95

first learning local relationships and then uses data PHATE


diffusion to learn global structure
scvis Uses a deep neural network to encode data in two https://bitbucket.org/jerry00/scvis-​dev 96

dimensions
NMF, non-​negative matrix factorization; PHATE, potential of heat diffusion for affinity-​based transition embedding; SWNE,
similarity weighted non-​negative embedding; t-​SNE, t-​stochastic neighbour embedding; UMAP, uniform manifold approximation
and projection.

Nature Reviews | Nephrology


Reviews

a Local structure b Global structure Loading more cells enables greater throughput at the
(neighbourhood distances) (cell type and cluster distances) cost of a potential increase in cell doublets4.
The choice of tissue dissociation methodology can
also have a substantial impact on the types of cells availa­
ble for analysis107. One key choice is whether to dissociate
the sample into single cells or single nuclei. Dissociation
of single cells has been widely applied to fresh tissue
samples107. For frozen tissues, single-​nucleus isolation
Cluster
distances and sequencing is a more viable option 107–109. Both
Cell neighbourhood types of protocol seem to have their own specific biases,
although for some sample types, such as human neurons,
only single-​nuclei dissociation has been shown to work
well107–109. One limitation of single-​nuclei methods is that
they generally result in fewer molecules captured per cell,
as most RNA is in the cytoplasm. However, information
captured from nuclei alone can often be sufficient for
accurate classification of cell types and subtypes7.
Fig. 6 | local and global structure in a dataset. a | Preserving the local structure
of a dataset ensures that the neighbouring cells of each cell remain together in the Conclusions
visualization, rather than preserving the original gene expression space. The distance
Technical advances in scRNA-​seq technologies have
between cells in the final visualization is therefore representative of the degree of
similarity between their gene expression patterns. b | Preserving the global structure
led to the generation of datasets of increasing scale and
of a dataset ensures that large-​scale distances, such as the distances between cell types, complexity. In response, an ecosystem of computational
are maintained. methods has been developed to deal with the challenges
involved in analysing these datasets. Methods based on
the identification of MNNs successfully integrate datasets
Experimental design considerations across patients, conditions and technologies, addressing
Experimental design can have a substantial impact on the crucial issue of batch effects in scRNA-​seq data.
analysis. For example, if multiple biological samples are Additionally, a number of methods have been developed
to be collected and analysed, cells from each sample to model cellular trajectories and identify cell clusters.
should ideally be tagged to allow multiplexing, using However, one remaining limitation is that most clus­
methods such as cell hashing, and then analysed on the tering methods require users to specify the number of
same scRNA-​seq run106. For example, in an analysis of clusters and finding the optimal number of clusters for a
kidney samples from five different patients across three given dataset is challenging. A second limitation is that
scRNA-​seq runs, each run would ideally contain tagged manually annotating cell types using marker genes can be
cells from each patient. This approach enables the dis­ extremely time-​consuming. Fortunately, new automated
tinction between sample-​specific effects and experi­ and semi-​automated cell-​type classification methods are
mental batch effects, which is especially crucial if the being developed to address this issue, although novel cell
samples are from a case–control study32. For example, types and states will still need to be manually annotated.
when comparing gene knockout mice with wild-​type The ability to integrate datasets across samples, along
controls, cells from both types of mice are ideally run with the increased throughput of the latest scRNA-​seq
on the same experiment. Combinatorial indexing methods, will increase our ability to resolve cell sub­
methods facilitate this approach as cells from different types and discover rare cell types. Additionally, many
samples can be positioned in different wells during the newer methods, especially those used for low-​level data
first round of barcoding5,6. For droplet-​based methods, pre-​processing, take into account memory and central
some form of sample-​specific cell tag is necessary to processing unit usage, which is critical, as the size of
identify the sample source of a cell106. However, from a single-​cell datasets continues to increase. Further develop­
logistics standpoint, gathering all samples for processing ment of these computational methods will help research­
in the same experimental batch is not always possible, ers to unlock additional biological insights. Despite these
especially for animal experiments across various condi­ advances in computational methodology, validation of
tions and/or timepoints, or for patient samples that are any computational findings by testing multiple biolog­
collected during clinical procedures. ical replicates or conducting additional experiments,
The choice of scRNA-​seq methodology also has an such as immunostaining or RNA-​FISH, is still required.
effect on the number of molecules captured per cell and The advent of multi-​omics approaches will require a
the total number of cells analysed. In general, the com­ new set of tools that can link data on different cellular
binatorial indexing methods capture fewer UMIs per cell parameters, such as protein expression or epigenetic data,
Cell hashing
A technique that attaches than the droplet-​based methods, which can affect their to provide additional biological insight. For example,
unique molecular barcodes to ability to resolve some closely related cell subtypes4–6. analysing the relationship between gene expression and
multiple batches of samples for However, combinatorial indexing methods can capture enhancer and/or promoter accessibility might delineate
pooling and processing in one far more cells per experiment, potentially enabling the cell-​type-​specific maps of gene regulation, maximizing
batch, which not only improves
the experimental throughput
identification of rare cell populations4–6. For all of these the utility of scRNA-​seq datasets.
but also reduces technical methods, the user can generally control the number
batch differences. of cells that are loaded onto the scRNA-​seq platform. Published online xx xx xxxx

www.nature.com/nrneph
Reviews

1. Ramsköld, D., Wang, E. T., Burge, C. B. & Sandberg, R. 27. van den Brink, S. et al. Single-​cell sequencing reveals 53. Saelens, W., Cannoodt, R., Todorov HelenaSaeys, Y.,
An abundance of ubiquitously expressed genes dissociation-​induced gene expression in tissue Todorov, H. & Saeys, Y. A comparison of single-​cell
revealed by tissue transcriptome sequence data. subpopulations. Nat. Methods 14, 935–936 (2017). trajectory inference methods: towards more accurate
PLoS Comput. Biol. 5, e1000598 (2009). 28. McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. and robust tools. Nat. Biotechnol. 37, 547–554
2. Potter, S. S. Single-​cell RNA sequencing for the study DoubletFinder: doublet detection in single-​cell RNA (2019).
of development, physiology and disease. Nat. Rev. sequencing data using artificial nearest neighbors. A benchmark analysis of methods for single-​cell
Nephrol. 14, 479–492 (2018). Cell Syst. 8, 329–337.e4 (2019). trajectory inference.
3. Macosko, E. Z. et al. Highly parallel genome-​wide 29. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & 54. Bellman, R. On the theory of dynamic programming.
expression profiling of individual cells using nanoliter Satija, R. Integrating single-​cell transcriptomic data Proc. Natl Acad. Sci. USA 38, 716–719 (1952).
droplets. Cell 161, 1202–1214 (2015). across different conditions, technologies, and species. 55. Sun, S., Zhu, J., Ma, Y. & Zhou, X. Accuracy,
4. Zheng, G. X. Y. et al. Massively parallel digital Nat. Biotechnol. 36, 411–420 (2018). robustness and scalability of dimensionality reduction
transcriptional profiling of single cells. Nat. Commun. 30. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: methods for single-​cell RNA-​seq analysis. Genome Biol.
8, 14049 (2017). large-​scale single-​cell gene expression data analysis. 20, 269 (2019).
5. Rosenberg, A. B. et al. Single-​cell profiling of the Genome Biol. 19, 15 (2018). A benchmark study of methods used for
developing mouse brain and spinal cord with split-​pool 31. McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & dimensionality reduction of scRNA-​seq data.
barcoding. Science 12, eaam8999 (2018). Wills, Q. F. Scater: pre-​processing, quality control, 56. Abdi, H. & Williams, L. J. Principal component
6. Cao, J. et al. Comprehensive single cell transcriptional normalization and visualization of single-​cell RNA-​seq analysis. Chemom. Intell. Lab. Syst. 2, 433–459
profiling of a multicellular organism by combinatorial data in R. Bioinformatics 33, 1179–1186 (2017). (2010).
indexing. Science 357, 661–667 (2017). 32. Wagner, A., Regev, A. & Yosef, N. Uncovering the 57. Pierson, E. & Yau, C. ZIFA: dimensionality reduction
7. Lake, B. B. et al. A single-​nucleus RNA-​sequencing vectors of cellular states with single cell genomics. for zero-​inflated single-​cell gene expression analysis.
pipeline to decipher the molecular anatomy and Nat. Publ. Gr. 34, 1–53 (2016). Genome Biol. 16, 241 (2015).
pathophysiology of human kidneys. Nat. Commun. 10, 33. Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & 58. Buettner, F., Pratanwanich, N., McCarthy, D. J.,
2832 (2019). Marioni, J. C. Normalizing single-​cell RNA sequencing Marioni, J. C. & Stegle, O. f-​scLVM: scalable and
8. Combes, A. N., Zappia, L., Er, P. X., Oshlack, A. & data: challenges and opportunities. Nat. Methods 14, versatile factor analysis for single-​cell RNA-​seq.
Little, M. H. Single-​cell analysis reveals congruence 565–571 (2017). Genome Biol. 18, 212 (2017).
between kidney organoids and human fetal kidney. 34. L. Lun, A. T., Bach, K. & Marioni, J. C. Pooling across 59. Lee, D. D. & Seung, H. S. Learning the parts of objects
Genome Med. 11, 3 (2019). cells to normalize single-​cell RNA sequencing data with by non-​negative matrix factorization. Nature 401,
9. Tanay, A. & Regev, A. Scaling single-​cell genomics from many zero counts. Genome Biol. 17, 75 (2016). 788–791 (1999).
phenomenology to mechanism. Nature 541, 331–338 35. Bacher, R. et al. SCnorm: robust normalization of 60. Lin, X. & Boutros, P. C. Optimization and expansion of
(2017). single-​cell RNA-​seq data. Nat. Methods 14, 584–586 non-​negative matrix factorization. BMC Bioinformatics
10. Chen, C. et al. Single-​cell whole-​genome analyses by (2017). 21, 7 (2020).
linear amplification via transposon insertion (LIANTI). 36. Hafemeister, C. & Satija, R. Normalization and 61. Roweis, S. T. & Saul, L. K. Nonlinear dimensionality
Science 356, 189–194 (2017). variance stabilization of single-​cell RNA-​seq data reduction by locally linear embedding. Science 290,
11. Smallwood, S. A. et al. Single-​cell genome-​wide using regularized negative binomial regression. 2323–2326 (2000).
bisulfite sequencing for assessing epigenetic Genome Biol. 20, 296 (2019). 62. Angerer, P. et al. destiny: diffusion maps for large-​scale
heterogeneity. Nat. Methods 11, 817–820 (2014). 37. Fan, J. et al. Characterizing transcriptional single-​cell data in R. Bioinformatics 32, 1241–1243
12. Cusanovich, D. A. et al. Multiplex single-​cell profiling heterogeneity through pathway and gene set (2015).
of chromatin accessibility by combinatorial cellular overdispersion analysis. Nat. Methods 13, 241–244 63. Welch, J. D., Hartemink, A. J. & Prins, J. F. SLICER:
indexing. Science 348, 910–914 (2015). (2016). inferring branched, nonlinear cellular trajectories
13. Buenrostro, J. D. et al. Single-​cell chromatin 38. Brennecke, P. et al. Accounting for technical noise in from single cell RNA-​seq data. Genome Biol. 17, 106
accessibility reveals principles of regulatory variation. single-​cell RNA-​seq experiments. Nat. Methods 10, (2016).
Nature 523, 486–490 (2015). 1093–1095 (2013). 64. Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J.
14. Chen, S., Lake, B. B. & Zhang, K. High-​throughput 39. Barkas, N. et al. pagoda2: a package for analyzing Deep learning: new computational modelling
sequencing of the transcriptome and chromatin and interactively exploring large single-​cell RNA-​seq techniques for genomics. Nat. Rev. Genet. 20,
accessibility in the same cell. Nat. Biotechnol. 37, datasets. GitHub https://github.com/hms-​dbmi/ 389–403 (2019).
1452–1457 (2019). pagoda2 (2018). 65. Deng, Y., Bao, F., Dai, Q., Wu, L. F. & Altschuler, S. J.
15. Linker, S. M. et al. Combined single-​cell profiling of 40. Stuart, T. et al. Comprehensive integration of Scalable analysis of cell-type composition from
expression and DNA methylation reveals splicing single-​cell data. Cell 177, 1888–1902.e21 (2019). single-cell transcriptomics using deep recurrent
regulation and heterogeneity. Genome Biol. 20, 30 41. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & learning. Nat. Methods 16, 311–314 (2019).
(2019). Vert, J. P. A general and flexible method for signal 66. Svensson, V. Droplet scRNA-​seq is not zero-​inflated.
16. Gu, C., Liu, S., Wu, Q., Zhang, L. & Guo, F. Integrative extraction from single-​cell RNA-​seq data. Nat. Commun. Nat Biotechnol. 38, 147–150 (2020).
single-​cell analysis of transcriptome, DNA methylome 9, 284 (2018). 67. Wagner, F., Yan, Y. & Yanai, I. K-​nearest neighbor
and chromatin accessibility in mouse oocytes. Cell Res. 42. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. smoothing for single-​cell RNA-​seq data. Preprint at
29, 110–123 (2019). & Yosef, N. Deep generative modeling for single-​cell https://doi.org/10.1101/217737 (2017).
17. Amezquita, R. A. et al. Orchestrating single-​cell transcriptomics. Nat. Methods 15, 1053–1058 68. van Dijk, D. et al. Recovering gene interactions
analysis with Bioconductor. Nat. Methods 17, (2018). from single-​cell data using data diffusion. Cell 174,
137–145 (2020). 43. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. 716–729.e27 (2018).
A useful stepwise practical tutorial on how to & Theis, F. J. DCA: single cell RNA-seq denoising using 69. Huang, M. et al. SAVER: gene expression recovery
perform scRNA-​seq analysis in the R programming a deep count autoencoder. Nat. Commun. 10, 390 for single-​cell RNA sequencing. Nat. Methods 15,
language using the Bioconductor suite of tools. (2019). 539–542 (2018).
18. Lun, A. T. L., Mccarthy, D. J. & Marioni, J. C. 44. Yip, S. H., Sham, P. C. & Wang, J. Evaluation of tools 70. Lin, P., Troup, M. & Ho, J. W. K. CIDR: ultrafast and
A step-​by-​step workflow for low-​level analysis for highly variable gene discovery from single-​cell accurate clustering through imputation for single cell
of single-​cell RNA-​seq data with bioconductor. RNA-​seq data. Brief. Bioinform. 20, 1583–1589 RNA-​seq data. Genome Biol. 18, 59 (2017).
F1000Res. 5, 2122 (2016). (2018). 71. Li, W. V. & Li, J. J. An accurate and robust imputation
19. Luecken, M. D. & Theis, F. J. Current best practices in A benchmark analysis of methods available for method scImpute for single-​cell RNA-​seq data.
single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. selecting over-​dispersed genes. Nat. Commun. 9, 997 (2018).
15, e8746 (2019). 45. Leek, J. T. et al. Tackling the widespread and critical 72. Andrews, T. S. & Hemberg, M. False signals induced
This tutorial discusses scRNA-​seq analysis impact of batch effects in high-​throughput data. by single-​cell imputation. F1000Res. 7, 1740 (2019).
steps using the latest methods developed for Nat. Rev. Genet. 11, 733–739 (2010). 73. Lloyd, S. P. Least squares quantization in PCM.
each step. 46. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & IEEE Trans. Inf. Theory 28, 129–137 (1982).
20. Petukhov, V. et al. Accurate estimation of molecular Marioni, J. C. Batch effects in single-​cell RNA-​sequencing 74. Žurauskiene, J. & Yau, C. pcaReduce: hierarchical
counts in droplet-​based single-​cell RNA-​seq data are corrected by matching mutual nearest clustering of single cell transcriptional profiles.
experiments. Genome Biol. 19, 78 (2018). neighbors. Nat. Biotechnol. 36, 421–427 (2018). BMC Bioinformatics 17, 140 (2016).
21. Melsted, P. et al. Modular and efficient pre-​processing 47. Barkas, N. et al. Joint analysis of heterogeneous 75. Zeisel, A. et al. Cell types in the mouse cortex and
of single-​cell RNA-​seq. Preprint at https://doi.org/ single-​cell RNA-​seq dataset collections. Nat. Methods hippocampus revealed by single-​cell RNA-​seq.
10.1101/673285 (2019). 16, 695–698 (2019). Science 347, 1138–1142 (2015).
22. Islam, S. et al. Quantitative single-​cell RNA-​seq with 48. Tran, H. T. N. et al. A benchmark of batch-​effect 76. Zeisel, A. et al. Molecular architecture of the mouse
unique molecular identifiers. Nat. Methods 11, correction methods for single-​cell RNA sequencing nervous system. Cell 174, 999–1014.e22 (2018).
163–166 (2014). data. Genome Biol. 21, 12 (2020). 77. Duò, A., Robinson, M. D. & Soneson, C. A systematic
23. Smith, T. & Sudbery, I. UMI-​tools: modelling A benchmark study of methods available for batch performance evaluation of clustering methods for
sequencing errors in unique molecular identifiers to correction during analysis of scRNA-​seq data. single-​cell RNA-​seq data. F1000Res. 7, 1141 (2018).
improve quantification accuracy. Genome Res. 27, 49. Leek, J. T. Svaseq: removing batch effects and other A benchmark analysis of methods available for
491–499 (2017). unwanted noise from sequencing data. Nucleic Acids clustering in scRNA-​seq data analysis.
24. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Res. 42, e161 (2014). 78. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. &
Near-​optimal probabilistic RNA-​seq quantification. 50. Stuart, T. & Satija, R. Integrative single-​cell analysis. Lefebvre, E. Fast unfolding of communities in large
Nat. Biotechnol. 34, 525–527 (2016). Nat. Rev. Genet. 20, 257–272 (2019). networks. J. Stat. Mech. 2008, P10008 (2008).
25. Dobin, A. et al. STAR: ultrafast universal RNA-seq 51. Hie, B., Bryson, B. & Berger, B. Efficient integration 79. Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain
aligner. Bioinformatics 29, 15–21 (2013). of heterogeneous single-​cell transcriptomes using to Leiden: guaranteeing well-​connected communities.
26. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Scanorama. Nat. Biotechnol. 37, 685–691 (2019). Sci. Rep. 9, 5233 (2019).
Kingsford, C. Salmon provides fast and bias-​aware 52. Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges 80. Kiselev, V. Y. et al. SC3: consensus clustering of
quantification of transcript expression. Nat. Methods in unsupervised clustering of single-​cell RNA-​seq data. single-​cell RNA-​seq data. Nat. Methods 14, 483–486
14, 417–419 (2017). Nat. Rev. Genet. 20, 273–282 (2019). (2017).

Nature Reviews | Nephrology


Reviews

81. Li, H. et al. Reference component analysis of single-​cell 95. Moon, K. R. et al. Visualizing structure and transitions 107. Denisenko, E. et al. Systematic bias assessment in
transcriptomes elucidates cellular heterogeneity in in high-​dimensional biological data. Nat. Biotechnol. solid tissue 10x scRNA-​seq workflows. Preprint at
human colorectal tumors. Nat. Genet. 49, 708–718 37, 1482–1492 (2019). https://doi.org/10.1101/832444 (2019).
(2017). 96. Ding, J., Condon, A. & Shah, S. P. Interpretable 108. Lake, B. et al. Neuronal subtypes and diversity
82. Combes, A. N. et al. Single cell analysis of the dimensionality reduction of single cell transcriptome revealed by single-​nucleus RNA sequencing of the
developing mouse kidney provides deeper insight into data with deep generative models. Nat. Commun. 9, human brain. Science 352, 1586–1590 (2016).
marker gene expression and ligand-​receptor crosstalk. 2002 (2018). 109. Krishnaswami, S. R. et al. Using single nuclei for
Development 146, dev178673 (2019). 97. Abdelaal, T. et al. A comparison of automatic cell RNA-​seq to capture the transcriptome of postmortem
83. Qiu, X. et al. Single-​cell mRNA quantification and identification methods for single-​cell RNA-​sequencing neurons. Nat. Protoc. 11, 499–524 (2016).
differential analysis with Census. Nat. Methods 14, data. Genome Biol. 20, 194 (2019).
309–315 (2017). A benchmark study of methods available Acknowledgements
84. Cao, J. et al. The single-​cell transcriptional landscape for automated cell-​type classification in The authors were supported by NIH grants U01MH098977,
of mammalian organogenesis. Nature 566, 496–502 scRNA-​seq data. R01HL123755, U54HL145608, UH3DK114933 and
(2019). 98. Soneson, C. & Robinson, M. D. Bias, robustness and R01HG009285.
85. Wolf, F. A. et al. PAGA: graph abstraction reconciles scalability in single-​cell differential expression analysis.
clustering with trajectory inference through a topology Nat. Methods 15, 255–261 (2018). Author contributions
preserving map of single cells. Genome Biol. 20, 59 99. Lun, A. T. L. & Marioni, J. C. Overcoming confounding All authors researched data for the article, wrote the manu-
(2019). plate effects in differential expression analyses of script, made substantial contributions to discussions of the
86. La Manno, G. et al. RNA velocity of single cells. single-​cell RNA-​seq data. Biostatistics 18, 451–464 content and reviewed or edited the manuscript before
Nature 560, 494–498 (2018). (2017). submission.
87. Wu, Y., Tamayo, P. & Zhang, K. Visualizing and 100. Subramanian, A. et al. Gene set enrichment analysis:
interpreting single-​cell gene expression datasets with a knowledge-​based approach for interpreting Competing interests
similarity weighted nonnegative embedding. Cell Syst. genome-​wide expression profiles. Proc. Natl Acad. Y.W. declares no competing interests. K.Z. is a co-​founder,
7, 656–666.e4 (2018). Sci. USA 102, 15545–15550 (2005). equity holder, scientific advisory board member and paid
88. van der Maaten, L. & Hinton, G. Visualizing data using 101. Suykens, J. A. K. & Vandewalle, J. Indefinite kernels in consultant of Singlera Genomics, which has no commercial
t-​SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). least squares support vector machines and principal interests related to this article.
89. McInnes, L. & Healy, J. UMAP: uniform manifold component analysis. Neural Process. Lett. 43,
approximation and projection for dimension reduction. 162–172 (2017). Peer review information
Preprint at https://arxiv.org/abs/1802.03426 (2018). 102. Pliner, H. A., Shendure, J. & Trapnell, C. Supervised Nature Reviews Nephrology thanks B. J. Aronow and the other,
90. Becht, E. et al. Dimensionality reduction for visualizing classification enables rapid annotation of cell atlases. anonymous, reviewer(s) for their contribution to the peer
single-​cell data using UMAP. Nat. Biotechnol. 37, Nat. Methods 16, 983–986 (2019). review of this work.
38–44 (2018). 103. Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap:
91. Wattenberg, M., Viegas, F. & Johnson, I. How to use projection of single-​cell RNA-​seq data across data Publisher’s note
t-​SNE effectively. Distill https://doi.org/10.23915/ sets. Nat. Methods 15, 359–362 (2017). Springer Nature remains neutral with regard to jurisdictional
distill.00002 (2016). 104. Alquicira-​Hernandez, J., Sathe, A., Ji, H. P., claims in published maps and institutional affiliations.
92. van der Maaten, L. Accelerating t-​SNE using Nguyen, Q. & Powell, J. E. ScPred: accurate
tree-​based algorithms. J. Mach. Learn. Res. 15, supervised method for cell-​type classification from Related links
3221–3245 (2014). single-​cell RNA-​seq data. Genome Biol. 20, 264 Broad institute online single-​cell data browser: https://portals.
93. Kobak, D. & Linderman, G. C. UMAP does not (2019). broadinstitute.org/single_cell
preserve global structure any better than t-​SNE when 105. Regev, A. et al. The human cell atlas. eLife 6, e27041 eMBL-​eBi online single-​cell data browser: https://www.ebi.ac.
using the same initialization. Preprint at https://doi.org/ (2017). uk/gxa/sc/home
10.1101/2019.12.19.877522 (2019). 106. Stoeckius, M. et al. Cell hashing with barcoded UCsC online single-​cell data browser: https://cells.ucsc.edu/
94. Kobak, D. & Berens, P. The art of using t-​SNE for antibodies enables multiplexing and doublet
single-​cell transcriptomics. Nat. Commun. 10, 5416 detection for single cell genomics. Genome Biol. 19,
(2019). 224 (2018). © Springer Nature Limited 2020

www.nature.com/nrneph

You might also like