2021-Single-Cell Transcriptomics Current Methods and Challenges in Data Acquisition and Analysis
2021-Single-Cell Transcriptomics Current Methods and Challenges in Data Acquisition and Analysis
heterogeneous, often show a drastic variation at the individual TABLE 1 | Current SC-RNA-seq profiling techniques, based on transcript
coverage and UMI insertion possibility.
level (Wang and Bodovitz, 2010; Xin et al., 2016). The SC
experiments were found much conclusive compared with bulk Method Length of UMI insertion References
cell sequencing that involves sequencing in bulk (assuming cells transcript possibility
of a particular type are identical) and estimating an average of
ScNaUmi-seq Full length Yes Lebrigand et al., 2020
expressions. The SC transcriptomics was awarded as method
MATQ-seq Full length Yes Sheng and Zong, 2019
of the year by Nature in 2013 (Xue et al., 2015). With the
10× Chromium 30 end Yes Zheng et al., 2017
advent of next-generation sequencing, it becomes possible to
CEL-seq2 30 end Yes Hashimshony et al., 2016
develop sequencing methods to probe the dynamics of the
Drop-seq 30 end Yes Macosko et al., 2015
genome and variations thereof. Of them, RNA sequencing (RNA-
InDrop 30 end Yes Klein et al., 2015
seq)-mediated transcriptomic profiling revealed information of
Smart-seq2 Full length No Picelli et al., 2014
novel RNA species that deepened our understanding of the
STRT-seq 50 end Yes Islam et al., 2014
transcriptome dynamics (Tang et al., 2009; Wang et al., 2009;
MARS-seq 30 end Yes Jaitin et al., 2014
Ozsolak and Milos, 2011). Lately, these sequencing approaches
Smart-seq Full length No Ramskold et al., 2013
have been extended to study intra-population heterogeneity of
SCs (Wills et al., 2013), whereby it enabled the study of cell SC-RNA-seq, single-cell RNA sequencing; UMI, Unique Molecular Identifier.
fates, their transition to different subtypes, and the dynamics of
gene expression masked in bulk population studies (Altschuler
and Wu, 2010; Trapnell et al., 2014). Compared with bulk
sequencing, where libraries are prepared from thousands of cells, and thousands of cells in an unbiased manner (Kulkarni et al.,
libraries for single-cell RNA sequencing (SC-RNA-seq) are cell- 2019). In SC transcriptomics, each cell needs to be isolated from
specific towards investigating cellular functionalities of DNA its originating tissue. The Droplet-based techniques, which at the
and RNA in different cellular subsets (Gross et al., 2015; Xue core use microfluidics to attach cells with beads containing a
et al., 2015). Though SC-RNA-seq has revealed novel findings in unique barcode, are widely incorporated to separate cells. The
different cellular backgrounds, it poses specific challenges: Pre- performance criteria for isolation methods are based on three
processing of the SC-RNA-seq data is majorly different from parameters: throughput, purity, and recovery (Tomlinson et al.,
bulk RNA-seq, stricter protocols for library preparation and low 2013; Gross et al., 2015). Throughput indicates the number of cells
starting material. Another challenge is the lack of analytical that can be isolated per unit time, purity refers to the number
approaches required to accommodate large datasets generated of cells collected after separation from tissue, and recovery is
during SC-RNA-seq experiments. Keeping this in view, we the final amount of the target cells, in hand, after separation.
investigated the methods adopted in SC experiments, sequencing The morphological complexity of cells like those of the central
approaches, and challenges thereof, as part of realizing the goal of nervous system (CNS) makes the separation process a little
precision medicine. challenging. The segregation process exposes them to specific
environmental, chemical, and harsh dissociation steps that often
bias data analysis (Kulkarni et al., 2019). The dissociation of intact
SINGLE-CELL RNA SEQUENCE cells from a frozen postmortem tissue is also challenging, as cell
PROFILING TECHNIQUES membranes are prone to damage from mechanical and physical
stresses as part of the freeze–thaw process (McGann et al., 1988).
With the first report in 2009, a surge in the SC transcriptomics Though each cell separation methods currently in use shows an
methods capable of sequencing millions of cells with great advantage different for the above three parameters, it becomes
accuracy and viability in a short span of time was observed (Tang imperative to select a well-suited method for the isolation of a cell.
et al., 2009). These methods are generally different from each The current methodology of cell separation is broadly categorized
other in terms of cell isolation methods, cell lysis procedure, into two groups based on (1) cellular properties like cell density,
amplification process, cDNA generation, transcript coverage, and cell shape, cell size, etc., and (2) biological characteristics of a
Unique Molecular Identifier (UMI) tagging (at either 30 end or cell that comprises affinity methods (Tomlinson et al., 2013).
50 end). The most critical distinction in the SC-RNA profiling Tables 2, 3 show some of the widely used methods concerning the
techniques is that some provide full-length transcript coverage operational mode, throughput, advantages, and disadvantages.
and some only partially sequence from either 30 end or 50 end of Though high-throughput SC-RNA approaches such as 10×
the transcript (Chen et al., 2019). Table 1 highlights widely used Chromium allows analysis of cells in an unbiased manner,
SC-RNA profiling methods in terms of different properties. it lacks in providing an in-depth information on sequence
diversity, splicing, and chimeric transcripts generated in the
process (Lebrigand et al., 2020). The problem is overcome
OPTIMAL METHODOLOGY OF by performing Nanopore long-read sequencing [using a cell
SINGLE-CELL TRANSCRIPTOMICS barcode (cellBC) assignment to long reads] to obtain a full-
length sequence corresponding to the 10× Chromium system’s
Of the various sequencing platforms, Drop-seq, InDrop, and 10× data. As SC library preparation requires robust amplification,
Chromium are well-known platforms for sequencing hundreds chimeric cDNA generation and amplification bias issues are
TABLE 2 | Commonly used methods for cell isolation based on biological characteristics.
Fluorescence- Automatic High High rate of rare cell Cost-intensive, high skills Herzenberg et al., 2002;
activated cell sorting, high purity required Gross et al., 2015
sorting
Magnetic-activated Automatic High High purity, cost-efficient Cell capture is non-specific Schmitz et al., 1994; Welzel
cell separation et al., 2015
TABLE 3 | Commonly used methods for cell isolation on the bases of physical characteristics.
Microfluidic cell separation Automatic High Works with low starting High skills required, Wyatt Shields et al., 2015
materials, amplification dissociated cells
integration
Micromanipulation manual Manual Low More control over cell, live Laborious, high skills Citri et al., 2012
cell picking and intact cell separation needed
Laser-capture Manual Low Undamaged live cell Too complex to operate, Espina et al., 2006
microdissection capture, highly advanced threat of contamination by
neighboring cells
Density gradient Manual Low Cost-efficient Too slow and laborious, low Beakke, 1951
centrifugation yield
currently addressed by employing a 30 or 50 end tag- are currently the most popular alignment tools, which can
based approach (Trombetta et al., 2015; Natarajan et al., 2019). map billions of reads to a reference transcriptome with greater
The sequence length method determines the quality of alignment accuracy and high speed. Transcriptome reconstruction can
across the total length of a gene, while tag-based methods be either de novo (for samples lacking reference genome) or
integrate UMIs at either 30 end or 50 end of the transcript reference based, also called genome-guided assembly (Chen et al.,
(Kivioja et al., 2012; Smith et al., 2017; Sena et al., 2018). 2011). However, the former technique sometimes lacks accuracy
The UMI addition makes it easier to identify and quantify the in comparison with the reference-based assembly approach
individual transcripts by eliminating PCR artifacts and minimizes (Garber et al., 2011). For SC-RNA-seq methods that generate
false annotation of PCR-generated chimeric cDNAs as novel data on a whole-transcriptome basis, Smart-seq2 (Picelli et al.,
transcripts. The full length-based methodology provides an all- 2014) and MATQ-seq (Sheng and Zong, 2019) use Cufflinks,
inclusive coverage of the reads, yet they contribute a bias for long RSEM, Stringtie, etc., for the quantification of transcripts, while
genes, as the genes with shorter length are often missed (Phipson methods that incorporate the 30 end UMI tagging [like Drop-seq
et al., 2017). Additionally, the higher sequencing error rate of (Macosko et al., 2015), InDrop (Klein et al., 2015), MARS-seq
long-read sequencers and UMI problems account for a serious (Jaitin et al., 2014), etc.] require specific algorithms to generate
issue pertaining to these platforms (Gupta et al., 2018; Lebrigand the expression count for the transcript. Another efficient tool
et al., 2020; Volden and Vollmers, 2020). Despite this, the Tag- for the UMI-based methods was developed by Huang and
based methods have shown a fair dominance in SC-RNA library Sanguinetti (2017) for calculating the expression count of SCs
preparation for quantifying the transcripts in SC analysis when accurately. Table 4 provides information about the current tools
cell number is large (Figure 1). for read alignment and expression quantification. The SC-RNA-
seq exhibits certain limitations, which results in higher technical
noise (Kolodziejczyk et al., 2015). In SC-RNA-seq data, many
QUANTIFICATION OF EXPRESSION AND transcripts appear to be lost during reverse transcription due to
QUALITY CONTROL the small number and low capture efficiency of RNA molecules
in SCs (Saliba et al., 2014). Consequently, in one cell, some
Like bulk RNA-seq, the transcripts in SC-RNA are sequenced transcripts are highly expressed but are missing in another cell.
into reads that generate the raw fastq data. The quality of the This pattern is described as a “dropout” event. It has been
sequence reads generated in a sequencing method is considered reported that even the most sensitive protocol for SC-RNA-seq
an important quality indicator of SC-RNA-seq data. As the fails to detect some of the transcripts as part of Dropout events
alignment of the transcript reads for SC-RNA-seq is same as (Haque et al., 2017). When the cells are dissociated or isolated,
bulk RNA-seq, the methods and tools used for the gene or a certain number of cells become dead or get destroyed. The
transcript quantification for bulk RNA-seq can also be used SC-RNA-seq methods generate low-quality data from these cells
for quantifying transcripts generated by SC-RNA-seq (Li and (Ilicic et al., 2016). After alignment and quantification of the
Homer, 2010; Fonseca et al., 2012). HISAT2 (Kim et al., 2019), transcripts, the quality control check of cells is necessary to
TopHat2 (Kim et al., 2013), and STAR (Dobin et al., 2013) remove low-quality cells for an accurate downstream analysis.
FIGURE 1 | Single-cell analysis in disease and health. Starting from the dissociation of target cells from the target tissue/organ, their isolation based on
fluorescence-activated cell sorting (FACS) or other microfluidic techniques to RNA extraction. The RNA extraction is followed by cDNA synthesis by reverse
transcriptase, followed by amplification and sequencing. From the sequencing, the reads are aligned and subjected to quantification that results in a quantification
matrix or Gene Expression Matrix.
TABLE 4 | Widely used tools for read alignment and expression quantification.
Salmon Expression quantification k-mer-based read quantification https://combine-lab.github. Patro et al., 2017
io/salmon/
Kallisto Expression quantification Pseudoalignment-based rapid read https://pachterlab.github. Bray et al., 2016
determination io/kallisto/
StringTIe Expression quantification Alignment based, splice aware https://ccb.jhu.edu/ Pertea et al., 2015
software/stringtie/
HISAT2 Read alignment Alignment based, splice aware https://daehwankimlab.github.io/ Sirén et al., 2014
hisat2/
Sailfish Expression quantification k-mer-based read quantification http://www.cs.cmu.edu/ Patro et al., 2014
~{}ckingsf/software/sailfish/
RNA-Skim Expression quantification Sig-mer (a type of k-mer)-based http: Zhang and Wang,
read quantification of transcripts //www.csbio.unc.edu/rs/ 2014
TopHat2 Read alignment Alignment based, splice aware https: Kim et al., 2013
//ccb.jhu.edu/software/
tophat/index.shtml
STAR Read alignment Alignment based, splice aware https://github.com/ Dobin et al., 2013
alexdobin/STAR
Bowtie Read alignment Maintains quality threshold, hence http: Langmead et al.,
less no. of mismatches //bowtie-bio.sourceforge. 2009
net/index.shtml
Cufflinks Expression quantification Alignment based, splice aware https://github.com/cole- Trapnell et al., 2010
trapnell-lab/cufflinks
designed for bulk RNA-seq. However, data processing tasks like are principal component analysis (PCA) (Van Der Maaten et al.,
normalization, DGE analysis, cell imputation, and dimensionality 2009) and T-distribution stochastic neighbor embedding (t-SNE)
reduction, etc., call for the development of novel computational (Van Der Maaten and Hinton, 2008; Kobak and Berens, 2019).
techniques, algorithms, and tools for smooth execution of SC- PCA uses a linear process to transform a set of variables (possibly
RNA-seq data analysis. The nature of the challenges that SC- correlated) into an uncorrelated variable known as a principal
RNA-seq data possess, including big data problem (Costa, 2012; component, while t-SNE is a non-linear probability distribution-
Yu and Lin, 2016; Angerer et al., 2017; He et al., 2017), is based approach. Both PCA and t-SNE methods of dimensionality
highlighted in the following subsections: reduction have certain limitations (Chen et al., 2019); based on
the assumption that approximately all the data are distributed
normally, PCA does not effectively amount to the underlying
Normalization complexities in the structure of SC-RNA-seq data, and t-SNE
has a larger time complexity reaching O(n2 ) (Pezzotti et al.,
In SC-RNA-seq, coverage of sequences between the libraries 2017). The most recent algorithm employed for dimensionality
exhibit systematic differences from experimental procedures, reduction “UMAP” (Uniform Manifold Approximation and
dropout events, depth of the sequencing, and other technical Projection) (McInnes et al., 2018; Becht et al., 2019) outperforms
effects (Stegle et al., 2015). These differences must be corrected PCA and t-SNE for SC-RNA-seq in terms of high reproducibility
by normalizing the data such that there is no interference in the and meaningful organization of cells (Becht et al., 2018). UMAP
comparison of the gene expression between cells. Being crucial, is a non-linear graph-based algorithm that tends to identify
normalization of the SC-RNA-seq datasets eventually leads to the closest neighbors of a data point and assigns them a
lucid downstream analysis, including identifying different cell larger weight, thereby preserving the topological structure of the
subsets and revealing differential expression of genes. In bulk data. The idea is to project a low-dimensional representation
RNA-seq, expression counts from various libraries are usually of the data while preserving the nearest neighbours of an
normalized by computing the fragments per kilobase of transcript individual data point (i.e., cells). This helps to group more
counts of per million mapped fragments (FPKM) (Mortazavi closely related neighbours and partly conserves the relation of
et al., 2008), transcripts per million (TPM) (Li and Dewey, points in the “long-range” using the intermediate data points.
2011), reads per kilobase of transcripts per million mapped Although the interpretation of the distances in a reduced space
reads (RPKM), upper quartile (UQ) (Bullard et al., 2010), becomes difficult, UMAP has been largely able to uncover the
DESeq (Love et al., 2014), removed unwanted variation (RUV) elusive features of the data. UMAP is computationally faster
(Risso et al., 2014), and Gamma regression model (Ding et al., than t-SNE, preserves the global structure, and maintains the
2015). Generally, there are two types of normalization: (1) continuity of cell subsets (Becht et al., 2018). At the core, UMAP
normalization of data within the sample, and (2) normalization assumes the subsistence of a “manifold structure” in the data.
of the data between the sample (Vallejos et al., 2015, 2017). This assumption makes it find the manifolds in the noise of
In the former, FPKM/RPKM or TPM are used to exclude data. Since SC-RNA-seq suffers from a significant amount of
gene-specific biases (Vallejos et al., 2017) such as guanine– noise, it is necessary to consider it before applying UMAP
cytosine (GC) content and gene length, while in the latter, (McInnes et al., 2018).
the normalization method tunes the sample-specific differences Another method to perform dimensionality reduction is
such as sequencing depth and capture efficiency. While ignoring the linear discriminant analysis (LDA). LDA is a supervised
the underlying stochasticity, normalization generates a relative dimensionality reduction method that tends to maximize
expression estimate (Stegle et al., 2015), assuming the overall the separability between the predetermined classes, using
processed RNA per sample is equal (AlJanahi et al., 2018; the covariance of “between-class” and “within-class.” It first
Olsen and Baryawno, 2018). The bulk-based strategies for calculates the mean of the distances between the classes and then
normalization have been reported unsuitable for SC-RNA-seq the mean of distances within the classes. The goal is to find a
datasets because the datasets are highly zero-inflated and have projection to maximize the ratio of between-class variability to
higher technical noise. Multiple methods have been developed for the lower within-class variability (Tharwat et al., 2017; Qiao and
normalizing the SC-RNA-seq data (Vallejos et al., 2015; Lun et al., Meister, 2020).
2016; Sengupta et al., 2016; Bacher et al., 2017; Yip et al., 2017). The SC-RNA-seq exhibits potential challenges similar to text
However, O(nlogn) is considered more efficient than others in mining, such as polysemy and synonymy, noise, and sparsity.
performing normalization of SC-RNA-seq data (Yip et al., 2017). Recently, a popular text mining technique, latent semantic
analysis (LSA), has been used in SC-RNA-seq dimensionality
Dimensionality Reduction reduction (Cheng et al., 2019). LSA at core uses a linear algebra-
High dimensionality is yet another challenge that SC-RNA-seq based method, called singular value decomposition (SVD), to
data present. Owing to the data coming from cells showing high cluster the semantically similar terms. SVD approximates a
dimensions, i.e., a large number of genes, it is necessary to reduce low-rank matrix to the given cell-gene matrix, such that the
(while optimally preserving the critical properties) the set of dimensions of the new matrix are much less than the original.
random variables and work with the principle variables which This approximation is made by taking a combined product of
describe the data profoundly (Andrews and Hemberg, 2019). The the matrices of left-singular vector, right-singular vector, and the
two most frequently used methods for dimensionality reduction diagonal singular values.
Differential Gene Expression Analysis In addition to the sparsity in data, SC-RNA-seq data suffer
The expression of genes is stochastic in a cell; expression from a huge level of noise from faulty experimental designs
values thus observed are quite heterogeneous at the individual usually referred to as “batch-effects.” The noise in the data may
level among seemingly similar cells. The DGE analysis helps contribute to the overfitting of the data. The overfitting can
to understand the innate cellular processes and stochasticity of be avoided using regularization. Regularization is a process of
gene expressions (McDavid et al., 2013). The problem faced in restricting or reducing the features at the time of modeling.
DGE analysis is identifying genes that are largely expressed in So far, the clustering methods cluster the cells as per the
a group of cells without any or no preliminary information of transcription similarity, but the biological annotation of cell
primary cell subtypes (Stegle et al., 2015). Additionally, gene clusters remains a challenge. A possible solution could come from
expressions in individual cells show multimodality (Kippner the generation of the data itself, as the more data are accumulated,
et al., 2014). As expression variability of genes between cells of the the more can unknown clusters be matched with the previously
same type indicates transcriptional heterogeneity (Johnson et al., known clusters. Another popular approach for cluster annotation
2015; Angermueller et al., 2016), it needs robust computational is to use Gene Ontology (GO) analysis of the marker genes
approaches to detect the true heterogeneity. In addition to (Ashburner et al., 2000).
multimodality, the sparsity due to—but not limited to—dropout
events brings irregularities in the data, consequent of which the
differential genes are difficult to detect. Various parametric as
Single-Cell Spatial Transcriptomics and
well as non-parametric approaches like Single-cell Differential RNA Velocity
Expression, Model-based Analysis of Single-cell Transcriptome Spatial transcriptomics (ST) gives measurement of gene
(MAST), D3E, scDD, SigEMD, and DEsingle (Kharchenko et al., expression changes with reference to geographical coordinates of
2014; Finak et al., 2015; Delmans and Hemberg, 2016; Korthauer the cells in tissues. It allows measurements of the transcripts with
et al., 2016; Miao et al., 2018; Wang and Nabavi, 2018) have an advantage of conserving the spatial information, providing
been developed/proposed for the DGE analysis in the SC-RNA- an additional analytical edge (Burgess, 2019). ST conform to
seq data. However, these tools try to manage either the gene in situ methods like seqFISH (Shah et al., 2016), seqFISH+ (Eng
dropouts or multimodality (Wang et al., 2019). For the subtle et al., 2019), FISSEQ (Fluorescence in situ Sequence) (Lee et al.,
DGE analysis, these two crucial challenges need to be taken 2015), MERFISH (Chen et al., 2015), and SC-RNA-seq-based
care of together. methods like slide-seq (Rodriques et al., 2019) and Niche-seq
(Medaglia et al., 2017). In situ labeling of the transcripts in
tissues is advantageous for visualizing the location; however, a
Cluster Analysis chance of molecular overcrowding results in fluorescence signal
Cluster analysis of SC-RNA-seq data is required to identify both overlap. This overcrowding can be overcome by using SC spatial
known and unknown rare cell types (Menon, 2018). Along with RNA-seq; however, the dissociation of cells prior to sequencing
the technical dropout events, the cells show a huge variation in makes it difficult to link the transcriptomes back to their original
gene expression levels even from the same set. As mentioned locations (Burgess, 2019). These complementary strengths and
above, SC-RNA-seq suffers from massive inflation of zeros. limitations make it necessary to integrate the datasets generated
There are three reasons for the observation of zeros in data: by each technology.
(1) the transcript was absent explicitly, hence a “true zero”; In ST, a pair of images are generated, one containing whole
(2) the depth of sequencing was very low, and the transcript tissue with fairly visible spots and the other having clearly
was present but not accounted for; and (3) at the time of visible fluorescence array spots (Wong et al., 2018). To leverage
library preparation, the transcript could not be captured or the ST, the image data from ST need to be integrated with
failed to amplify. The measurements from the latter two are the SC-RNA-seq data. As the principle challenges in both ST
considered to be the “false zeros.” The concentration of too and SC-RNA-seq are the sparsity of the data and noise from
many zeros in the data brings in irregularities. These technical technical and biological sources, an accurate data normalization
and biological factors lead to significant noise, due to which and transformation is necessary before any downstream analysis
cluster analysis becomes challenging. For this, methods like (Wagner et al., 2016). Few tools have been developed to
Seurat, DropClust, and SCANPY (Satija et al., 2015; Ntranos determine the cell types with respect to their spatial identities
et al., 2016; Yip et al., 2017; Sinha et al., 2018) have been (Edsgärd et al., 2018; Svensson et al., 2018; Dries et al., 2019;
proposed for clustering of SCs. There are certain limitations Queen et al., 2019). These tools lack interactive processing of
associated with these as well. Seurat and SCANPY work well images and fails in providing a comprehensive three-dimensional
with large datasets but underperforms when the dataset is view of the tissue. Recently, STUtility (Bergenstråhle et al.,
smaller (Kiselev et al., 2019). The anticipated complexity in 2020b)—an R package using non-negative matrix factorization
data and the rate of generation of SC data will be a challenge (NMF) for reducing the dimensions, spatial correlation (based
for all these tools. UMAP is yet another method for cluster on Pearson correlation), and K-means clustering—was found
identification of SC-RNA-seq data; however, as UMAP tends to capable of providing a holistic view of the expression in tissues.
preserve the local-topological structure, it is rather difficult to SpatialCPie (Bergenstråhle et al., 2020a) is another easy-to-use R
establish a relationship between clusters when the underlying cell package that uses clustering at various resolutions to interactively
subtypes are unknown. uncover the gene expression patterns. Elosua-Bayes et al. (2021)
FIGURE 2 | (A) There is a steep rise every year for the publications of studies addressing the big data and SC-RNA-seq. For big data papers on PubMed, we used
the query “[big data (All Fields) AND MapReduce (All Fields) AND Hadoop (All fields)].” For SC-RNA-seq and big data papers on PubMed, we used “[(scRNA-seq OR
Big Data) OR (Single-cell AND big data)].” (B,C) Numbers were collected from the Human Cell Atlas Data portal of some exemplary projects.
developed SPOTlight, which uses NMF along with non-negative For the integration and analysis of the SC multi-omics
least squares (NNLS). NMF helps in dimensional reduction, data, several methods developed for the variety of SC-mono-
followed by selection of marker genes using seurat package omics data have been fused or extended further to fulfill the
and then using NNLS to deconvolute each captured location requirement. However, each tool follows a different strategy for
(Elosua-Bayes et al., 2021). the analysis, which can be categorized as follows: (1) correlation
The SC-RNA measurements have advanced our and unsupervised cluster analysis; (2) data integration of different
understanding of the intrinsic cellular functionalities; however, samples from a single measurement type and a single experiment
the destruction of cells in the process ceases the possibility of type, e.g., SC-RNA-seq; (3) analysis and integration of data from
further resampling for an additional transcriptional state analysis. different experiments and a single measurement type across
A new methodology, RNA velocity, is capable of deducing the different samples, e.g., sc-Spatial Transcriptomics; (4) integration
future transcriptional state of a cell (La Manno et al., 2018). The of data from SC population, with more than one measurement
idea behind the study is that the transcriptional upregulation of type, different samples, and a single experiment; and (5)
gene at a particular stage leads to the short-spanned abundance of integration of data across multiple cells, multiple experiments,
unspliced transcripts. Similarly, the downregulation of the gene and multiple measurement types, e.g., combination of the SC-
at a point of time results in a decrease of spliced transcripts. The RNA-seq, scATAC, scCHIP-seq, CITE-seq, etc., of different cells
ratio of this variation between unspliced and spliced transcripts collected at different time points (Stuart et al., 2019; Lähnemann
is used to estimate the future state of a cell. et al., 2020; Lee et al., 2020).
Computational methods and tools for integration of biological
data are evolving gradually. A number of techniques have been
Single-Cell Multi-omics and Data developed that have been discussed in section “Cluster Analysis.”
Integration Seurat (Butler et al., 2018) is currently at the top of integrative
Biological activities in cells are perplexing, and the measurements analysis of SC multi-omics data, integrating the datasets based
of these processes show contrasting variation at temporal and on the second principle. Along with Seurat, mutual nearest
histological levels. To comprehensively understand the intricate neighbor (MNN)-based method (Haghverdi et al., 2018) has been
biological process of cells and organisms, it is necessary to exploited to analyze the data combined on the basis of the second
investigate them at a multi-omics scale. Contingent upon the category. For the fourth category, analytical methods developed
research question, SC experiments have flexed its reach to variety for bulk cellular analysis like MOFA (Argelaguet et al., 2018),
of layers, the majority of which include the following: (1) SCI- MINT (Rohart et al., 2017a), mixOmics (Rohart et al., 2017b),
seq for Single-cell Genome Sequencing (Vitak et al., 2017), (2) and DIABLO (Singh et al., 2019) are being utilized. Cardelino
scBS-seq for Single-cell DNA methylation (Smallwood et al., (McCarthy et al., 2018), MATCHER (Welch et al., 2017), and
2014), (3) scATAC-seq for Single-cell chromatin accessibility cloealign (Campbell et al., 2019) are currently the tools used for
(Buenrostro et al., 2015), (4) CITE-seq for cell Surface Proteins integrative analysis under the fourth category. To our knowledge,
(Stoeckius et al., 2017), (5) scCHIP-seq for Histone Modifications there are no tools available for the last category.
(Gomez et al., 2013), and (6) scGESTALT (Frieda et al., 2017)
and MEMOIR (Raj et al., 2018) for chromosomal conformation.
A universal challenge for all the SC technologies is that Big Data Pertaining to Single-Cell RNA
the measurements from a very low starting material led to Sequencing
generation of highly sparse and extremely noisy data. Hence, the The data-intensive scientific discoveries rely on three
integration of this data requires a statistically sound and robust paradigms—theory, experimentation, and simulation modeling
computational framework. A primary challenge thereof remains (Tolle et al., 2011). As big data is described with three
to find an empirical strategy to normalize, batch-effect correction characteristics (volume, velocity, and variety) (Stephens
and linking the data from different sources so that the biological et al., 2015; Adil et al., 2016), data generated by SC-RNA-seq
meaning and inference remain uncompromised. are tantamount to these three quantitative characteristics
(Ivanov et al., 2013). With the introduction of new methods Hadoop Distributed File System (HDFS). Incorporating big data
in microfluidics (Zare and Kim, 2010), combinatorial indexing technologies in the analysis of rapidly increasing SC genomics
procedures (Fan et al., 2015), and rapid drop in the sequencing data will help in transforming and processing it with limitless
cost, SC assay profiling has widely become a routine practice scalability and fault tolerance at a very low cost.
among biologists for analyzing millions of cells in hours, paving
the way for the accumulation of a large amount of data. The most
popular next-generation sequencing platform, Illumina HiSeq, CONCLUSION AND FUTURE
results in the accumulation of around 100 gigabytes of raw RNA- PERSPECTIVE
seq data per study. It usually takes hours to align these raw data to
their reference genome. SC experiments generating petabytes of As a consequence of meager RNA capture rate, low starting
data on a variety of layers contribute to the big data paradigm. materials, and challenging experimental protocols, the SC-RNA-
A human genome has 20,000–25,000 genes composed of 3 seq faces computational and analytical challenges. The noise and
million base pairs, totaling to 100 gigabytes of data, equivalent to sparsity due to the technical (dropout events) and biological
102,400 photos1 ; it is expected that more or less “25 petabytes” factors make the downstream analysis of SC-RNA-seq data a
of genomic data will be generated annually around the globe complicated task. Additionally, the rapidity in the development
by the year 2030 (Khoury et al., 2020). It is anticipated that of new and exciting experimental methods for SC-RNA-seq is
human genomic data can potentially overtake the data produced paving the way for a large accumulation of data. This large
by online social networks (Check Hayden, 2015). The Human agglomeration of data is nothing but the genomic face of
Cell Atlas (HCA)—a project to prepare a reference map of each “big data.” These two challenges together give rise to a new
cell in the human body at various stages, will accumulate a paradigm of Big Single-Cell Data Science. Although a plethora of
massive amount of data by the end of its completion (Regev algorithms and computational tools have already been developed,
et al., 2017). There is a need for comprehensive integration it is essential to address these challenges collectively and produce
of big data and SC-RNA-seq technologies. A large number a robust, accurate, parallel, and scalable framework.
of publications on SC-RNA and big data have emerged lately
(Figure 2A). The datasets of 4.5 million cells are already
published in Data2 , the largest of which contains more than 1.5 AUTHOR CONTRIBUTIONS
million CD34+ hematopoietic cells of human bone marrow (Setty
et al., 2019) and 1.3 million transcriptomes of mouse brain cells MA and ATJ conceived the idea, edited the manuscript,
(Figures 2B,C). and contributed to the compilation of data for designing of
Consequently, the data acquired from these experiments figures. AA, VK, and ATJ contributed to the writing of the
constitute a data revolution in the field of SC biology manuscript. All authors contributed to the article and approved
(Lähnemann et al., 2019). As SC-RNA-seq data have a greater the submitted version.
potential of uncovering the hidden patterns at the molecular
level, the data pertaining to it thus require an extremely parallel,
scalable, and statistically sound computational framework as FUNDING
its handling tools. Big data technologies like Apache’s Hadoop
(Taylor, 2010; O’Driscoll et al., 2013) and Spark (Zaharia et al., ATJ is grateful to DST-SERB for financial support
2016; Guo et al., 2018) embody the required computational (CRG/2019/004106) that helped in to establishing the
parallelism and data distribution mechanisms. Hadoop uses infrastructural facilities.
MapReduce technology for parallel and scalable processing
(Dean and Ghemawat, 2008) to disintegrate the larger problems
into smaller subproblems on a distributed file system called ACKNOWLEDGMENTS
1
https://www.experfy.com/blog/intersection-genomics-big-data The authors would like to thank their colleagues for the help in
2
https://data.humancellatlas.org/ improving the contents of the manuscript.
REFERENCES Andrews, T. S., and Hemberg, M. (2019). M3Drop: dropout-based feature selection
for scRNASeq. Bioinformatics (Oxford, England) 35, 2865–2867. doi: 10.1093/
Adil, A., Kar, H. A., Jangir, R., and Sofi, S. A. (2016). “Analysis of multi-diseases bioinformatics/bty1044
using big data for improvement in healthcare,” in Proceedings of the 2015 IEEE Angerer, P., Simon, L., Tritschler, S., Wolf, F. A., Fischer, D., and Theis, F. J.
UP Section Conference on Electrical Computer and Electronics, UPCON 2015, (2017). Single cells make big data: new challenges and opportunities in
Allahabad. doi: 10.1109/UPCON.2015.7456696 transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91. doi: 10.1016/j.coisb.2017.07.
AlJanahi, A. A., Danielsen, M., and Dunbar, C. E. (2018). An introduction to the 004
analysis of single-cell RNA-sequencing data. Mol. Ther. Methods Clin. Dev. 10, Angermueller, C., Clark, S. J., Lee, H. J., Macaulay, I. C., Teng, M. J., Hu, T. X.,
189–196. doi: 10.1016/j.omtm.2018.07.003 et al. (2016). Parallel single-cell sequencing links transcriptional and epigenetic
Altschuler, S. J., and Wu, L. F. (2010). Cellular heterogeneity: do heterogeneity. Nat. Methods 13, 229–232. doi: 10.1038/nmeth.3728
differences make a difference? Cell 141, 559–563. doi: 10.1016/j.cell.2010. Argelaguet, R., Velten, B., Arnol, D., Dietrich, S., Zenz, T., Marioni, J. C., et al.
04.033 (2018). Multi-omics factor analysis—a framework for unsupervised integration
of multi-omics data sets. Mol. Syst. Biol. 14:8124. doi: 10.15252/msb.2017 Ding, B., Zheng, L., Zhu, Y., Li, N., Jia, H., Ai, R., et al. (2015). Normalization
8124 and noise reduction for single cell RNA-seq experiments. Bioinformatics 31,
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., et al. 2225–2227. doi: 10.1093/bioinformatics/btv122
(2000). Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29. Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., et al.
doi: 10.1038/75556 (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21.
Bacher, R., Chu, L. F., Leng, N., Gasch, A. P., Thomson, J. A., Stewart, R. M., doi: 10.1093/bioinformatics/bts635
et al. (2017). SCnorm: robust normalization of single-cell RNA-seq data. Nat. Dries, R., Zhu, Q., Eng, C. H. L., Sarkar, A., Bao, F., George, R. E., et al. (2019).
Methods 14, 584–586. doi: 10.1038/nmeth.4263 Giotto, a pipeline for integrative analysis and visualization of single-cell spatial
Beakke, M. K. (1951). Density gradient centrifugation: a new separation technique. transcriptomic data. bioRxiv [Preprint]. doi: 10.1101/701680
J. Am. Chem. Soc. 73, 1847–1848. doi: 10.1021/ja01148a508 Edsgärd, D., Johnsson, P., and Sandberg, R. (2018). Identification of spatial
Becht, E., Dutertre, C.-A., Kwok, I., Ng, L. G., Ginhoux, F., and Newell, E. (2018). expression trends in single-cell gene expression data. Nat. Methods 15, 339–342.
Evaluation of UMAP as an alternative to t-SNE for single-cell data. bioRxiv doi: 10.1038/nmeth.4634
[Preprint]. doi: 10.1101/298430 Elosua-Bayes, M., Nieto, P., Mereu, E., Gut, I., and Heyn, H. (2021). SPOTlight:
Becht, E., McInnes, L., Healy, J., Dutertre, C. A., Kwok, I. W. H., Ng, L. G., et al. seeded NMF regression to deconvolute spatial transcriptomics spots with
(2019). Dimensionality reduction for visualizing single-cell data using UMAP. single-cell transcriptomes. Nucleic Acids Res. gkab043. doi: 10.1093/nar/
Nat. Biotechnol. 37, 38–44. doi: 10.1038/nbt.4314 gkab043
Bergenstråhle, J., Bergenstråhle, L., and Lundeberg, J. (2020a). SpatialCPie: an Eng, C. H. L., Lawson, M., Zhu, Q., Dries, R., Koulena, N., Takei, Y., et al. (2019).
R/Bioconductor package for spatial transcriptomics cluster evaluation. BMC Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+.
Bioinform. 21:161. doi: 10.1186/s12859-020-3489-7 Nature 568:235. doi: 10.1038/s41586-019-1049-y
Bergenstråhle, J., Larsson, L., and Lundeberg, J. (2020b). Seamless integration Espina, V., Wulfkuhle, J. D., Calvert, V. S., VanMeter, A., Zhou, W., Coukos,
of image and molecular analysis for spatial transcriptomics workflows. BMC G., et al. (2006). Laser-capture microdissection. Nat. Protoc. 1, 586–603. doi:
Genomics 21:482. doi: 10.1186/s12864-020-06832-3 10.1038/nprot.2006.85
Bray, N. L., Pimentel, H., Melsted, P., and Pachter, L. (2016). Near-optimal Fan, H. C., Fu, G. K., and Fodor, S. P. A. (2015). Combinatorial labeling of single
probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527. doi: 10. cells for gene expression cytometry. Science 347:1258367. doi: 10.1126/science.
1038/nbt.3519 1258367
Brennecke, P., Anders, S., Kim, J. K., Kołodziejczyk, A. A., Zhang, X., Proserpio, V., Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A. K., et al. (2015).
et al. (2013). Accounting for technical noise in single-cell RNA-seq experiments. MAST: A flexible statistical framework for assessing transcriptional changes and
Nat. Methods 10, 1093–1098. doi: 10.1038/nmeth.2645 characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol.
Buenrostro, J. D., Wu, B., Litzenburger, U. M., Ruff, D., Gonzales, M. L., Snyder, 16:278. doi: 10.1186/s13059-015-0844-5
M. P., et al. (2015). Single-cell chromatin accessibility reveals principles of Fonseca, N. A., Rung, J., Brazma, A., and Marioni, J. C. (2012). Tools for mapping
regulatory variation. Nature 523, 486–490. doi: 10.1038/nature14590 high-throughput sequencing data. Bioinformatics 28, 3169–3177. doi: 10.1093/
Bullard, J. H., Purdom, E., Hansen, K. D., and Dudoit, S. (2010). Evaluation of bioinformatics/bts605
statistical methods for normalization and differential expression in mRNA-Seq Frieda, K. L., Linton, J. M., Hormoz, S., Choi, J., Chow, K. H. K., Singer, Z. S., et al.
experiments. BMC Bioinformatics 11:94. doi: 10.1186/1471-2105-11-94 (2017). Synthetic recording and in situ readout of lineage information in single
Burgess, D. J. (2019). Spatial transcriptomics coming of age. Nat. Rev. Genet. cells. Nature 541, 59–64. doi: 10.1038/nature20777
20:317. doi: 10.1038/s41576-019-0129-z Garber, M., Grabherr, M. G., Guttman, M., and Trapnell, C. (2011). Computational
Butler, A., Hoffman, P., Smibert, P., Papalexi, E., and Satija, R. (2018). Integrating methods for transcriptome annotation and quantification using RNA-seq. Nat.
single-cell transcriptomic data across different conditions, technologies, and Methods 8, 469–477. doi: 10.1038/nmeth.1613
species. Nat. Biotechnol. 36, 411–420. doi: 10.1038/nbt.4096 Gomez, D., Shankman, L. S., Nguyen, A. T., and Owens, G. K. (2013). Detection of
Campbell, K. R., Steif, A., Laks, E., Zahn, H., Lai, D., McPherson, A., et al. (2019). histone modifications at specific gene loci in single cells in histological sections.
Clonealign: statistical integration of independent single-cell RNA and DNA Nat. Methods 10, 171–177. doi: 10.1038/nmeth.2332
sequencing data from human cancers. Genome Biol. 20:54. doi: 10.1186/s13059- Gross, A., Schoendube, J., Zimmermann, S., Steeb, M., Zengerle, R., and Koltay, P.
019-1645-z (2015). Technologies for single-cell isolation. Int. J. Mol. Sci. 16, 16897–16919.
Check Hayden, E. (2015). Genome researchers raise alarm over big data. Nature doi: 10.3390/ijms160816897
312–314. doi: 10.1038/nature.2015.17912 Guo, R., Zhao, Y., Zou, Q., Fang, X., and Peng, S. (2018). Bioinformatics
Chen, G., Ning, B., and Shi, T. (2019). Single-cell RNA-seq technologies and applications on apache spark. GigaScience 7:giy098. doi: 10.1093/gigascience/
related computational data analysis. Front. Genet. 10:317. doi: 10.3389/fgene. giy098
2019.00317 Gupta, I., Collier, P. G., Haase, B., Mahfouz, A., Joglekar, A., Floyd, T., et al. (2018).
Chen, G., Wang, C., and Shi, T. L. (2011). Overview of available methods for diverse Single-cell isoform RNA sequencing characterizes isoforms in thousands of
RNA-Seq data analyses. Sci. China Life Sci. 54, 1121–1128. doi: 10.1007/s11427- cerebellar cells. Nat. Biotechnol. 36, 1197–1202. doi: 10.1038/nbt.4259
011-4255-x Haghverdi, L., Lun, A. T. L., Morgan, M. D., and Marioni, J. C. (2018). Batch effects
Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S., and Zhuang, X. (2015). in single-cell RNA-sequencing data are corrected by matching mutual nearest
Spatially resolved, highly multiplexed RNA profiling in single cells. Science neighbors. Nat. Biotechnol. 36, 421–427. doi: 10.1038/nbt.4091
348:6090. doi: 10.1126/science.aaa6090 Haque, A., Engel, J., Teichmann, S. A., and Lönnberg, T. (2017). A practical guide
Cheng, C., Easton, J., Rosencrance, C., Li, Y., Ju, B., Williams, J., et al. (2019). to single-cell RNA-sequencing for biomedical research and clinical applications.
Latent cellular analysis robustly reveals subtle diversity in large-scale single-cell Genome Med. 9, 1–12. doi: 10.1186/s13073-017-0467-4
RNA-seq data. Nucleic Acids Res. 47:e143. doi: 10.1093/nar/gkz826 Hashimshony, T., Senderovich, N., Avital, G., Klochendler, A., de Leeuw, Y., Anavy,
Citri, A., Pang, Z. P., Südhof, T. C., Wernig, M., and Malenka, R. C. (2012). L., et al. (2016). CEL-Seq2: Sensitive highly-multiplexed single-cell RNA-Seq.
Comprehensive qPCR profiling of gene expression in single neuronal cells. Nat. Genome Biol. 17:77. doi: 10.1186/s13059-016-0938-8
Protoc. 7, 118–127. doi: 10.1038/nprot.2011.430 He, K. Y., Ge, D., and He, M. M. (2017). Big data analytics for genomic medicine.
Costa, F. F. (2012). Big data in genomics: challenges and solutions. G.I.T. Lab. J. Int. J. Mol. Sci. 18, 1–18. doi: 10.3390/ijms18020412
1–4. Herzenberg, L. A., Parks, D., Sahaf, B., Perez, O., Roederer, M., and Herzenberg,
Dean, J., and Ghemawat, S. (2008). MapReduce: simplified data processing L. A. (2002). The history and future of the fluorescence activated cell sorter and
on large clusters. Commun. ACM 51, 107–113. doi: 10.1145/1327452.132 flow cytometry: a view from Stanford. Clin. Chem. 48, 1819–1827.
7492 Hu, P., Zhang, W., Xin, H., and Deng, G. (2016). Single cell isolation and analysis.
Delmans, M., and Hemberg, M. (2016). Discrete distributional differential Front. Cell Dev. Biol. 4:116. doi: 10.3389/fcell.2016.00116
expression (D3E) - a tool for gene expression analysis of single-cell Huang, Y., and Sanguinetti, G. (2017). BRIE: transcriptome-wide splicing
RNA-seq data. BMC Bioinform. 17:110. doi: 10.1186/s12859-016-09 quantification in single cells. Genome Biol. 18:123. doi: 10.1186/s13059-017-
44-6 1248-5
Hwang, B., Lee, J. H., and Bang, D. (2018). Single-cell RNA sequencing Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L. (2009). Ultrafast and
technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 1–14. doi: memory-efficient alignment of short DNA sequences to the human genome.
10.1038/s12276-018-0071-8 Genome Biol. 10:R25. doi: 10.1186/gb-2009-10-3-r25
Ilicic, T., Kim, J. K., Kolodziejczyk, A. A., Bagger, F. O., McCarthy, D. J., Marioni, Lebrigand, K., Magnone, V., Barbry, P., and Waldmann, R. (2020). High
J. C., et al. (2016). Classification of low quality cells from single-cell RNA-seq throughput error corrected Nanopore single cell transcriptome sequencing.
data. Genome Biol. 17:29. doi: 10.1186/s13059-016-0888-1 Nat. Commun. 11, 1–8. doi: 10.1038/s41467-020-17800-6
Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., et al. Lee, J., Hyeon, D. Y., and Hwang, D. (2020). Single-cell multiomics: technologies
(2014). Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. and data analysis methods. Exp. Mol. Med. 52, 1428–1442. doi: 10.1038/s12276-
Methods 11, 163–166. doi: 10.1038/nmeth.2772 020-0420-2
Ivanov, T., Korfiatis, N., and Zicari, R. V. (2013). On the Inequality of the 3V’s of Lee, J. H., Daugharthy, E. R., Scheiman, J., Kalhor, R., Ferrante, T. C., Terry,
Big Data Architectural Paradigms: A Case For Heterogeneity. Available online at: R., et al. (2015). Fluorescent in situ sequencing (FISSEQ) of RNA for gene
https://arxiv.org/abs/1311.0805 expression profiling in intact cells and tissues. Nat. Protoc. 10, 442–458. doi:
Jaitin, D. A., Kenigsberg, E., Keren-Shaul, H., Elefant, N., Paul, F., Zaretsky, I., et al. 10.1038/nprot.2014.191
(2014). Massively parallel single-cell RNA-seq for marker-free decomposition Li, B., and Dewey, C. N. (2011). RSEM: accurate transcript quantification from
of tissues into cell types. Science 343, 776–779. doi: 10.1126/science.1247651 RNA-Seq data with or without a reference genome. BMC Bioinform. 12:323.
Johnson, M. B., Wang, P. P., Atabay, K. D., Murphy, E. A., Doan, R. N., Hecht, J. L., doi: 10.1186/1471-2105-12-323
et al. (2015). Single-cell analysis reveals transcriptional heterogeneity of neural Li, H., and Homer, N. (2010). A survey of sequence alignment algorithms for
progenitors in human cortex. Nat. Neurosci. 18, 637–646. doi: 10.1038/nn. next-generation sequencing. Brief. Bioinform. 11, 473–483. doi: 10.1093/bib/
3980 bbq015
Kharchenko, P. V., Silberstein, L., and Scadden, D. T. (2014). Bayesian approach Love, M. I., Huber, W., and Anders, S. (2014). Moderated estimation of fold
to single-cell differential expression analysis. Nat. Methods 11, 740–742. doi: change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15:550.
10.1038/nmeth.2967 doi: 10.1186/s13059-014-0550-8
Khoury, M. J., Armstrong, G. L., Bunnell, R. E., Cyril, J., and Iademarco, Lun, A. T. L., Bach, K., and Marioni, J. C. (2016). Pooling across cells to normalize
M. F. (2020). The intersection of genomics and big data with public health: single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75.
opportunities for precision public health. PLoS Med. 17:e1003373. doi: 10.1371/ doi: 10.1186/s13059-016-0947-7
journal.pmed.1003373 Macaulay, I. C., Ponting, C. P., and Voet, T. (2017). Single-cell multiomics: multiple
Kim, D., Paggi, J. M., Park, C., Bennett, C., and Salzberg, S. L. (2019). Graph-based measurements from single cells. Trends Genet. 33, 155–168. doi: 10.1016/j.tig.
genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. 2016.12.003
Biotechnol. 37, 907–915. doi: 10.1038/s41587-019-0201-4 Macosko, E. Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M.,
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S. L. (2013). et al. (2015). Highly parallel genome-wide expression profiling of individual
TopHat2: accurate alignment of transcriptomes in the presence of insertions, cells using nanoliter droplets. Cell 161, 1202–1214. doi: 10.1016/j.cell.2015.
deletions and gene fusions. Genome Biol. 14:R36. doi: 10.1186/gb-2013-14-4- 05.002
r36 McCarthy, D. J., Rostom, R., Huang, Y., Kunz, D. J., Danecek, P., Bonder, M. J., et al.
Kippner, L. E., Kim, J., Gibson, G., and Kemp, M. L. (2014). Ingle cell (2018). Cardelino: integrating whole exomes and single-cell transcriptomes to
transcriptional analysis reveals novel innate immune cell types. PeerJ 2:e452. reveal phenotypic impact of somatic variants. bioRxiv [Preprint]. doi: 10.1101/
doi: 10.7717/peerj.452 413047
Kiselev, V. Y., Andrews, T. S., and Hemberg, M. (2019). Challenges in unsupervised McDavid, A., Finak, G., Chattopadyay, P. K., Dominguez, M., Lamoreaux, L.,
clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282. doi: 10. Ma, S. S., et al. (2013). Data exploration, quality control and testing in single-
1038/s41576-018-0088-9 cell qPCR-based gene expression experiments. Bioinformatics 29, 461–467. doi:
Kivioja, T., Vähärautio, A., Karlsson, K., Bonke, M., Enge, M., Linnarsson, S., 10.1093/bioinformatics/bts714
et al. (2012). Counting absolute numbers of molecules using unique molecular McGann, L. E., Yang, H. Y., and Walterson, M. (1988). Manifestations of cell
identifiers. Nat. Methods 9, 72–74. doi: 10.1038/nmeth.1778 damage after freezing and thawing. Cryobiology 25, 178–185. doi: 10.1016/0011-
Klein, A. M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., et al. 2240(88)90024-7
(2015). Droplet barcoding for single-cell transcriptomics applied to embryonic McInnes, L., Healy, J., Saul, N., and Großberger, L. (2018). UMAP: uniform
stem cells. Cell 161, 1187–1201. doi: 10.1016/j.cell.2015.04.044 manifold approximation and projection. J. Open Source Softw. 3:861. doi: 10.
Kobak, D., and Berens, P. (2019). The art of using t-SNE for single-cell 21105/joss.00861
transcriptomics. Nat. Commun. 10:5416. doi: 10.1038/s41467-019-13056-x Medaglia, C., Giladi, A., Stoler-Barak, L., De Giovanni, M., Salame, T. M., Biram,
Kolodziejczyk, A. A., Kim, J. K., Svensson, V., Marioni, J. C., and Teichmann, S. A. A., et al. (2017). Spatial reconstruction of immune niches by combining
(2015). The technology and biology of single-cell RNA sequencing. Mol. Cell 58, photoactivatable reporters and scRNA-seq. Science 358, 1622–1626. doi: 10.
610–620. doi: 10.1016/j.molcel.2015.04.005 1126/science.aao4277
Korthauer, K. D., Chu, L. F., Newton, M. A., Li, Y., Thomson, J., Stewart, R., Menon, V. (2018). Clustering single cells: a review of approaches on high-and low-
et al. (2016). A statistical approach for identifying differential distributions in depth single-cell RNA-seq data. Brief. Funct. Genomics 18:434. doi: 10.1093/
single-cell RNA-seq experiments. Genome Biol. 17:222. doi: 10.1186/s13059- bfgp/ely001
016-1077-y Miao, Z., Deng, K., Wang, X., and Zhang, X. (2018). DEsingle for detecting
Kulkarni, A., Anderson, A. G., Merullo, D. P., and Konopka, G. (2019). Beyond three types of differential expression in single-cell RNA-seq data. Bioinformatics
bulk: a review of single cell transcriptomics methodologies and applications. (Oxford, England) 34, 3223–3224. doi: 10.1093/bioinformatics/bty332
Curr. Opin. Biotechnol. 58, 129–136. doi: 10.1016/j.copbio.2019.03.001 Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008).
La Manno, G., Soldatov, R., Zeisel, A., Braun, E., Hochgerner, H., Petukhov, V., Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat.
et al. (2018). RNA velocity of single cells. Nature 560, 494–498. doi: 10.1038/ Methods 5, 621–628. doi: 10.1038/nmeth.1226
s41586-018-0414-6 Natarajan, K. N., Miao, Z., Jiang, M., Huang, X., Zhou, H., Xie, J., et al.
Lähnemann, D., Köster, J., Szczurek, E., Mccarthy, D. J., Hicks, S. C., Mark, D., (2019). Comparative analysis of sequencing technologies for single-cell
et al. (2019). 12 grand challenges in single-cell data science. PeerJ 7:e27885v3. transcriptomics. Genome Biol. 20:70. doi: 10.1186/s13059-019-1676-5
doi: 10.7287/peerj.preprints.27885v2 Ntranos, V., Kamath, G. M., Zhang, J. M., Pachter, L., and Tse, D. N. (2016).
Lähnemann, D., Köster, J., Szczurek, E., McCarthy, D. J., Hicks, S. C., Robinson, Fast and accurate single-cell RNA-seq analysis by clustering of transcript-
M. D., et al. (2020). Eleven grand challenges in single-cell data science. Genome compatibility counts. Genome Biol. 17, 1–14. doi: 10.1186/s13059-016-
Biol. 21:31. doi: 10.1186/s13059-020-1926-6 0970-8
O’Driscoll, A., Daugelaite, J., and Sleator, R. D. (2013). Big data”, Hadoop and cloud implications for RNA-Seq based gene expression analysis. Sci. Rep. 8:13121.
computing in genomics. J. Biomed. Inform. 46, 774–781. doi: 10.1016/j.jbi.2013. doi: 10.1038/s41598-018-31064-7
07.001 Sengupta, D., Rayan, N. A., Lim, M., Lim, B., and Prabhakar, S. (2016). Fast, scalable
Olsen, T. K., and Baryawno, N. (2018). Introduction to single-cell RNA sequencing. and accurate differential expression analysis for single cells. bioRxiv [Preprint].
Curr. Protoc. Mol. Biol. 122:57. doi: 10.1002/cpmb.57 doi: 10.1101/049734
Ozsolak, F., and Milos, P. M. (2011). RNA sequencing: Advances, challenges and Setty, M., Kiseliovas, V., Levine, J., Gayoso, A., Mazutis, L., and Pe’er, D. (2019).
opportunities. Nat. Rev. Genet. 12, 87–98. doi: 10.1038/nrg2934 Characterization of cell fate probabilities in single-cell data with Palantir. Nat.
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., and Kingsford, C. (2017). Biotechnol. 37, 451–460. doi: 10.1038/s41587-019-0068-4
Salmon provides fast and bias-aware quantification of transcript expression. Shah, S., Lubeck, E., Zhou, W., and Cai, L. (2016). In situ transcription profiling
Nat. Methods 14, 417–419. doi: 10.1038/nmeth.4197 of single cells reveals spatial organization of cells in the mouse hippocampus.
Patro, R., Mount, S. M., and Kingsford, C. (2014). Sailfish enables alignment-free Neuron 92, 342–357. doi: 10.1016/j.neuron.2016.10.001
isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Sheng, K., and Zong, C. (2019). Single-cell RNA-Seq by multiple annealing and
Biotechnol. 32, 462–464. doi: 10.1038/nbt.2862 tailing-based quantitative single-cell RNA-Seq (MATQ-Seq). Methods Mol. Biol.
Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T. C., Mendell, J. T., 1979, 57–71. doi: 10.1007/978-1-4939-9240-9_5
and Salzberg, S. L. (2015). StringTie enables improved reconstruction of a Singh, A., Shannon, C. P., Gautier, B., Rohart, F., Vacher, M., Tebbutt, S. J.,
transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295. doi: 10.1038/ et al. (2019). DIABLO: an integrative approach for identifying key molecular
nbt.3122 drivers from multi-omics assays. Bioinformatics 35, 3055–3062. doi: 10.1093/
Pezzotti, N., Lelieveldt, B. P. F., Van Der Maaten, L., Höllt, T., Eisemann, E., and bioinformatics/bty1054
Vilanova, A. (2017). Approximated and user steerable tSNE for progressive Sinha, D., Kumar, A., Kumar, H., Bandyopadhyay, S., and Sengupta, D. (2018).
visual analytics. IEEE Trans. Visualization Comp. Graphics 23, 1739–1752. doi: Dropclust: Efficient clustering of ultra-large scRNA-seq data. Nucleic Acids Res.
10.1109/TVCG.2016.2570755 46:e36. doi: 10.1093/nar/gky007
Phipson, B., Zappia, L., and Oshlack, A. (2017). Gene length and detection bias Sirén, J., Välimäki, N., and Mäkinen, V. (2014). HISAT2 - fast and sensitive
in single cell RNA sequencing protocols. F1000Research 6:595. doi: 10.12688/ alignment against general human population. IEEE/ACM Trans. Comput. Biol.
f1000research.11290.1 Bioinform. 11, 375–388. doi: 10.1109/TCBB.2013.2297101
Picelli, S., Faridani, O. R., Björklund, ÅK., Winberg, G., Sagasser, S., and Sandberg, Smallwood, S. A., Lee, H. J., Angermueller, C., Krueger, F., Saadeh, H., Peat,
R. (2014). Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. J., et al. (2014). Single-cell genome-wide bisulfite sequencing for assessing
9, 171–181. doi: 10.1038/nprot.2014.006 epigenetic heterogeneity. Nat. Methods 11, 817–820. doi: 10.1038/nmeth.
Qiao, M., and Meister, M. (2020). Factorized Linear Discriminant Analysis for 3035
Phenotype-Guided Representation Learning of Neuronal Gene Expression Data. Smith, T., Heger, A., and Sudbery, I. (2017). UMI-tools: modeling sequencing
Available online at: https://arxiv.org/abs/2010.02171v4 errors in Unique Molecular Identifiers to improve quantification accuracy.
Queen, R., Cheung, K., Lisgo, S., Coxhead, J., and Cockell, S. (2019). Spaniel: Genome Res. 27, 491–499. doi: 10.1101/gr.209601.116
analysis and interactive sharing of spatial transcriptomics data. bioRxiv Song, Y., Xu, X., Wang, W., Tian, T., Zhu, Z., and Yang, C. (2019). Single cell
[Preprint]. doi: 10.1101/619197 transcriptomics: Moving towards multi-omics. Analyst 144, 3172–3189. doi:
Raj, B., Wagner, D. E., McKenna, A., Pandey, S., Klein, A. M., Shendure, J., 10.1039/c8an01852a
et al. (2018). Simultaneous single-cell profiling of lineages and cell types Stegle, O., Teichmann, S. A., and Marioni, J. C. (2015). Computational and
in the vertebrate brain. Nat. Biotechnol. 36, 442–450. doi: 10.1038/nbt. analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–
4103 145. doi: 10.1038/nrg3833
Ramskold, D., Luo, S., Wang, Y., Li, R., Deng, Q., Omid, R., et al. (2013). Full- Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J.,
Length mRNA-Seq from single Cell levels of RNA and individual circulating et al. (2015). Big data: astronomical or genomical? PLoS Biol. 13:e1002195.
tumor cells. Nat. Biotechnol. 30, 777–782. doi: 10.1038/nbt.2282.Full-Length doi: 10.1371/journal.pbio.1002195
Regev, A., Teichmann, S., Lander, E., Amit, I., Benoist, C., Birney, E., et al. (2017). Stoeckius, M., Hafemeister, C., Stephenson, W., Houck-Loomis, B.,
Science forum: the human cell atlas. eLife 6:e27041. Chattopadhyay, P. K., Swerdlow, H., et al. (2017). Simultaneous
Risso, D., Ngai, J., Speed, T. P., and Dudoit, S. (2014). Normalization of RNA- epitope and transcriptome measurement in single cells. Nat. Methods
seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 9:2579.
896–902. doi: 10.1038/nbt.2931 Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W. M., et al.
Rodriques, S. G., Stickels, R. R., Goeva, A., Martin, C. A., Murray, E., Vanderburg, (2019). Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21.
C. R., et al. (2019). Slide-seq: a scalable technology for measuring genome- doi: 10.1016/j.cell.2019.05.031
wide expression at high spatial resolution. Science 363, 1463–1467. doi: 10.1126/ Svensson, V., Teichmann, S. A., and Stegle, O. (2018). SpatialDE: Identification of
science.aaw1219 spatially variable genes. Nat. Methods 15, 343–346. doi: 10.1038/nmeth.4636
Rohart, F., Eslami, A., Matigian, N., Bougeard, S., and Lê Cao, K. A. (2017a). MINT: Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., et al. (2009).
a multivariate integrative method to identify reproducible molecular signatures mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 6,
across independent experiments and platforms. BMC Bioinform. 18:128. doi: 377–382. doi: 10.1038/nmeth.1315
10.1186/s12859-017-1553-8 Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework
Rohart, F., Gautier, B., Singh, A., and Lê Cao, K. A. (2017b). mixOmics: an and its current applications in bioinformatics. BMC Bioinform. 11:S1. doi: 10.
R package for ‘omics feature selection and multiple data integration. PLoS 1186/1471-2105-11-S12-S1
Comput. Biol. 13:1005752. doi: 10.1371/journal.pcbi.1005752 Tharwat, A., Gaber, T., Ibrahim, A., and Hassanien, A. E. (2017). Linear
Saliba, A. E., Westermann, A. J., Gorski, S. A., and Vogel, J. (2014). Single-cell discriminant analysis: a detailed tutorial. AI Commun. 30, 169–190. doi: 10.
RNA-seq: Advances and future challenges. Nucleic Acids Res. 42, 8845–8860. 3233/AIC-170729
doi: 10.1093/nar/gku555 Tolle, K. M., Tansley, D. S. W., and Hey, A. J. G. (2011). The fourth Paradigm: Data-
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F., and Regev, A. (2015). Spatial intensive scientific discovery. Proc. IEEE 99, 1334–1337. doi: 10.1109/JPROC.
reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502. 2011.2155130
doi: 10.1038/nbt.3192 Tomlinson, M. J., Tomlinson, S., Yang, X. B., and Kirkham, J. (2013). Cell
Schmitz, B., Radbruch, A., Kümmel, T., Wickenhauser, C., Korb, H., Hansmann, separation: Terminology and practical considerations. J. Tissue Eng. 4, 1–14.
M. L., et al. (1994). Magnetic activated cell sorting (MACS) - a new doi: 10.1177/2041731412472690
imrnunomagnetic method for megakarvocvtic cell isolation. Eur. J. Heamatol. Trapnell, C., Cacchiarelli, D., Grimsby, J., Pokharel, P., Li, S., Morse, M., et al.
52, 267–275. (2014). The dynamics and regulators of cell fate decisions are revealed by
Sena, J. A., Galotto, G., Devitt, N. P., Connick, M. C., Jacobi, J. L., Umale, P. E., et al. pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386. doi:
(2018). Unique Molecular Identifiers reveal a novel sequencing artefact with 10.1038/nbt.2859
Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., Van Baren, Wills, Q. F., Livak, K. J., Tipping, A. J., Enver, T., Goldson, A. J., Sexton, D. W.,
M. J., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals et al. (2013). Single-cell gene expression analysis reveals genetic associations
unannotated transcripts and isoform switching during cell differentiation. Nat. masked in whole-tissue experiments. Nat. Biotechnol. 31, 748–752. doi: 10.1038/
Biotechnol. 28, 511–515. doi: 10.1038/nbt.1621 nbt.2642
Trombetta, J., Gennert, D., Lu, D., and Sattija, R. (2015). Preparation of single-cell Wong, K., Navarro, J. F., Bergenstråhle, L., Ståhl, P. L., and Lundeberg, J. (2018). ST
RNA-seq libraries for NGS. Curr. Protoc. Mol. Biol. 19, 161–169. doi: 10.3851/ Spot Detector: a web-based application for automatic spot and tissue detection
IMP2701.Changes for spatial transcriptomics image datasets. Bioinformatics 34, 1966–1968. doi:
Vallejos, C. A., Marioni, J. C., and Richardson, S. (2015). BASiCS: Bayesian analysis 10.1093/bioinformatics/bty030
of single-cell sequencing data. PLoS Comput. Biol. 11:e1004333. doi: 10.1371/ Wyatt Shields, C. IV, Reyes, C. D., and López, G. P. (2015). Microfluidic cell sorting:
journal.pcbi.1004333 a review of the advances in the separation of cells from debulking to rare cell
Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S., and Marioni, J. C. (2017). isolation. Lab Chip 5, 1230–1249. doi: 10.1039/c4lc01246a
Normalizing single-cell RNA sequencing data: challenges and opportunities. Xin, Y., Kim, J., Ni, M., Wei, Y., Okamoto, H., Lee, J., et al. (2016). Use of the
Nat. Methods 14, 565–571. doi: 10.1038/nmeth.4292.Normalizing Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet
Van Der Maaten, L. J. P., and Hinton, G. E. (2008). Visualizing high-dimensional cells. Proc. Natl. Acad. Sci. U.S.A. 113, 3293–3298. doi: 10.1073/pnas.160230
data using t-sne. J. Machine Learn. Res. 9, 2579–2605. 6113
Van Der Maaten, L. J. P., Postma, E. O., and Van Den Herik, H. J. (2009). Xue, R., Li, R., and Bai, F. (2015). Single cell sequencing: technique, application,
“Dimensionality reduction: a comparative review,” in Technical Report TiCC-TR and future development. Sci. Bull. 60, 33–42. doi: 10.1007/s11434-014-0634-6
2009-005 (Tilburg: Tillburg University). Yip, S. H., Wang, P., Kocher, J. P. A., Sham, P. C., and Wang, J. (2017). Linnorm:
Vitak, S. A., Torkenczy, K. A., Rosenkrantz, J. L., Fields, A. J., Christiansen, L., improved statistical analysis for single cell RNA-seq expression data. Nucleic
Wong, M. H., et al. (2017). Sequencing thousands of single-cell genomes with Acids Res. 45:e179. doi: 10.1093/nar/gkx828
combinatorial indexing. Nat. Methods 472, 90–94. doi: 10.1038/nmeth.4154 Yu, P., and Lin, W. (2016). Single-cell transcriptome study as big data. Genomics
Volden, R., and Vollmers, C. (2020). Highly multiplexed single-cell full-length Proteomics Bioinform. 14, 21–30. doi: 10.1016/j.gpb.2016.01.005
cDNA Sequencing of human immune cells with 10X genomics and R2C2. Zaharia, M., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I., et al.
bioRxiv [Preprint]. doi: 10.1101/2020.01.10.902361 (2016). Apache spark. Commun. ACM 59, 56–65. doi: 10.1145/2934664
Wagner, A., Regev, A., and Yosef, N. (2016). Revealing the vectors of cellular Zare, R. N., and Kim, S. (2010). Microfluidic platforms for single-cell analysis.
identity with single-cell genomics. Nat. Biotechnol. 34, 1145–1160. doi: 10.1038/ Annu. Rev. Biomed. Eng. 12, 187–201. doi: 10.1146/annurev-bioeng-070909-
nbt.3711 105238
Wang, D., and Bodovitz, S. (2010). Single cell analysis: the new frontier in “omics.”. Zhang, Z., and Wang, W. (2014). RNA-skim: a rapid method for RNA-Seq
Trends Biotechnol. 28, 281–290. doi: 10.1016/j.tibtech.2010.03.002 quantification at transcript level. Bioinformatics 30, i283–i292. doi: 10.1093/
Wang, T., Li, B., Nelson, C. E., and Nabavi, S. (2019). Comparative analysis of bioinformatics/btu288
differential gene expression analysis tools for single-cell RNA sequencing data. Zheng, G. X. Y., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R.,
BMC Bioinform. 20:40. doi: 10.1186/s12859-019-2599-6 et al. (2017). Massively parallel digital transcriptional profiling of single cells.
Wang, T., and Nabavi, S. (2018). SigEMD: a powerful method for differential gene Nat. Commun. 8:14049. doi: 10.1038/ncomms14049
expression analysis in single-cell RNA sequencing data. Methods 145, 25–32.
doi: 10.1016/j.ymeth.2018.04.017 Conflict of Interest: The authors declare that the research was conducted in the
Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA-Seq: A revolutionary tool for absence of any commercial or financial relationships that could be construed as a
transcriptomics. Nat. Rev. Genet. 10, 57–63. doi: 10.1038/nrg2484 potential conflict of interest.
Welch, J. D., Hartemink, A. J., and Prins, J. F. (2017). MATCHER: manifold
alignment reveals correspondence between single cell transcriptome Copyright © 2021 Adil, Kumar, Jan and Asger. This is an open-access article
and epigenome dynamics. Genome Biol. 18:138. doi: 10.1186/s13059-017- distributed under the terms of the Creative Commons Attribution License (CC BY).
1269-0 The use, distribution or reproduction in other forums is permitted, provided the
Welzel, G., Seitz, D., and Schuster, S. (2015). Magnetic-activated cell sorting original author(s) and the copyright owner(s) are credited and that the original
(MACS) can be used as a large-scale method for establishing zebrafish neuronal publication in this journal is cited, in accordance with accepted academic practice. No
cell cultures. Sci. Rep. 5:7959. doi: 10.1038/srep07959 use, distribution or reproduction is permitted which does not comply with these terms.