0% found this document useful (0 votes)
45 views

A Systematic Overview of Single-Cell Transcriptomics Databases, Their Use Cases, and Limitations

The document provides an overview of single-cell transcriptomics databases, discussing their categories, scope, limitations, and computational solutions. It examines large-scale scRNA-seq databases and how they are categorized based on their purpose. Technical and methodological challenges of curating large databases are also addressed.

Uploaded by

scribd.6do57
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

A Systematic Overview of Single-Cell Transcriptomics Databases, Their Use Cases, and Limitations

The document provides an overview of single-cell transcriptomics databases, discussing their categories, scope, limitations, and computational solutions. It examines large-scale scRNA-seq databases and how they are categorized based on their purpose. Technical and methodological challenges of curating large databases are also addressed.

Uploaded by

scribd.6do57
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

A Systematic Overview of Single-Cell Transcriptomics Databases, their

Use cases, and Limitations

Mahnoor N. Gondal1,2, Saad Ur Rehman Shah3, Arul M. Chinnaiyan1,2,4,5,6,7,✝,*, Marcin


Cieslik1,2,4,7,✝,*
1
Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI USA
2
Michigan Center for Translational Pathology, University of Michigan, Ann Arbor, MI USA
3
Gies College of Business, University of Illinois Business College, Champaign, IL USA
4
Department of Pathology, University of Michigan, Ann Arbor, MI USA
5
Department of Urology, University of Michigan, Ann Arbor, MI USA
6
Howard Hughes Medical Institute, Ann Arbor, MI USA
7
University of Michigan Rogel Cancer Center, Ann Arbor, MI USA

* Correspondence:

Marcin Cieslik
[email protected]

Arul M. Chinnaiyan
[email protected]

Keywords: Single-cell RNA-seq, Single-cell Databases, Single-cell Atlases, Single-cell data


analysis, Web-based platforms, Cell heterogeneity, Single-cell data integration, Computational
methods

Abstract

Rapid advancements in high-throughput single-cell RNA-seq (scRNA-seq) technologies and


experimental protocols have led to the generation of vast amounts of genomic data that populates
several online databases and repositories. Here, we systematically examined large-scale scRNA-seq
databases, categorizing them based on their scope and purpose such as general, tissue-specific
databases, disease-specific databases, cancer-focused databases, and cell type-focused databases.
Next, we discuss the technical and methodological challenges associated with curating large-scale
scRNA-seq databases, along with current computational solutions. We argue that understanding
scRNA-seq databases, including their limitations and assumptions, is crucial for effectively utilizing
this data to make robust discoveries and identify novel biological insights. Furthermore, we propose
that bridging the gap between computational and wet lab scientists through user-friendly web-based
platforms is needed for democratizing access to single-cell data. These platforms would facilitate
interdisciplinary research, enabling researchers from various disciplines to collaborate effectively.
This review underscores the importance of leveraging computational approaches to unravel the
complexities of single-cell data and offers a promising direction for future research in the field.
1 Introduction

The first commercially available single-cell platform emerged in 2014(1). Over the past decade,
single-cell sequencing technologies have rapidly advanced, becoming faster and more cost-effective.
Today, there are over 10 different commercially available platforms for high-throughput single-cell
data collection(2,3). This advancement has fueled remarkable growth in the field of single-cell RNA
sequencing (scRNA-seq) research, with nearly 2000 studies published to date(4), populating
numerous databases and repositories(5–9). These studies have provided valuable insights into various
biological processes, including development(10), disease initiation and progression(11), immune
response(12), and identification of rare cell types(13,14). Alongside the generation of large-scale
single-cell data, we also observe a sharp rise in scRNA-seq analysis tools, expected to reach 3000 by
the end of 2025(15).

Previous reviews and benchmarking analyses have extensively covered various aspects of
scRNA-seq analysis such as quality control(16), normalization(17,18), integration(19,20), and cell
type annotation(21). However, the complexity of large-scale data necessitates a comprehensive
evaluation of available scRNA-seq databases and repositories. This evaluation is crucial to
understand concepts like integration in the context of large-scale databasing. Understanding the scope
and limitations of these databases is crucial for storing, analyzing, and interpreting single-cell data
directly from these repositories. In this review, we systematically address the limitations and common
assumptions of existing scRNA-seq databases. We discuss the utility of these databases to meet the
specific needs of researchers studying different biological systems and processes.

2 Landscape of Single-cell Transcriptomics Databases

The rapid expansion of single-cell RNA-seq (scRNA-seq) studies has led to the development of
numerous databases and repositories for storing, retrieving, and interpreting single-cell data(5–9).
These databases provide a resource for single-cell transcriptomic data that can be used to build
computational models to investigate various biological processes. The data in scRNA-seq databases
or atlases can come from either a "primary" source, where the data was generated by the study itself,
or an "aggregated" source, where data was collected and curated from multiple studies (Figure 1A-D,
Table 1). Single-cell databases can be further categorized into general (non-specific or broad
category of databases), tissue-specific databases, disease-specific databases, cancer-focused
databases, and cell type-focused databases (Figure 1E).
Figure 1. Overview of single-cell databases, their technical/methodological issues, current
solutions, and common assumptions. (A) Overview of citations gathered from single-cell data
repositories from primary or aggregated studies (data collected on March 31st). (B) The pie chart
showing the total cells in the primary and aggregated data source (C) Highlights the number of
datasets or studies published per database in the primary data source category. (D) Shows the number
of donors/samples in the aggregated data source category. (E) A pie chart highlighting the total
percentage of databases in general, tissue-specific, disease-specific, cancer-focused, and cell
type-focused databases. (F-G) Exhibits the percentage of studies for tissue-specific and
disease-focused databases. (H) Shows the number of cancer types per cancer-focused databases. (I)
Technical and methodological issues with databases, current computational and methods-based
solutions, and their common assumptions and limitations.
2.1 General and Tissue-Specific Single-cell Databases

The establishment of the Human Cell Atlas (HCA)(5) in 2017 marked a significant effort to collect
and integrate large-scale single-cell data into a comprehensive reference atlas for all human cells.
HCA’s open-access resource forms one of the largest public databases for integrated single-cell data
from large-scale sequencing projects comprising over 437 projects and 58.5 million cells across 18
tissues. Single-cell atlases offer high-resolution views of cellular composition in organs, leading to
groundbreaking discoveries of rare cell types, developmental processes, and cell states associated
with various disease processes(22–24). Despite the accessibility and extensive research interest in
this data, disagreements may arise when selecting a particular study as a reference for hypothesis
building(25).

To address this issue, one of the HCA's sub-projects aims to develop tissue-specific reference atlases
that serve as consensus representations of specific organs across multiple projects (Figure 1F). These
atlases provide a standardized reference for specific tissues for comparing different datasets,
facilitating cross-study comparisons and meta-analyses. An example of a tissue-specific single-cell
database established by the HCA is the Human Lung Cell Atlas (HLCA)(26) which integrates 49
lung datasets encompassing 2.4 million cells from 486 individuals. The HLCA database development
involved four main steps: data curation, integration method selection, cell type annotation, and data
usage. Despite the benefits associated with large-scale tissue-specific single-cell databases, it is
important to note that integrating data from diverse datasets, labs, and technologies presents
challenges due to differences in data quality, sample handling, and experimental protocols. Moreover,
ensuring consistency and standardization across datasets is crucial for meaningful comparisons, but
achieving this in practice can be complex, particularly with the wide variety of cell types and states.
For the HLCA project, the team primarily relied on harmonized manual annotation, integration
benchmarking(20), and expert views, which can be subjective and may lack reproducibility.

Other general databases include Single Cell Portal (SCP)(27) and CZ CellxGene Discover(28) which
are more flexible and are ideal for retrieving single-cell data focusing on a particular dataset of
interest, focusing on unique features and variations within the datasets. The SCP and CZ CellxGene,
developed by the Chan Zuckerberg Initiative (CZI) and Broad Institute, respectively, provide
web-based interfaces for data exploration and analysis. CZ CellxGene hosts more than 1284 datasets
while SCP constitutes 654 datasets. They offer interactive visualization tools for exploring gene
expression patterns, cell clusters, and cell type annotations in scRNA-seq data. Both platforms
support the sharing of scRNA-seq datasets, allowing researchers to collaborate and access public
datasets. Using the CZ CellxGene platform, users can also download the raw count data as an RDS
file containing a Seurat object or an h5ad file with an AnnData object to perform their analysis.
Similarly, SCP data can be downloaded as individual metadata files, raw count expressions, and
normalized expression data from the website directly, however, the availability of raw or normalized
data is subjective to each study in SCP. Another interesting example of a comprehensive database is
Tabula Sapiens(29) which houses primary data from 15 individuals across 24 tissues. This database
enables the evaluation of gene expression in normal or baseline cell states, providing a valuable
resource for developing gene regulation networks and trajectories(30). It offers a unique opportunity
to study cell type-specific expression changes. The data is easily accessible through web platforms
and can also be explored using tools like CZ CellxGene.
2.2 Disease-focused Single-cell Databases

Numerous other scRNA-seq databases have also emerged, including PanglaoDB(9), UCSC Cell
Browser(31), and Human Protein Atlas(32). However, these general databases are not designed to
systematically gather data on gene expression specificity in different diseases. Given the diverse and
heterogeneous nature of human diseases, which manifest unique gene expression profiles, there is a
critical need for databases focused on disease-specific exploration (Figure 1G).

SC2disease(33) is a manually curated scRNA-seq database that addresses this need, cataloging cell
type-specific genes associated with 25 diverse diseases, including Huntington's disease, multiple
sclerosis, and Alzheimer's disease. While SC2disease represents a pioneering effort in
disease-specific gene expression profiling, there is also a growing need for more specific databases
dedicated to the disease of interest. To address this, databases like SC2sepsis(34), ssREAD(35), and
SCovid(36) have emerged, focusing on individual diseases such as sepsis, Alzheimer's, and
Covid-19, respectively. These databases aim to provide a more granular and disease-specific view of
gene expression patterns, enhancing our understanding of disease mechanisms and potential
therapeutic targets.
2.3 Cancer-focused Single-cell Databases

Cancer is a complex disease characterized by its highly heterogeneous and multifactorial nature(37).
Traditional approaches to studying cancer, such as bulk RNA sequencing constitute a mixture of the
cellular composition in tumors and often fail to accurately capture cancer cell-specific gene
expression(38,39). scRNA-seq technologies offer unprecedented insights into tumor heterogeneity,
evolution, and responses to therapy(40–42). As a result, numerous databases hosting cancer-focused
scRNA-seq data have emerged (Figure 1H).

One such example is CancerSEA(43) which was launched in 2019 as a resource utilizing single-cell
data from cancer datasets to decode the functional states of cancer cells, these states included
stemness, invasion, metastasis, proliferation, epithelial-to-mesenchymal transition (EMT),
angiogenesis, apoptosis, cell cycle, differentiation, DNA damage, DNA repair, hypoxia,
inflammation, and quiescence. In a salient study, Dohmen et al.(44) utilized CancerSEA's functional
states to validate gene sets derived from their machine-learning model, while Zhao et al.(45)
demonstrated the necessity of NF-KB for initiating oncogenesis using CancerSEA's functional states.
Several other studies(46–49) have leveraged CancerSEA to correlate their gene or gene set findings
with cancer single-cell data, showcasing the utility of this resource.

However, CancerSEA has limitations, including hosting only 93,475 malignant cells and an inability
to study interactions between stromal or immune cells and cancer cells. It also lacks a user-friendly
web interface to support data exploration and visualization. In an attempt to overcome these
challenges, TISCH was originally developed in 2021(50), with version 2 released in 2023(51).
TISCH2 curates cancer datasets with both malignant and non-malignant cell types, currently hosting
190 datasets, encompassing 50 cancer types, and spanning 6 million cells. To illustrate the utility of
TISCH in computational models, Xu et al.(52) employed TISCH’s data to analyze the correlation
between FOXM1 and immune cells. Similarly, Zhang et al.(53) employed TISCH to evaluate m7G
regulators expression in osteosarcoma scRNA-seq data. As such, numerous studies have utilized
TISCH to evaluate the expression of genes of interest across cancer datasets(53–58). Although
TISCH provides a valuable resource to the cancer research community, it is important to be aware of
the assumptions and limitations of TISCH data. While many studies aim to understand gene
expression in malignant cells, TISCH also contains treatment data from immune checkpoint blockade
(ICB), chemotherapy, and targeted therapy. Therefore, it is important to ensure that the results are not
confounded, as gene expression varies after treatments and can yield diverse results(59).
Additionally, TISCH includes data from multiple stages of cancer, such as primary tumors or
metastatic sites. Therefore, users need to carefully extract only relevant information when employing
TISCH data. TISCH employs an automatic cell-type annotation method, which may lead to a lack of
consensus with the original dataset's manual annotation. Importantly, all downloaded datasets in
TISCH are in fixed expression matrices(41), and users cannot download the raw count data.
Therefore, any attempts to further integrate or normalize the data might result in technical variation
rather than biological results. These limitations could potentially introduce bias into analyses or
hinder comparability across different databases.
2.4 Cell-type-focused Single-cell Databases

To better understand the intricacies of cell biology, dedicated resources focused on cell-type profiling
of single cells have emerged. JingleBells(60), introduced in 2017, represented an advancement in this
direction by providing a comprehensive immune cell resource. JingleBells facilitates the study of
immune cell involvement in various diseases, including cancer, and infectious diseases, providing
valuable insights into disease mechanisms and potential therapeutic targets. However, JingleBells
lacks an interactive web interface and only allows for BAM file download which means analyzing
and interpreting single-cell data from JingleBells requires specialized computational tools and
expertise, limiting accessibility to researchers with specific skills. In comparison, the human Antigen
Receptor database (huARdb)(61), published originally in 2022, is a comprehensive human single-cell
immune profiling database, housing 444,794 high-confidence T or B cells (hcT/B cells) with
complete TCR/BCR sequences and transcriptomes sourced from 215 datasets. To enhance user
experience, the authors have created a user-friendly web interface that offers interactive visualization
modules, enabling biologists to analyze transcriptome and TCR/BCR features at the single-cell level
with ease. Fan et al.(62) utilized huARdb by analyzing ulcerative colitis (UC) patients' immune cells
derived from huARdb. Similarly, they also employed huARdb to investigate the healthy and UC
composition of peripheral blood immune cells and colonic cells(63). Additional cell-type-focused
single-cell databases include EndoDB(64) which hosts endothelial cells transcriptomics data from
360 datasets and ABC portal(65), a database for blood cells across 198 datasets, allowing for a blood
cell-type-specific exploration.
Table 1. Detailed list of the single-cell databases. The list encompasses essential information such
as database name, year of establishment, PubMed ID (PMID), number of citations, URL, web
interface availability, number of datasets/studies per database, cell count, primary vs aggregated
distinction, specific groups, tissue type, and download availability.
3 Challenges Associated with the Utilization of Large-scale Single-cell Databases and their
Examples from Current Literature

While the scRNA-seq field is progressing towards the improvement and development of large-scale
single-cell databases, their application in research comes with certain caveats and despite their
vastness, they must be used judiciously (Figure 1I). Some of the key considerations and limitations
include:
3.1 Data quality

Ensuring data quality in scRNA-seq is critical for accurate interpretation and analysis. A fundamental
assumption of droplet-based scRNA-seq is that each droplet, where molecular tagging and reverse
transcription occur, contains messenger RNA (mRNA) from a single cell. However, in practice, this
assumption is often violated, leading to potential distortions in the interpretation of scRNA-seq data.
Common examples include droplets containing multiple cells (doublets) or no cells at all (empty
droplets or dropouts). This becomes a major issue because large-scale databases such as the Human
Cell Atlas (HCA) rely heavily on the accuracy and cellular specificity of transcriptional readouts
generated by scRNA-seq.
3.1.1 Examples from current literature and benchmarking studies

Dropouts: To overcome the issue of dropouts, numerous single-cell imputation methods have been
developed. However, imputation affects downstream results and some of these methods may
introduce false correlations. For example, Breda et al.(66)’s comparison of MAGIC(67) results with
Sanity (SAmpling-Noise-corrected Inference of Transcription activitY), elicited that MAGIC
introduced strong positive correlations where no or low correlation was expected. A comparative
study by Zhang et al.(68) highlighted that the number of cells and method parameters also affected
imputation results and some methods preferred similar cells while imputing. Therefore, imputation
results can be variable, and downstream analysis will be affected by imputation therefore in our
opinion it should ideally be avoided or performed with caution.

Doublets: Doublets are also not real cells and are major confounders in scRNA-seq data analysis.
However, there are computational methods that exist to detect doublets in single-cell data. A
benchmarking study(69) compared nine doublet detection methods, revealing that there is still room
for improvement in detection accuracy. Generally, these methods performed better on datasets with
higher doublet rates, larger sequencing depths, more cell types, or greater heterogeneity between cell
types. However, the removal of doublets by these methods led to improvements in various
downstream analyses. It enhanced the identification of Differentially Expressed (DE) genes, reduced
the presence of spurious cell clusters, and improved the inference of cell trajectories. However, the
extent of improvement varied across different methods, highlighting the need for further refinement
and development in this area.
3.2 Normalization

Normalization is another critical aspect of scRNA-seq data analysis and can be a complex problem
when dealing with multiple datasets. Specifically, variability in experimental protocols and data
processing methods can pose challenges in data normalization, affecting the comparability of results
across datasets in a database. Differences in normalization approaches can lead to discrepancies in
gene expression profiles, making it difficult to draw meaningful conclusions from the downstream
analyses.
3.2.1 Examples from current literature and benchmarking studies

There are several methods to perform single-cell data normalization such as SCT transformation, and
log1p normalization(17). The choice of the method, however, is dependent on various features of the
data including sequencing depth as both lowly and highly abundance genes are confounded by
sequencing depth(17). Booeshaghi et al.(18) demonstrated that the assumptions implied in the choice
of normalization methods will affect downstream analysis in determining whether the variation is
technical or biological. In a salient example, TISCH2(51) database hosts single-cell gene expression
matrices for each dataset. In our analysis of TISCH2 data, this matrix is already normalized and
integrated, users incorporating this data in their research need to be aware of this normalization to
make accurate assessments of data and not re-normalize or merge it directly with other datasets
which might result in substandard results. Therefore, in our opinion, when using datasets directly
from single-cell databases it is necessary to be aware of the pre-processing steps and how they affect
downstream results to ensure accurate analysis and interpretation.
3.3 Integration and batch effects removal

The integration and batch effect removal of scRNA-seq data from diverse datasets, labs, and
technologies can be complex(70). Variations in data formats, processing pipelines, and batch effects
can affect the robustness and reliability of integrated analyses, potentially masking true biological
signals. Methods for integrating heterogeneous datasets are continually evolving, with efforts focused
on minimizing batch effects and preserving biological variability. There are more than 50 integration
methods published to date(20,71).
3.3.1 Examples from current literature and benchmarking studies

Large databases host numerous datasets from multiple studies, however, it is also important to be
aware of the properties associated with each study during integration. For example, Salcher et al.(13)
established a large non-small cell lung cancer (NSCLC) atlas comprising 29 datasets spanning
1,283,972 single cells from 556 samples. Although this effort resulted in the in-depth characterization
of a neutrophil subpopulation, however, according to our re-analysis of this data, among the 29
datasets, Maynard et al.(72)’s NSCLC samples were also incorporated which were not
treatment-naive. This can be a potential confounder in downstream analysis. Therefore, it is the user's
responsibility to be aware of this data-specific property and to use atlases and databases with care to
derive robust biological insights. Similarly, several attempts have been made to benchmark
integration methods for single-cell data(19,20). While Tran et al.(19) showed that LIGER(73), Seurat
3(74), and Harmony(75) performed the best among 11 other methods, Luecken et al.(20) revealed
that LIGER and Seurat v3 favor the removal of batch effect over the conservation of biological
variation. This highlights the importance of considering the dataset and the specific research question
when selecting an integration method. In our view, similar to the Human Cell Atlas pipeline,
benchmarking integration methods need to be performed on each study to first evaluate which
method suits your data the best. Selecting the right method is crucial as it directly impacts the
biological insights that can be generated from the integrated data.
3.4 Cell-Type Annotation

Accurate annotation of cell types in scRNA-seq databases is crucial for interpreting results
accurately. While automatic cell-type annotation methods are convenient, they may lack consensus
with manual annotations from original datasets. This can introduce ambiguity in cell-type
assignments and lead to misinterpretation of results. Although harmonizing cell type annotations
across different datasets is essential for facilitating cross-study comparisons and meta-analysis, in our
opinion the results should be manually validated to make sure the automatic annotations make logical
sense.
3.4.1 Examples from current literature and benchmarking studies

In our recent re-analysis of Tabula Sapiens data(29), we observed that 10% of the heart cells were
mislabelled as hepatocytes in the study’s original metadata. This is biologically incorrect since
hepatocytes cannot be in the heart, these are liver epithelial cells(76). One potential reason for this
mislabelling can be that Tabula Sapiens data was annotated using an automatic cell-type annotation
tool, another reason could be sample mishandling. Therefore, diligent manual intervention for cell
type annotation needs to be practiced to ensure accurate and robust results. Additionally, Abdelaal et
al.(21) carried out a performance comparison analysis between 22 automatic cell-type identification
methods in single-cell data. Although the authors did not state a preference they noted that the results
can vary depending on input features and the number of cells which means that they cannot be solely
relied on, there will be some manual intervention for accurate cell type annotation.
3.5 “Zero-code” Single-cell Analysis Platforms

Single-cell data plays a crucial role in validating and enhancing the accuracy of wet lab results and
hypothesis-driven publications(77). To facilitate easy access and analysis of this data, many
databases provide built-in tools that allow researchers without computational expertise to explore
existing datasets and assess their hypotheses using basic operations like exploring gene expressions,
isolating cell subsets for individual analysis, and identifying clusters within the data. However, for
more complex analyses that require significant computational resources, these tools are often not
available directly on the database platforms.

To address this challenge, wet lab scientists can employ platforms like ICARUS_v3(78) for
zero-code single-cell analysis. ICARUS_v3 employs a geometric cell sketching method to subsample
representative cells from the dataset to store in memory. This enables advanced scRNA-seq analysis
through a user-friendly web interface. ICARUS_v3 can seamlessly integrate with output files from
databases like Single Cell Portal (SCP)(27) and CZ CellxGene Discover(28), eliminating the need for
coding expertise. Users can leverage this platform to conduct a wide range of analyses, including
differential expression analysis, gene regulatory network construction, trajectory analysis, and
cell-cell communication inference. While ICARUS_v3 has its assumptions and limitations, it offers
users the flexibility to set parameters at each stage and provides a diverse range of operations. This
capability not only bridges the gap between experimental and bioinformatics researchers but also
simplifies the utilization of scRNA-seq databases.

4 Platforms for hosting and visualizing large-scale single-cell data

As the volume of single-cell data continues to grow, scalability becomes a significant concern.
Developing methods and infrastructure that can handle the increasing complexity and size of
single-cell datasets is crucial for future research. Towards this aim, for easy, fast, and customizable
exploration of single-cell data for public use, numerous user-driven platforms have emerged(79,80).

One such platform, the Interactive SummarizedExperiment Explorer (iSEE)(79), launched in 2018,
enables users to host their SummarizedExperiment data. Researchers such as Graf et al.(81) and
Newton et al.(82) have employed iSEE to visualize their single-cell data, demonstrating its utility in
data exploration. Similarly, the Single Cell Explorer(80) allows users to input loom and Seurat
objects, making the data more accessible.

ShinyCell(83) is another example of a platform offering web-based interfaces for exploring and
analyzing data. These interfaces can be customized for maximum usability and can be uploaded to
online platforms to broaden access to published data. ShinyCell supports various common single-cell
data formats, including SingleCellExperiment, h5ad, loom, and Seurat objects, as inputs. In a salient
example, Ma et al.(84) used ShinyCell to host their pan-cancer single-cell data, showcasing its
versatility and effectiveness in data dissemination. Likewise, Curras-Alonso et al.(85) developed
their web application using ShinyCell, highlighting its widespread adoption in the research
community. By providing easy-to-use tools for data analysis, these platforms help democratize access
to single-cell data and facilitate collaboration between researchers from different disciplines.

5 Discussion

The rapid expansion of single-cell RNA-seq (scRNA-seq) studies has ushered in a plethora of
databases and repositories dedicated to storing, retrieving, and interpreting single-cell data. These
databases provide a wealth of single-cell transcriptomic data that can be used to build computational
models to understand various biological processes. However, challenges such as data quality,
normalization, integration, and annotation can affect the reliability and comparability of results
across different datasets and studies.

While the existing databases are valuable for basic scRNA-seq analysis, they cannot often perform
advanced analyses such as regulon activity assessment, pseudobulking, and differential gene
expression analysis. Users still need to possess programming skills and be familiar with using a
command-line interface to conduct customized analysis. Furthermore, many wet labs may not have
the necessary resources to manage high-performance computing clusters. To address this gap and
enable wet-lab researchers to conduct advanced scRNA-seq analysis, platforms like ICARUS_v3(78)
offer web-based analysis tools. These platforms provide an accessible way for researchers to explore
and analyze single-cell data, bridging the gap between wet lab experimentation and bioinformatics
analysis.
Taken together, in this mini-review, we address the utility and applicability of large-scale scRNA-seq
databases. We address some of the challenges and common assumptions that need to be considered
when using these databases for hypothesis-driven studies, highlighting platforms for hosting
customized scRNA-seq data for community usage. While challenges remain, the development of
user-friendly platforms is narrowing the gap between wet-lab experimentation and bioinformatics
analysis, ultimately advancing our understanding of cellular processes at a single-cell level.

6 Acknowledgments

The study was supported by the National Cancer Institute (NCI) Outstanding Investigator Award
R35CA231996 (A.M.C.), NCI Prostate SPORE grant P50CA186786 (A.M.C.), and NCI
Michigan-VUMC Biomarker Characterization Center grant U2CCA271854 (A.M.C.). A.M.C. is also
a Howard Hughes Medical Institute Investigator, A. Alfred Taubman Scholar, and American Cancer
Society Professor. This manuscript was also supported in part by funding from the Innovation in
Cancer Informatics (ICI398672) and the V Foundation for Cancer Research (T2019-006) to M.C. We
would like to acknowledge the use of ChatGPT, an AI language model developed by OpenAI, for
assisting with language editing and suggestions during the preparation of this manuscript.

7 Author contributions

MC, AMC, SURS, and MNG carried out the study design and drafted the manuscript. All authors
contributed to the article and approved the submitted version.

8 Competing interests

A.M.C. is a co-founder of and serves as a Scientific Advisory Board member for LynxDx, Esanik
Therapeutics, Medsyn, and Flamingo Therapeutics. A.M.C. is a scientific advisor or consultant for
EdenRoc, Aurigene Oncology, Ascentage Pharma, Proteovant, Belharra, Rappta Therapeutics, and
Tempus.

9 References

1. Wu X, Yang B, Udo-Inyang I, Ji S, Ozog D, Zhou L, et al. Research Techniques Made Simple:


Single-Cell RNA Sequencing and its Applications in Dermatology. J Invest Dermatol. 2018
May;138(5):1004–9.

2. Valihrach L, Androvic P, Kubista M. Platforms for Single-Cell Collection and Analysis. Int J Mol Sci
[Internet]. 2018 Mar 11;19(3). Available from: http://dx.doi.org/10.3390/ijms19030807

3. Mereu E, Lafzi A, Moutinho C, Ziegenhain C, McCarthy DJ, Álvarez-Varela A, et al. Benchmarking


single-cell RNA-sequencing protocols for cell atlas projects. Nat Biotechnol. 2020 Jun;38(6):747–55.

4. Svensson V, da Veiga Beltrame E, Pachter L. A curated database reveals trends in single-cell


transcriptomics. Database [Internet]. 2020 Nov 28;2020. Available from:
http://dx.doi.org/10.1093/database/baaa073

5. Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, et al. The Human Cell Atlas. Elife
[Internet]. 2017 Dec 5;6. Available from: http://dx.doi.org/10.7554/eLife.27041
6. Abugessaisa I, Noguchi S, Böttcher M, Hasegawa A, Kouno T, Kato S, et al. SCPortalen: human and
mouse single-cell centric database. Nucleic Acids Res. 2018 Jan 4;46(D1):D781–7.

7. Papatheodorou I, Moreno P, Manning J, Fuentes AMP, George N, Fexova S, et al. Expression Atlas
update: from tissues to single cells. Nucleic Acids Res. 2020 Jan 8;48(D1):D77–83.

8. Wang Z, Feng X, Li SC. SCDevDB: A Database for Insights Into Single-Cell Gene Expression Profiles
During Human Developmental Processes. Front Genet. 2019 Sep 26;10:903.

9. Franzén O, Gan LM, Björkegren JLM. PanglaoDB: a web server for exploration of mouse and human
single-cell RNA sequencing data. Database [Internet]. 2019 Jan 1;2019. Available from:
http://dx.doi.org/10.1093/database/baz046

10. Han X, Zhou Z, Fei L, Sun H, Wang R, Chen Y, et al. Construction of a human cell landscape at
single-cell level. Nature. 2020 May;581(7808):303–9.

11. Strzelecka PM, Ranzoni AM, Cvejic A. Dissecting human disease with single-cell omics: application in
model systems and in the clinic. Dis Model Mech [Internet]. 2018 Nov 5;11(11). Available from:
http://dx.doi.org/10.1242/dmm.036525

12. Chen D, Luo Y, Cheng G. Single cell and immunity: Better understanding immune cell heterogeneities
with single cell sequencing. Clin Transl Med. 2023 Jan;13(1):e1159.

13. Salcher S, Sturm G, Horvath L, Untergasser G, Kuempers C, Fotakis G, et al. High-resolution single-cell
atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer
Cell. 2022 Dec 12;40(12):1503–20.e8.

14. Lee H, Yu H, Welch J. A beginner’s guide to single-cell transcriptomics. Biochem . 2019 Oct
18;41(5):34–8.

15. Zappia L, Theis FJ. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome
Biol. 2021 Oct 29;22(1):301.

16. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand
challenges in single-cell data science. Genome Biol. 2020 Feb 7;21(1):31.

17. Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using
regularized negative binomial regression. Genome Biol. 2019 Dec 23;20(1):296.

18. Sina Booeshaghi A, Hallgrímsdóttir IB, Gálvez-Merchán Á, Pachter L. Depth normalization for
single-cell genomics count data [Internet]. bioRxiv. 2022 [cited 2022 May 9]. p. 2022.05.06.490859.
Available from: https://www.biorxiv.org/content/10.1101/2022.05.06.490859v1

19. Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, et al. A benchmark of batch-effect
correction methods for single-cell RNA sequencing data. Genome Biol. 2020 Jan 16;21(1):12.

20. Luecken MD, Büttner M, Chaichoompu K, Danese A, Interlandi M, Mueller MF, et al. Benchmarking
atlas-level data integration in single-cell genomics. Nat Methods. 2022 Jan;19(1):41–50.

21. Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJT, et al. A comparison of automatic
cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019 Sep 9;20(1):194.

22. Argelaguet R, Cuomo ASE, Stegle O, Marioni JC. Computational principles and challenges in single-cell
data integration. Nat Biotechnol. 2021 Oct;39(10):1202–15.
23. Badia-I-Mompel P, Wessels L, Müller-Dott S, Trimbour R, Ramirez Flores RO, Argelaguet R, et al. Gene
regulatory network inference in the era of single-cell multi-omics. Nat Rev Genet. 2023
Nov;24(11):739–54.

24. Rood JE, Maartens A, Hupalowska A, Teichmann SA, Regev A. Impact of the Human Cell Atlas on
medicine. Nat Med. 2022 Dec;28(12):2486–96.

25. Elmentaite R, Domínguez Conde C, Yang L, Teichmann SA. Single-cell atlases: shared and
tissue-specific cell types across human organs. Nat Rev Genet. 2022 Jul;23(7):395–410.

26. Sikkema L, Ramírez-Suástegui C, Strobl DC, Gillett TE, Zappia L, Madissoon E, et al. An integrated cell
atlas of the lung in health and disease. Nat Med. 2023 Jun;29(6):1563–77.

27. Tarhan L, Bistline J, Chang J, Galloway B, Hanna E, Weitz E. Single Cell Portal: an interactive home for
single-cell genomics data. bioRxiv [Internet]. 2023 Jul 17; Available from:
http://dx.doi.org/10.1101/2023.07.13.548886

28. CZI Single-Cell Biology Program, Abdulla S, Aevermann B, Assis P, Badajoz S, Bell SM, et al. CZ
CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of
aggregated data [Internet]. bioRxiv. 2023 [cited 2024 Mar 16]. p. 2023.10.30.563174. Available from:
https://www.biorxiv.org/content/10.1101/2023.10.30.563174v1

29. Tabula Sapiens Consortium*, Jones RC, Karkanias J, Krasnow MA, Pisco AO, Quake SR, et al. The
Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science. 2022 May
13;376(6594):eabl4896.

30. Gondal MN, Mannan R, Bao Y, Hu J, Cieslik M, Chinnaiyan AM. Abstract 860: Pan-tissue master
regulator inference reveals mechanisms of MHC alterations in cancers. Cancer Res. 2024 Mar
22;84(6_Supplement):860–860.

31. Speir ML, Bhaduri A, Markov NS, Moreno P, Nowakowski TJ, Papatheodorou I, et al. UCSC Cell
Browser: visualize your single-cell data. Bioinformatics. 2021 Dec 7;37(23):4578–80.

32. Karlsson M, Zhang C, Méar L, Zhong W, Digre A, Katona B, et al. A single-cell type transcriptomics
map of human tissues. Sci Adv [Internet]. 2021 Jul;7(31). Available from:
http://dx.doi.org/10.1126/sciadv.abh2169

33. Zhao T, Lyu S, Lu G, Juan L, Zeng X, Wei Z, et al. SC2disease: a manually curated database of
single-cell transcriptome for human diseases. Nucleic Acids Res. 2021 Jan 8;49(D1):D1413–9.

34. Li Y, Tan R, Chen Y, Liu Z, Chen E, Pan T, et al. SC2sepsis: sepsis single-cell whole gene expression
database. Database [Internet]. 2022 Aug 18;2022. Available from:
http://dx.doi.org/10.1093/database/baac061

35. Wang C, McNutt M, Ma A, Fu H, Ma Q. ssREAD: A Single-cell and Spatial RNA-seq Database for
Alzheimer’s Disease. bioRxiv [Internet]. 2023 Sep 12; Available from:
http://dx.doi.org/10.1101/2023.09.08.556944

36. Qi C, Wang C, Zhao L, Zhu Z, Wang P, Zhang S, et al. SCovid: single-cell atlases for exposing molecular
characteristics of COVID-19 across 10 human tissues. Nucleic Acids Res. 2022 Jan 7;50(D1):D867–74.

37. Gondal MN, Chaudhary SU. Navigating Multi-Scale Cancer Systems Biology Towards Model-Driven
Clinical Oncology and Its Applications in Personalized Therapeutics. Front Oncol. 2021 Nov
24;11:712505.
38. Ding S, Chen X, Shen K. Single-cell RNA sequencing in breast cancer: Understanding tumor
heterogeneity and paving roads to individualized therapy. Cancer Commun. 2020 Aug;40(8):329–44.

39. Huang D, Ma N, Li X, Gou Y, Duan Y, Liu B, et al. Advances in single-cell RNA sequencing and its
applications in cancer research. J Hematol Oncol. 2023 Aug 24;16(1):98.

40. Wang Y, Mashock M, Tong Z, Mu X, Chen H, Zhou X, et al. Changing Technologies of RNA Sequencing
and Their Applications in Clinical Oncology. Front Oncol. 2020 Apr 9;10:447.

41. Zeng J, Zhang Y, Shang Y, Mai J, Shi S, Lu M, et al. CancerSCEM: a database of single-cell expression
map across various human cancers. Nucleic Acids Res. 2022 Jan 7;50(D1):D1147–55.

42. Gondal MN, Cieslik M, Chinnaiyan AM. Integrated cancer cell-specific single-cell RNA-seq datasets of
immune checkpoint blockade-treated patients. bioRxiv [Internet]. 2024 Jan 22; Available from:
http://dx.doi.org/10.1101/2024.01.17.576110

43. Yuan H, Yan M, Zhang G, Liu W, Deng C, Liao G, et al. CancerSEA: a cancer single-cell state atlas.
Nucleic Acids Res. 2019 Jan 8;47(D1):D900–8.

44. Dohmen J, Baranovskii A, Ronen J, Uyar B, Franke V, Akalin A. Identifying tumor cells at the single-cell
level using machine learning. Genome Biol. 2022 May 30;23(1):123.

45. Zhao M, Chauhan P, Sherman CA, Singh A, Kaileh M, Mazan-Mamczarz K, et al. NF-κB subunits direct
kinetically distinct transcriptional cascades in antigen receptor-activated B cells. Nat Immunol. 2023
Sep;24(9):1552–64.

46. Tang Y, Kwiatkowski DJ, Henske EP. Midkine expression by stem-like tumor cells drives persistence to
mTOR inhibition and an immune-suppressive microenvironment. Nat Commun. 2022 Aug
26;13(1):5018.

47. Wang L, Cao Y, Guo W, Xu J. High expression of cuproptosis-related gene FDX1 in relation to good
prognosis and immune cells infiltration in colon adenocarcinoma (COAD). J Cancer Res Clin Oncol.
2023 Jan;149(1):15–24.

48. Lan Z, Yao X, Sun K, Li A, Liu S, Wang X. The Interaction Between lncRNA SNHG6 and hnRNPA1
Contributes to the Growth of Colorectal Cancer by Enhancing Aerobic Glycolysis Through the
Regulation of Alternative Splicing of PKM. Front Oncol. 2020 Mar 31;10:363.

49. Deng L, Jiang A, Zeng H, Peng X, Song L. Comprehensive analyses of PDHA1 that serves as a
predictive biomarker for immunotherapy response in cancer. Front Pharmacol. 2022 Aug 8;13:947372.

50. Sun D, Wang J, Han Y, Dong X, Ge J, Zheng R, et al. TISCH: a comprehensive web resource enabling
interactive single-cell transcriptome visualization of tumor microenvironment. Nucleic Acids Res. 2021
Jan 8;49(D1):D1420–30.

51. Han Y, Wang Y, Dong X, Sun D, Liu Z, Yue J, et al. TISCH2: expanded datasets and new tools for
single-cell transcriptome analyses of the tumor microenvironment. Nucleic Acids Res. 2023 Jan
6;51(D1):D1425–31.

52. Xu Z, Pei C, Cheng H, Song K, Yang J, Li Y, et al. Comprehensive analysis of FOXM1 immune
infiltrates, m6a, glycolysis and ceRNA network in human hepatocellular carcinoma. Front Immunol. 2023
May 10;14:1138524.

53. Zhang Y, Gan W, Ru N, Xue Z, Chen W, Chen Z, et al. Comprehensive multi-omics analysis reveals
m7G-related signature for evaluating prognosis and immunotherapy efficacy in osteosarcoma. J Bone
Oncol. 2023 Jun;40:100481.

54. Benedetti E, Liu EM, Tang C, Kuo F, Buyukozkan M, Park T, et al. A multimodal atlas of tumour
metabolism reveals the architecture of gene-metabolite covariation. Nat Metab. 2023 Jun;5(6):1029–44.

55. Liu Y ’e, Lu S, Sun Y, Wang F, Yu S, Chen X, et al. Deciphering the role of QPCTL in glioma
progression and cancer immunotherapy. Front Immunol. 2023 Mar 29;14:1166377.

56. Liu Y ’e, Zhao S, Chen Y, Ma W, Lu S, He L, et al. Vimentin promotes glioma progression and maintains
glioma cell resistance to oxidative phosphorylation inhibition. Cell Oncol . 2023 Dec;46(6):1791–806.

57. Liu Y, Yang J, Wang T, Luo M, Chen Y, Chen C, et al. Expanding PROTACtable genome universe of E3
ligases. Nat Commun. 2023 Oct 16;14(1):6509.

58. Zhao S, Chi H, Ji W, He Q, Lai G, Peng G, et al. A Bioinformatics-Based Analysis of an Anoikis-Related


Gene Signature Predicts the Prognosis of Patients with Low-Grade Gliomas. Brain Sci [Internet]. 2022
Oct 5;12(10). Available from: http://dx.doi.org/10.3390/brainsci12101349

59. Liu Y, Altreuter J, Bodapati S, Cristea S, Wong CJ, Wu CJ, et al. Predicting patient outcomes after
treatment with immune checkpoint blockade: A review of biomarkers derived from diverse data
modalities. Cell Genom. 2024 Jan 10;4(1):100444.

60. Ner-Gaon H, Melchior A, Golan N, Ben-Haim Y, Shay T. JingleBells: A Repository of Immune-Related


Single-Cell RNA-Sequencing Datasets. J Immunol. 2017 May 1;198(9):3375–9.

61. Wu L, Xue Z, Jin S, Zhang J, Guo Y, Bai Y, et al. huARdb: human Antigen Receptor database for
interactive clonotype-transcriptome analysis at the single-cell level. Nucleic Acids Res. 2022 Jan
7;50(D1):D1244–54.

62. Fan Q, Li M, Zhao W, Zhang K, Li M, Li W. Hyper α2,6-Sialylation Promotes CD4+ T-Cell Activation
and Induces the Occurrence of Ulcerative Colitis. Adv Sci. 2023 Sep;10(26):e2302607.

63. Fan Q, Dai W, Li M, Wang T, Li X, Deng Z, et al. Inhibition of α2,6-sialyltransferase relieves symptoms
of ulcerative colitis by regulating Th17 cells polarization. Int Immunopharmacol. 2023 Dec;125(Pt
A):111130.

64. Khan S, Taverna F, Rohlenova K, Treps L, Geldhof V, de Rooij L, et al. EndoDB: a database of
endothelial cell transcriptomics data. Nucleic Acids Res. 2019 Jan 8;47(D1):D736–44.

65. Gao X, Hong F, Hu Z, Zhang Z, Lei Y, Li X, et al. ABC portal: a single-cell database and web server for
blood cells. Nucleic Acids Res. 2023 Jan 6;51(D1):D792–804.

66. Breda J, Zavolan M, van Nimwegen E. Bayesian inference of gene expression states from single-cell
RNA-seq data. Nat Biotechnol. 2021 Aug;39(8):1008–16.

67. van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, et al. Recovering Gene Interactions from
Single-Cell Data Using Data Diffusion. Cell. 2018 Jul 26;174(3):716–29.e27.

68. Zhang L, Zhang S. Comparison of Computational Methods for Imputing Single-Cell RNA-Sequencing
Data. IEEE/ACM Trans Comput Biol Bioinform. 2020 Mar-Apr;17(2):376–89.

69. Xi NM, Li JJ. Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA
Sequencing Data. Cell Syst. 2021 Feb 17;12(2):176–94.e6.

70. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, et al. Comprehensive
Integration of Single-Cell Data. Cell. 2019 Jun 13;177(7):1888–902.e21.

71. Zappia L, Phipson B, Oshlack A. Exploring the single-cell RNA-seq analysis landscape with the
scRNA-tools database. PLoS Comput Biol. 2018 Jun;14(6):e1006245.

72. Maynard A, McCoach CE, Rotow JK, Harris L, Haderk F, Kerr DL, et al. Therapy-Induced Evolution of
Human Lung Cancer Revealed by Single-Cell RNA Sequencing. Cell. 2020 Sep 3;182(5):1232–51.e22.

73. Liu J, Gao C, Sodicoff J, Kozareva V, Macosko EZ, Welch JD. Jointly defining cell types from multiple
single-cell datasets using LIGER. Nat Protoc. 2020 Nov;15(11):3632–62.

74. Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression
data. Nat Biotechnol. 2015 May;33(5):495–502.

75. Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate
integration of single-cell data with Harmony. Nat Methods. 2019 Dec;16(12):1289–96.

76. Gong J, Tu W, Liu J, Tian D. Hepatocytes: A key role in liver inflammation. Front Immunol.
2022;13:1083780.

77. Bao Y, Qiao Y, Choi JE, Zhang Y, Mannan R, Cheng C, et al. Targeting the lipid kinase PIKfyve
upregulates surface expression of MHC class I to augment cancer immunotherapy. Proc Natl Acad Sci U
S A. 2023 Dec 5;120(49):e2314416120.

78. Jiang A, Snell RG, Lehnert K. ICARUS v3, a massively scalable web server for single cell RNA-seq
analysis of millions of cells [Internet]. bioRxiv. 2023 [cited 2024 Mar 20]. p. 2023.11.20.567692.
Available from: https://www.biorxiv.org/content/biorxiv/early/2023/11/21/2023.11.20.567692.1

79. Rue-Albrecht K, Marini F, Soneson C, Lun ATL. iSEE: Interactive SummarizedExperiment Explorer.
F1000Res. 2018 Jun 14;7:741.

80. Feng D, Whitehurst CE, Shan D, Hill JD, Yue YG. Single Cell Explorer, collaboration-driven tools to
leverage large-scale single cell RNA-seq data. BMC Genomics. 2019 Aug 27;20(1):676.

81. Graf C, Wilgenbus P, Pagel S, Pott J, Marini F, Reyda S, et al. Myeloid cell-synthesized coagulation
factor X dampens antitumor immunity. Sci Immunol [Internet]. 2019 Sep 20;4(39). Available from:
http://dx.doi.org/10.1126/sciimmunol.aaw8405

82. Newton AH, Williams SM, Major AT, Smith CA. Cell lineage specification and signalling pathway use
during development of the lateral plate mesoderm and forelimb mesenchyme. Development [Internet].
2022 Sep 15;149(18). Available from: http://dx.doi.org/10.1242/dev.200702

83. Ouyang JF, Kamaraj US, Cao EY, Rackham OJL. ShinyCell: simple and sharable visualization of
single-cell gene expression data. Bioinformatics. 2021 Oct 11;37(19):3374–6.

84. Ma C, Yang C, Peng A, Sun T, Ji X, Mi J, et al. Pan-cancer spatially resolved single-cell analysis reveals
the crosstalk between cancer-associated fibroblasts and tumor microenvironment. Mol Cancer. 2023 Oct
13;22(1):170.

85. Curras-Alonso S, Soulier J, Defard T, Weber C, Heinrich S, Laporte H, et al. An interactive murine
single-cell atlas of the lung responses to radiation injury. Nat Commun. 2023 Apr 28;14(1):2445.

You might also like