0% found this document useful (0 votes)
26 views

ScBERT As A Large-Scale Pretrained Deep Language Model For Cell Type Annotation of Single-Cell RNA-seq Data

Uploaded by

namengdu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

ScBERT As A Large-Scale Pretrained Deep Language Model For Cell Type Annotation of Single-Cell RNA-seq Data

Uploaded by

namengdu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

nature machine intelligence

Article https://doi.org/10.1038/s42256-022-00534-z

scBERT as a large-scale pretrained deep


language model for cell type annotation of
single-cell RNA-seq data

Received: 3 February 2022 Fan Yang1,7, Wenchuan Wang1,2,7, Fang Wang1,7, Yuan Fang1,3,4, Duyu Tang1,
Junzhou Huang5, Hui Lu 2,6 and Jianhua Yao 1
Accepted: 19 August 2022

Published online: 26 September 2022


Annotating cell types on the basis of single-cell RNA-seq data
Check for updates is a prerequisite for research on disease progress and tumour
microenvironments. Here we show that existing annotation methods
typically suffer from a lack of curated marker gene lists, improper handling
of batch effects and difficulty in leveraging the latent gene–gene interaction
information, impairing their generalization and robustness. We developed
a pretrained deep neural network-based model, single-cell bidirectional
encoder representations from transformers (scBERT), to overcome the
challenges. Following BERT’s approach to pretraining and fine-tuning,
scBERT attains a general understanding of gene–gene interactions by
being pretrained on huge amounts of unlabelled scRNA-seq data; it is then
transferred to the cell type annotation task of unseen and user-specific
scRNA-seq data for supervised fine-tuning. Extensive and rigorous
benchmark studies validated the superior performance of scBERT on cell
type annotation, novel cell type discovery, robustness to batch effects and
model interpretability.

Single-cell RNA-sequencing (scRNA-seq) has been extensively used prior knowledge of researchers and is therefore prone to biases and
for the characterization of complex tissues and organisms at the errors7. Furthermore, marker genes for interested cell types are not
single-cell level1–3, which has revolutionized transcriptomic studies. always available, and novel cell types do not have marker gene sets
Accurate cell type annotation on scRNA-seq is critical for biological yet. Besides, most cell types are determined by a set of genes instead
and medical research 4. Cell type annotation methods can be of a single marker gene8. Without a proper method to integrate the
categorized into three types: (1) annotation using marker genes, (2) expression information of multiple marker genes, it is difficult to
annotation using correlation-based methods and (3) annotation by guarantee a unified and accurate cell type assignment for each clus-
supervised classification5. ter9,10. For example, some automatic annotation methods are built
Cluster-then-annotate is the commonly used method6, where on the hypothesis that marker genes should have high expression
manually curated marker genes identified from the literature are in cells. However, even some well-documented marker genes do
employed to assign cell types for clusters derived from unsuper- not have high expression in all of the cells in the corresponding cell
vised learning5. However, selecting the marker genes depends on the types11. The absence or fluctuation of the expression of these marker

AI Lab, Tencent, Shenzhen, China. 2SJTU-Yale Joint Center for Biostatistics and Data Science, School of Life Sciences and Biotechnology, MoE Key Lab
1

of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China. 3Department of Molecular and Cellular Biology, Harvard University,
Cambridge, MA, USA. 4Department of Immunology, Harvard Medical School, Boston, MA, USA. 5Department of Computer Science and Engineering, the
University of Texas at Arlington, Arlington, TX, USA. 6Center for Biomedical Informatics, Shanghai Engineering Research Center for Big Data in Pediatric
Precision Medicine, Shanghai Children’s Hospital, Shanghai, China. 7These authors contributed equally: Fan Yang, Wenchuan Wang, Fang Wang.
e-mail: [email protected]; [email protected]

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 852


Article https://doi.org/10.1038/s42256-022-00534-z

genes might therefore considerably affect the preciseness of marker- fine-tuning. Following BERT’s mentality and paradigm, we developed
gene-based methods. a novel and unified architecture named scBERT (Fig. 1), which learns
Instead of relying on a spot of marker genes, correlation-based general scRNA-seq knowledge by being pretrained on millions of unla-
methods measure the correlation of gene expression profiles between belled scRNA-seq data with a variety of cell types from different sources,
the query samples and reference dataset5. These methods are poten- and assigns cell types by simply plugging in a classifier and fine-tuning
tially affected by the batch effect across platforms and experiments12. the parameters supervised by reference datasets. Pretraining enables
Although batch-effect correction methods exist, it is still challenging the model to learn the general syntax of gene–gene interactions, which
to distinguish true biological diversity from technical differences and helps to remove the batch effects across datasets and improve the
thus preserve important biological variations13. Meanwhile, the com- generalizability (Extended Data Fig. 1a). Fine-tuning ensures that the
monly used similarity measures (that is, cosine similarity, Spearman’s output embedding for each gene encodes context information that is
correlation and Pearson correlation) may not be robust or efficient at more relevant to the transcriptional profiles of the reference dataset. To
measuring the distance between two sets of high-dimensional, sparse annotate a query cell, scBERT computes the probability for the cell to
scRNA-seq data14. be any of the cell types labelled in the reference dataset by mining the
Annotation by supervised/semi-supervised classification meth- high-level implicit patterns (Extended Data Fig. 1b). Note that if there
ods follows the classic paradigm in machine learning that recognizes is no cell type to assign with high confidence, the query cell would be
patterns in gene expression profiles and then transfers the labels from labelled as unassigned to prevent incorrect assignment and to allow
labelled to unlabelled datasets5. Such methods have been widely used novel cell type discovery. Compared with the original BERT model,
recently due to their robustness to noise and variability of data, as well scBERT has some innovative designs to unleash its power in the cell
as their independence from artificially selected marker genes. Never- type annotation task.
theless, due to their limited model capacity, most of these methods First, the embedding of BERT includes token and position embed-
need to perform highly variable gene (HVG) selection and dimensional- dings25. Our design of embeddings is similar to BERT in some aspects
ity reduction before inputting the data into the classifier15–19. However, while having unique features to leverage gene knowledge. The token
HVGs are variable across different batches and datasets, hindering embedding of the original BERT is a discrete variable (standing for a
their generalization ability across cohorts16. Dimensionality reduc- word), whereas the raw expression input to our model is a continu-
tion techniques such as principal component analysis (PCA) may lose ous variable (standing for the expression of a gene in a single cell)
high-dimensional information as well as gene-level independent inter- with biological and technical noise. We draw on the bag-of-words
pretability. Furthermore, the parameter settings for HVG selection and technology in the NLP26 field to bin the expressions of genes (which
PCA in these methods are far from reaching a consensus and inevitably could be considered as the gene transcript frequency in each cell),
introduce artificial bias for performance evaluation15–19. Given that the thus converting them to discrete values with the additional benefit
HVGs are selected on the basis of the expression variance across the of the reduction of the data noise to some extent. As shuffling the
whole dataset, in which the dominant cell types account for the most columns of our input does not change its meaning (like the exten-
variance, there is a risk of overlooking the key genes of rare cell types. sion of BERT to understand tabular data with TaBERT27), absolute
Selecting HVGs ignores co-occurrence and the biological interactions positions are meaningless for genes. Instead, gene embeddings were
of genes (especially between HVGs and non-HVGs), which are use- obtained from gene2vec28 to represent the gene identity (each gene
ful for cell type annotation20. Besides, simple classifiers such as fully has a unique gene2vec embedding), which could be viewed as relative
connected networks were not able to efficiently capture gene–gene embeddings26 to capture the semantic similarity from the aspect of
interactions. A new method with improved pattern recognition ability general co-expression. The co-expression genes retain closer represen-
is therefore required to overcome the above issues of under-fitting to tations, and distributed representations of genes have proven useful
large-scale datasets. for capturing gene–gene interactions28. In this way, scBERT formal-
A growing number of deep learning-based methods have recently izes information on the gene expressions for Transformer efficiently
been applied to scRNA-seq data analyses and achieved superior perfor- and generates a single-cell-specific embedding (scBERT embedding)
mance21–23. The bidirectional encoder representations from transform- that represents the cell-specific expression (Extended Data Fig. 1c)
ers (BERT) is a state-of-the-art (SOTA) Transformer-based language after pretraining.
representation learning model. It has made breakthrough progress Second, existing single-cell methods have to pre-process the raw
in the fields of natural language processing (NLP) due to the powerful data with selection or manipulation of genes (that is, HVG selection,
self-attention mechanism and long-range information integration manually selecting marker genes and PCA) due to their limited capa-
capability introduced by transformer layers24,25. BERT’s paradigm of bility to efficiently model high-dimension data9,10,19,29–31; they would
pretraining and fine-tuning enables the use of large-scale unlabelled unavoidably bring artificial bias and overfitting problems, which in
data to improve the generalizability of the AI model. Inspired by such turn may severely impair their generalizability. Conversely, a Trans-
exciting progress, we developed single-cell BERT (scBERT) model former with a large receptive field could effectively leverage the global
for the cell annotation of scRNA-seq data. Following the pretrain- information in scRNA-seq data and learn a comprehensive global rep-
ing and fine-tuning paradigm, we validated the power of applying resentation for each cell by unbiasedly capturing long-range gene–
self-supervised learning on large-scale unlabelled scRNA-seq data to gene interactions. Due to the computational complexity, the input
improve the model’s generalizability and overcome the batch effect. sequence of Transformer is limited to a length of 512, whereas most of
Extensive benchmarking indicated that scBERT can provide robust the scRNA-seq data contain over 10,000 genes. We therefore replaced
and accurate cell type annotations with gene-level interpretability. the Transformer encoder used in BERT with Performer32 to improve
To the best of our knowledge, scBERT pioneered the application of the scalability of the model to tolerate over 16,000 gene inputs. With
Transformer architectures in scRNA-seq data analysis with innovatively Performer, scBERT keeps the full gene-level interpretation, abandons
designed embeddings for genes. the use of HVGs and dimensionality reduction and lets discrimina-
tive genes and useful interactions come to the surface by themselves
Results (Extended Data Fig. 1d). scBERT therefore allows for the discovery of
The scBERT algorithm gene expression patterns and longer-range dependency for cell type
The original BERT25 proposed a revolutionary technique that gener- annotation in an unbiased data-driven manner. scBERT is stable and
ates generic knowledge of language by pretraining and then transfers robust, instead of relying heavily on the hyperparameter selection
the knowledge to downstream tasks of different configurations using (Extended Data Fig. 1e).

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 853


Article https://doi.org/10.1038/s42256-022-00534-z

a scBERT
Self-supervised pre-training
Reconstruction loss

Embedding K Embedding K
Embedding 1 Embedding 1

Reconstructor
Gene 1 Gene 1 Gene 1 Gene 1 Gene 1
Random Expression Performer Learned
Masking Embedding encoder Representations
Gene M Gene M Gene M Gene M Gene M
1 N 1 N 1 N ll 1 ll 1
ell ell ell ell ell ell ll N ll N
C C C C C C Ce Ce Ce Ce
Expression
Binning

Embedding K
Embedding 1
Gene 1 Gene 1
Gene
Embedding
Gene M Gene M
scRNA-seq data
(unlabeled) gene2vec ll 1 ll N
Ce Ce

Embedding K Embedding K

Embedding 1 Embedding 1
Gene 1 Gene 1 Gene 1 Gene 1 Cell 1

Classifier
Expression Expression Performer
Learned 1D
encoder
Binning Embedding (pre-trained) Representations Convolution
Gene M Gene M Gene M Gene M Cell N
Reference scRNA-seq data
ll 1 ll N ll 1 ll N ll 1 ll N ll 1 ll N
(Labeled) Ce Ce Ce Ce Ce Ce Ce Ce

Label prediction loss


Supervised finetuning

b
B7 B26

h q k v

...

...

...

...
Performer layers (x6) Performer encoder

Self-attention heads (×10)

Gene embedding EG1 EG2 EG3 EG(M-2) EG(M-1) EGM

Expression embedding EB2 EB15 E(MASK) EB Ezero E(MASK) Gene 1

Random mask B2 B15 (MASK) B43 Zero (MASK)


Gene M

Binned expression profile gene2vec


of a single cell B2 B15 B7 B43 Zero B26

Fig. 1 | Overview of the scBERT model. a, Self-supervised learning on unlabelled the classifier are independently and separately employed for the models during
data and fine-tuning on task-specific data. At the self-supervised pretraining the pretraining and fine-tuning processes. b, Illustration of the embeddings of
stage, unlabelled data were collected from PanglaoDB. Masked expression scBERT. The preprocessed scRNA-seq data are first converted into discretized
embedding and gene embedding were added as input and then fed into the expression, and then the non-zero expressions are randomly masked. Taking the
Performer blocks. The reconstructor was used to generate outputs. Outputs for first gene as an example, the gene embedding EG1 (the gene identity from
masked genes were used to calculate the reconstruction loss. At the supervised gene2vec falling into the first bin) and the expression embedding EB2 (the gene
fine-tuning stage, the task-specific scRNA-seq data were input into the pretrained expression falling into the second bin and being transformed to the same
encoder. The output representation then passed a one-dimensional convolution dimension as the EG1) are summed and fed into scBERT to generate
layer and a classifier to generate the cell type prediction. ⊕ represents element- representations for genes. The representations are then used for pretraining or
wise addition. The Performer encoder is the component that is shared between fine-tuning.
the models used in the pretraining and fine-tuning stages. The reconstructor and

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 854


Article https://doi.org/10.1038/s42256-022-00534-z

Evaluating cell type annotation robustness on intra-dataset We next tested the robustness of scBERT when the distributions
We first benchmarked the performance of scBERT against other meth- of cell types were severely biased. Four cell types from the Zheng68K
ods on nine scRNA-seq datasets covering 17 major organs/tissues, more dataset (CD8+ cytotoxic T cells, CD19+ B cells, CD34+ cells and CD8+/
than 50 cell types, over 500,000 cells, and mainstream single-cell omics CD45RA+ naive cytotoxic cells), with transcriptomic similarity between
technologies (Drop-seq, 10X, SMART-seq and Sanger-Nuclei), com- each pair, were selected for class-imbalanced tests. scBERT surpassed
prehensively considering the diversity in data size, as well as the data all of the other methods (accuracy = 0.840 and F1-score = 0.826). Seurat
complexity33 (Supplementary Table 1). Marker-gene-based methods misidentifed CD8+ cytotoxic T cells as CD8+/CD45RA+ naive cytotoxic
(SCINA, Garnett, scSorter), correlation-based methods (Seurat v4, cells, whereas SingleR misclassified all of the CD19+ B cells due to their
SingleR, scmap_cell, scmap_cluster, Cell_ID(c), Cell_ID(g)) and machine rarity. scBERT, however, exhibited the lowest misclassification rate
learning-based methods (SciBet, scNym) were used for comparison even though the two cell populations are highly similar (Fig. 2e and
(Supplementary Table 2). For each of the datasets, we applied the Extended Data Fig. 4b). Overall, the results indicate that scBERT is
fivefold cross-validation strategy to avoid the influence of random robust to class-imbalanced datasets.
results on the conclusion. scBERT surpassed the comparison methods
in both accuracy and macro F1-score on most of the datasets (Fig. 2a Cell type annotation across cohorts and organs
and Extended Data Fig. 2). In real-world circumstances, the reference and query datasets are
Among the intra-dataset, the Zheng68K dataset from human always sourced from multiple studies, and even different sequencing
peripheral blood mononuclear cells (PBMCs) is the most representa- platforms, where the batch effects can lead to poor performance on
tive dataset for benchmarking cell type annotation methods. Due cell type annotation (Fig. 3a). Here we benchmarked scBERT and com-
to the severe cell type imbalance and the extremely high similari- parison methods by employing a leave-one-dataset-out strategy with
ties between subtypes, even the SOTA method could not achieve an human pancreas datasets generated by distinct sequencing techniques
accuracy above 0.71. The performance of scBERT, with complete (Baron35, Muraro36, Segerstolpe37 and Xin38; Fig. 3 and Extended Data
deletion of reported marker genes, is already on par with the best Fig. 5). Machine-learning-based methods (scBERT, scNym and SciBet)
performance of existing methods (Extended Data Fig. 1b), demon- achieved the best results, indicating that cell-type-specific patterns
strating the superiority of scBERT’s pattern recognition ability on could be discovered by pattern recognition without being affected
gene expressions compared with those methods that heavily depend by batch effects; Seurat, however, relies on compulsive batch correc-
on known marker genes. With the addition of marker genes, scBERT tion before the annotation. For cross-cohort data, scBERT achieved
could capture more comprehensive gene expression patterns con- a superior performance by a large magin, with an accuracy of 0.992
structed by them. With all genes as inputs, scBERT surpassed SOTA compared with scNym (accuracy of 0.904), and outperformed other
methods by a large margin on overall cells (Fig. 2b,c, and Extended popular methods (accuracies: SciBet = 0.985, Seurat = 0.984, Sin-
Data Figs. 3 and Fig. 4a; scBERT F1-score = 0.691, accuracy = 0.759; best gleR = 0.987; Fig. 3b). scBERT correctly annotated most cells (>97%)
F1-score by other methods = 0.659, accuracy = 0.704) and achieved the in the Muraro dataset, and over 99% of the cells in the other three
highest performance for CD8+ cytotoxic T cells and CD8+/CD45RA+ datasets, demonstrating the superb and stable performance of our
T cells (F1-score = 0.788 versus 0.617, P-value = 9.025 × 10 −5; accu- method in cross-cohort tasks. By contrast, scNym misclassified the
racy = 0.801 versus 0.724, P-value = 2.265 × 10−5), which are highly alpha cells as the beta cell type and was confused by the beta and delta
similar and were difficult to distinguish in previous studies34. The cells (Fig. 3e,f). We then used cells from different organs to benchmark
results indicated that scBERT could recognize the underlying gene the performance of scBERT and comparison methods on cross-organ
expression patterns and long-range gene–gene dependency after dataset. The experiment results demonstrated that scBERT is on par
pretraining, capture diverse feature subspace by multi-head attention with comparison methods on cross-organ task (Extended Data Fig. 5b).
and enjoy comprehensive high-level representation of cell type-specific scBERT showed its robustness in identifying cells from different
global information. sequencing technologies, experiments, different disease states (type-2
Notably, the list of best-performing methods changes across dif- diabetes and health) and even different organs.
ferent tasks and datasets, whereas scBERT is always among it. For
instance, the top-tier methods for the inter-dataset (that is, scNym and Discovery of novel cell types
Seurat) performed badly on the Xin dataset in Fig. 2. These uncertain- In most tasks, the reference dataset may not cover all of the cell types
ties in performance reflect the limitations of the comparison methods present in the query dataset. The marker-based methods are hindered
in their generalizability, as well as the generalization of our method by the manually selected markers of known cell types and therefore may
across all of the benchmarking datasets. face difficulty distinguishing unseen cell types; the correlation-based
To explore whether the number of cells of a reference dataset methods, however, usually force the model to assign a novel class to
affects the performance of scBERT, we constructed a series of refer- the closest known class. The machine learning-based methods could
ence datasets from the Zheng68K dataset by uniformly subsampling automatically and actively detect the novel cell types by checking the
it proportionally from 10% to 90% (Fig. 2d). With only 30% of the cells, predicted probability. Besides, scBERT enjoys some potential advan-
scBERT outperformed all of the other methods and its performance tages. First, the multi-head attention mechanism allows scBERT to
improved rapidly as the reference cell number increased. extract information from different representation subspaces, which

Fig. 2 | Benchmarking and robustness evaluation by intra-dataset cross- in Extended Data Fig. 3. c, Heatmaps for the confusion matrices of the cross-
validation. a, Performance of cell type annotation methods measured by validation results on the Zheng68K dataset for scBERT, Seurat and CellID_cell.
accuracy and F1-score on n = 9 datasets using fivefold cross-validation. Box The confusion matrices of other methods are included in Extended Data
plots show the median (centre lines), interquartile range (hinges) and 1.5-times Fig. 4a. d, The influence on the cell type annotation performance by splitting
the interquartile range (whiskers). The F1-scores of these datasets are shown in different proportions of the Zheng68K dataset as the reference set for fine-
Extended Data Fig. 2a. The performance of SCINA, Garnett and scSorter is shown tuning. The standard deviations are shown as the error bar. e, Heatmap for the
in Extended Data Fig. 2b. The results of Tucker dataset, Lung dataset and Human confusion matrices of scBERT of cross-validation on the imbalanced dataset
Cell Atlas dataset are shown in Extended Data Fig. 2c,d. b, t-SNE plot of the whole reconstructed from Zheng68K dataset. The confusion matrices of other methods
Zheng68K dataset (n = 68,450 cells). Left panel is coloured by expert-annotated are included in Extended Data Fig. 4b. The detailed reconstruction process is
cell types from the original research; right panel is coloured by scBERT prediction introduced in the Methods.
results. The t-SNE plots of the annotation of comparison methods are shown

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 855


Article https://doi.org/10.1038/s42256-022-00534-z

might be a benefit for capturing the subtle differences between novel large-scale, diverse dataset. Third, Transformer with a large recep-
and known cell types. Second, scBERT may have possibly seen the tive field could effectively learn comprehensive global representation
novel cells and learnt their unique patterns during pretraining on a by capturing long-range gene–gene interactions, which may better

a
1.0

scBERT
0.8 scNym
SciBet
Accuracy

Seurat
0.6 SingleR
CellID_cell
CellID_group
0.4 scmap_cell
scmap_cluster

0.2

Zheng68k Baron Muraro Xin Segerstolpe MacParland


Datasets

b Ground truth scBERT prediction

CD4+ T Helper2
CD4+/CD25 T Reg
CD4+/CD45RA+/CD25– naive T
CD4+/CD45RO+ Memory
CD8+ cytotoxic T
CD8+/CD45RA+ naive cytotoxic
CD14+ monocyte
CD19+ B
CD34+
CD56+ NK
Dendritic
Unassigned

c scBERT Seurat CellID_cell


CD14+ monocyte 1.0

CD19+ B
0.8
CD34+
CD4+/CD25 T Reg
0.6
CD4+/CD45RA+/CD25– naive T
CD4+/CD45RO+ memory
0.4
CD56+ NK
CD8+ Cytotoxic T
0.2
CD8+/CD45RA+ naive cytotoxic
Dendritic 0
CD14+ monocyte
CD19+ B
CD34+
CD4+/CD25 T Reg
CD4+/CD45RA+/CD25– naive T
CD4+/CD45RO+ memory
CD56+ NK
CD8+ cytotoxic T
CD8+/CD45RA+ naive cytotoxic
Dendritic

CD14+ monocyte
CD19+ B
CD34+
CD4+/CD25 T Reg
CD4+/CD45RA+/CD25– naive T
CD4+/CD45RO+ memory
CD56+ NK
CD8+ cytotoxic T
CD8+/CD45RA+ naive cytotoxic
Dendritic

CD14+ monocyte
CD19+ B
CD34+
CD4+/CD25 T Reg
CD4+/CD45RA+/CD25– naive T
CD4+/CD45RO+ memory
CD56+ NK
CD8+ cytotoxic T
CD8+/CD45RA+ naive cytotoxic
Dendritic

d e
scBERT
CD19+ B 66.7% 2.8% 17.4% 4.3%
0.75 SciBet
Seurat
CellID_cell
0.70 CD34+ 0% 88.9% 8.7% 8.7%
Accuracy

0.65
CD8+ Cyt 0% 0% 73.9% 26.1%

0.60

CD8+/CD45RA+ NC 0% 0% 0% 100.0%
0.55
+

C
B

yt
34

10 30 50 70 90
C
+

D
19

A+
8+
C
D

R
D

Data size (%)


C

45
C

D
/C
8+
D
C

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 856


Article https://doi.org/10.1038/s42256-022-00534-z

a Source c

1.0 1.0

0.8

Accuracy

F1-score
Baron
0.5
Muraro
0.6
Xin
Segerstolpe
0.4

T
ym

Se t

n t
R

_ ll
sc ma up

_c ll
er

T
ym

Se t
Si at

C llID R
_g ll
sc ma up

_c ell
er
e

e
lID _ce

ap ce

lID _ce
R

R
C gle

C gle
iB

ur

iB

ur
st

st
ap c
sc gro

sc ro
BE

BE

N
m p_

m p_
lu

lu
Sc

Sc
C ID
sc

sc

n
Si
sc

sc
l
el

e
el

el
b Ground truth d
1.00 1.0

0.95

Accuracy

F1-score
0.90 0.8
Alpha 0.85
Beta
Delta 0.80
0.6
Gamma

ym

ym
et

at

et

at
T

T
R

R
iB

ur

iB

ur
N

N
BE

BE
Sc

Se

Sc

Se
sc

sc
sc

sc
e scBERT prediction

Alpha
Beta
Delta
Gamma

Alpha Alpha
Beta Beta
Delta Delta
Gamma Gamma

f scNym prediction

Alpha
Beta
Delta
Gamma

Alpha Alpha
Beta Beta
Delta Delta
Gamma Gamma

Fig. 3 | Performance of scBERT across independent datasets generated by quartiles, with the whiskers in the range of 1.5-times the interquartile. d, Zoomed-
different single-cell sequencing technologies. a, A t-SNE representation of in plot of accuracy and F1-score of the top-tier methods. e, t-SNE representation
10,220 cells from four independent datasets (Baron, Muraro, Segerstolpe and of alpha, beta, delta and gamma cells from four pancreas datasets (left), beta
Xin) generated by different sequencing platforms (inDrop, CEL-Seq2, SMART- cells from the Muraro dataset (middle) and alpha cells from the Segerstolpe
Seq2 and SMARTer). Cells are coloured by the source of datasets. b, t-SNE dataset (right) coloured by scBERT prediction. f, t-SNE representation of alpha,
representation of alpha, beta, delta and gamma cells from four pancreas datasets beta, delta and gamma cells from four pancreas datasets (left), beta cells from
coloured by the annotated cell types provided by the atlas from the original the Muraro dataset (middle) and alpha cells from the Segerstolpe dataset (right)
paper. c, Comparison of accuracy and F1-score of inter-dataset cross-validation coloured by scNym prediction. t-SNE plots of other comparison methods are
among different methods. The lower and upper hinges denote the first and third shown in Extended Data Fig. 5a.

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 857


Article https://doi.org/10.1038/s42256-022-00534-z

a 1.0
Novel
0.8
Known

Accuracy 0.6

0.4

0.2

1.0
Novel
0.8
Known

0.6
F1-score

0.4

0.2

up
ym

er
et

l
l
T

el
el
iB
R

st

ro
_c
_c
N
BE

lu
sc

_g
sc

lID
ap

_c
sc

lID
m

el
ap
sc

el
m

C
sc

Novel
Plasma (unseen)
Alpha–beta T cell
Central venous liver sinusoidal EC
Cholangiocyte
Erythroid cell
Gamma–delta T cell
Hepatocyte
Inflammatory macrophage
Mature B cell
Natural killer cell
Non-inflammatory macrophage
Periportal liver sinusoidal EC
Portal liver sinusoidal EC
0 0.2 0.4 0.6 0.8 1.0
Ground-truth Prediction
Confidence score

Fig. 4 | Identification of novel cell types. a, Performance of scBERT on the MacParland; the cells with low probability of model prediction (probability < 0.5)
MacParland dataset from human liver tissue by removing alpha–beta T cell, for all known cell types are assigned as potential novel cell types. Right: Sankey
gamma–delta T cell, mature B cell and plasma cell populations during the plot comparing scBERT predictions on known and novel cell types with original
scBERT training process. The accuracy and F1-score of both novel cell types and cell-type annotations for the MacParland dataset, where plasma cells are
known cell types are shown in the box plots, where the median (centre lines), labelled as novel cell type as they are unseen by the scBERT training process.
interquartile range (hinges) and 1.5-times the interquartile range (whiskers) are EC: endothelial cell.
shown. b, Left: the confidence scores provided by scBERT for the cell types of

characterize and distinguish novel cells41. scBERT performed the best types but failed to discover any novel cells. SciBet and scmap_cluster
on novel cell types and achieved the top-ranked performance on the are prone to assigning unknown labels to those cells from known types,
known cell types (Fig. 4). CellID_cell performed well on known cell which greatly reducees their known cell type classification accuracy.

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 858


Article https://doi.org/10.1038/s42256-022-00534-z

a Alpha Beta Delta Gamma b

GCG Alpha cells


IRX2 Epsilon cells
LOXL4
ARX Satellite glial cells
CRYBA2 Enterocytes
GC
PLCE1 Parathyroid chief cells
FEV Enteric neurons
FAP
PTPRT 0 10 20 30
ADCYAP1
Beta cells
NPTX2
TGFBR3 Delta cells
PDX1 Loop of Henle cells
SCD5
GLIS3 Enteroendocrine cells
INS Osteoclasts
HADH
PFKFB2
Mast cells
DLK1
0 5 10 15 20
SST
HHEX Delta cells
LEPR Anterior pituitary gland cells
PRG4
BCHE Tanycytes
DPYSL3 Glomus cells
SLC38A1
FRZB His bundle cells
RBP4 Alpha cells
PCSK1
ETV1 0 10 20 30
THSD7A Gamma (PP) cells
PPY
SERTM1 Alpha cells
AQP3 Delta cells
ID2
SLITRK6 Enteroendocrine cells
ZNF503 Beta cells
CHN2
ENTPD2
Chromaffin cells

0 10 20 30
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Enrichment score

c
Alpha
z-score
–0.8
Beta 0
0.8
1.6
Delta
2.4

Gamma
GCG
IRX2
LOXL4
ARX
CRYBA2
GC
PLCE1
FEV
FAP
PTPRT
ADCYAP1
NPTX2
TGFBR3
PDX1
SCD5
GLIS3
INS
HADH
PFKFB2
DLK1
SST
HHEX
LEPR
PRG4
BCHE
DPYSL3
SLC38A1
FRZB
RBP4
PCSK1
ETV1
THSD7A
PPY
SERTM1
AQP3
ID2
SLITRK6
ZNF503
CHN2
ENTPD2
d scBERT embedding Gene expression

ARI = 0.95 ARI = 0.87

Alpha
UMAP2

UMAP2

Beta
Delta
Gamma

UMAP1 UMAP1

Fig. 5 | Model interpretability. a, Heatmap for the attention weights provided Tables 4–15. c, Dot plot showing z-scores among the ten genes receiving the
by scBERT on the Pancreas cell type annotation task. The detailed attention highest attention, and the cell types. The size and colour of each dot reflect the
estimation process is described in Methods. Top 10 genes with highest attention z-score. d, UMAP representation of alpha, beta, delta and gamma cells from the
weights are listed for each cell type. The complete top gene list can be found in Muraro dataset coloured by cell types, based on the scBERT embedding (left)
Supplementary Table 3. b, The results of enrichment analysis of the top attention and the raw expression (right) of each cell. The adjusted Rand index (ARI) score is
genes from scBERT, with the complete information provided in Supplementary calculated and shown in the plot.

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 859


Article https://doi.org/10.1038/s42256-022-00534-z

Compared with SciBet and scmap_cluster, our method achieves supe- of cell type annotation methods on their classification ability for a fair
rior accuracy on both the novel (scBERT = 0.329 versus SciBet = 0.174 comparison in this study.
and scmap_cluster = 0.174) and known (scBERT = 0.942 versus Sci- To the best of our knowledge, there is currently no research on
Bet = 0.784 and scmap_cluster = 0.666) classes. Taken together, these applying Transformer architectures to gene expression data analysis.
results suggest that scBERT can correctly discover novel cell types that The originally designed end-to-end scBERT framework, with gene
are not present in original reference datasets while remaining accurate expression embedding and a self-learning strategy, has superior perfor-
in predicting the performance of other cell types. mance, interpretability and generalization potential on cell type anno-
tation tasks. Beyond that, scBERT can also be applied to other tasks by
Investigating scBERT model interpretability simply modifying the output and supervision signals. scBERT, as an
Existing machine learning methods have to select HVGs or reduce effective cell type annotation tool, has been released on the platform
dimensionality due to their simplified network architecture and low for public usage. We hope that scBERT could improve understanding
model capacity, hence destroying the gene-level interpretability. By of cell-type-associated gene–gene interactions and nurture the revolu-
contrast, the attention mechanism employed in scBERT naturally tion of AI paradigm in single-cell RNA-seq analysis.
provides hints for the decision-making of the model using every indi- Despite the above advantages, the scBERT may face potential
vidual gene. limitations including gene expression embedding, modelling gene
Here we took the Muraro dataset as an illustration, and interactions and the masking strategy during the pretraining stage.
top-attention-gene lists were produced for the four kinds of pan- First, the token embedding of the original BERT is for discrete
creas islet cells, with well-studied biological functions (Fig. 5a). The variables (standing for a word), whereas the expression input is a con-
top-attention genes included reported markers of specific cell types tinuous variable (standing for the expression of a gene in a single cell),
(LOXL4 for alpha cells and ADCYAP1 for beta cells39; Extended Data which may have biological and technical noise. scBERT converts them
Fig. 6a). Almost all of the top-attention genes, except markers, were to discrete values and could thus reduce some data noise compared
identified as differentially expressed genes using DESeq40, as potential with existing methods, which utilize the expression values directly;
novel markers (Fig. 5c and Extended Data Fig. 6b). For instance, SCD5 however, it sacrifices some data resolution, and there is still room
has not been reported as a cell-type-specific marker for beta cells, but to optimize the embedding of gene expression for model input. Our
in a GWAS study, a novel loci for type-2 diabetes susceptibility was approach for binning the expression may cause some resolution loss.
fine-mapped to a coding variant of SCD41. The results demonstrated Second, gene interactions usually exist in the form of networks (that
that scBERT could facilitate understanding the cell type annotated and is, gene regulatory networks and biological signalling pathways)42,
provide some support for further biological findings. and this kind of prior knowledge has not been explicitly incorporated
Enrichment analysis was performed for the top-50 attention-gene in scBERT. Aggregating information from neighbours within a graph
lists using various gene-set libraries; the results revealed that there neural network based on biological networks may better mimic gene–
were some interesting relationships between the top enriched terms gene interactions. The idea could be applied to the single-cell analysis
and the corresponding cell types (Fig. 5b and Supplementary Tables by building cell-level graph using the scRNA-seq data. From this point
3–15). In particular, with the cell-type-associated gene-set library from of view, it can be foreseen that Transformers for graph43 may be the
PanglaoDB, the top-one-enriched term for each type always hits the future development direction of scBERT44. Third, the efficiency of
true cell population. As another example, insulin secretion and AMPK masking during pretraining is another point worth optimizing. The
signal pathway, the top-two-enriched KEGG pathways in beta cells, are current masking strategy in scBERT is simplified with non-zero mask-
vital to beta cell function. Furthermore, based on the clustering per- ing. With the zero-inflated input45, the model might be inclined to
formance, the scBERT embedding is more distinguishable for cell type output all zeroes for the reconstruction task during pretraining. We
annotation than raw gene expression (ARI: 0.95 versus 0.87), indicating therefore masked the non-zero values and calculated the loss based
the efficiency of scBERT in learning single-cell-specific representation, on the non-zero values during pretraining; however, masking only the
which can be used for downstream analysis (Fig. 5d). non-zero values may lower the utilization of the single-cell data for
pretraining, due to their minority. Advanced masking strategy tailored
Discussion for single-cell data could be introduced to improve the computational
To improve the generalization ability of the cell type annotation efficiency of the masking process.
algorithm and the interpretability of individual gene importance, we For future work, we would like to explore the versatility and flex-
developed scBERT (a deep learning model with a multi-head attention ibility of scBERT in a variety of downstream tasks (that is, gene–gene
mechanism and self-supervised strategy) to learn domain-irrelevant interaction, batch correction, clustering, differential analysis in disease
gene expression patterns and interactions from the whole genome conditions)46.
expression of large-scale, unlabelled scRNA-seq data; transfer the
general knowledge to cell type annotation task by fine-tuning; and Methods
trace back to the importance of each individual gene for model The scBERT model
interpretability. By systematically analysing the components of The scBERT model adopts the advanced paradigm of BERT and tailors
scBERT, we gain several insights into the application of Transformer the architecture to solve single-cell data analysis. The connections
in single-cell data analysis (that is, the benefits of pretraining, rec- of our model with BERT are given as follows. First, scBERT follows
ognization of non-marker patterns, detection of subtle gene–gene BERT’s revolutionary method to conduct self-supervised pretrain-
interactions, single-cell-specific embeddings and hyperparam- ing25 and use Transformer as the model backbone32. Second, our
eters sensitivity). See the Methods and Extended Data Fig. 1 for a design of embeddings is similar to BERT in some aspects while having
systematic analysis. unique features to leverage gene knowledge. From this perspective,
scBERT surpasses the existing advanced methods on diverse our expression embedding could be viewed as the token embedding
benchmarks, collectively involving 9 single-cell datasets, 17 major of BERT. As shuffling the columns of our input does not change its
organs/tissues, more than 50 cell types, over 500,000 cells and the meaning (like the extension of BERT to understand tabular data with
mainstream single-cell omics technologies (that is, Drop-seq, 10X, TaBERT27), absolute positions are meaningless for gene. We instead
SMART-seq and Sanger-Nuclei), indicating its generalization and use gene2vec to produce gene embeddings, which could be viewed as
robustness. Notably, we employed the accuracy, macro F1-score and relative embeddings26 that capture the semantic similarities between
confusion matrix as evaluation metrics to benchmark the performance any of two genes. Third, Transformer with global receptive field could

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 860


Article https://doi.org/10.1038/s42256-022-00534-z

effectively learn global representation and long-range dependency dropout zeroes phenomenon48, we randomly masked the non-zero
without absolute position information, achieving excellent perfor- gene expression and then reconstructed the original inputs by model
mance on non-sequential data (such as images, tables)24,27. predictions using the remaining genes. We ultilized cross-entropy loss
as the reconstruction loss, formulated as:
Gene embedding. In NLP, the inputs of the BERT model are word
M N
embeddings, a set of real-valued vectors in a pre-defined vector space LRec = − ∑ ∑ yi,j log (pi,j ) (4)
that represent individual words. The word embedding technology helps i=1 j=1

to better represent the text by assuring the words with similar meanings
have a similar representation46. However, from the aspect of scRNA-seq, where M is the number of cells and N is the number of masked gene
the inputs are constituted by individual genes and a pre-defined vector expression values; yi,j and pi,j are the true and predicted expressions,
space is needed to represent the similarity between them. Hence we respectively, of gene j in cell i. With this self-supervised strategy, the
employed gene2vec28 to specifically encode gene embeddings. In this model can learn general deep representations of gene expression pat-
way, the difficulty of model training is reduced, with the help of the terns on the large amount of unlabelled data, which might alleviate the
inter-gene relationship provided by past knowledge. efforts of the downstream fine-tuning process.

Expression embedding. In spite of the gene embedding, there is also Supervised learning on specific tasks. The output of scBERT
a challenge on how to utilize the transcription level of each gene, which was a 200-dimensional feature corresponding to each gene, and a
is actually a single continuous variable. It is worth noting that the fre- one-dimensional convolution was applied for abstract information
quency of a word’s occurrence in a text is valuable information for text extraction for each gene feature. A three-layer neural network was then
analysis and is often transformed as a bag-of-words by term-frequency applied as the classification head and transformed the gene features
statistical analysis for downstream tasks in the area of NLP47. The gene into the probability for each cell type. Cross-entropy loss was also
expression could also be considered as the occurrence of each gene that employed as the cell type label prediction loss, calculated as:
has already been well-documented in a biological system. From this
M
insight, we applied the conventionally used term-frequency-analysis LPred = − ∑ zi log (qi ) (5)
method that discretizes the continuous expression variables by bin- i=1

ning, and converts them into 200-dimensional vectors, which are then
used as token embeddings for the scBERT model. where zi and qi indicate the ground-truth cell type label and predicted
label of cell i, respectively.
Model building. The quadratic computational complexity of the BERT
model with Transformer as the basic unit does not scale very well to Datasets
long sequences, whereas the gene number of scRNA-seq can be up As the model training consists of two stages, self-supervised learning
to more than 20,000. To this end, a matrix decomposition version of on unlabelled data and fine-tuning on task-specific data, the dataset
Transformer (that is, Performer) was employed to enlarge the sequence used in the two stages were collected from different sources to avoid
length. The regular dot-product attention in Transformer is a mapping data leakage. In the first stage, large amounts of data without annota-
of Q, K, V, which are encoded representations of the input queries, keys tions were used for general pattern learning, whereas, in the second,
and values created for each unit, respectively. The bidirectional atten- task-specific data with well-annotated cell labels were required for the
tion matrix is formulated as: subsequential systematic benchmarking of the scBERT and SOTA meth-
ods. To this end, we only included scRNA-seq datasets that provided
Att (Q, K, V) = D−1 (QKT ) V, D = diag (QKT 1L ) (1) highly credible cell type annotation and had been cited by the major-
ity of the cell type annotation methods for performance evaluation.
where Q = Wq X, K = WK X, V = WV X are linear transformations of the input
X; WQ, WK, WV are the weight matrices as parameters; 1L is the all-ones The Panglao dataset. The Panglao dataset49 was downloaded from
vector of length L; and diag(⋅) is a diagonal matrix with the input vector the PanglaoDB website (https://panglaodb.se/). In brief, PanglaoDB
as the diagonal. integrated 209 human single-cell datasets comprising 74 tissues with
The attention matrix in Performer is described as follows: 1,126,580 cells originating from different experimental sources via vari-
ous platforms. In this study, we used scRNA-seq data from PanglaoDB
ˆ (Q, K, V) = D̂ −1 (Q′ ((K′ )T V)) , D̂ = diag (Q′ ((K′ )T 1L ))
Att (2) for first-stage pretraining. No annotations or cell labels were used at
the first stage as the self-learning strategy was employed, and only
where Q′ = ∅(Q), K′ = ∅(K), and the function ∅(x) is defined as: the genes and their expression levels were needed as inputs for the
scBERT model.
c
∅ (X) = f (ωT X) (3)
√m
Zheng68k dataset. The Zheng68k is a classic PBMC dataset by 10X
CHROMIUM that is widely used for cell type annotation performance
where c is a positive constant, ω is a random feature matrix, and m is acessment34. It contains about 68,450 cells within eleven subtypes of
the dimesionality of the matrix. Here we constructed our model with cells: CD8+ cytotoxic T cells (30.3%), CD8+/CD45RA+ naive cytotoxic
six Performer encoder layers and ten heads for each layer. cells (24.3%), CD56+ NK cells (12.8%), CD4+/CD25 T Reg cells (9.0%),
The model training process contains two stages: self-supervised CD19+ B cells (8.6%), CD4+/CD45RO+ memory cells (4.5%), CD14+
learning on unlabelled data to get a pretrained model and super- monocyte cells (4.2%), dendritic cells (3.1%), CD4+/CD45RA+/CD25-
vised learning on the specific cell type annotation tasks to get the naive T cells (2.7%), CD34+ cells (0.4%) and CD4+ T Helper2 cells (0.1%).
fine-tuned model. The Zheng68k dataset contains rare cell types, and the distribution of
cell types in this dataset is imbalanced. Strong correlations between
Self-supervised learning on unlabelled data. In this study, we fol- cell types make it difficult to differentiate them.
lowed the conventional self-learning strategy of the BERT model
in NLP tasks by randomly masking the input data value and making Pancreas datasets. The pancreas datasets comprise Baron, Muraro,
a prediction on the basis of the remaining inputs. Considering the Segerstolpe and Xin. The cell type labels were aligned and four cell

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 861


Article https://doi.org/10.1038/s42256-022-00534-z

types were included. The Baron dataset was downloaded from the Gene a domain adversary56. It requires no prior manual specification of
Expression Omnibus (GEO) (accession no. GSE84133) and the protocol marker genes. It makes use of the target data by domain adaptation
was inDrop35. The Muraro dataset was downloaded from GEO (acces- and achieves the best performance on several tasks; however, users
sion no. GSE85241) and the protocol was CEL-Seq236. The Segerstolpe have to endure the inconvenience that they must re-train the model
dataset was accessed from ArrayExpress (accession no. E-MTAB-5061) on each batch of new-coming data.
and the protocol was Smart-Seq237. The Xin dataset was downloaded
from GEO (accession no. GSE81608) and the protocol was SMARTer38. SciBet. Scibet is a supervised classification method that selects genes
The above pancreas datasets were generated from different experiment using E-test for multinomial model building and annotates cell types
platforms (Supplementary Table 1). for a new cell in the test set19. We adopted SciBet in R package for
benchmarking.
MacParland dataset. The MacParland dataset50 from human liver
tissue contains 20 hepatic cell populations from the transcriptional Seurat. As a popular single-cell data analysis pipeline, Seurat is widely
profiling of 8,444 cells by 10X CHROMIUM. We downloaded the data used by biologists and clinical experts. Seurat maps the query samples
from GEO (accession no. GSE115469) and generated the cell type anno- to the reference dataset in a reference-based annotation manner57. In
tation following the authors’ reported procedure. this study, we adopted the implementation of the cell type annotation
of Seurat v.4.0 and followed the cell type annotation tutorial provided
Heart datasets. The heart datasets contain one large dataset51 for by Seurat for benchmarking.
pretraining, and the Tucker dataset52 for benchmarking and evalu-
ation in the hyperparameter sensitivity analysis. The large heart SingleR. SingleR is a reference-based analysis method that calculates
dataset for pretraining contains 451,513 cells from 11 cell types by the Spearman coefficient on variable genes and aggregates the coef-
four different sequencing platforms (Harvard-Nuclei, Sanger-Nuclei, ficients to score the cell for each cell type58. It iterates on the above pro-
Sanger-Cells, and Sanger-CD45) and was downloaded from https:// cess by subsampling top genes until the most closely related cell types
data.humancellatlas.org/explore/projects/ad98d3cd-26fb-4ee3- are distinguished. The SingleR package was applied for benchmarking.
99c9-8a2ab085e737. The Tucker dataset contains 287,269 cells from
11 cell types via single nuclear RNA-sequencing and was downloaded CellID. CellID is a clustering-free multivariant statistical method
from https://singlecell.broadinstitute.org/single_cell/study/SCP498/ for cell type annotation that performs dimensionality reduction,
transcriptional-and-cellular-diversity-of-the-human-heart. evaluates the gene-to-cell distance and extracts gene signatures for
cells (cell-to-cell strategy) and groups (group-to-cell strategy)29.
Lung dataset. The lung dataset was from human lung tissue and ana- In this study, both strategies from the R package were used for
lysed for COVID-19-related disease mechanisms53. The dataset contains benchmarking.
samples from 12 donors by 10X Genomics sequencing, and 39,778
cells from nine cell types. The data were downloaded from https://doi. scmap. A reference-based annotation method including two strate-
org/10.6084/m9.figshare.11981034.v1. gies: scmap_cluster and scmap_cell; scmap_cluster maps individual
cells from query samples to certain cell types in the reference dataset,
Human Cell Atlas dataset. The Human Cell Atlas dataset54 contains whereas scmap_cell maps individual cells from query samples to indi-
84,363 cells from 27 cell types among 15 major organs (skin, oesopha- vidual cells in a reference dataset30. Both scmap_cluster and scmap_cell
gus, trachea, heart, spleen, common bile duct, stomach, liver, blood, perform feature selection and calculate distances (the cosin and euclid-
lymph node, small intestine, bladder, rectum, marrow, muscle) by ean distances). The reference is searched for the nearest neighbours
HiSeq X Ten sequencing. The dataset was downloaded from GEO (acces- to a query cell. We used the R package of scmap for the scmap-cluster
sion no. GSE159929). and scmap_cell tools.

Data pre-processing SCINA. SCINA is a typical marker gene-based annotation method that
As for the data provided in gene expression matrix format, requires a list of marker genes for different cell types and identifies the
log-normalization was performed on the data, using a size factor of cell types based on the assumption that there exists a bimodal distribu-
10,000 and quality control by filtering cell outliers with less than 200 tion for each marker gene and the higher modes belong to the relevant
genes expressed. As for the input of scBERT, no dimension reduction cell type9. We used the Scina package for benchmarking.
or HVG selection was processed as scBERT has a capacity of more than
20,000 genes as input and retains full gene-level interpretability. Garnett. Garnett requires a user-defined cell hierarchy of cell types
and marker genes as input. Garnett aggregates marker gene scores
Comparison methods using term frequency-inverse document frequency transformation
For benchmarking, we implemented SOTA methods from the three and uses an elastic-net regression-based model for annotation10.
annotation categories: marker-based, correlation-based and supervised We adopted the original R package to use the garnet model for
classification. Among them, SCINA, Garnett and scSorter represent benchmarking.
annotation using marker gene databases; Seurat, SingleR, CellID and
scmap are correlation-based methods; and scNym and Scibet are the scSorter. Scsorter employs marker genes and the HVGs for clustering
SOTA methods that conduct annotation by supervised/semi-supervised and cell type annotation based on the observation that most marker
classification. Notably, this categorization depends on how the most genes do not consistently preserve high expression levels in all of
important process is conducted. As for marker gene-based annotation, the cells belonging to the related cell types31. Here we adopted the R
the CellMarker database with manually curated cell-type markers using implement of Scsorter.
a literature search of over 100 000 papers was applied for the marker
database55. No manual selection of the marker genes was included for Benchmarking
an unbiased and fair comparison of all of the methods. To assess the performance of the annotation methods under different
scenarios, nine pairs of reference and test datasets were generated,
scNym. scNym is a recently proposed semi-supervised learning annota- and the performance was evaluated using scBERT and all the above
tion method that leverages the unlabelled target data through training methods. The details are listed below.

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 862


Article https://doi.org/10.1038/s42256-022-00534-z

Performance on intra-dataset data using cross-validation. The and cell clustering task are not equivalent, those metrics assessing the
PBMC data are from Zheng68k with high inter-class similarity, the Pan- quality and distance of clusters are excluded from this study.
creas datasets (Baron, Muraro, Segerstolpe and Xin), the MacParland
dataset, the Tucker dataset, the Lung dataset and the Human Cell Atlas Sensitivity analysis on the hyperparameters. The influence of the
dataset and were employed to test the performance on the intra-dataset hyperparameters (size of the embedding vector, the binning setting,
in a fivefold cross-validation manner. Notably, the reference dataset the number of encoder layers and the number of heads for each layer)
in this section also refers to the training dataset for the supervised were systematically estimated on the heart datasets with large-scale
methods, including scBERT. heart dataset (451,513 cells) as the pretraining dataset and the Tucker
dataset as the evaluation dataset.
Performance on the inter-dataset data. To evaluate the robustness
of the methods on cross-cohort data with batch effects from differ- Scalability. When evaluating on the large Tucker datasets with
ent single-cell sequencing platforms, we tested the methods on four 287,269 cells, those comparison methods implemented in R faced
pancreas datasets (Baron, Muraro, Segerstolpe and Xin), taking three severe problem in scalability due to their poor memory management.
datasets as the training set and the remaining one as the test set each For instance, CellID met the memory bottleneck when calculating a
time. Considering the difference in cell populations among these data- matrix of 50,000 × 230,000, and we made efforts to split the matrix
sets, all datasets were aligned, retaining only four kinds of pancreas into pieces to avoid memory overflow. Conversely, benefiting from
islet cells (alpha, beta, delta and gamma cells) that are common in these mini-batch sampling and the efficient Performer encoder, scBERT
datasets. To evaluate the robustness of the methods on cross-organ could easily deal with large-scale datasets at both the pretraining and
data, we tested the methods on three major organs (the oesophagus, the fine-tuning stage.
rectum and stomach) from Human Cell Atlas dataset.
Marker genes for the marker-based comparison methods. To avoid
The influence of reference cell amount on the performance. The bias introduced by marker selection, well-documented marker lists
number of reference cells is prone to influence the model perfor- associated with well-defined cell types from CellMarker55 were used.
mance. In this study, 10%, 30%, 50%, 70% and 90% of the PBMC cells
from the Zheng68K dataset were randomly selected as the reference Systematic analysis of scBERT
for fine-tuning while the remaining as the query samples for testing. Pretraining versus not pretraining. Following BERT’s pretraining and
fine-tuning paradigm, our method is prone to generate an efficient
Class-imbalanced data tests. Following the construction method for encoder and provide a general embedding that better represents the
class-imbalanced data4, we collected four PBMC cell types (CD19+ B, gene expression of each cell by revealing critical patterns with lower
CD8+ cytotoxic T, CD34+ and CD8+/CD45RA naive cytoxic cells) that data noise. The results of the ablation study on model performance
contain various levels of similarity across cell types from Zheng68K with and without pretraining (Extended Data Fig. 1a) demonstrated
data. The cells of the four types were randomly selected with the cell the essentiality of pretraining for the model’s downstreaming task
numbers 10,000, 100, 10,000 and 100, respectively, as reference data (that is, cell type annotation), with a relatively large and important
for fine-tuning. As for model testing, 100 cells were randomly selected difference in the bioinformatics field. The scBERT model extracts the
per cell type as query data. useful attention pattern on gene expressions and interactions from
a large scale of various scRNA-seq data, alleviating the efforts of the
Novel cell type detection. Human liver tissue was used to assess fine-tuning process on the specific downstream tasks.
the unknown cell type identification. Here we adopted MacParland
dataset50 from human liver tissues with 8,434 cells belonging to 14 cell Feasibility on classifying with gene expression patterns. It is well
types. In this experiment, we took four immune cells for novel cell type known that marker genes play a key role in cell type annotation for
simulation, which were absent from other liver datasets. Following marker gene-based annotation, and most of the reference-based
the schema proposed in the previous study7, we performed leave-out annotation. Even some of the supervised-based methods are heav-
one cell type evaluation by removing one cell type from the reference ily dependent on prior marker gene knowledge. Among the current
dataset while keeping the cell type groups in the query dataset. The mainstream methods that use marker genes for classification, some
evaluation process was iterated on each cell type. At present, there is methods use the gene expression pattern for cell type annotation.
no unified quantitative evaluation metrics for detection of novel cell Both types of method were reported to achieve good performance on
type. Some approaches compute the accuracy by putting the novel variable cell type annotation tasks, indicating that both types of data
class together with known classes, which unavoidably overwhelms imply discriminative information for different cell types. To investigate
the models’ accuracy for rare and novel cell types. Besides accurately the effect of marker genes and the discriminant ability of the remain-
detecting novel cell types, a good cell type annotation method should ing expression patterns that comprise only the non-marker genes, we
maintain the ability to accurately discriminate known cell types. In conducted experiments in which marker genes were eliminated gradu-
this regard, we evaluate the accuracy of novel cell type and known cell ally, leaving the remaining expression profiles for cell type annotation
types, separately. Notably, we employed a strict evaluation method for (Extended Data Fig. 1b and Supplementary Table 16). The results prove
novel cell types with the accuracy calculated on the union set of cells that the marker genes are important for cell type annotation; however,
with the novel cell type label and the cells that are predicted as novel in addition to the marker genes, there are still informative gene patterns
cell types. that have good distinguishing power on cell type classification. With
deletion of 100% of marker genes, scBERT can still efficiently learn the
Assessment on the necessity of self-learning. To illustrate the neces- informative gene patterns and achieve a performance that is on par
sity of the self-learning process of scBERT, the performance gain was with the best performance achieved by comparison methods with all
evaluated on the model after self-learning and fine-tuning compared of the marker genes on the representative Zheng68K dataset (Extended
to the model training from scratch. Data Fig. 1b). We also explored detected gene lists from scBERT, and
other machine learning (scNym) and non-machine learning (Seurat)
Evaluation metrics. Cell type annotation performance of each method methods on MacParland and Baron, respectively (Supplementary
at cell-level and cell-type-level was evaluated using the metrics of accu- Tables 17 and 18). Consistent with the above experiment on the dele-
racy and macro F1-score, respectively. Since cell type annotation task tion of markers, we observe that machine learning-based methods

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 863


Article https://doi.org/10.1038/s42256-022-00534-z

tend to learn high-level implicit cell-type-specific patterns (that is, overfitting problem when decreasing or increasing the number of
discovering some genes with a high rank across cell types), whereas layers. Overall, the small fluctuations of the above parameters had
non-machine-learning-based methods usually simply find differentially little effect on the performance of the model, which also verified the
expressed genes using statistics analysis. The results indicated that robustness of scBERT.
the attention mechanism, saliency mechanism and statistics analysis
could gain complementary information from different perspectives Model interpretability
on the mining pattern of single-cell data. We conducted a comprehensive interpretability analysis to explore
the key genes for decision-making, as scBERT models were built on the
General gene embedding versus single-cell-specific embedding. self-attention mechanism and all of the genes’ representations
Gene2vec is based on bulk data28, which measures the average expres- remained at the end of our workflow. The attention weights reflect the
sion of genes from tissues and is the sum of cell type-specific gene contribution of each gene and the interaction of gene pairs. The atten-
expression weighted by cell type proportions59. In this regard, gen- tion weights can be obtained from equation (1), modified by replacing
e2vec maintains the general co-expression patterns of genes but stays V with V 0, where V 0 contains one-hot indicators for each position index.
away from strong noise and high sparsity of single-cell sequencing. We integrated all the attention matrices into one matrix by taking an
We therefore utilized gene2vec as our gene embedding to represent element-wise average across all attention matrices in multi-head
the gene identity (each gene has a unique gene2vec embedding) and multi-layer Performers. In this average attention matrix, each value
the semantic similarity from the aspect of general co-expression pat- A (i, j) represented how much attention from gene i was paid to gene j.
tern. The encoder of scBERT could also learn a single-cell-specific To focus on the importance of genes to each cell, we summed the atten-
embedding (we briefly call it scBERT embedding) that represents tion matrix along with columns into an attention-sum vector, and its
the cell-specific expression. To illustrate the evolution of the embed- length is equal to the number of genes. In this way, we could obtain the
ding (or representation) during the model learning, we visualized the top attention genes corresponding to a specific cell type compared to
examples of gene2vec and scBERT embedding in Extended Data Fig. 1b. other cell types. The attention weights were visualized and the top
Our model could generate different representations of the same gene genes were sent to Enrichr32 for enrichment analysis.
for different cell inputs, whereas gene2vec generated all of the same Enrichment analysis was performed for the top-50-attention-gene
representations of the same gene for different cell inputs. We observed lists using various gene-set libraries, and the results revealed there were
that the scBERT embedding exhibits a cell-type-specific representa- some interesting relationships between top-enriched terms and the
tion (that is, the example representation of the gene is substantially corresponding cell types.
enriched in alpha cells), which is suitable for downstreaming the cell
type annotation task. Furthermore, the cell-type-specific representa- Statistical analysis
tion learns some correlation beyond gene2vec. Benefiting from the The Wilcoxon test was applied for the significance test. Cross-validation
attention mechanism of the Performer, the model could detect the was employed in all the benchmarking experiments, and standard
subtle gene interaction patterns that can only be seen in single-cell deviations were drawn in the figures. Normalized confusion matrix was
data after model training on scRNA-seq data (Extended Data Fig. 1d). used for displaying the prediction. The significance was calculated by
It could be observed that some genes have strong attention weights Wilcoxon test on the paired groups. Jaccard index was used for similar-
to all other genes, indicating that it plays a critical role in identifying ity measure for the detected gene lists by different methods. The ARI
the implicit patterns, which is consistent with the conclusion of the was applied to for similarity measure for clusters.
detected gene lists in Supplementary Tables 17 and 18.
Reporting summary
Influence of hyperparameters. A systematic investigation into the Further information on research design is available in the Nature
sensitivity of hyperparameters—including the number of bins, the Research Reporting Summary linked to this article.
size of scBERT embedding vector, the number of attention heads, and
the number of Performer encoder layers—was performed on scBERT Data availability
(Extended Data Fig. 1b). First, the expression embedding by ranking All data used in this study are publicly available and the usages are
raw expression into seven bins is suitable for scBERT. Increasing the fully illustrated in the Methods. The published Panglao dataset was
bin numbers to nine hinders the model performance, indicating that downloaded from https://panglaodb.se/. The published Zheng68k
ranking the gene expression would denoise the raw data and improve dataset was downloaded from the ‘Fresh 68K PBMCs’ section at https://
scBERT’s efficiency in learning meaningful patterns. By contrast, support.10xgenomics.com/single-cell-gene-expression/datasets
reducing the bin numbers would also affect the model performance (SRP073767)34. The published pancreatic datasets were downloaded
due to the loss of gene expression information (that is, blurring the from github at https://hemberg-lab.github.io/scRNA.seq.datasets/
relatively large gene expression difference). The above experimental (Baron: GSE84133, Muraro: GSE85241, Segerstolpe: E-MTAB-5061,
results proved that the proper design of bin numbers that balance Xin: GSE81608) 35–38. The MacParland dataset was downloaded
denoising while reserving expression information would benefit the from https://www.ncbi.nlm.nih.gov/geo/ (GSE115469)50. The heart
model performance. Second, gene2vec provided an embedding of datasets were downloaded from https://data.humancellatlas.org/
200 dimensions and achieved the best performance compared with explore/projects/ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 and
other dimensions. Reduction of the dimension of scBERT embedding https://singlecell.broadinstitute.org/single_cell/study/SCP498/
vector in the latent space would impair the model’s representation transcriptional-and-cellular-diversity-of-the-human-heart (refs. 51,52).
ability and performance (especially when the dimension is 50). Third, The lung dataset for COVID-19 study was downloaded from https://
the Performer with ten attention heads is suitable for our method. doi.org/10.6084/m9.figshare.11981034.v1 (ref. 53). The adult Human
Decreasing the number of attention heads might reduce the model Cell Atlas of 15 major organs dataset was downloaded from https://
representation ability due to fewer representative subspaces. Increas- www.ncbi.nlm.nih.gov/geo/ (GSE159929)54. Source Data are provided
ing the number of attention heads seems to have limited influence on with this paper.
the performance; however, the over-parameterized model (with 20
attention heads) faces a risk of overfitting, especially when applying Code availability
to small datasets. Similarly, the model performs stable with four and The source code of the pre-processing, scBERT modelling and
six of Performer encoder layers but might suffer from an under- or fine-tuning processes are freely available on Github (https://github.

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 864


Article https://doi.org/10.1038/s42256-022-00534-z

com/TencentAILabHealthcare/scBERT) and Zenodo (https://doi. 21. Wang, T. et al. MOGONET integrates multi-omics data using
org/10.5281/zenodo.6572672)60 with detailed instructions. The source graph convolutional networks allowing patient classification and
code for the other comparison methods are publicly available (see biomarker identification. Nat. Commun. 12, 1–13 (2021).
Supplementary Table 2). 22. Wang, T. et al. BERMUDA: a novel deep transfer learning method
for single-cell RNA sequencing batch correction reveals hidden
References high-resolution cellular subtypes. Genome Biol. 20, 1–15 (2019).
1. Plass, M. et al. Cell type atlas and lineage tree of a whole complex 23. Menden, K. et al. Deep learning–based cell composition analysis
animal by single-cell transcriptomics. Science 360, aaq1723 from tissue expression profiles. Sci. Adv. 6, aba2619 (2020).
(2018). 24. Parmar, N. et al. Image transformer. In Proc. 35th International
2. Cao, J. et al. The single-cell transcriptional landscape Conference on Machine Learning Vol. 80, 4055–4064 (PMLR,
of mammalian organogenesis. Nature 566, 496–502 2018); https://proceedings.mlr.press/v80/parmar18a.html
(2019). 25. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training
3. Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs of deep bidirectional transformers for language understanding.
creates a Tabula Muris. Nature 562, 367–372 (2018). In Proc. 2019 Conference of the North American Chapter of the
4. Zhao, X., Wu, S., Fang, N., Sun, X. & Fan, J. Evaluation of single-cell Association for Computational Linguistics: Human Language
classifiers for single-cell RNA sequencing data sets. Briefings Technologies Vol. 1, 4171–4186 (Association for Computational
Bioinform. 21, 1581–1595 (2020). Linguistics, 2018).
5. Pasquini, G., Rojo Arias, J. E., Schäfer, P. & Busskamp, V. 26. Le, Q. V. et al. XLNet: generalized autoregressive pretraining
Automated methods for cell type annotation on scRNA-seq data. for language understanding. In Advances in Neural Information
Comput. Struct. Biotechnol. J.19, 961–969 (2021). Processing Systems Vol. 32 (NeurIPS 2019); https://proceedings.
6. Cao, Y., Wang, X. & Peng, G. SCSA: a cell type annotation tool for neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67c
single-cell RNA-seq data. Front. Genet. 0, 490 (2020). c69-Abstract.html
7. Huang, Q., Liu, Y., Du, Y. & Garmire, L. X. Evaluation of cell type 27. Yin, P., Neubig, G., Yih, W. & Riedel, S. TaBERT: pretraining for joint
annotation R packages on single-cell RNA-seq data. Genomics understanding of textual and tabular data. In Proc. 58th Annual
Proteomics Bioinform. 19, 267–281 (2020). Meeting of the Association for Computational Linguistics 8413–
8. Moffitt, J. R. et al. Molecular, spatial, and functional single-cell 8426 (Association for Computational Linguistics, 2020); https://
profiling of the hypothalamic preoptic region. Science 362, doi.org/10.18653/V1/2020.ACL-MAIN.745
aau5324 (2018). 28. Du, J. et al. Gene2vec: distributed representation of genes based
9. Zhang, Z. et al. SCINA: a semi-supervised subtyping algorithm of on co-expression. BMC Genomics 20, 7–15 (2019).
single cells and bulk samples. Genes 10, 531 (2019). 29. Cortal, A., Martignetti, L., Six, E. & Rausell, A. Gene signature
10. Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification extraction and cell identity recognition at the single-cell level
enables rapid annotation of cell atlases. Nat. Methods 16, with Cell-ID. Nat. Biotechnol. 39, 1095–1102 (2021).
983–986 (2019). 30. Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: Projection of
11. Grabski, I. N. & Irizarry, R. A. A probabilistic gene expression single-cell RNA-seq data across data sets. Nat. Methods 15,
barcode for annotation of cell types from single-cell RNA-seq 359–362 (2018).
data. Biostatistics. https://doi.org/10.1093/biostatistics/kxac021 31. Guo, H. & Li, J. scSorter: assigning cells to known cell types
(2022). according to marker genes. Genome Biol. 22, 1–18 (2021).
12. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch 32. Choromanski, K. et al. Rethinking attention with performers.
effects in single-cell RNA-sequencing data are corrected by In International Conference on Learning Representations
matching mutual nearest neighbors. Nat. Biotechnol. 36, (NIPS, 2021).
421–427 (2018). 33. Abdelaal, T. et al. A comparison of automatic cell identification
13. Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 20,
methods for single-cell RNA sequencing data. Genome Biol. 21, 1–19 (2019).
1–32 (2020). 34. Zheng, G. X. Y. et al. Massively parallel digital transcriptional
14. Serra, A., Coretto, P., Fratello, M. & Tagliaferri, R. Robust profiling of single cells. Nat. Commun. 8, 1–12 (2017).
and sparse correlation matrix estimation for the analysis of 35. Baron, M. et al. A single-cell transcriptomic map of the human
high-dimensional genomics data. Bioinformatics 34, 625–634 and mouse pancreas reveals inter- and intra-cell population
(2018). structure. Cell Syst. 3, 346–360.e4 (2016).
15. Ma, F. & Pellegrini, M. ACTINN: automated identification of cell 36. Muraro, M. J. et al. A single-cell transcriptome atlas of the human
types in single cell RNA sequencing. Bioinformatics 36, pancreas. Cell Syst. 3, 385–394.e3 (2016).
533–538 (2020). 37. Segerstolpe, Å. et al. Single-cell transcriptome profiling of human
16. Alquicira-Hernandez, J., Sathe, A., Ji, H. P., Nguyen, Q. & Powell, J. pancreatic islets in health and type 2 diabetes. Cell Metabol. 24,
E. scPred: accurate supervised method for cell-type 593–607 (2016).
classification from single-cell RNA-seq data. Genome Biol. 20, 38. Xin, Y. et al. RNA sequencing of single human islet cells reveals
1–17 (2019). type 2 diabetes genes. Cell Metabol. 24, 608–615 (2016).
17. Cao, Z.-J., Wei, L., Lu, S., Yang, D.-C. & Gao, G. Searching 39. Nica, A. C. et al. Cell-type, allelic, and genetic signatures in the
large-scale scRNA-seq databases via unbiased cell embedding human pancreatic beta cell transcriptome. Genome Res. 23,
with Cell BLAST. Nature Commun. 11, 1–13 (2020). 1554–1562 (2013).
18. Xie, P. et al. SuperCT: a supervised-learning framework for 40. Anders, S. & Huber, W. Differential expression analysis for
enhanced characterization of single-cell transcriptomic profiles. sequence count data. Nat. Precedings https://doi.org/10.1038/
Nucleic Acids Res. 47, e48–e48 (2019). npre.2010.4282.1 (2010).
19. Li, C. et al. SciBet as a portable and fast single cell type identifier. 41. Mahajan, A. et al. Fine-mapping type 2 diabetes loci to
Nat. Commun. 11, 1–8 (2020). single-variant resolution using high-density imputation
20. Qiu, P. Embracing the dropouts in single-cell RNA-seq analysis. and islet-specific epigenome maps. Nat. Genet. 50,
Nat. Commun. 11, 1–9 (2020). 1505–1513 (2018).

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 865


Article https://doi.org/10.1038/s42256-022-00534-z

42. Hwang, S. et al. HumanNet v2: human gene networks for disease Acknowledgements
research. Nucl. Acids Res. 47, D573–D580 (2019). We thank B. Jiang and Y. Ji for their valuable suggestions on model
43. Liu, T.-Y. et al. Do transformers really perform badly for graph building and experimental design. We thank T. Shen for advice on the
representation? In Advances in Neural Information Processing large-scale model pretraining. H.L. was supported by the National
Systems Vol. 34 (NeurIPS, 2021). Key R&D Program of China (grant no. 2018YFC0910500), a SJTU-Yale
44. Yun, S., Jeong, M., Kim, R., Kang, J. & Kim, H. J. Graph transformer Collaborative Research Seed Fund, and Neil Shen’s SJTU Medical
networks. In 33rd Conference on Neural Information Processing Research and Key-Area Research. F.Y. was supported by Development
Systems (NeurIPS, 2019). Program of Guangdong Province (grant no. 2021B0101420005).
45. McDavid, A. et al. Data exploration, quality control and testing
in single-cell qPCR-based gene expression experiments. Author contributions
Bioinformatics 29, 461–467 (2013). F.Y. and J.Y. conceived and designed the project. W.W. developed and
46. Goldberg, Y. Neural Network Methods for Natural Language implemented the algorithms under the guidance of F.Y. and J.Y.. W.W.
Processing Vol. 10, 1–311 (Springer, 2017); https://doi.org/10.2200/ and F.W. collected the datasets. W.W., F.Y. and F.W. conducted the
S00762ED1V01Y201703HLT037 experiments, data analysis and method comparisons. F.Y. and W.W.
47. Zhang, Y., Jin, R. & Zhou, Z.-H. Understanding bag-of-words drew the figures and wrote the manuscript, with the guidance of J.Y.
model: a statistical framework. Int. J. Mach. Learn. Cybernetics 1, and H.L. Y.F. and F.W. finalized the manuscript and figures. D.T. gave
43–52 (2010). suggestions for the design of the Transformer architecture, and the
48. Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian application of the NLP technology. J.H. gave suggestions on improving
approach to single-cell differential expression analysis. Nat. the manuscript. F.Y. and F.W. revised the figures and manuscript. All of
Methods 11, 740–742 (2014). the authors reviewed and approved the manuscript.
49. Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web
server for exploration of mouse and human single-cell RNA Competing interests
sequencing data. Database 2019, 46 (2019). The authors declare no competing interests.
50. MacParland, S. A. et al. Single cell RNA sequencing of human
liver reveals distinct intrahepatic macrophage populations. Nat. Additional information
Commun. 9, 1–21 (2018). Extended data is available for this paper at https://doi.org/10.1038/
51. Litviňuková, M. et al. Cells of the adult human heart. Nature 588, s42256-022-00534-z.
466–472 (2020).
52. Tucker, N. R. et al. Transcriptional and cellular diversity of the Supplementary information The online version contains
human heart. Circulation 142, 466–482 (2020). supplementary material available at https://doi.org/10.1038/s42256-
53. Lukassen, S. et al. SARS-CoV-2 receptor ACE2 and TMPRSS2 are 022-00534-z.
primarily expressed in bronchial transient secretory cells. EMBO J.
39, e105114 (2020). Correspondence and requests for materials should be addressed to
54. He, S. et al. Single-cell transcriptome profiling of an Hui Lu or Jianhua Yao.
adult human cell atlas of 15 major organs. Genome Biol. 21, 1–34
(2020). Peer review information Nature Machine Intelligence thanks Jesper
55. Zhang, X. et al. CellMarker: a manually curated resource of cell Tegner and the other, anonymous, reviewer(s) for their contribution to
markers in human and mouse. Nucl. Acids Res. 47, D721–D728 the peer review of this work.
(2019).
56. Kimmel, J. C. & Kelley, D. R. Semi-supervised adversarial Reprints and permissions information is available at
neural networks for single-cell classification. Genome Res. 31, www.nature.com/reprints.
gr.268581.120 (2021).
57. Hao, Y. et al. Integrated analysis of multimodal single-cell data. Publisher’s note Springer Nature remains neutral with regard to
Cell 184, 3573–3587.e29 (2021). jurisdictional claims in published maps and institutional affiliations.
58. Aran, D. et al. Reference-based analysis of lung single-cell
sequencing reveals a transitional profibrotic macrophage. Nat. Springer Nature or its licensor holds exclusive rights to this
Immunol. 20, 163–172 (2019). article under a publishing agreement with the author(s) or other
59. Wang, X., Park, J., Susztak, K., Zhang, N. R. & Li, M. Bulk tissue rightsholder(s); author self-archiving of the accepted manuscript
cell type deconvolution with multi-subject single-cell expression version of this article is solely governed by the terms of such
reference. Nat. Commun. 10, 1–9 (2019). publishing agreement and applicable law.
60. Yang, F. et al. scBERT as a Large-scale Pretrained Deep Language
Model for Cell Type Annotation of Single-cell RNA-seq (Zenodo, © The Author(s), under exclusive licence to Springer Nature Limited
2022); https://doi.org/10.5281/zenodo.6572672 2022

Nature Machine Intelligence | Volume 4 | October 2022 | 852–866 866


Article https://doi.org/10.1038/s42256-022-00534-z

Extended Data Fig. 1 | See next page for caption.

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-022-00534-z

Extended Data Fig. 1 | The system analysis of the architecture design of the best performance achieved by other cell type annotation methods with all
scBERT. a, Performance of scBERT (with/without pre-training) measured by marker genes. c, UMAP representation of alpha, beta, delta, and gamma cells
accuracy and F1-score on Zheng68K dataset using 5-fold cross-validation. scBERT from Muraro dataset coloured by gene2vec embedding (sum of 200-dimension
with pre-training is trained on over 1,000,000 cells from public scRNA-seq data vectors) (top) and scBERT embedding (bottom) of alpha-specific gene LOXL4.
from PanglaoDB. In the contrast, the model weights of scBERT without d, The heatmap of average attention matrix obtained by taking an element-wise
pre-training are initiated randomly. Box plot shows the median (centre lines), average across all attention matrices in multi-head multi-layer Performers. Each
interquartile range (hinges) and 1.5 times the interquartile range (whiskers). value A (i, j) (i and j indicate the index of row and column) represents how much
b, Performance evaluation on the effect of gradually removing marker genes attention from gene i was paid to gene j. e, Sensitivity analysis of
(no deletion, deletion of 10%, deletion of 50% and deletion of 100% markers) on hyperparameters includes the number of bins (top left), the dimension of scBERT
accuracy. Box plot shows the median (centre lines), interquartile range (hinges), embedding vector (top right), the number of attention heads (bottom left) and
and 1.5 times the interquartile range (whiskers). The green dashed line represents the number of Performer encoder layers (bottom right).

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-022-00534-z

Extended Data Fig. 2 | See next page for caption.

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-022-00534-z

Extended Data Fig. 2 | Performance comparison between scBERT and other accuracy (left) and F1-score (right) on Zheng68K dataset using 5-fold cross-
cell type annotation methods on intra-datasets. a, Performance of scBERT validation. Box plot shows the median (centre lines), interquartile range (hinges),
and other automatic cell type annotation methods measured by F1-score on n = 6 and 1.5 times the interquartile range (whiskers). c-d, Performance of scBERT and
datasets (Zheng68K, Baron, Muraro, Xin, Segerstolpe, and MacParland) using other automatic cell type annotation methods measured by accuracy (c) and
5-fold cross-validation. Box plots show the median (centre lines), interquartile F1-score (d) on n = 3 datasets (Tucker dataset, lung dataset and Human Cell Atlas
range (hinges), and 1.5 times the interquartile range (whiskers). b, Performance dataset) using 5-fold cross-validation. Box plots show the median (centre lines),
of scBERT and marker-based methods (SCINA, Garnett, scSorter) measured by interquartile range (hinges), and 1.5 times the interquartile range (whiskers).

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-022-00534-z

Extended Data Fig. 3 | Heatmaps for the confusion matrices of the results SingleR, CellID_cell, CellID_group, scmap_cell, scmap_cluster, SCINA, Garnett,
on Zheng68k dataset for other comparison methods. a, The tSNE plots show scSorter) on Zheng68K dataset. The colours indicate the cell types annotation
the cell type annotation results of comparison methods (scNym, SciBet, Seurat, result from each individual method.

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-022-00534-z

Extended Data Fig. 4 | t-SNE plots of the cell type annotation results on cell, and scmap_cluster. b, Heatmaps for the prediction confusion matrices on
Zheng68K dataset (n = 68,450 cells). a, Heatmaps for the prediction confusion the imbalanced dataset constructed from Zheng68K dataset for Seurat, SingleR,
matrices on Zheng68K dataset for scNym, SciBet, SingleR, CellID_group, scmap_ CellID_cell, CellID_group, scmap_cell, and scmap_cluster.

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-022-00534-z

Extended Data Fig. 5 | Performance comparison between scBERT and other methods (SciBet, Seurat, SingleR, CellID_cell, CellID_group, scmap_cell, and
cell type annotation methods on cross-cohort dataset and cross-organ scmap_cluster). b, Performance of scBERT and other cell type annotation
dataset. a, t-SNE representation of alpha, beta, delta, and gamma cells from four methods measured by accuracy (left) and F1-score (right) on datasets from 3
pancreas datasets (n = 10,220 cells). The top left t-SNE plot is coloured by the organs (n = 17,384) using 5-fold cross-validation. Box plots show the median
annotated cell types provided by the atlas from the original paper, meanwhile (centre lines), interquartile range (hinges), and 1.5 times the interquartile range
other t-SNE plots are coloured by the cell type annotation results of comparison (whiskers).

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-022-00534-z

Extended Data Fig. 6 | See next page for caption.

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-022-00534-z

Extended Data Fig. 6 | The distribution of the top attention sum genes across representation of alpha, beta, delta, and gamma cells from Muraro dataset
the four cell types of the Muraro dataset. a, UMAP representation of alpha, coloured by expression distribution of top attention sum genes that have
beta, delta, and gamma cells from Muraro dataset coloured by expression distinguishing patterns on corresponding cell types but have not been reported
distribution of top attention sum genes that are consistent with reported as markers yet.
marker genes for alpha, beta, delta and gamma cells, respectively. b, UMAP

Nature Machine Intelligence

You might also like