0% found this document useful (0 votes)
10 views

sc3: Consensus Clustering of Single-Cell Rna-Seq Data: Brief Communications

Uploaded by

Joy Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

sc3: Consensus Clustering of Single-Cell Rna-Seq Data: Brief Communications

Uploaded by

Joy Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

brief communications

SC3: consensus To constrain parameter values in the SC3 pipeline, we first con-
sidered six publicly available scRNA-seq datasets6–11 featuring high-

clustering of single-cell confidence cell labels (since they include cells from different stages,
conditions or lines) that can be considered gold standards (Fig. 1b

RNA-seq data and Supplementary Results 1). To quantify the similarity between
the reference labels and the clusters obtained by SC3, we used the
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.

adjusted Rand index (Online Methods), which ranges from 0 for a


Vladimir Yu Kiselev1, Kristina Kirschner2, level of similarity expected by chance to 1 for identical clusterings.
Michael T Schaub3,4, Tallulah Andrews1, Andrew Yiu1, For the gold-standard datasets, we found that the quality of the
Tamir Chandra1,5, Kedar N Natarajan1,6, Wolf Reik1,5,7, outcome as measured by the adjusted Rand index was sensitive to
Mauricio Barahona8, Anthony R Green2 & the number of eigenvectors, d, retained after spectral transforma-
tion (Supplementary Figs. 1 and 2). For all six datasets, we found
Martin Hemberg1 that the best clusterings were achieved when d was between 4%
and 7% of the number of cells, N (Fig. 1c, Supplementary Fig. 3a
Single-cell RNA-seq enables the quantitative characterization and Online Methods). The robustness of the 4–7% range was sup-
of cell types based on global transcriptome profiles. ported by a simulation experiment in which the reads from the
We present single-cell consensus clustering (SC3), six gold-standard datasets were downsampled by a factor of ten
a user-friendly tool for unsupervised clustering, which (Supplementary Fig. 3a). We further tested the SC3 pipeline on
achieves high accuracy and robustness by combining multiple six other published datasets12–17, in which the cell labels can only
clustering solutions through a consensus approach be considered ‘silver standard’ since they were assigned using com-
(http://bioconductor.org/packages/SC3). We demonstrate putational methods and the authors’ knowledge of the underlying
that SC3 is capable of identifying subclones from the biology. Again, we found that SC3 performed well when using
transcriptomes of neoplastic cells collected from patients. d in the 4–7% of N interval (Supplementary Fig. 3b). The final
step, consensus clustering, improved both the accuracy and the
A key advantage of single-cell RNA sequencing (scRNA-seq) is stability of the solution. k-means-based methods typically provide
that it can be used to determine cell types in an unbiased way by different outcomes depending on the initial conditions. We found
submitting transcriptomes to unsupervised clustering1–3. A full that this variability was significantly reduced with the consensus
characterization of the transcriptional landscape of individual approach (Fig. 1d).
cells holds enormous potential for both basic biology and clinical To benchmark SC3, we considered five other methods: tSNE18
applications. However, de novo identification and characterization followed by k-means clustering (t-SNE + k-means; similar to
of cell types requires robust and accurate computational meth- the method used by Grün et al.1), pcaReduce19, SNN-Cliq20,
ods. We have developed SC3, an interactive and user-friendly SINCERA21 and SEURAT22. SC3 performed better than the
R package for clustering (Supplementary Software 1 and five tested methods across all benchmark datasets (Wilcoxon
see http://bioconductor.org/packages/SC3 for the latest version). signed-rank test, P < 0.01), with only a few exceptions (Fig. 2a).
Its integration with Bioconductor4 and scater5 makes it easy to In addition to considering accuracy, we also compared the stabil-
incorporate into existing workflows. ity of SC3 with other stochastic methods (pcaReduce and tSNE +
Each step of the SC3 pipeline (Fig. 1a and Online Methods) k-means but not SEURAT) by running them 100 times (Fig. 2a,b
requires the user to specify a number of parameters, which can be and Online Methods). In contrast to the other methods that rely
difficult and time-consuming to optimize. To avoid this problem, on different initializations, SC3 was highly stable.
SC3 utilizes a parallelization approach whereby a significant sub- Although SC3’s consensus strategy provided high accuracy, it
set of the parameter space is evaluated simultaneously to obtain a came at a moderate computational cost: the run time for 2,000 cells
set of clusterings. SC3 then combines all the different clustering was ~20 min (Supplementary Fig. 4a). The main bottleneck was the
outcomes into a consensus matrix that summarizes how often k-means clustering. By reducing how many runs were considered, it
each pair of cells is located in the same cluster. The final result was possible to cluster 5,000 cells in ~20 min with only a slight reduc-
is determined by complete-linkage hierarchical clustering of the tion in accuracy (Supplementary Fig. 4b). To apply SC3 to even
consensus matrix into k groups. larger datasets, we implemented a hybrid approach that combines

1Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK. 2Cambridge Institute for Medical Research, Wellcome Trust/MRC Stem Cell Institute and Department

of Haematology, University of Cambridge, Hills Road, Cambridge, UK. 3Department of Mathematics and naXys, University of Namur, Namur, Belgium. 4ICTEAM,
Université Catholique de Louvain, Louvain-la-Neuve, Belgium. 5Epigenetics Programme, The Babraham Institute, Babraham, Cambridge, UK. 6EMBL-European
Bioinformatics Institute, Hinxton, Cambridge, UK. 7Centre for Trophoblast Research, University of Cambridge, Cambridge, UK. 8Department of Mathematics, Imperial
College London, London, UK. Correspondence should be addressed to M.H. ([email protected]).
Received 28 November 2016; accepted 1 March 2017; published online 27 March 2017; doi:10.1038/nmeth.4236

nature methods | VOL.14 NO.5 | MAY 2017 | 483


brief communications
a Input Gene Filter Distances Transformations d range k-means Consensus
Euclidean PCA N cells
Pearson Laplacian N cells

N cells
(1,d1)
Spearman d1

N cells
N cells
N cells d2
N cells
(1,dD)
Filtered genes

N cells
Genes

dD
(6,d1)
1
0.8
0.6
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.

0.4
(6,dD)
0.2
0

b Gold standard k N Units


c d Individual Consensus
40
Biase (ref. 6) 3 49 FPKM Biase Yan Goolam Deng Pollen1
Number of solutions with ARI > 95% of max.

1.00
Yan (ref. 7) 7 90 RPKM
0.75
Goolam (ref. 8) 5 124 CPM 0.50
30
Deng (ref. 9) 10 268 RPKM 0.25
Pollen (ref. 10) 11 301 TPM 0.00
Pollen2 Kolodz. Treutlein Ting Patel
Kolodziejczyk (ref. 11) 3 704 CPM 20 1.00
Silver standard 0.75

ARI
Treutlein (ref. 12) 5 80 FPKM 0.50
0.25
Ting (ref. 13) 7 149 CPM 10
0.00
Patel (ref. 14) 5 430 TPM
Usoskin1 Usoskin2 Usoskin3 Klein Zeisel
Usoskin (ref. 15) 11 622 RPM 1.00
0 0.75
Klein (ref. 16) 4 2,717 UMI
0.50
Zeisel (ref. 17) 9 3,005 UMI 0 5 10 15 20 0.25
d (percent of N) 0.00

Figure 1 | The SC3 framework for consensus clustering of scRNA-seq data. (a) Overview of clustering with SC3. Results of the consensus step are
shown for the Treutlein12 data. (b) Published datasets used to set SC3 parameters. N, number of cells; k, number of clusters originally identified by
the authors; RPKM, reads per kilobase of transcript per million mapped reads; RPM, reads per million mapped reads; FPKM, fragments per kilobase of
transcript per million mapped reads; TPM, transcripts per million mapped reads; UMI, unique molecular identifiers; CPM, counts per million mapped
reads. (c) Eigenvector (d) values that achieve adjusted Rand index (ARI) > 0.95 on gold-standard datasets. Black vertical lines indicate the interval
d = 4–7% of N, showing high accuracy in the classification. (d) 100 realizations of the SC3 clustering of the datasets in b. Dots represent individual
clustering runs and bars represent the median. Red and gray correspond to clustering with and without consensus step, respectively. The solid black
line corresponds to ARI = 0.8. The dashed black line separates gold- and silver-standard datasets.

unsupervised and supervised methodologies. SC3 selects a subset of RMT estimates and cluster numbers suggested by the original
5,000 cells uniformly at random and obtains clusters from this subset authors (Fig. 2b). SC3 is also interactive, allowing users to explore
as described above. Subsequently, the inferred labels are used to train different choices of k in real time by assessing the consensus matrix
a support vector machine, which then assigns labels to the remaining (Fig. 2d), the silhouette index26 (a measure of how tightly grouped
cells (Online Methods). The hybrid approach worked well to predict the cells in the clusters are) or the expression matrix.
cell labels (Fig. 2c and Supplementary Fig. 4c). We were able to ana- SC3 can help to interpret the results of clustering by identify-
lyze a Drop-seq dataset with 44,808 cells and 39 clusters22, generating ing differentially expressed genes, marker genes and outlier cells
results in good agreement with the original study (Supplementary (Supplementary Fig. 6, Supplementary Table 2 and Online
Results, Supplementary Fig. 5 and Supplementary Table 1). The Methods). Marker genes are particularly useful since they can
main drawback of the sampling strategy is that rare cell-types may be used to uniquely identify a cluster. To illustrate these features,
not be identified, and when the number of cells greatly exceeds we analyzed the Deng9 dataset tracing embryonic developmen-
5,000, there is a substantial risk that the sampled distribution will tal stages. The most stable result for k = 10 generated clusters
differ significantly from the full distribution (Online Methods). For that largely agreed with known sampling timepoints (Fig. 2d).
identifying rare subpopulations (for example, cancer stem cells), We identified ~3,000 marker genes (Supplementary Table 3),
methods specifically designed for this purpose, such as RaceID1 or many of which had been previously reported as developmental
GiniClust23, may be more appropriate. stage-specific27,28 and several of which were stage-specific but had
To help users choose an optimal number of clusters, we have not been previously reported (Supplementary Table 3). Notably,
implemented a method based on random matrix theory (RMT)24,25 when using published reference labels9, we identified nine cells
(Online Methods). Overall, we found good agreement between with high outlier scores (Supplementary Fig. 6c), which turned

484 | VOL.14 NO.5 | MAY 2017 | nature methods


brief communications

a Biase Yan Goolam Deng Pollen1 b Estimation of k Solution stability


1.00 SNN- tSNE +
Ref SC3 SINCERA SC3 pcReduce
Gold standard Cliq k-means
0.75
Biase 3 3 5 6 1 0.12 0.82
0.50 Yan 7 6 6 11 1 0.53 0.89
0.25 Goolam 5 6 4 21 1 0.8 0.57
Deng 10 9 3 20 0.99 0.81 0.54
0.00
Pollen 11 11 9 14 0.79 0.76 0.89
Pollen2 Kolodz. Treutlein Ting Patel 1 0.95 0.88
1.00 Method Kolodziejczyk 3 10 18 2 1 0.87 0.85
SC3
0.75 Silver standard
tSNE + k-means
Treutlein 5 3 19 3 1 0.68 0.35
ARI

0.50 pcaReduce
SNN–Cliq Ting 7 10 10 13 1 0.89 0.62
0.25
SINCERA Patel 5 17 10 25 1 0.93 0.44
0.00
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.

SEURAT 0.93 0.84 0.37


Usoskin 11 11 11 20 0.95 0.77 0.39
Usoskin1 Usoskin2 Usoskin3 Klein Zeisel 0.95 0.74 0.42
1.00
Klein 4 18 7 305 1 0.59 0.8
0.75 Zeisel 9 30 8 330 0.95 0.89 0.43
0.50

0.25 d
0.00 1 Stage
Zygote
c 1 2 3 10 20 30 40 50 Percent of total
number of cells 0.8 Early two-cell
1.00 in a training set Mid-two-cell
Dataset Late two-cell
0.6 Four-cell
0.75 Deng
Pollen2 Eight-cell
Kolodziejczyk 0.4 16-cell
ARI

0.50
Patel Early blastocyst
Usoskin3 Mid-blastocyst
0.25 0.2 Late blastocyst
Klein
Zeisel
0.00 Macosko 0
Cluster 1 2 3 4 5 6 7 8 9 10

Figure 2 | Benchmarking of SC3 against existing methods. (a) SC3, tSNE + k-means and pcaReduce were applied 100 times to each dataset. SNN-Cliq
and SINCERA are deterministic and were run only once. SEURAT was also run once but was optimized over values of the density parameter G (Online
Methods). Dots represent ARI between inferred clusterings and reference labels; bars correspond to median ARI. The solid black line indicates ARI = 0.8.
The dashed black line separates gold- and silver-standard datasets. (b) The number of clusters k̂ predicted by SC3, SINCERA and SNN-Cliq for all datasets.
Ref, reference clustering reported by the authors. Stability is defined as Nc/100, where Nc is the number of times the most frequent solution was found
from 100 runs. (c) Performance of the SC3 hybrid approach. Dots represent outliers higher (or lower) than the highest (or lowest) value within 1.5× the
interquartile range (IQR). The solid black line indicates ARI = 0.8. The dashed black line in the legend separates gold- and silver-standard datasets.
(d) Consensus matrix as generated by SC3 for the Deng9 dataset, indicating how often each pair of cells was assigned to the same cluster by the
different parameter combinations (1, always; 0, never). Colors at the top represent reference labels corresponding to stages of development.

out to have been prepared using the Smart-seq2 protocol instead determined by growing individual HSCs into granulocyte and
of the Smart-seq protocol9,20. macrophage colonies, followed by Sanger sequencing of the TET2
Finally, we investigated the ability of SC3 to identify subclones and JAK2V617F loci (Supplementary Fig. 7b,c). In agreement
based on transcriptomes. Myeloproliferative neoplasms, a group with SC3 clustering, patient 1 was found to harbor three different
of diseases characterized by the overproduction of terminally dif- subclones: (i) cells with mutations in both loci, (ii) cells with a
ferentiated myeloid cells, reflect an early stage of tumorigenesis TET2 mutation and (iii) wild-type cells. Strikingly, the SC3 clusters
in which multiple subclones are known to coexist in the same contained 22%, 29% and 49% of the cells, respectively, in excel-
patient29. Myeloproliferative neoplasms are thought to originate lent agreement with the 20%, 30% and 50% found in the patient
from hematopoietic stem cells (HSCs). To gain further insight into (Supplementary Fig. 7c). The HSC compartment of patient 2 was
the transcriptional landscape of patient-derived HSCs, we obtained 100% mutant for TET2 and JAK2V617F (Supplementary Fig. 7c),
scRNA-seq data from two patients (Supplementary Figs. 7 again consistent with SC3 clustering (Supplementary Fig. 10).
and 8, Supplementary Table 4 and Online Methods). For patient 1 We then analyzed the pooled cells from patients 1 and 2. SC3 clus-
(N = 51), the silhouette index and the RMT method suggested tering again suggested k = 3 (Fig. 3 and Supplementary Fig. 11),
that three clusters were optimal, and SC3 produced three clusters in agreement with the RMT algorithm. Notably, all of the puta-
of similar size (Supplementary Fig. 9). For patient 2 (N = 89), tive double-mutant cells from patient 1 were grouped with the
SC3 generated a single cluster (Supplementary Fig. 10), in agree- double-mutant cells from patient 2. SC3 reported 33 marker genes
ment with the RMT algorithm. for the putative TET2 mutant and 202 marker genes for the puta-
Since TET2 and JAK2V617F30,31 are the only loci with known tive double mutant clone (Fig. 3 and Supplementary Table 5).
driver mutations in these two patients, we hypothesized that Together with additional evidence (Supplementary Results and
clusters corresponded to clones with different combinations of Supplementary Fig. 12), we conclude that SC3 is able to identify
mutations. The genotype composition of each HSC clone was subclones across patients.

nature methods | VOL.14 NO.5 | MAY 2017 | 485


brief communications
K.K. and A.R.G. are supported by Bloodwise (grant ref. 13003), the Wellcome
Trust (grant ref. 104710/Z/14/Z), the Medical Research Council, the Kay Kendall
Leukaemia Fund, the Cambridge NIHR Biomedical Research Center, the Cambridge
Cluster 14
SOX4
Experimental Cancer Medicine Centre, the Leukemia and Lymphoma Society of
ABHD8 12 America (grant ref. 07037) and a core support grant from the Wellcome Trust
NEU1
TALDO1
10 and MRC to the Wellcome Trust-Medical Research Council Cambridge Stem Cell
CR1L 8 Institute. W.R. was supported by BBSRC (grant ref. BB/K010867/1), the Wellcome
CEP135
MLLT3 6 Trust (grant ref. 095645/Z/11/Z), EU BLUEPRINT and EpiGeneSys.
PRSS57
DGKZ 4
SYTL1
PACS2
2 AUTHOR CONTRIBUTIONS
MS4A8B 0 M.H. conceived the study; V.Y.K., M.H., M.T.S., M.B., T.A. and A.Y. contributed
POLR3H
ACO2 to the computational framework; K.K. and T.C. performed the experiments for
EIF5B
PAX5
the patient data; K.N.N. helped with the analysis of embryonic mouse data;
CD83 M.B., W.R., A.R.G. and M.H. supervised the research; and V.Y.K. and M.H. led the
WDFY3
EIF2AK3 writing of the manuscript with input from the other authors.
MED25
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.

CLDN6
PSME1 COMPETING FINANCIAL INTERESTS
EFCAB11
MLL The authors declare no competing financial interests.
CYBASC3
DUSP14
SYNRG Reprints and permissions information is available online at http://www.nature.
PHC1
IRF9 com/reprints/index.html.
KLF3
Cluster 1. Grün, D. et al. Nature 525, 251–255 (2015).
Patient1−TET2 + JAK2V617F Patient1−WT TET2 + WT JAK2V617F 2. Jaitin, D.A. et al. Science 343, 776–779 (2014).
Patient1−TET2 + WT JAK2V617F Patient2−TET2 + JAK2V617F
3. Mahata, B. et al. Cell Rep. 7, 1130–1142 (2014).
4. Gentleman, R.C. et al. Genome Biol. 5, R80 (2004).
Figure 3 | SC3 defines subclones from two patients with myeloproliferative
5. McCarthy, D.J., Campbell, K.R., Lun, A.T.L. & Wills, Q.F.
neoplasm. Marker-gene expression matrix (after gene filter and log- Bioinformatics https://doi.org/10.1093/bioinformatics/btw777 (2017).
transformation; see Online Methods) of a combined dataset of patient 1 6. Biase, F.H., Cao, X. & Zhong, S. Genome Res. 24, 1787–1796 (2014).
and patient 2. Clusters (separated by white vertical lines) correspond 7. Yan, L. et al. Nat. Struct. Mol. Biol. 20, 1131–1139 (2013).
to k = 3. Only the top 10 marker genes are shown for each cluster. 8. Goolam, M. et al. Cell 165, 61–74 (2016).
WT, wild type (i.e., no mutation). 9. Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Science 343,
193–196 (2014).
10. Pollen, A.A. et al. Nat. Biotechnol. 32, 1053–1058 (2014).
Methods 11. Kolodziejczyk, A.A. et al. Cell Stem Cell 17, 471–485 (2015).
Methods, including statements of data availability and any associated 12. Treutlein, B. et al. Nature 509, 371–375 (2014).
accession codes and references, are available in the online version 13. Ting, D.T. et al. Cell Rep. 8, 1905–1918 (2014).
of the paper. 14. Patel, A.P. et al. Science 344, 1396–1401 (2014).
15. Usoskin, D. et al. Nat. Neurosci. 18, 145–153 (2015).
16. Klein, A.M. et al. Cell 161, 1187–1201 (2015).
Note: Any Supplementary Information and Source Data files are available in the 17. Zeisel, A. et al. Science 347, 1138–1142 (2015).
online version of the paper. 18. van der Maaten, L. & Hinton, G. J. Mach. Learn. Res. 9, 2579–2605 (2008).
19. Zurauskiene, J. & Yau, C. BMC Bioinformatics http://doi.org/10.1186/
Acknowledgments s12859-016-0984-y (2016).
We thank B. Vangelov, J.-C. Delvenne and R. Lambiotte for fruitful discussions 20. Xu, C. & Su, Z. Bioinformatics https://doi.org/10.1093/bioinformatics/
and for their help with computational methods. We also thank D. Flores Santa Cruz, btv088 (2015).
D. Dimitropolou and J. Grinfeld for technical assistance with experiments. 21. Guo, M., Wang, H., Potter, S.S., Whitsett, J.A. & Xu, Y. PLoS Comput. Biol.
We thank I. Vasquez-Garcia, D. Harmin, M. Kosicki, D. Ramsköld and M. Huch 11, e1004575 (2015).
for comments on the manuscript. V.Y.K., T.A., A.Y. and M.H. are supported by 22. Macosko, E.Z. et al. Cell 161, 1202–1214 (2015).
Wellcome Trust Grants. K.N.N. is supported by the Wellcome Trust Strategic 23. Jiang, L., Chen, H., Pinello, L. & Yuan, G.-C. Genome Biol. 17, 144 (2016).
Award ‘Single cell genomics of mouse gastrulation’. M.T.S. acknowledges support 24. Patterson, N., Price, A.L. & Reich, D. PLoS Genet. 2, e190 (2006).
from FRS-FNRS; the Belgian Network DYSCO (Dynamical Systems, Control and 25. Tracy, C.A. & Widom, H. Commun. Math. Phys. 159, 151–174 (1994).
Optimisation), funded by the Interuniversity Attraction Poles Programme 26. Rousseeuw, P.J. J. Comput. Appl. Math. 20, 53–65 (1987).
initiated by the Belgian State Science Policy Office; and the ARC (Action de 27. Guo, G. et al. Dev. Cell 18, 675–685 (2010).
Recherche Concerte) on Mining and Optimization of Big Data Models, funded 28. Boroviak, T. et al. Dev. Cell 35, 366–382 (2015).
by the Wallonia-Brussels Federation. M.B. acknowledges support from EPSRC 29. Chen, E., Staudt, L.M. & Green, A.R. Immunity 36, 529–541 (2012).
(grant EP/N014529/1). T.C. was funded through a core funded fellowship by the 30. Ortmann, C.A. et al. N. Engl. J. Med. 372, 601–612 (2015).
Sanger Institute and a Chancellor′s fellowship from the University of Edinburgh. 31. Nangalia, J. et al. N. Engl. J. Med. 369, 2391–2405 (2013).

486 | VOL.14 NO.5 | MAY 2017 | nature methods


ONLINE METHODS (Fig. 1a). In principle, the k used for the hierarchical cluster-
SC3 clustering. SC3 takes as input an expression matrix, M, ing need not be the same as the k used in step 5. However, for
in which columns correspond to cells and rows correspond to simplicity in SC3 the two parameters are constrained to have the
genes/transcripts. Each element of M corresponds to the expres- same value. Figure 1d shows how the quality and the stability of
sion of a gene/transcript in a given cell. By default, SC3 does clustering improves after consensus clustering.
not carry out any form of normalization or correction for batch
effects. SC3 is based on five elementary steps. The parameters Adjusted Rand index. If cell-labels are available (for example,
in each of these steps can be easily adjusted by the user but are from a published dataset) the adjusted Rand index (ARI)34 can
set to sensible default values, determined via the gold-standard be used to calculate similarity between the SC3 clustering and
datasets (see main text). the published clustering. ARI is defined as follows. Given a set
1. Gene filter. The gene filter removes genes/transcripts that are of n elements and two clusterings of these elements, the overlap
either expressed (expression value > 2) in less than X% of cells between the two clusterings can be summarized in a contin-
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.

(rare genes/transcripts) or expressed (expression value > 0) in at gency table, in which each entry denotes the number of objects
least (100 – X)% of cells (ubiquitous genes/transcripts). By default, in common between the two clusterings. The ARI can then be
X is set at 6. The motivation for the gene filter is that ubiquitous calculated as
and rare genes are most often not informative for clustering.
We also explored all three parameters defined in the gene filter  nij    ai   b j    n
(expression thresholds of rare and ubiquitous genes/transcripts ∑ ij  2  − ∑ i  2  ∑ j  2   /  2
ARI =  
and the percentage X) and found that in general the gene filter  a  b    a  b j    n
1  i  j  i 
did not affect the accuracy of clustering (Supplementary Fig. 3c).  ∑ i  2  + ∑ j  2   −  ∑ i  2  ∑ j  2   /  2
However, the gene filter significantly reduced the dimensionality 2           
of the data, thereby speeding up the method.
For further analysis, the filtered expression matrix M is log- where nij are values from the contingency table, ai is the sum of the
transformed after adding a pseudocount of 1: M′ = log2(M + 1). ith row of the contingency table, bj is the sum of the jth column of
2. Distance calculations. Distances between the cells (i.e., col-  
umns) in M′ are calculated using the Euclidean, Pearson and the contingency table and   denotes a binomial coefficient.
 
Spearman metrics to construct distance matrices. Since the reference labels are known for all published datasets,
We investigated the impact of dropouts on distance calculations ARI is used for all comparisons throughout the paper.
by considering a modified distance metric that ignores dropouts.
This was done by excluding genes that were not expressed in at Downsampling of the gold-standard datasets. For each gene i
least one cell from the distance calculation. We found that this did and each cell j, the downsampled expression value was generated
not improve the performance (Supplementary Fig. 3d). by drawing from a binomial distribution with parameters P = 0.1
3. Transformations. All distance matrices are then trans- and n = round(Mij).
formed using either principal component analysis (PCA) or by
calculating the eigenvectors of the associated graph Laplacian Additional validation of SC3 pipeline. Additionally, we
(L = I – D–1/2AD–1/2, where I is the identity matrix, A is a simi- investigated the impact of dropouts by considering a modified
larity matrix (A = e–A′/max(A′)), where A′ is a distance matrix) distance metric that ignores dropouts, but we found that this
and D is the degree matrix of A, a diagonal matrix that contains did not improve the performance (Supplementary Fig. 3d and
the row-sums of A on the diagonal (Dii = ΣjAij). The columns Online Methods).
of the resulting matrices are then sorted in ascending order by
their corresponding eigenvalues. Identification of a suitable number of groups k̂ . Matrix Z is
4. k-means. k-means clustering is performed on the first d eigen- obtained from M′ by subtracting the mean and dividing by the
vectors of the transformed distance matrices (Fig. 1a) by using s.d. for each column (z-score). Next, the eigenvalues of X = ZTZ
the default kmeans() R function with the Hartigan and Wong are calculated. The number of clusters, k̂ , is determined by
algorithm32. By default, the maximum number of iterations is set the number of eigenvalues that are significantly different with
to 109 and the number of starts is set to 1,000. P < 0.001 from the Tracy–Widom distribution24,25 with mean
5. Consensus clustering. SC3 computes a consensus matrix using ( n − 1 + p )2 and s.d.
the cluster-based similarity partitioning algorithm (CSPA)33.
1
For each individual clustering result, a binary similarity matrix is
 1 1 3
constructed from the corresponding cell labels: if two cells belong ( n −1 + p)⋅
 n −1
+
p 
,
to the same cluster, their similarity is 1; otherwise the similar-
ity is 0 (Fig. 1a). A consensus matrix is calculated by averaging
all similarity matrices of individual clusterings. To reduce where n is the number of genes/transcripts and p is the number
computational time, if the length of the d range (D in Fig. 1a) is of cells.
more than 15, a random subset of 15 values selected uniformly
from the d range is used. Benchmarking. For each dataset we used the expression units pro-
The resulting consensus matrix is clustered using hierarchi- vided by the authors of that set (Fig. 1b). The gene filter was applied
cal clustering with complete agglomeration, and the clusters are to all the datasets. For tSNE + k-means, SNN-Cliq and pcaRe-
inferred at the k level of hierarchy, where k is defined by the user duce, the same log-transformation as in SC3 (M′ = log2(M + 1))

doi:10.1038/nmeth.4236 nature methods


was applied. For SINCERA, we used the original z-score normali- Biological insights. SC3 can identify differentially expressed
zation21 instead of the log-transformation. For tSNE, the Rtsne R genes as genes that vary between two or more clusters. Accordingly,
package was used with the default parameters. For SEURAT, we marker genes are identified as genes that are highly expressed in
used the original Seurat R package (version 1.3): we performed only one of the clusters and are able to distinguish one cluster
tSNE embedding with the default parameters once (following the from all the remaining ones (Supplementary Fig. 6a). Cell out-
authors’ tutorial at http://satijalab.org/seurat/seurat_clustering_ liers are identified through the calculation of a score for each
tutorial_part1.html) and then clustered the data using the cell using the minimum covariance determinant36. Cells that fit
DBSCAN algorithm multiple times, during which we varied the well into their clusters receive an outlier score of 0, whereas high
density parameter G in the range 10−3–103 to find a maximal values indicate that the cell should be considered an outlier.
ARI (this ARI is presented in Fig. 2a). SEURAT was not able to Identification of differential expression. Differential expres-
find more than one cluster for the smallest datasets (Biase, Yan, sion is calculated using the nonparametric Kruskal–Wallis test,
Goolam, Treutlein and Ting) leading to very small ARI scores. For an extension of the Mann–Whitney test for tests of more than
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.

all methods we supplied the k used by the original authors. two groups. The Kruskal–Wallis test has the advantage of being
nonparametric, but as a consequence, it is not well suited for
Cluster stability. We calculated stability of clustering solutions situations in which many genes have the same expression value.
by running each method 100 times and finding the most frequent A significant P-value indicates that gene expression in at least one
solution and the number of times (Nc) it appeared. The stability cluster stochastically dominates one other cluster. SC3 provides a
measure shown in Figure 2b is then calculated as Nc/100. list of all differentially expressed genes with P < 0.01, corrected for
multiple testing (using the default ‘holm’ method of the p.adjust()
Support vector machines (SVM). When using SVM, a specific R function), and plots gene expression profiles of the 50 most
fraction of the cells is selected at random with uniform prob- significant differentially expressed genes. Note that calculating
ability. Next, an SVM35 model with a linear kernel is constructed differential expression after clustering can introduce a bias in the
based on the obtained clustering. We used the svm function of the distribution of P-values, and thus we advise using the P-values
e1071 R package with default parameters. The cluster IDs for the for ranking the genes only.
remaining cells are then predicted by the SVM model. Identification of marker genes. For each gene, a binary classi-
fier is constructed based on the mean cluster-expression values.
Identification of rare cell-types. To specifically evaluate the The area under the receiver operating characteristic (ROC) curve
sensitivity of SC3 for identifying rare cell-types, we carried out is used to quantify the accuracy of the prediction. A P-value is
a synthetic experiment in which cells from one cell-type were assigned to each gene using the Wilcoxon signed-rank test, com-
removed iteratively from the Kolodziejczyk and Pollen datasets. paring gene ranks in the cluster with the highest mean expression
For the Pollen dataset, all but 1–7 of the cells in one of the 11 with all others (P-values are adjusted using the default ‘holm’
clusters were removed. The limit of 7 cells corresponds to the method of the p.adjust() R function). Genes with areas under
size of the smallest cluster in the original data. Subsequently, the ROC curve (AUROC) > 0.85 and with P < 0.01 are defined
SC3 was run using k = 11, and we asked whether or not the cells as marker genes. The AUROC threshold corresponds to the 99th
of the rare cell-type were located in a separate cluster. This was percentile of the AUROC distributions obtained from 100 random
repeated 100 times for each cell-type; Supplementary Figure 4d permutations of cluster labels for all datasets (Supplementary
reports the percentage of runs in which the rare cells were found Table 2 and Supplementary Fig. 6b). SC3 provides a visualiza-
together in a cluster with no other cells. Note that the ARI is tion of the gene expression profiles for the top 10 marker genes
a poor indicator of the ability to identify rare cells, since this of each obtained cluster.
measure is relatively insensitive to the behavior of a small frac- Cell outlier detection. Outlier cells are detected by first taking an
tion of the cells. For the Kolodziejczyk dataset, we used a similar expression matrix of each individual cluster (all cells with the same
strategy, but we allowed for 1–101 cells in the rare group. For labels) and reducing its dimensionality using the robust method
the Pollen dataset, SC3 can detect clusters containing ~1% of the for PCA (ROBPCA)37. This method outputs a matrix with N rows
cells, whereas for the Kolodziejczyk dataset ~10% of the cells are (number of cells in the cluster) and P columns (retained number
required (Supplementary Fig. 4d). We hypothesize that the abil- of principal components after running ROBPCA). SC3 then uses
ity to identify rare cells reflects the origins of the two datasets; the P = min(P, 3) first principal components for further analysis. If
Pollen data is more diverse, as it represents 11 different cell lines, ROBPCA fails to perform or P = 0, SC3 shows a warning message.
while the Kolodziejczyk data comes from one cell-type grown in We found (results not shown) that this usually happened when the
three different conditions. distribution of gene expression in cells was too skewed toward 0.
For the hybrid SC3 approach, with 30% of cells used to train Second, robust distances (Mahalanobis) between the cells in each
the SVM, we were able to calculate the probability of including the cluster are calculated from the reduced expression matrix using
rare cell-types in the training set analytically by multiplying the the minimum covariance determinant (MCD)36. We then used a
data from Supplementary Figure 4d by the probability of all rare threshold based on the Q% quantile of the chi-squared distribution
cells to be included in the drawn sample (30% of all cells). This (with p degrees of freedom) to define outliers. By default Q = 99.99,
probability was calculated using the hypergeometric distribution but it can be manually adjusted by a user. Finally, we define an
R function: phyper(n.rare.cells – 1, n.rare.cells, n.other.cells, 0.3 × outlier score as the difference between the square root of the
(n.other.cells + n.rare.cells), lower.tail = F), where n.rare.cells is the robust distance and the square root of the Q% quantile of the
number of rare cells and n.other.cells is the number of other cells chi-squared distribution (with p degrees of freedom). The outlier
in the dataset (Supplementary Fig. 4e). score is plotted as a bar plot (Supplementary Fig. 6c).

nature methods doi:10.1038/nmeth.4236


Gene and pathway enrichment analysis. We used the g:Profiler spike-in controls downloaded from the ERCC consortium.
web tool38 to perform gene and pathway enrichment analysis in Counts of uniquely mapped reads in each protein coding gene and
all obtained sets of genes. each ERCC spike-in were calculated using SeqMonk (http://
www.bioinformatics.bbsrc.ac.uk/projects/seqmonk) and were
Analysis of the Macosko dataset. To analyze the Drop-seq used for further downstream analysis. Quality control of the
dataset we followed the procedure used by Macosko et al.22 and cells comprised two steps: (i) filtering cells based on the number
selected the 11,040 cells in which more than 900 genes were of expressed genes and (ii). filtering cells based on the ratio of
expressed. Moreover, due to the low read depth, the gene fil- the total number of ERCC spike-in reads to the total number of
ter was removed. We then sampled 5,000 cells and clustered reads in protein-encoding genes. Filtering thresholds were manu-
using SC3, including the SVM step, 100 times. All 100 solu- ally chosen by visual exploration of the quality control features
tions were consistent with each other, resulting in an average (Supplementary Fig. 8). After filtering, 51 and 89 cells were
ARI of 0.58, and they were sufficiently accurate compared to retained from patient 1 and patient 2, respectively. The expres-
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.

the reference authors’ clustering, yielding an average ARI of sion values in each dataset were then normalized by first using a
0.54 (Supplementary Fig. 5a). Since each of the 100 solutions size-factor normalization (from DESeq2 package45), to account
were different, we added an additional consensus clustering for sequencing depth variability, and then by using a normali-
step using the ‘best of k’ consensus algorithm39. This approach zation based on ERCC spike-ins, performed using the RUVSeq
provided a single solution based on the 100 different solutions package46 (RUVg() function with parameter k = 1), to account
and was as accurate as the individual solutions, with an ARI of for technical variability. For combined patient data, normalization
0.52 (the actual labels are presented in Supplementary Table 1). steps were performed after pooling the cells. The resulting filtered
The SC3 consensus solution splits the large original cluster and normalized datasets were clustered by SC3. Potential biases in
(cluster 24 with 29,400 cells) hierarchically into two clusters of cell filtering on the proportions of cells in the clusters of patient 1
smaller sizes (18,105 + 10,558 = 28,663 cells; clusters 4 and 8 in are considered in Supplementary Results 2. The cluster of lower
Supplementary Fig. 5b). Additional gene and pathway enrich- cell quality was separated from the other biologically meaningful
ment analysis for the differentially expressed genes between the clusters of patient 1 and did not change the total proportion of
two clusters is presented in Supplementary Table 1. If more the biologically meaningful clusters. Supplementary Results 3
than 75% of the cells from the reference cluster were shared with shows that SC3 clustering results of patient 1 did not depend on
the SC3 cluster, we defined these two clusters as matched. In the normalization procedure.
total, 31 reference clusters were matched to the SC3 clusters. Clustering of patient scRNA-seq data by SC3. We clustered
scRNA-seq data from patient 1 and patient 2 separately, as well
Patients. Both patients provided written informed consent. as a combined dataset containing data from patient 1 + patient 2.
Diagnoses were made in accordance with the guidelines of the For patient 1, in agreement with the RMT algorithm, the best
British Committee for Standards in Haematology. clustering was achieved for k = 3 (Supplementary Fig. 9). Data
Isolation of hematopoietic stem and progenitor cells. Cell popu- from patient 2 was homogeneous, and SC3 was unable to iden-
lations were derived from peripheral blood enriched for hemat- tify more than one meaningful cluster (Supplementary Fig. 10),
opoietic stem and progenitor cells (CD34+, CD38–, CD45RA–, again in agreement with the RMT algorithm. For the combined
CD90+), hereafter referred to as HSCs. For single cell cultures, dataset for patient 1 + patient 2, the best values of the silhouette
individual HSCs were sorted into 96-well plates (Supplementary index were obtained when k was 2 or 3 (Supplementary Fig. 11).
Fig. 7a,b) and grown in a cytokine cocktail designed to promote In both cases, all of the cells from cluster 1 in patient 1 were
progenitor expansion as previously described40. For scRNA-seq grouped with the cells from patient 2. For k = 3, clusters 2 and 3
studies, single HSCs were directly sorted into lysis buffer, as of patient 1 were also resolved (Fig. 3). The RMT algorithm also
described in Picelli et al.41. provided k = 3 for the merged patient 1 + patient 2 dataset.
Determination of mutation load. Colonies of granulocyte/ Comparison of clustering of patient 1 scRNA-seq data. Results of
macrophage composition were chosen and DNA-isolated for the clustering of patient 1 data by other methods and their com-
Sanger sequencing for JAK2V617F and TET2 mutations as pre- parisons are SC3 is presented in Supplementary Results 4 and 5.
viously described by Ortmann et al.30. Identification of differentially expressed genes from microarray data.
Single cell RNA-sequencing. Single HSCs were sorted into 96- The microarray data of patient 1 was obtained from Array Express,
well plates and cDNA generated as described previously41. The under accession number E-MTAB-3086 (ref. 30). One replicate
Nextera XT library-making kit was used for library generation as (2B) was identified as an outlier and removed. The ‘limma’ R
described by Picelli et al.41. package47 was used to identify 932 differentially expressed genes
Processing of scRNA-seq data from HSCs. We sequenced 96 between WT and TET2/JAK2V617F double-mutants using an
single cell samples per patient with tow sequencing lanes per adjusted (by false discovery rate) P-value threshold of 0.1.
sample, yielding a variable number of reads (mean = 2,180,357; Marker genes analysis for patients. For both patients, to increase
s.d. = 1,342,541). FastQC42 was used to assess the sequence qual- the number of marker genes, the AUROC threshold was set to
ity. Foreign sequences from the Nextera Transposase agent were 0.7 instead of the default value of 0.85 and the P-value threshold
discovered and subsequently removed with Trimmomatic43, using was set at 0.1.
the parameters HEADCROP:19 ILLUMINACLIP:NexteraPE-
PE.fa:2:30:10 TRAILING:28 CROP:90 MINLEN:60 to trim the Data availability. All datasets (in Fig. 1b and the Macosko data-
reads to 90 bases, before mapping with TopHat44 to the Ensembl set) were acquired from the accession numbers provided in the
reference genome version GRCh38.77, augmented with the original publications. According to their respective authors, the

doi:10.1038/nmeth.4236 nature methods


Pollen dataset contains two distinct hierarchies and the cells º devtools::install_github(“satijalab/Seurat”, ref = “da6cd08”)
can be grouped either into 4 or 11 clusters, and the Usoskin data- º In the newer versions of SEURAT, a different algorithm is
set contains three hierarchies and the cells can be grouped either used for clustering.
into 4, 8 or 11 clusters. scRNA-seq data for patient 1 and 2 is • Source files used for generating Supplementary Results 2–5
available from GEO under accession code GSE79102. Source data can be found in Supplementary Software 2.
files for Figures 1–3, and Supplementary Figure 1–7 and 12 are
available online.

Software availability. SC3 is available as a R package at http:// 32. Hartigan, J.A. & Wong, M.A. J. R. Stat. Soc. Ser. C Appl. Stat. 28,
100–108 (1979).
bioconductor.org/packages/SC3/. 33. Strehl, A. & Ghosh, J. J. Mach. Learn. Res. 3, 583–617 (2003).
Scripts for figure generation are available at http://github.com/ 34. Hubert, L. & Arabie, P. J. Classif. 2, 193–218 (1985).
hemberg-lab/SC3-paper-figures. At the time of writing the manu- 35. Ben-Hur, A., Horn, D., Siegelmann, H.T. & Vapnik, V. J. Mach. Learn. Res.
2, 125–137 (2001).
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.

script, the following old versions of some of the tools were used
36. Hubert, M. & Debruyne, M. WIREs Comp Stat 2, 36–43 (2010).
(these tools have been updated/upgraded since then): 37. Hubert, M., Rousseeuw, P.J. & Branden, K.V. Technometrics 47,
• SC3 (1.1.2 ≤ Version < 1.1.5). These versions of SC3 can be 64–79 (2005).
installed: 38. Reimand, J. et al. Nucleic Acids Res. 44, W83–W89 (2016).
39. Goder, A. & Filkov, V. Consensus clustering algorithms: comparison and
• from source/binary files from Bioconductor http://biocon- refinement. in Proceedings of the Meeting on Algorithm Engineering &
ductor.org/packages/3.3/bioc/html/SC3.html Experiments 109–117 (Society for Industrial and Applied Mathematics,
• from GitHub using commands: 2008).
■ install.packages(“devtools”) 40. Petzer, A.L., Zandstra, P.W., Piret, J.M. & Eaves, C.J. J. Exp. Med. 183,
2551–2558 (1996).
■ devtools::install_github(“hemberg-lab/SC3”, ref = 41. Picelli, S. et al. Nat. Protoc. 9, 171–181 (2014).
“8a86b60463”) 42. Andrews, S. FastQC: A quality control tool for high throughput sequence
º SC3 v.1.1.2 source and DESCRIPTION files can be found data. Reference Source (2010).
43. Bolger, A.M., Lohse, M. & Usadel, B. Bioinformatics 30, 2114–2120
in Supplementary Software 1. (2014).
º In the newer versions, the main SC3 pipeline has not been 44. Trapnell, C., Pachter, L. & Salzberg, S.L. Bioinformatics 25,
changed. 1105–1111 (2009).
• SEURAT (version 1.3), which can be installed from 45. Love, M.I., Huber, W. & Anders, S. Genome Biol. 15, 550 (2014).
46. Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Nat. Biotechnol. 32,
GitHub: 896–902 (2014).
º install.packages(“devtools”) 47. Ritchie, M.E. et al. Nucleic Acids Res. 43, e47 (2015).

nature methods doi:10.1038/nmeth.4236

You might also like