Full-length mRNA-Seq from single-cell levels of RNA
Full-length mRNA-Seq from single-cell levels of RNA
Genome-wide transcriptome analyses are routinely used to monitor tissue-, disease- and cell type–specific gene expression,
but it has been technically challenging to generate expression profiles from single cells. Here we describe a robust mRNA-Seq
protocol (Smart-Seq) that is applicable down to single cell levels. Compared with existing methods, Smart-Seq has improved
© 2012 Nature America, Inc. All rights reserved.
read coverage across transcripts, which enhances detailed analyses of alternative transcript isoforms and identification of
single-nucleotide polymorphisms. We determined the sensitivity and quantitative accuracy of Smart-Seq for single-cell
transcriptomics by evaluating it on total RNA dilution series. We found that although gene expression estimates from single cells
have increased noise, hundreds of differentially expressed genes could be identified using few cells per cell type. Applying Smart-
Seq to circulating tumor cells from melanomas, we identified distinct gene expression patterns, including candidate biomarkers
for melanoma circulating tumor cells. Our protocol will be useful for addressing fundamental biological problems requiring
genome-wide transcriptome profiling in rare cells.
Analyses of transcriptomes through massively parallel sequencing of is a need for a single-cell transcriptome method that can be used to
cDNAs (mRNA-Seq) generates millions of short sequence fragments both quantify gene expression and provide the coverage for efficient
that can be analyzed to accurately quantify expression levels1, assemble detection of transcript variants and alleles.
new transcripts2,3 and investigate alternate RNA processing 4,5. Here we introduce a single-cell RNA-sequencing protocol with
These techniques have been consistently pushed toward devel- markedly improved transcriptome coverage, which samples cDNAs
opment of methods that require lower starting amounts of RNA, from more than just the ends of mRNAs. Using this protocol, we
ideally as small as single cells. A protocol initially developed for sequenced the mRNAs from many individual mammalian cells, as
single-cell microarray studies6 has been adapted for mRNA-Seq and well as well-defined dilution series of purified total RNAs, to compre-
used to generate transcriptome data for individual mouse oocytes hensively assess how sensitivity, variability and detection of differing
and early embryonic cells7,8. Using the method, thousands of genes
npg
1Ludwig Institute for Cancer Research, Stockholm, Sweden. 2Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden. 3Illumina, Inc.,
Hayward, California, USA. 4Department of Chemical Physiology, Center for Regenerative Medicine, The Scripps Research Institute, San Diego, La Jolla, California,
USA. 5Rebecca and John Moores Cancer Center, San Diego, La Jolla, California, USA. 6Department of Reproductive Medicine, University of California, San Diego,
La Jolla, California, USA. 7These authors contributed equally to this work. Correspondence should be addressed to R.S. ([email protected]).
Received 14 November 2011; accepted 22 May 2012; published online 22 July 2012; doi:10.1038/nbt.2282
or Tn5-mediated ‘tagmentation’ using the Nextera technology (Tn5). Quantitative assessment of single-cell transcriptomics
Both of these library preparation methods enable random shot- Analyses of gene expression from millions of cells using mRNA-Seq
gun sequencing of cDNAs (Supplementary Fig. 1). We generated is highly reproducible and has low technical variation1,4. To our
Smart-Seq libraries from 42 individual human or mouse cells, and in knowledge, no single-cell mRNA-Seq study has measured the tech-
addition we generated 64 libraries from dilution series of total RNA nical variation intrinsic to the cDNA pre-amplification components
derived from human brain (16 samples), mouse brain (28 samples) of single-cell methods. We therefore diluted microgram amounts of
and universal human reference RNA (UHRR, 20 samples). We reference total RNA down to nano- and picogram levels and applied
sequenced each sequencing library on the Illumina platform, typi- Smart-Seq to assess sensitivity, technical variability and detection of
cally generating >20 million uniquely mapping reads (Supplementary differentially expressed transcripts of Smart-Seq on low amounts of
Table 1). For comparison, we also made several standard mRNA-Seq total RNA. For comparison, we generated standard mRNA-Seq librar-
libraries from 100 ng to a few micrograms of total RNA. ies from 100 ng to microgram amounts of reference total RNA.
npg
a 100
b 100
e
10 Spearman r = 0.87
Pearson r = 0.88
c 1.6
d f
Smart-Seq mRNA-Seq 10 Spearman r = 0.77
Within cell lines
1.2 1 ng
Brain versus UHRR Standard deviation, log10
10 cells 5
100 pg Single cells
20 pg 0.6 Between cell lines
0.8 10 pg Single cells 0
0.4
0.4 −5
© 2012 Nature America, Inc. All rights reserved.
0.2
−10
0 0
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −10 −5 0 5 10
Gene expression (RPKM), log10 Gene expression (RPKM), log10 mRNA-Seq (RPKM, log2)
g
10 Spearman r = 0.57
Figure 2 Sensitivity and variability in Smart-Seq from few or single cells. (a) Percentage of genes Pearson r = 0.58
and Smart-Seq generated data (y axis) starting from 1 ng total RNA (e), 100 pg total RNA (f) and 10 pg total RNA (g). Correlation coefficients computed
from log2 transformed relative gene expression profiles, together with nonlinear loess regression curves (green) and y = x lines (red).
detection sensitivity is affected by limiting starting amounts of RNA only a modest increase in technical noise (Fig. 2c). When lowering
that lead to random loss of low-abundance transcripts, but still the input amounts down to picogram levels, there was a clear increase
majority of low-abundance and the vast majority of highly expressed in technical variability, particularly for less abundantly expressed
transcripts are reliably detected even in single cells. transcripts (Fig. 2c). We compared technical variability at picogram
Second, we determined the reproducibility in expression levels levels of total RNA to the biological variation found in comparisons
generated from diluted RNA and individual cells. Comparison of of human brain samples and UHRR using standard mRNA-Seq
Smart-Seq and previous mouse oocyte data7 demonstrated improved (Fig. 2c). Notably, analyses of gene-expression variation between indi-
estimation of expression with Smart-Seq (lower variability in data vidual cancer cells of different origin revealed extensive biological
from oocyte to oocyte) across the whole range of expression levels variation in highly expressed genes (Fig. 2d).
(Supplementary Fig. 2k). Correlation analyses between technical Finally, we assessed whether single-cell expression profiles from
replicates of diluted RNA showed increasing concordance with larger preamplified material were representative of the original expression
amounts of RNA. Comparing data from the single cells against the profiles. Comparing relative gene expression levels (UHRR minus
RNA-dilution data, we observed higher correlations (Pearson cor- brain) estimated using standard mRNA-Seq to those estimated from
relations of 0.75–0.85) among individual cells of the same type than Smart-Seq with different amounts of input RNA, we again found a high
among dilution replicates at 10 pg (Pearson correlations of 0.65–0.75) concordance (Fig. 2e–g). Starting with 1 ng or 100 pg total RNA, the
(Supplementary Fig. 6). As variability in measurements of expression relative expression in Smart-Seq and standard mRNA-Seq, respectively,
depends on transcript expression levels, we computed the variability had Spearman correlations of 0.87 and 0.77 (Fig. 2e,f). Comparisons
as a function of the expression level (Fig. 2c,d). This analysis showed with 10 pg input RNA showed overall good correlation (Fig. 2g) but
that Smart-Seq on 10 ng total RNA had the same technical variability identified two populations of transcripts with distorted expression in
as standard mRNA-Seq and that Smart-Seq on 1 ng total RNA showed Smart-Seq data from either human brain sample or UHRR, reflecting
a b 45 c
PC3 (n = 4) 35
cell 1
100 30 cell 2
T24
SVD component 2
cell 3
25 cell 4
50 cell 1
660 20 cell 2
LNCaP
433 cell 3
0 15 cell 4
731 10 NM_015277
–50 NM_001144967
LNCaP (n = 4) NEDD4L
5
T24 (n = 4)
–100
–100 –50 0 50 100 150
0
1 ng 100 pg 50 pg 10 pg Single
cells
Single
cells
d LNCaP versus PC3
SVD component 1 Total RNA T24 versus PC3
300
Number of exons
LNCaP versus T24
Mouse brain Mouse Human
ES/other cancer
200
Figure 3 Transcriptional and post-transcriptional analyses of cancer cell cells cells
line cells using Smart-Seq. (a) Categorization of individual cells according 100
to cell line of origin using single-cell Smart-Seq transcriptomes. Singular-value decomposition (SVD) analysis was
conducted for 12 individual cancer cells (four cells each from the PC3, LNCaP and T24 cancer cell lines) based on 0
global gene expression profiles. Projections are shown based on the first two dimensions that capture most of the 0 5 10 15 20 25
False discovery rate (%)
variance. The numbers of significantly differentially expressed genes per pair-wise cell line comparison are shown next to
the arrows (P < 0.05, 1-way ANOVA and Tukey post-hoc test). (b) Mean number of exons with sufficient read coverage for MISO analyses of exon inclusion
levels in sequence-depth matched single-cell mRNA-Seq data. Smart-Seq data from diluted mouse brain RNA (green) compared with previously published
© 2012 Nature America, Inc. All rights reserved.
mouse ESCs8 and ESC-derived cells (red) and 12 Smart-Seq–analyzed individual prostate and bladder cell line cells (purple). Individual RNA or cell
measurements are plotted. (c) Single-cell Smart-Seq reads mapping to a portion of the NEDD4L gene locus from four individual T24 and LNCaP cells. Read
coverage is shown as a heatmap with darker blue indicating higher read coverage. (d) Number of differentially included exons identified among the PC3,
LNCaP and T24 cell lines from single-cell Smart-Seq analysis on four cells per cell line as a function of estimated false discovery rate.
stochastic losses, mostly of low-abundance transcripts when starting In this comparison of three cancer cell lines, we found 100 exons with
with such minute RNA amounts (Fig. 2a,g). Preamplification of cDNA differential exon inclusion levels among the three cell lines, with a less
could also lead to disproportionate amplification of short transcripts, than 1% false discovery rate (Fig. 3d and Supplementary Table 3).
but we found no systematic bias (Supplementary Fig. 7). A previous We conclude that Smart-Seq considerably improves our ability to detect
microarray study had analyzed PCR-amplified cDNA (from picogram alternative RNA processing in single cells.
starting amounts) and found the transcriptome overall preserved but
skewed10. Our data from 1 ng and 100 pg total RNA showed no skew- Analyses of circulating tumor cell transcriptomes
ing, that is, the loess slopes estimated from the data approximated 1 Having demonstrated that Smart-Seq generates quantitative and repro-
(Fig. 2e–g). Together, these results demonstrated that transcriptome ducible single-cell transcriptomes, we asked whether global transcriptome
analyses from few or single cells, in general, preserved relative differ- analyses of putative CTCs could reveal their tumor of origin and pro-
ences in expression for detected transcripts. vide data to support the use of this method for unbiased cancer-specific
biomarker identification. To this end, we generated transcriptomes from
Transcriptional and post-transcriptional differences six single NG2+ putative melanoma CTCs isolated from peripheral blood
npg
Having demonstrated the improved performance of Smart-Seq on drawn from a patient with recurrent melanoma using immunomagnetic
low amounts of RNA compared with previously published methods, purification with a MagSweeper instrument (Illumina)12. For comparison,
we focused our analyses on single-cell transcriptomes from prostate we also generated Smart-Seq libraries from single cells derived from pri-
(PC3 and LNCaP) and bladder (T24) cancer cell line cells. The global mary melanocytes (n = 2), melanoma cancer cell line (SKMEL5, n = 4 and
gene expression of 12 individual cells (four from each cell line) clus- UACC257, n = 3) cells and from human embryonic stem cells (ESCs, n = 8).
tered according to cell line of origin and we identified hundreds As the NG2+ putative CTCs were isolated from blood, it was important
of differentially expressed genes among the three cell lines (Fig. 3a; to compare them to blood cells. The putative CTCs were distinct from
q < 0.05 ANOVA; P < 0.05 post-hoc test). lymphoma cell lines (BL41 and BJAB)13 and immune tissues (lymphn-
The pronounced 3′-end bias of previous single-cell mRNA-Seq studies ode and white blood cell samples), as well as embryonic stem cells, and
has hampered the ability to identify alternative splicing differences in single instead were highly similar to primary melanocytes and melanoma cell
cells. We used the Bayesian mixture of isoforms framework (MISO)11 to line cells. Unsupervised hierarchical clustering and correlation analyses
infer exon inclusion levels for known alternatively spliced exons in the 12 of gene expression levels showed a clear clustering of cells according to
individual cells. The improved read coverage with Smart-Seq resulted in a cell type of origin (Fig. 4a and Supplementary Fig. 8), and separation
twofold increase in the number of potential alternatively spliced exons that from the human brain RNA samples that were previously analyzed with
could be assessed, compared to previously published single-cell mRNA- Smart-Seq or mRNA-Seq (data not shown). Additional support for the
Seq data (Fig. 3b), substantially improving our ability to detect alternative melanocytic origin of the putative melanoma CTCs came from analyses
splicing. Cell type–specific alternative splicing could be inferred from of melanocyte lineage–specific markers, as all NG2+ cells expressed high
single-cell transcriptomes, as seen in read coverage across the differentially mRNAs levels for MLANA14, TYR15 and the melanocyte specific m-form
included exon 13 of the NEDD4L gene (Fig. 3c). This exon was frequently of MITF16 but not immune markers such as PTPRC (Fig. 4b), in contrast
included in LNCaP cells (93% mean inclusion level) but was included at to peripheral blood lymphocytes (Supplementary Fig. 9). Furthermore,
much lower levels in T24 cells (15% mean inclusion levels) whereas low NG2+ cells expressed high levels of melanoma-associated genes (based
expression of NEDD4L in PC3 cells precluded inclusion level estimation. on our unbiased selection of the 100 transcripts most strongly associated
24 0
4 100 Expression (RPKM, log2) Expression (RPKM, log10)
0.6 69 100
95 4
0
100 100 0 5 10 0 1 2 3
0.8
100 100
c
Immune
***
1.0
Melanoma
Immune CTC PM SKMEL UACC ESC T24 LNCaP PC3
samples
0 50 100 150 200
CTC gene expression (RPKM)
d SKMEL UACC PM CTC ESC BL
MAGEB2
MAGEC2 f SKMEL UACC PM CTC ESC BL
MAGEA10,A5 GPR143
MAGEA2,2B SEMA5A
MAGEA12 ABCB5
MAGEA6 TRPM1
MAGEA3 TGFB1l1
PLIN2
Immune: SLC16A6
e SKMEL UACC PM CTC ESC BL BJ W L MGST2
SLC7A8
GJB1
PRAME CDH1
DPP4
© 2012 Nature America, Inc. All rights reserved.
GPR126
RAB31
CRIM1 CYSLTR2
ABCG5 GNAL
SCL20A1 IFITM2
ADAM17 HLA-G
RPS3 HLA-H
PSMB1 HLA-C
HLA-B
TTYH3
g Scale
Chr11: 89017950 89017955 89017960 89017965 89017970
HDAC6
HRAS
APP
---> G A G C A G T G G C T C C G A A G G C A C C G T C
UCSC genes DSG2
TYR
F E Q W L R R H R NF2
SNP (dbSNP 132) found in ≥ 1% of samples ANXA6
rs1126809 CD81
VAMP8
PM CTC Relative expression (log2)
A 1 0 152 117 47 198 56 0
G 1,133 78 0 0 0 1 1 0 –5 0 5
Figure 4 Single-cell transcriptomes of circulating tumor cells. (a) Hierarchical clustering of human samples based on gene expression of highly
expressed genes (>100 RPKM). Coloring indicates high-order clusters and the confidence in clusters are indicated with bootstrap values (percentage).
Samples analyzed include human immune samples (Burkitt’s lymphoma cell lines BL41 and BJAB, and white blood cells and lymph node samples)
and cells from putative melanoma CTCs (CTC), primary melanocytes (PM), melanoma cell lines SKMEL5 (SKMEL) and UACC257 (UACC), prostate
cancer cell lines (LNCaP and PC3), bladder cancer cell line (T24) and human embryonic stem cells (ESC). (b) Expression of melanocyte makers
npg
(PMEL, MITF, TYR and MLANA) and immune marker PTPRC in single-cell transcriptomes from a with Burkitt’s lymphoma cell lines BL41 and BJAB
(BL). (c) Gene expression levels in CTCs for an unbiased set of 100 immune and melanoma markers. (d–f) Heatmaps showing relative expression of
melanoma associated tumor antigens (d), upregulated plasma-membrane proteins (e), and downregulated plasma-membrane proteins (f) in single-cell
transcriptomes as in b with the addition of more immune samples (W, white blood cells; L, lymph node). (g) Number of reads from individual PMs and
putative CTCs that support the reference (G) or risk (A) allele for the melanoma-associated SNP (rs1126809).
with melanoma; see Online Methods), but not immune cell–associated MHC class I genes. Notably, the preferentially expressed antigen in
genes selected in a similar manner (Fig. 4c, P < 3.7 × 10−15, Wilcoxon melanoma (PRAME) was highly expressed in NG2+ cells, which
rank sum test). Thus, both their global transcriptomes and expression together with elevated expression of known melanoma tumor anti-
patterns of melanoma-associated transcripts clearly support a melanoma gens, provides strong support for the conclusion that the NG2+ cells
CTC identity for the NG2+ cells. were CTCs that originated from a melanoma.
We next investigated whether the NG2+ putative CTCs showed signs In recent years, there has been a strong interest in identifying CTCs
of originating from a melanoma tumor. Comparison of their gene from different tumors using the a priori assumption that plasma-
expression profiles with those of individual primary melanocytes iden- membrane proteins would be good diagnostic biomarkers. We used the
tified 289 genes with significantly (q < 0.05 ANOVA; P < 0.05 post-hoc CTC transcriptome analysis to screen for membrane proteins selec-
test) higher expression in the putative CTCs than the primary melano- tively expressed in melanoma-derived CTCs compared to primary
cytes, and 436 genes with significantly (q < 0.05 ANOVA; P < 0.05 post- melanocytes and immune cells. We identified nine upregulated plasma
hoc test) lower levels (Supplementary Table 4). The upregulated genes membrane–associated transcripts in the CTCs compared to primary
were significantly (Benjamini-Hochberg adjusted P < 0.05) enriched for melanocytes (q < 0.05 ANOVA; P < 0.05 post-hoc test), many of
melanoma-associated antigens (Fig. 4d and Supplementary Fig. 10) which are not expressed in immune cells and have not been previously
that have been repeatedly found to be upregulated in cancer17, mitotic associated with melanomas (Fig. 4e). Similarly, screening for loss of
cell cycle genes and additional categories (Supplementary Table 5). expression of plasma-membrane proteins identified 37 genes with
Downregulated genes were enriched for regulators of cell death and significantly (q < 0.05 ANOVA; P < 0.05 post-hoc test) lower expression
in the CTCs than primary melanocytes (Fig. 4f). Of note, epithelial Note: Supplementary information is available in the online version of the paper.
Cadherin 1 (CDH1) showed no expression in the CTCs, and loss of
Acknowledgments
CDH1 is thought to contribute to cancer progression by increasing We thank C. Burge and G. Winberg for critical reading of the manuscript, T. Juarez
proliferation, invasion and metastasis18. We also found downregula- and J. Cotton at the University of California San Diego for their help in Internal
tion of genes associated with the escape from immune surveillance, Review Board protocol preparation and aquisition of clinical samples, A.A. Talasaz
including five HLA genes (Fig. 4f), and TRPM1, suggesting that and G. Cann for assistance with the Magsweeper, members of the Science for Life
these gene expression changes might enable the CTCs to escape from laboratory (Stockholm) for assistance with MiSeq sequencer. Y.-C.W. was supported
by a fellowship from the Marie Mayer Foundation. L.C.L. was supported by US
immune surveillance. Notably, low expression of TRPM1 has been National Institutes of Health (NIH) K12HD001259. J.F.L. was supported by NIH
shown to correlate with melanoma aggressiveness and metastasis19. R33MH87925 and California Institute for Regenerative Medicine (CL1-00502,
Future studies of these membrane proteins will likely enhance our RT1-01108, TR1-01250, and RN2-00931). R.S. was supported by European Research
understanding of CTC migration and invasiveness, and these results Council (starting grant 243066), Swedish Research Council (2008-4562), Foundation
for Strategic Research (FFL4) and Åke Wiberg Foundation (756194131).
highlight the utility of studying single CTC cells with RNA-Seq.
Lastly, we investigated whether Smart-Seq transcriptome data could AUTHOR CONTRIBUTIONS
be mined for single-nucleotide polymorphisms (SNPs) and other D.R. designed and performed the computational analyses of sequencing reads,
genetic variants associated with melanomas or other cancers. With prepared figures, tables and methods, and contributed manuscript text. S.L. and R.L.
developed protocols and created libraries. I.K. and S.L. did primary data analysis.
the improved read coverage provided by the Smart-Seq method, we
Y.-C.W., G.A.D. and J.F.L. prepared melanoma circulating tumor cells, melanocytes
identified 4,312 high-confidence genomic sites with support for an and melanoma cell line cells. O.R.F. and Q.D. contributed additional sequencing
alternative allele in at least two CTCs, whereas genotype calls only libraries. L.C.L. and G.P.S. contributed to study design and manuscript text.
supported by a single cell showed an excess of previously unidentified, R.S. designed the study and prepared the manuscript, with input from other authors.
likely artifactual, sites (Supplementary Fig. 11) together with a smaller
COMPETING FINANCIAL INTERESTS
subset (9%) of A-to-G RNA editing sites (data not shown). Ninety- The authors declare competing financial interests: details are available in the online
© 2012 Nature America, Inc. All rights reserved.
two percent of the high-confidence sites coincided with documented version of the paper.
SNPs, for example, the melanoma-associated SNP in the TYR gene
(rs1126809)20 (Fig. 4g). We conclude that Smart-Seq enables screening Published online at http://www.nature.com/doifinder/10.1038/nbt.2282.
Reprints and permissions information is available online at http://www.nature.com/
for SNPs and mutations in transcribed regions using only few cells. reprints/index.html.
DISCUSSION 1. Mortazavi, A., Williams, B., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying
mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).
Generating high-coverage transcriptomes from single cells and small 2. Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in
numbers of cells will have many applications for studying rare cells; such mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol.
cells can be either individually picked or identified through cell sorting 28, 503–510 (2010).
3. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals
or laser-capture techniques. Our results showed that using Smart-Seq unannotated transcripts and isoform switching during cell differentiation.
on 10 ng of total RNA was practically indistinguishable from a stand- Nat. Biotechnol. 28, 511–515 (2010).
4. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes.
ard mRNA-Seq, whereas starting with 1 ng (corresponding roughly Nature 456, 470–476 (2008).
to 50–100 cells) showed only a minor (less than twofold) increase in 5. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of alternative
expression-level variability. Therefore, this method could be applied splicing complexity in the human transcriptome by high-throughput sequencing.
Nat. Genet. 40, 1413–1415 (2008).
to studies on homogeneous cell populations available in quantities of 6. Kurimoto, K. et al. An improved single-cell cDNA amplification method for efficient
tens to hundreds of cells. high-density oligonucleotide microarray analysis. Nucleic Acids Res. 34, e42 (2006).
However, many biologically and clinically important cell types 7. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods
6, 377–382 (2009).
exist in rare quantities and often in heterogeneous milieus, which
npg
8. Tang, F. et al. Tracing the derivation of embryonic stem cells from the inner cell
necessitates single-cell approaches. Smart-Seq generates robust and mass by single-cell RNA-Seq analysis. Cell Stem Cell 6, 468–478 (2010).
9. Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly
quantitative transcriptome data from single cells. We found hundreds multiplex RNA-seq. Genome Res. 21, 1160–1167 (2011).
of differentially expressed genes using only a few individual cells per 10. Iscove, N.N. et al. Representation is faithfully preserved in global cDNA amplified
cell type; for example, comparing only two primary melanocytes to six exponentially from sub-picogram quantities of mRNA. Nat. Biotechnol. 20,
940–943 (2002).
melanoma CTCs identified biologically meaningful differences. Even 11. Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA
sequencing of a single cell yielded useful information, as we, in each sequencing experiments for identifying isoform regulation. Nat. Methods 7,
cell, detected most of the genes active in a culture of LNCaP cells. 1009–1015 (2010).
12. Talasaz, A.H. et al. Isolating highly enriched populations of circulating epithelial
Smart-Seq is a robust method for single-cell RNA-Seq with cells and other rare cells from blood using a magnetic sweeper device. Proc. Natl.
improved read coverage across transcripts, which enables more Acad. Sci. USA 106, 3970–3975 (2009).
13. Shukla, S. et al. CTCF-promoted RNA polymerase II pausing links DNA methylation
detailed analyses of alternative splicing. Based on our CTC tran- to splicing. Nature 3, 74–79 (2011).
scriptome results, single-cell analyses using Smart-Seq are also highly 14. Jungbluth, A.A. et al. Expression of melanocyte-associated markers gp-100 and
informative for identifying candidate biomarkers, SNPs and muta- Melan-A/MART-1 in angiomyolipomas. An immunohistochemical and rt-PCR
analysis. Virchows Arch. 434, 429–435 (1999).
tions. In conclusion, data sets obtained with the Smart-Seq protocol 15. Tomita, Y., Montague, P.M. & Hearing, V.J. Anti-T4-tyrosinase monoclonal antibodies–
provide improved representation of the transcriptomes of individual specific markers for pigmented melanocytes. J. Invest. Dermatol. 85, 426–430 (1985).
cells, which should be useful for both basic and clinical studies. 16. Fang, D. & Setaluri, V. Role of microphthalmia transcription factor in regulation of
melanocyte differentiation marker TRP-1. Biochem. Biophys. Res. Commun. 256,
657–663 (1999).
Methods 17. Chomez, P. et al. An overview of the MAGE gene family with the identification of
all human members of the family. Cancer Res. 61, 5544–5551 (2001).
Methods and any associated references are available in the online 18. Tang, A. et al. E-cadherin is the major mediator of human melanocyte adhesion to
version of the paper. keratinocytes in vitro. J. Cell Sci. 107, 983–992 (1994).
19. Duncan, L.M. et al. Down-regulation of the novel gene melastatin correlates with
potential for melanoma metastasis. Cancer Res. 58, 1515–1520 (1998).
Accession code. Gene Expression Omnibus: GSE38495 (sequencing 20. Gudbjartsson, D.F. et al. ASIP and TYR pigmentation variants associate with
read data). cutaneous melanoma and basal cell carcinoma. Nat. Genet. 40, 886–891 (2008).
cDNA (~5 ng cDNA) was used to construct Illumina sequencing libraries using procedure ensured that mapped reads were unique across both the genome
either Illumina’s Ultra Low Input mRNA-Seq Guide (the ‘PE’ protocol) or a modi- and transcriptome, while allowing for reads to map to different transcripts of
fication of Epicentre’s Nextera DNA sample preparation protocol (the ‘Tn5’ proto- the same gene in the initial transcriptome mapping. The uniquely mapped
col). With the PE protocol, the amplified cDNA was fragmented using a Covaris reads were converted to binary BAM files using Samtools22. The resulting
acoustic shearing instrument. The resulting fragments were end-repaired, fol- transcriptome data were visualized using the Integrated Genome Viewer (IGV,
lowed by the addition of a single A base, ligation to Illumina PE adaptors, and Broad Institute) using the histogram visualization for Supplementary Figure 3
then amplification in 12–18 cycles of PCR (depending on starting amounts of and heatmap visualization for Figure 3c.
RNA, see Supplementary Table 1 for detailed instructions of all libraries gener-
ated). With the Tn5 protocol, the amplified cDNA was ‘tagmentated’ at 55 °C for Expression level estimation and technical comparisons of sensitivity and
5 min in a 20-µl reaction with 0.25 µl of transposase and 4 µl of 5× HMW Nextera variation. Gene expression levels for Refseq transcripts were summarized
reaction buffer. We added 35 µl of PB to the tagmentation reaction mix to strip as RPKM values and read counts using rpkmforgenes23. RefSeq annota-
the transposase off the DNA, and the tagmentated DNA was purified with 88 µl of tions for human and mouse were downloaded on the 31 August 2011 and 13
SPRI XP beads (sample to beads ratio of 1:1.6). Purified DNA was then amplified December 2010, respectively. RPKM calculations only considered uniquely
by nine cycles of standard Nextera PCR. Library quality was confirmed using mappable positions for transcript length normalizations using the ENCODE
DNA 1000 kits on a Bioanalyzer (Agilent), and the libraries were then sequenced Mappability track (wgEncodeCrgMapabilityAlign50mer.bigWig) for human
on either Illumina’s HiSeq 2000, GAIIx or MiSeq instruments, and all clusters and in-house–computed uniqueness files for mouse. Overlapping RefSeq
that passed filter were exported into fastq files. Details on the sequence depth, transcripts were collapsed giving one expression value per gene locus. Only
sequencing platform and library construction method for each dilution replicate 10 million randomly selected mapped reads were used per sample to compare
and single cell are included in Supplementary Table 1. All data shown in the sensitivity and variation in gene and exon levels. Samples with fewer than
figures of this manuscript were generated using the PE protocol unless otherwise 10 million uniquely mappable reads (a few ESCs8) were therefore discarded
specified in the figure legend. from analyses. Samples with 20 pg of total RNA (used in Fig. 2b,d) were simu-
lated by using 5 million reads each of two different 10 pg samples. Analyses of
Construction and sequencing of standard mRNA-Seq libraries. We gener- gene detection (Fig. 2a,b and Supplementary Fig. 4b,c) were calculated over
ated mRNA-Seq transcriptome data following the Illumina mRNA-Seq kit pairs of technical replicates or individual cells. Genes were binned by the high-
from 100 ng and 1 µg of total RNA, as detailed in Supplementary Table 1. est expression level of the two samples, and was considered detected if it had an
were based on human and mouse RefSeq transcripts. Reads were mapped to gene variation)/2) when calculating the t statistic. The null distribution of the
RefSeq transcripts directly rather than to the genome, using Bowtie allowing t statistic was calculated by shuffling the sample labels (cell-to-cell line map-
for up to 10 hits per read. Each transcript was divided into 40 equally sized ping) repeatedly and for each shuffle compute the t statistics, thus allowing
bins, and the number of reads was counted for each bin and gene. The read the conversion of t statistics to P values for the cancer-cell comparison. To
count per bin for each gene was divided by total read count for that gene estimate false discovery rates, the sample groups were randomly split in half
before the bins for all the different genes were summed up. The calculated and combined with half from the other sample group, and the number of sig-
read coverage per bin was later normalized through the division by the bin nificant exons was counted using the t statistics introduced above (repeatedly,
with the largest read coverage. The mean and s.d. over replicates were shown to vary the random splitting of sample groups). The false discovery rate was
in Figure 1 and Supplementary Figure 2, including all transcripts with at then estimated as the number of significant exons in random shuffles divided
least ten mapped reads. Analyses of full-length transcript reconstructions were by the number of significant events with correct sample groups. The numbers of
based on RefSeq annotations, and we defined full-length reconstructed genes significant exons at different false discovery rates are presented in Figure 3d.
as those for which we obtained correct exon-intron structure throughout all
annotated exons of at least one isoform. We limited the analyses to expressed SNP and mutation detection. CTC RNA-Seq Fastq files were mapped to tran-
(≥0.1 RPKM) and multi-exon (≥2 exons) genes. scriptome (Ensembl, annotations downloaded 16 May 2011) and genome with
BWA33, allowing for no indels and removing multi-mapping reads. Samtools
Singular value decomposition. The global transcript expression values for rmdup22 was used to filter PCR duplicates, and BAM files were reordered by
cancer cells were analyzed using singular value decomposition (SVD) to Picard (http://picard.sourceforge.net/). Variant sites were called by the Genome
determine the fundamental patterns in the transcriptomes. The expression Analysis Toolkit34 jointly on reads from all six CTC samples, with a quality
levels in RPKM were normalized to unit length and the SVD computed using score threshold for sites of 500 and requiring detection in two or more samples
npg
SVDMAN28. Each cell was then projected onto the two strongest SVD compo- (see Supplementary Fig. 11 for more detailed information on varying these
nents to visualize the overall similarity in gene expression (Fig. 3a). threshold). We limited the analyses to transcribed regions using RefSeq gene
models, and the last 35 base pairs of transcripts were not considered to remove
Analyses of differential expression. One-way analysis of variance (ANOVA) false positives arising from mapping of reads with partial poly(A) tail. Analyses
was performed on expression levels (RPKM, log2) followed by Tukey of overlap with known SNPs were based on dbSNP build 132 (ref. 35).
post-hoc test in R/Bioconductor. Only genes significant after multiple testing
corrections (5% FDR, Benjamini-Hochberg) were evaluated with post-hoc 21. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient
test (P < 0.05). Lists of significantly differently expressed genes are available alignment of short DNA sequences to the human genome. Genome Biol. 10, R25
in Supplementary Table 4 for CTC, primary melanocyte and melanoma cell (2009).
22. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics
line comparisons, and in Figure 3a for comparisons between prostate and
25, 2078–2079 (2009).
bladder cancer cell line cells. 23. Ramsköld, D., Wang, E.T., Burge, C.B. & Sandberg, R. An abundance of ubiquitously
expressed genes revealed by tissue transcriptome sequence data. PLOS Comput.
Selection of marker genes for melanoma and immune cells. To identify the Biol. 5, e1000598 (2009).
24. Bengtsson, M., Ståhlberg, A., Rorsman, P. & Kubista, M. Gene expression profiling
100 transcripts most strongly associated with melanoma and immune cells,
in single cells from the pancreatic islets of Langerhans reveals lognormal distribution
respectively, we initially calculated the difference in mean gene expression of mRNA levels. Genome Res. 15, 1388–1392 (2005).
between melanoma samples29 and a combination of monocytes30, T cells31, 25. Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from
white blood cells and lymph node samples (Fig. 4a). The differences were paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 4570–4578 (2010).
26. Bullard, J.H., Purdom, E., Hansen, K.D. & Dudoit, S. Evaluation of statistical
divided by the highest expression value in any of the samples, to avoid dif-
methods for normalization and differential expression in mRNA-Seq experiments.
ferences driven by outlier expression values in one replicate only. We ranked BMC Bioinformatics 11, 94 (2010).
genes according to this metric and selected the 100 strongest markers for the 27. Sam, L.T. et al. A comparison of single molecule and amplification based sequencing
melanoma and for the immune cell combination. We then evaluated the mean of cancer transcriptomes. PLoS ONE 6, e17305 (2011).
28. Wall, M.E., Dyck, P.A. & Brettin, T.S. SVDMAN–singular value decomposition
expression values of each gene in the individual putative CTCs. To include the
analysis of microarray data. Bioinformatics 17, 566–568 (2001).
monocyte SAGE data, we converted 1.5 RPM to 1 RPKM, assuming an average 29. Berger, M.F. et al. Integrative analysis of the melanoma transcriptome.
transcripts length of 1.5 kb (ref. 23). Genome Res. 20, 413–427 (2010).