0% found this document useful (0 votes)
9 views

Full-length mRNA-Seq from single-cell levels of RNA

The document presents a novel mRNA-Seq protocol called Smart-Seq, which allows for robust transcriptome analysis at the single-cell level. This method improves read coverage across transcripts, enabling detailed studies of gene expression and identification of biomarkers in rare cells, such as circulating tumor cells from melanoma patients. Smart-Seq demonstrates increased sensitivity and accuracy compared to previous single-cell transcriptome methods, making it a valuable tool for biological research.

Uploaded by

joeing0301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Full-length mRNA-Seq from single-cell levels of RNA

The document presents a novel mRNA-Seq protocol called Smart-Seq, which allows for robust transcriptome analysis at the single-cell level. This method improves read coverage across transcripts, enabling detailed studies of gene expression and identification of biomarkers in rare cells, such as circulating tumor cells from melanoma patients. Smart-Seq demonstrates increased sensitivity and accuracy compared to previous single-cell transcriptome methods, making it a valuable tool for biological research.

Uploaded by

joeing0301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Articles

Full-length mRNA-Seq from single-cell levels of RNA


and individual circulating tumor cells
Daniel Ramsköld1,2,7, Shujun Luo3,7, Yu-Chieh Wang4, Robin Li3, Qiaolin Deng1, Omid R Faridani1,
Gregory A Daniels5, Irina Khrebtukova3, Jeanne F Loring4, Louise C Laurent6, Gary P Schroth3 &
Rickard Sandberg1,2

Genome-wide transcriptome analyses are routinely used to monitor tissue-, disease- and cell type–specific gene expression,
but it has been technically challenging to generate expression profiles from single cells. Here we describe a robust mRNA-Seq
protocol (Smart-Seq) that is applicable down to single cell levels. Compared with existing methods, Smart-Seq has improved
© 2012 Nature America, Inc. All rights reserved.

read coverage across transcripts, which enhances detailed analyses of alternative transcript isoforms and identification of
single-nucleotide polymorphisms. We determined the sensitivity and quantitative accuracy of Smart-Seq for single-cell
transcriptomics by evaluating it on total RNA dilution series. We found that although gene expression estimates from single cells
have increased noise, hundreds of differentially expressed genes could be identified using few cells per cell type. Applying Smart-
Seq to circulating tumor cells from melanomas, we identified distinct gene expression patterns, including candidate biomarkers
for melanoma circulating tumor cells. Our protocol will be useful for addressing fundamental biological problems requiring
genome-wide transcriptome profiling in rare cells.

Analyses of transcriptomes through massively parallel sequencing of is a need for a single-cell transcriptome method that can be used to
cDNAs (mRNA-Seq) generates millions of short sequence fragments both quantify gene expression and provide the coverage for efficient
that can be analyzed to accurately quantify expression levels1, assemble detection of transcript variants and alleles.
new transcripts2,3 and investigate alternate RNA processing 4,5. Here we introduce a single-cell RNA-sequencing protocol with
These techniques have been consistently pushed toward devel- markedly improved transcriptome coverage, which samples cDNAs
opment of methods that require lower starting amounts of RNA, from more than just the ends of mRNAs. Using this protocol, we
ideally as small as single cells. A protocol initially developed for sequenced the mRNAs from many individual mammalian cells, as
single-cell microarray studies6 has been adapted for mRNA-Seq and well as well-defined dilution series of purified total RNAs, to compre-
used to generate transcriptome data for individual mouse oocytes hensively assess how sensitivity, variability and detection of differing
and early embryonic cells7,8. Using the method, thousands of genes
npg

expression vary with different amounts of starting material. Our results


expressed in mouse oocytes had been detected, and it yielded demonstrate the power of single-cell RNA-Seq for both transcriptional
increased sensitivity compared with microarrays7. However, this and post-transcriptional studies, and provide valuable insights into the
first single-cell mRNA-Seq experiment lacked technical controls, design of experiments that start from few or single cells. To demon-
making it impossible to distinguish biological variation between dif- strate the biological importance of this method, we applied this assay
ferent cells from the technical variation that is intrinsic to cDNA to putative circulating tumor cells (CTCs) captured from the blood of a
amplification protocols when starting with small amounts of RNA. melanoma patient to demonstrate how Smart-Seq enables high-quality
Therefore, the question remained whether single-cell transcriptomes transcriptome mapping in individual, clinically important cells.
faithfully represent the RNA population before amplification and how
technical variation limits the power to find differences in expres- RESULTS
sion. This initial mRNA-Seq method also preferentially amplified Efficient and robust single-cell RNA sequencing
the 3′ ends of mRNAs, and hence the data could only be used to For Smart-Seq, first we lysed each cell in hypotonic solution and
identify distal splicing events. Recently, a method for multiplexed converted poly(A)+ RNA to full-length cDNA using oligo(dT)
single-cell RNA-Seq has been introduced that quantifies transcripts priming and SMART template switching techno­logy, followed by
through reads mapping to mRNA 5′ ends9. Neither of these methods 12–18 cycles of PCR preamplification of cDNA. We used the ampli-
generates read coverage across full transcripts. As most mammalian fied cDNA to construct standard Illumina sequencing ­ libraries
multi-exon genes are subject to alternative RNA processing4,5, there using either Covaris shearing followed by ligation of adaptors (PE)

1Ludwig Institute for Cancer Research, Stockholm, Sweden. 2Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden. 3Illumina, Inc.,

Hayward, California, USA. 4Department of Chemical Physiology, Center for Regenerative Medicine, The Scripps Research Institute, San Diego, La Jolla, California,
USA. 5Rebecca and John Moores Cancer Center, San Diego, La Jolla, California, USA. 6Department of Reproductive Medicine, University of California, San Diego,
La Jolla, California, USA. 7These authors contributed equally to this work. Correspondence should be addressed to R.S. ([email protected]).

Received 14 November 2011; accepted 22 May 2012; published online 22 July 2012; doi:10.1038/nbt.2282

nature biotechnology VOLUME 30 NUMBER 8 AUGUST 2012 777


Articles

Figure 1 Smart-Seq read coverage across


transcripts. (a) Comparison of read coverage
a Oocytes (Smart-Seq)
Oocytes (ref. 7)
over transcripts for Smart-Seq–analyzed mouse 0–1 kb 1–2 kb 2–3 kb 3–4 kb

Read coverage (%)


100
oocytes (n = 3) and previously published
mouse oocyte transcriptome data (ref. 7;
50
n = 2). Transcripts were grouped according to
annotated lengths and analyzed separately,
0
with the transcript length ranges indicated
(top right). We display the read coverage over 0 1 0 1 2 0 1 2 3 0 1 2 3 4
Distance from 3′ end (kb) Distance from 3′ end (kb) Distance from 3′ end (kb) Distance from 3′ end (kb)
the transcripts as a distance from the 3′ end,
with the vertical dashed gray line showing the 4–5 kb 5–7 kb 7–9 kb 9–15 kb

Read coverage (%)


100
length of the shortest included transcripts
after which a decline in read coverage is
50
expected. Error bars represent s.d. among
biological replicates. (b) Mean read coverage
0
over transcripts for Smart-Seq data generated
from diluted amounts of mouse brain RNA. 0 1 2 3 4 5 0 1 2 3 4 5 6 7 0 3 6 9 0 3 6 9 12 15
Independent dilution series (including data from Distance from 3′ end (kb) Distance from 3′ end (kb) Distance from 3′ end (kb) Distance from 3′ end (kb)
different laboratories) are shown as separate
data sets, and sample numbers are listed from
b 100 c 100
uppermost line down. For comparison, we
80 80
included data from standard mRNA-Seq on
Read coverage (%)

Read coverage (%)


100 ng of mouse brain RNA (non-amplified).
60 60
Errors bars, s.d. (n = 5, 3, 4 and 4 for lines top
© 2012 Nature America, Inc. All rights reserved.

to bottom). (c) Read coverage (as in b) for 12


40 40
individual human cells of prostate and bladder Smart-Seq
mRNA-Seq
cancer line origin, analyzed using Smart-Seq Non-amplified
20 0.5–10 ng 20
(cancer cells; n = 12) and for prostate cell line mRNA-Seq 100 pg Smart-Seq
Non-amplified 10 pg Cancer cells
LNCaP analyzed with standard mRNA-Seq 0 0
(non-amplified; n = 4). Error bars, s.d. 5′ end 3′ end 5′ end 3′ end
Transcript (%) Transcript (%)

or Tn5-mediated ‘tagmentation’ using the Nextera technology (Tn5). Quantitative assessment of single-cell transcriptomics
Both of these library preparation methods enable random shot- Analyses of gene expression from millions of cells using mRNA-Seq
gun sequencing of cDNAs (Supplementary Fig. 1). We generated is highly reproducible and has low technical variation1,4. To our
Smart-Seq libraries from 42 individual human or mouse cells, and in knowledge, no single-cell mRNA-Seq study has measured the tech-
addition we generated 64 libraries from dilution series of total RNA nical variation intrinsic to the cDNA pre-amplification components
derived from human brain (16 samples), mouse brain (28 samples) of single-cell methods. We therefore diluted microgram amounts of
and universal human reference RNA (UHRR, 20 samples). We reference total RNA down to nano- and picogram levels and applied
sequenced each sequencing library on the Illumina platform, typi- Smart-Seq to assess sensitivity, technical variability and detection of
cally generating >20 million uniquely mapping reads (Supplementary differentially expressed transcripts of Smart-Seq on low amounts of
Table 1). For comparison, we also made several standard mRNA-Seq total RNA. For comparison, we generated standard mRNA-Seq librar-
libraries from 100 ng to a few micrograms of total RNA. ies from 100 ng to microgram amounts of reference total RNA.
npg

First, we addressed the sensitivity of the method in detecting tran-


Smart-Seq improves coverage across transcripts scripts expressed at different levels. Starting with 10 ng or 1 ng of
In previous single-cell mRNA-sequencing studies7,8, the data suffered total RNA, we found no or minimal decline in sensitivity compared
from a pronounced 3′-end bias that limited analysis across full-length with standard mRNA-Seq. However, lowering the starting amounts
transcripts. We sequenced single-cell transcriptomes from mouse to single-cell levels decreased the detection rate of less abundant
oocytes to enable a direct comparison with published mouse oocyte transcripts (Fig. 2a). Analyses of the 12 cancer cell line cells (four
single-cell data7. Analyses of read coverage across transcripts demon- cells each from the LNCaP, PC3 and T24 lines) showed that ~76% of
strated that Smart-Seq has considerably improved full-length cover- transcripts expressed at 10 RPKM (reads per kilobase exon model and
age of all transcripts longer than 1 kb (Fig. 1a and Supplementary million mappable reads), which roughly equals the median expression
Fig. 2a–h). Smart-Seq analyses of mouse brain RNA at different dilu- for detected transcripts, were reproducibly detected in all single-cell
tions showed that even better coverage was obtained with increased profiles (Fig. 2b). We found that the sensitivity of gene detection for
starting amounts, with nanogram dilutions reaching close to the the individual cancer cells was similar to that obtained with ~20 pg
coverage observed using standard mRNA-Seq from 100 ng to 1 µg of starting total RNA (Fig. 2b), with ~8,000 genes detected per cell
total RNA (Fig. 1b). From only 10 pg input amounts (the estimated and increasing with the number of analyzed cells (Supplementary
amount of RNA in a small eukaryotic cell, Supplementary Table 2), Fig. 4a) Furthermore, we observed that the starting amount of total
we achieved close to 40% coverage at the 5′ end. Analyses of single-cell RNA had a larger impact on sensitivity than the number of PCR cycles
transcriptomes from cancer cell lines (four cells each from LNCaP, used (Supplementary Fig. 5) and that the sequence depth had little
PC3 and T24) obtained equally good read coverage (Fig. 1c) and, effect on transcript detection at levels above a million uniquely map-
indeed, for 25% of all expressed, multi-exon genes our read coverage ping reads per cell, with expression levels stabilizing after 3 million
enabled full-length transcript reconstruction (Supplementary Fig. 3). uniquely mapped reads (Supplementary Fig. 4c,d). Comparisons of
We conclude that Smart-Seq has substantially improved read coverage Smart-Seq and previous mouse oocyte data7 demonstrated similar
compared with previous single-cell transcriptome methods. sensitivity (Supplementary Fig. 2i,j). We conclude that transcript

778 VOLUME 30 NUMBER 8 AUGUST 2012 nature biotechnology


Articles

a 100
b 100
e
10 Spearman r = 0.87
Pearson r = 0.88

Smart-Seq, 1 ng (RPKM, log2)


80 Smart-Seq 80
Within cell lines 5
Gene detection (%)

Gene detection (%)


10 ng
1 ng 80 cells
60 60
100 pg 10 cells
Single cells 0
20 pg
40 10 pg 40 Between cell lines
Single cells
mRNA-Seq −5
20 Non-amplified 20 Dilution series
Brain versus UHRR Data from Fig. 2a
−10
0 0
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −10 −5 0 5 10
Gene expression (RPKM), log10 Gene expression (RPKM), log10 mRNA-Seq (RPKM, log2)

c 1.6
d f
Smart-Seq mRNA-Seq 10 Spearman r = 0.77
Within cell lines

Smart-Seq, 100 pg (RPKM, log2)


Non-amplified Dilution series Pearson r = 0.79
10 ng 80 cells
0.8 Data from Fig. 2b
Standard deviation, log10

1.2 1 ng
Brain versus UHRR Standard deviation, log10
10 cells 5
100 pg Single cells
20 pg 0.6 Between cell lines
0.8 10 pg Single cells 0
0.4

0.4 −5
© 2012 Nature America, Inc. All rights reserved.

0.2

−10
0 0
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −10 −5 0 5 10
Gene expression (RPKM), log10 Gene expression (RPKM), log10 mRNA-Seq (RPKM, log2)
g
10 Spearman r = 0.57
Figure 2 Sensitivity and variability in Smart-Seq from few or single cells. (a) Percentage of genes Pearson r = 0.58

Smart-Seq, 10 pg (RPKM, log2)


reproducibly detected in replicate pairs, binned according to expression level. We performed all
pair-wise comparisons within groups of replicates and report the mean and 90% confidence interval. 5
We used Smart-Seq data generated from diluted amounts of human UHRR total RNA as indicated.
As controls, we added both a comparison of technical replicates of human UHRR analyzed using
0
standard mRNA-Seq protocols with 100 ng input RNA (non-amplified) as well as a comparison of
human UHRR and brain RNA from standard mRNA-Seq data. (b) Percentage of genes reproducibly
detected within replicate pairs, binned according to expression level (as in a) for human LNCaP, −5
PC3 and T24 cells. We show pair-wise comparisons among single cells from the same cancer cell line
(blue), among multiple cells of the same cell line (purple and blue), and comparisons among single −10
cells from different cancer cell lines (yellow). (c) Standard deviation in gene expression estimates −10 −5 0 5 10
within replicates in bins of genes sorted according to expression levels. Error bars, s.e.m. (n ≥ 10) mRNA-Seq (RPKM, log2)
(d) Standard deviation in gene expression estimates in replicates (as in c). (e–g) Scatter plots showing
the relative differences between human UHRR and brain gene expression levels estimated from standard mRNA-Seq data on 100 ng input RNA (x axis)
npg

and Smart-Seq generated data (y axis) starting from 1 ng total RNA (e), 100 pg total RNA (f) and 10 pg total RNA (g). Correlation coefficients computed
from log2 transformed relative gene expression profiles, together with nonlinear loess regression curves (green) and y = x lines (red).

detection sensitivity is affected by limiting starting amounts of RNA only a modest increase in technical noise (Fig. 2c). When lowering
that lead to random loss of low-abundance transcripts, but still the input amounts down to picogram levels, there was a clear increase
majority of low-abundance and the vast majority of highly expressed in technical variability, particularly for less abundantly expressed
transcripts are reliably detected even in single cells. transcripts (Fig. 2c). We compared technical variability at picogram
Second, we determined the reproducibility in expression levels levels of total RNA to the biological variation found in comparisons
generated from diluted RNA and individual cells. Comparison of of human brain samples and UHRR using standard mRNA-Seq
Smart-Seq and previous mouse oocyte data7 demonstrated improved (Fig. 2c). Notably, analyses of gene-expression variation between indi-
estimation of expression with Smart-Seq (lower variability in data vidual cancer cells of different origin revealed extensive biological
from oocyte to oocyte) across the whole range of expression levels variation in highly expressed genes (Fig. 2d).
(Supplementary Fig. 2k). Correlation analyses between technical Finally, we assessed whether single-cell expression profiles from
replicates of diluted RNA showed increasing concordance with larger preamplified material were representative of the original expression
amounts of RNA. Comparing data from the single cells against the profiles. Comparing relative gene expression levels (UHRR minus
RNA-dilution data, we observed higher correlations (Pearson cor- brain) estimated using standard mRNA-Seq to those estimated from
relations of 0.75–0.85) among individual cells of the same type than Smart-Seq with different amounts of input RNA, we again found a high
among dilution replicates at 10 pg (Pearson correlations of 0.65–0.75) concordance (Fig. 2e–g). Starting with 1 ng or 100 pg total RNA, the
(Supplementary Fig. 6). As variability in measurements of expression relative expression in Smart-Seq and standard mRNA-Seq, respectively,
depends on transcript expression levels, we computed the variability had Spearman correlations of 0.87 and 0.77 (Fig. 2e,f). Comparisons
as a function of the expression level (Fig. 2c,d). This analysis showed with 10 pg input RNA showed overall good correlation (Fig. 2g) but
that Smart-Seq on 10 ng total RNA had the same technical variability identified two populations of transcripts with distorted expression in
as standard mRNA-Seq and that Smart-Seq on 1 ng total RNA showed Smart-Seq data from either human brain sample or UHRR, reflecting

nature biotechnology VOLUME 30 NUMBER 8 AUGUST 2012 779


Articles

a b 45 c

Number of detected exons (×1,000)


150 40
55,996 kb 55,998 kb 56,000 kb 56,002 kb 56,004 kb 56,006 kb 56,008 kb

PC3 (n = 4) 35
cell 1
100 30 cell 2
T24
SVD component 2

cell 3
25 cell 4
50 cell 1
660 20 cell 2
LNCaP
433 cell 3
0 15 cell 4
731 10 NM_015277
–50 NM_001144967
LNCaP (n = 4) NEDD4L
5
T24 (n = 4)
–100
–100 –50 0 50 100 150
0
1 ng 100 pg 50 pg 10 pg Single
cells
Single
cells
d LNCaP versus PC3
SVD component 1 Total RNA T24 versus PC3
300

Number of exons
LNCaP versus T24
Mouse brain Mouse Human
ES/other cancer
200
Figure 3 Transcriptional and post-transcriptional analyses of cancer cell cells cells
line cells using Smart-Seq. (a) Categorization of individual cells according 100
to cell line of origin using single-cell Smart-Seq transcriptomes. Singular-value decomposition (SVD) analysis was
conducted for 12 individual cancer cells (four cells each from the PC3, LNCaP and T24 cancer cell lines) based on 0
global gene expression profiles. Projections are shown based on the first two dimensions that capture most of the 0 5 10 15 20 25
False discovery rate (%)
variance. The numbers of significantly differentially expressed genes per pair-wise cell line comparison are shown next to
the arrows (P < 0.05, 1-way ANOVA and Tukey post-hoc test). (b) Mean number of exons with sufficient read coverage for MISO analyses of exon inclusion
levels in sequence-depth matched single-cell mRNA-Seq data. Smart-Seq data from diluted mouse brain RNA (green) compared with previously published
© 2012 Nature America, Inc. All rights reserved.

mouse ESCs8 and ESC-derived cells (red) and 12 Smart-Seq–analyzed individual prostate and bladder cell line cells (purple). Individual RNA or cell
measurements are plotted. (c) Single-cell Smart-Seq reads mapping to a portion of the NEDD4L gene locus from four individual T24 and LNCaP cells. Read
coverage is shown as a heatmap with darker blue indicating higher read coverage. (d) Number of differentially included exons identified among the PC3,
LNCaP and T24 cell lines from single-cell Smart-Seq analysis on four cells per cell line as a function of estimated false discovery rate.

stochastic losses, mostly of low-abundance transcripts when starting In this comparison of three cancer cell lines, we found 100 exons with
with such minute RNA amounts (Fig. 2a,g). Preamplification of cDNA differential exon inclusion levels among the three cell lines, with a less
could also lead to disproportionate amplification of short transcripts, than 1% false discovery rate (Fig. 3d and Supplementary Table 3).
but we found no systematic bias (Supplementary Fig. 7). A previous We conclude that Smart-Seq considerably improves our ability to detect
microarray study had analyzed PCR-amplified cDNA (from picogram alternative RNA processing in single cells.
starting amounts) and found the transcriptome overall preserved but
skewed10. Our data from 1 ng and 100 pg total RNA showed no skew- Analyses of circulating tumor cell transcriptomes
ing, that is, the loess slopes estimated from the data approximated 1 Having demonstrated that Smart-Seq generates quantitative and repro-
(Fig. 2e–g). Together, these results demonstrated that transcriptome ducible single-cell transcriptomes, we asked whether global transcriptome
analyses from few or single cells, in general, preserved relative differ- analyses of putative CTCs could reveal their tumor of origin and pro-
ences in expression for detected transcripts. vide data to support the use of this method for unbiased cancer-specific
biomarker identification. To this end, we generated transcriptomes from
Transcriptional and post-transcriptional differences six single NG2+ putative melanoma CTCs isolated from peripheral blood
npg

Having demonstrated the improved performance of Smart-Seq on drawn from a patient with recurrent melanoma using immunomagnetic
low amounts of RNA compared with previously published methods, purification with a MagSweeper instrument (Illumina)12. For comparison,
we focused our analyses on single-cell transcriptomes from prostate we also generated Smart-Seq libraries from single cells derived from pri-
(PC3 and LNCaP) and bladder (T24) cancer cell line cells. The global mary melanocytes (n = 2), melanoma cancer cell line (SKMEL5, n = 4 and
gene expression of 12 individual cells (four from each cell line) clus- UACC257, n = 3) cells and from human embryonic stem cells (ESCs, n = 8).
tered according to cell line of origin and we identified hundreds As the NG2+ putative CTCs were isolated from blood, it was important
of differentially expressed genes among the three cell lines (Fig. 3a; to compare them to blood cells. The putative CTCs were distinct from
q < 0.05 ANOVA; P < 0.05 post-hoc test). lymphoma cell lines (BL41 and BJAB)13 and immune tissues (lymphn-
The pronounced 3′-end bias of previous single-cell mRNA-Seq studies ode and white blood cell samples), as well as embryonic stem cells, and
has hampered the ability to identify alternative splicing differences in single instead were highly similar to primary melanocytes and melanoma cell
cells. We used the Bayesian mixture of isoforms framework (MISO)11 to line cells. Unsupervised hierarchical clustering and correlation analyses
infer exon inclusion levels for known alternatively spliced exons in the 12 of gene expression levels showed a clear clustering of cells according to
individual cells. The improved read coverage with Smart-Seq resulted in a cell type of origin (Fig. 4a and Supplementary Fig. 8), and separation
twofold increase in the number of potential alternatively spliced exons that from the human brain RNA samples that were previously analyzed with
could be assessed, compared to previously published single-cell mRNA- Smart-Seq or mRNA-Seq (data not shown). Additional support for the
Seq data (Fig. 3b), substantially improving our ability to detect alternative melanocytic origin of the putative melanoma CTCs came from analyses
splicing. Cell type–specific alternative splicing could be inferred from of melanocyte lineage–specific markers, as all NG2+ cells expressed high
single-cell transcriptomes, as seen in read coverage across the differentially mRNAs levels for MLANA14, TYR15 and the melanocyte specific m-form
included exon 13 of the NEDD4L gene (Fig. 3c). This exon was frequently of MITF16 but not immune markers such as PTPRC (Fig. 4b), in contrast
included in LNCaP cells (93% mean inclusion level) but was included at to peripheral blood lymphocytes (Supplementary Fig. 9). Furthermore,
much lower levels in T24 cells (15% mean inclusion levels) whereas low NG2+ cells expressed high levels of melanoma-associated genes (based
expression of NEDD4L in PC3 cells precluded inclusion level estimation. on our unbiased selection of the 100 transcripts most strongly associated

780 VOLUME 30 NUMBER 8 AUGUST 2012 nature biotechnology


Articles

a 0.2 b SKMEL UACC PM CTC ESC BL


PMEL
100 MITF
100 TYR
0.4 1
99 87 MLANA
PTPRC
Correlation

24 0
4 100 Expression (RPKM, log2) Expression (RPKM, log10)
0.6 69 100
95 4
0
100 100 0 5 10 0 1 2 3
0.8
100 100
c
Immune

***
1.0
Melanoma
Immune CTC PM SKMEL UACC ESC T24 LNCaP PC3
samples
0 50 100 150 200
CTC gene expression (RPKM)
d SKMEL UACC PM CTC ESC BL
MAGEB2
MAGEC2 f SKMEL UACC PM CTC ESC BL
MAGEA10,A5 GPR143
MAGEA2,2B SEMA5A
MAGEA12 ABCB5
MAGEA6 TRPM1
MAGEA3 TGFB1l1
PLIN2
Immune: SLC16A6
e SKMEL UACC PM CTC ESC BL BJ W L MGST2
SLC7A8
GJB1
PRAME CDH1
DPP4
© 2012 Nature America, Inc. All rights reserved.

GPR126
RAB31
CRIM1 CYSLTR2
ABCG5 GNAL
SCL20A1 IFITM2
ADAM17 HLA-G
RPS3 HLA-H
PSMB1 HLA-C
HLA-B
TTYH3
g Scale
Chr11: 89017950 89017955 89017960 89017965 89017970
HDAC6
HRAS
APP
---> G A G C A G T G G C T C C G A A G G C A C C G T C
UCSC genes DSG2
TYR
F E Q W L R R H R NF2
SNP (dbSNP 132) found in ≥ 1% of samples ANXA6
rs1126809 CD81
VAMP8
PM CTC Relative expression (log2)
A 1 0 152 117 47 198 56 0
G 1,133 78 0 0 0 1 1 0 –5 0 5

Figure 4 Single-cell transcriptomes of circulating tumor cells. (a) Hierarchical clustering of human samples based on gene expression of highly
expressed genes (>100 RPKM). Coloring indicates high-order clusters and the confidence in clusters are indicated with bootstrap values (percentage).
Samples analyzed include human immune samples (Burkitt’s lymphoma cell lines BL41 and BJAB, and white blood cells and lymph node samples)
and cells from putative melanoma CTCs (CTC), primary melanocytes (PM), melanoma cell lines SKMEL5 (SKMEL) and UACC257 (UACC), prostate
cancer cell lines (LNCaP and PC3), bladder cancer cell line (T24) and human embryonic stem cells (ESC). (b) Expression of melanocyte makers
npg

(PMEL, MITF, TYR and MLANA) and immune marker PTPRC in single-cell transcriptomes from a with Burkitt’s lymphoma cell lines BL41 and BJAB
(BL). (c) Gene expression levels in CTCs for an unbiased set of 100 immune and melanoma markers. (d–f) Heatmaps showing relative expression of
melanoma associated tumor antigens (d), upregulated plasma-membrane proteins (e), and downregulated plasma-membrane proteins (f) in single-cell
transcriptomes as in b with the addition of more immune samples (W, white blood cells; L, lymph node). (g) Number of reads from individual PMs and
putative CTCs that support the reference (G) or risk (A) allele for the melanoma-associated SNP (rs1126809).

with melanoma; see Online Methods), but not immune cell–associated MHC class I genes. Notably, the preferentially expressed antigen in
genes selected in a similar manner (Fig. 4c, P < 3.7 × 10−15, Wilcoxon melanoma (PRAME) was highly expressed in NG2+ cells, which
rank sum test). Thus, both their global transcriptomes and expression together with elevated expression of known melanoma tumor anti-
patterns of melanoma-associated transcripts clearly support a melanoma gens, provides strong support for the conclusion that the NG2+ cells
CTC identity for the NG2+ cells. were CTCs that originated from a melanoma.
We next investigated whether the NG2+ putative CTCs showed signs In recent years, there has been a strong interest in identifying CTCs
of originating from a melanoma tumor. Comparison of their gene from different tumors using the a priori assumption that plasma-
expression profiles with those of individual primary melanocytes iden- membrane proteins would be good diagnostic biomarkers. We used the
tified 289 genes with significantly (q < 0.05 ANOVA; P < 0.05 post-hoc CTC transcriptome analysis to screen for membrane proteins selec-
test) higher expression in the putative CTCs than the primary melano- tively expressed in melanoma-derived CTCs compared to primary
cytes, and 436 genes with significantly (q < 0.05 ANOVA; P < 0.05 post- melanocytes and immune cells. We identified nine upregulated plasma
hoc test) lower levels (Supplementary Table 4). The upregulated genes membrane–associated transcripts in the CTCs compared to primary
were significantly (Benjamini-Hochberg adjusted P < 0.05) enriched for melanocytes (q < 0.05 ANOVA; P < 0.05 post-hoc test), many of
melanoma-associated antigens (Fig. 4d and Supplementary Fig. 10) which are not expressed in immune cells and have not been pre­viously
that have been repeatedly found to be upregulated in cancer17, mitotic associated with melanomas (Fig. 4e). Similarly, screening for loss of
cell cycle genes and additional categories (Supplementary Table 5). expression of plasma-membrane proteins identified 37 genes with
Downregulated genes were enriched for regulators of cell death and significantly (q < 0.05 ANOVA; P < 0.05 post-hoc test) lower ­expression

nature biotechnology VOLUME 30 NUMBER 8 AUGUST 2012 781


Articles

in the CTCs than primary melanocytes (Fig. 4f). Of note, epithelial Note: Supplementary information is available in the online version of the paper.
Cadherin 1 (CDH1) showed no expression in the CTCs, and loss of
Acknowledgments
CDH1 is thought to contribute to cancer progression by increasing We thank C. Burge and G. Winberg for critical reading of the manuscript, T. Juarez
proliferation, invasion and metastasis18. We also found downregula- and J. Cotton at the University of California San Diego for their help in Internal
tion of genes associated with the escape from immune surveillance, Review Board protocol preparation and aquisition of clinical samples, A.A. Talasaz
including five HLA genes (Fig. 4f), and TRPM1, suggesting that and G. Cann for assistance with the Magsweeper, members of the Science for Life
these gene expression changes might enable the CTCs to escape from laboratory (Stockholm) for assistance with MiSeq sequencer. Y.-C.W. was supported
by a fellowship from the Marie Mayer Foundation. L.C.L. was supported by US
immune surveillance. Notably, low expression of TRPM1 has been National Institutes of Health (NIH) K12HD001259. J.F.L. was supported by NIH
shown to correlate with melanoma aggressiveness and metastasis19. R33MH87925 and California Institute for Regenerative Medicine (CL1-00502,
Future studies of these membrane proteins will likely enhance our RT1-01108, TR1-01250, and RN2-00931). R.S. was supported by European Research
understanding of CTC migration and invasiveness, and these results Council (starting grant 243066), Swedish Research Council (2008-4562), Foundation
for Strategic Research (FFL4) and Åke Wiberg Foundation (756194131).
highlight the utility of studying single CTC cells with RNA-Seq.
Lastly, we investigated whether Smart-Seq transcriptome data could AUTHOR CONTRIBUTIONS
be mined for single-nucleotide polymorphisms (SNPs) and other D.R. designed and performed the computational analyses of sequencing reads,
genetic variants associated with melanomas or other cancers. With prepared figures, tables and methods, and contributed manuscript text. S.L. and R.L.
developed protocols and created libraries. I.K. and S.L. did primary data analysis.
the improved read coverage provided by the Smart-Seq method, we
Y.-C.W., G.A.D. and J.F.L. prepared melanoma circulating tumor cells, melanocytes
identified 4,312 high-confidence genomic sites with support for an and melanoma cell line cells. O.R.F. and Q.D. contributed additional sequencing
alternative allele in at least two CTCs, whereas genotype calls only libraries. L.C.L. and G.P.S. contributed to study design and manuscript text.
supported by a single cell showed an excess of previously unidentified, R.S. designed the study and prepared the manuscript, with input from other authors.
likely artifactual, sites (Supplementary Fig. 11) together with a smaller
COMPETING FINANCIAL INTERESTS
subset (9%) of A-to-G RNA editing sites (data not shown). Ninety- The authors declare competing financial interests: details are available in the online
© 2012 Nature America, Inc. All rights reserved.

two percent of the high-confidence sites coincided with documented version of the paper.
SNPs, for example, the melanoma-associated SNP in the TYR gene
(rs1126809)20 (Fig. 4g). We conclude that Smart-Seq enables screening Published online at http://www.nature.com/doifinder/10.1038/nbt.2282.
Reprints and permissions information is available online at http://www.nature.com/
for SNPs and mutations in transcribed regions using only few cells. reprints/index.html.

DISCUSSION 1. Mortazavi, A., Williams, B., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying
mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).
Generating high-coverage transcriptomes from single cells and small 2. Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in
numbers of cells will have many applications for studying rare cells; such mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol.
cells can be either individually picked or identified through cell sorting 28, 503–510 (2010).
3. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals
or laser-capture techniques. Our results showed that using Smart-Seq unannotated transcripts and isoform switching during cell differentiation.
on 10 ng of total RNA was practically indistinguishable from a stand- Nat. Biotechnol. 28, 511–515 (2010).
4. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes.
ard mRNA-Seq, whereas starting with 1 ng (corresponding roughly Nature 456, 470–476 (2008).
to 50–100 cells) showed only a minor (less than twofold) increase in 5. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of alternative
expression-level variability. Therefore, this method could be applied splicing complexity in the human transcriptome by high-throughput sequencing.
Nat. Genet. 40, 1413–1415 (2008).
to studies on homogeneous cell populations available in quantities of 6. Kurimoto, K. et al. An improved single-cell cDNA amplification method for efficient
tens to hundreds of cells. high-density oligonucleotide microarray analysis. Nucleic Acids Res. 34, e42 (2006).
However, many biologically and clinically important cell types 7. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods
6, 377–382 (2009).
exist in rare quantities and often in heterogeneous milieus, which
npg

8. Tang, F. et al. Tracing the derivation of embryonic stem cells from the inner cell
necessitates single-cell approaches. Smart-Seq generates robust and mass by single-cell RNA-Seq analysis. Cell Stem Cell 6, 468–478 (2010).
9. Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly
quantitative transcriptome data from single cells. We found hundreds multiplex RNA-seq. Genome Res. 21, 1160–1167 (2011).
of differentially expressed genes using only a few individual cells per 10. Iscove, N.N. et al. Representation is faithfully preserved in global cDNA amplified
cell type; for example, comparing only two primary melanocytes to six exponentially from sub-picogram quantities of mRNA. Nat. Biotechnol. 20,
940–943 (2002).
melanoma CTCs identified biologically meaningful differences. Even 11. Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA
sequencing of a single cell yielded useful information, as we, in each sequencing experiments for identifying isoform regulation. Nat. Methods 7,
cell, detected most of the genes active in a culture of LNCaP cells. 1009–1015 (2010).
12. Talasaz, A.H. et al. Isolating highly enriched populations of circulating epithelial
Smart-Seq is a robust method for single-cell RNA-Seq with cells and other rare cells from blood using a magnetic sweeper device. Proc. Natl.
improved read coverage across transcripts, which enables more Acad. Sci. USA 106, 3970–3975 (2009).
13. Shukla, S. et al. CTCF-promoted RNA polymerase II pausing links DNA methylation
detailed analyses of alternative splicing. Based on our CTC tran- to splicing. Nature 3, 74–79 (2011).
scriptome results, single-cell analyses using Smart-Seq are also highly 14. Jungbluth, A.A. et al. Expression of melanocyte-associated markers gp-100 and
informative for identifying candidate biomarkers, SNPs and muta- Melan-A/MART-1 in angiomyolipomas. An immunohistochemical and rt-PCR
analysis. Virchows Arch. 434, 429–435 (1999).
tions. In conclusion, data sets obtained with the Smart-Seq protocol 15. Tomita, Y., Montague, P.M. & Hearing, V.J. Anti-T4-tyrosinase monoclonal antibodies–
provide improved representation of the transcriptomes of individual specific markers for pigmented melanocytes. J. Invest. Dermatol. 85, 426–430 (1985).
cells, which should be useful for both basic and clinical studies. 16. Fang, D. & Setaluri, V. Role of microphthalmia transcription factor in regulation of
melanocyte differentiation marker TRP-1. Biochem. Biophys. Res. Commun. 256,
657–663 (1999).
Methods 17. Chomez, P. et al. An overview of the MAGE gene family with the identification of
all human members of the family. Cancer Res. 61, 5544–5551 (2001).
Methods and any associated references are available in the online 18. Tang, A. et al. E-cadherin is the major mediator of human melanocyte adhesion to
version of the paper. keratinocytes in vitro. J. Cell Sci. 107, 983–992 (1994).
19. Duncan, L.M. et al. Down-regulation of the novel gene melastatin correlates with
potential for melanoma metastasis. Cancer Res. 58, 1515–1520 (1998).
Accession code. Gene Expression Omnibus: GSE38495 (sequencing 20. Gudbjartsson, D.F. et al. ASIP and TYR pigmentation variants associate with
read data). cutaneous melanoma and basal cell carcinoma. Nat. Genet. 40, 886–891 (2008).

782 VOLUME 30 NUMBER 8 AUGUST 2012 nature biotechnology


ONLINE METHODS Isolation of individual CTCs from peripheral blood. Ten milliliters of
Generation and amplification of Smart-Seq cDNA. The Smart-Seq cDNA gen- peripheral blood was collected from a male patient with recurrent, metastatic
eration and amplification methods developed for this manuscript have recently melanoma using K2 EDTA blood collection tubes (Becton Dickinson).
become available in a kit marketed by Clontech called the SMARTer Ultra Low Melanoma CTCs were collected under UCSD IRB #101330, ‘Detection and
RNA Kit for Illumina sequencing. Although all the libraries in this manuscript Molecular Characterization of Circulating Melanoma Cells’. The blood sample
were generated before the kit became commercially available, our protocol was processed within 3 h of collection. The erythrocytes in 4.5 ml of the blood
is reflected in the detailed instructions for generating cDNA from cell(s) or sample were lysed with BD Pharm Lyse lysing solution (Becton Dickinson) for
100 pg–10 ng of total RNA that is now included in the manual for this kit. For 10 min at room temperature. The nucleated cells were pelleted, resuspended in
single cell applications, each cell (or control RNA) was added in max 1 µl of HBSS containing 1% BSA and 5 mM EDTA, pelleted, resuspended in 1 ml of
media to 4 µl of hypotonic lysis buffer consisting of 0.2% Triton X-100 and HBSS containing 1% BSA and 5 mM EDTA. The nucleated cells were stained
2 U/µl of ribonuclease (RNase) inhibitors (Clontech, 2313B) in RNase free water. with Phycoerythrin-conjugated anti-human CD45 IgG to label leukocytes. The
The deposition of an intact cell in the hypotonic lysis buffer leads to immediate cells were subsequently reacted with biotinylated anti-human CSPG4 (also
lysis and stabilization of the RNA through RNase inhibitors. Then, poly(A)+ known as NG2) mouse IgG at 4 °C for 2 h, washed with HBSS, and reacted
RNA was reverse-transcribed through tailed oligo(dT) priming using the CDS with streptavidin-conjugated MG980A magnetic beads at 4 °C for 2 h. The
primer (5′–AAGCAGTGGTATCAACGCAGAGTACT(30)VN–3′, where V cells were captured based on magnetic sweeping to harvest the beads from
represents A, C or G) directly in total RNA or a whole cell lysate using Moloney cell suspension using the MagSweeper instrument (Illumina) as previously
murine leukemia virus reverse transcriptase (MMLV RT). The first-strand described12. The collected cells were stained with 5 µg/ml Calcein AM (Life
cDNA generation was carried out with the addition of 5× First Strand Buffer Technologies) in HBSS for 20 min to identify viable cells. Manual picking of
(250 mM Tris-HCl pH 8.3, 375 mM KCl and 30 mM MgCl2), dithiothreitol viable cells showing desired Calcein-positive/CD45-negative/bead-attached
(100 mM), dNTP mix (10 mM), RNAse inihibitor, oligos (CDS primer and profile was performed to isolate cells for molecular profiling. The individual
SMARTer II A oligo) and SmartScribe Reverse Transcriptase in a total volume cells were placed into 2.5 µl of Superblock (Thermo Scientific) containing
of 10 µl (see Clontech manual for details). Once the reverse transcription reac- 4,000 unit/ml RNase inhibitor (New England Biolabs) and stored at −80 °C
tion reaches the 5′ end of an RNA molecule, the terminal transferase activity until preparation of Smart-Seq libraries.
© 2012 Nature America, Inc. All rights reserved.

of MMLV adds a few nontemplated C nucleo­tides to the 3′ end of the cDNA.


The carefully designed SMARTer II A oligo (5′-AAGCAGTGGTATCAACGCA Isolation of mouse oocytes and human lymphocytes. MII oocytes were
GAGTACATrGrGrG-3′, where r indicate ribo­nucleotide bases) then base-pairs isolated from 4-week old CAST/EiJ female mice. Mice were superovulated
with these additional C nucleotides, creating an extended template. The reverse by injection of 5 IU PMSG, followed by injection of 5 IU of hCG 48 h later.
transcriptase then switches templates and continues transcribing to the end of the MII oocytes were isolated 14–15 h after hCG treatment by dissection of the
oligonucleotide. The resulting full-length cDNA contains the complete 5′ end of ampulla of the oviduct and cumulus cells were removed by hyaluronidase
the mRNA as well as an anchor sequence that serves as a universal priming site for digestion. Single oocytes were manually picked, lysed in dilution buffer, and
second-strand synthesis. The cDNA was then amplified using 12 cycles for 1 ng cDNA constructed as described above. Peripheral blood lymphocytes from
of total RNA, 15 cycles for 100 pg of total RNA, and 18 cycles for 10 pg total RNA healthy human volunteers were isolated on Ficoll gradients using LymphoPrep
or from single cells. The exact number of cycles used for each dilution replicate (Fresenius Kabi, Norway). Individual cells were manually picked into lysis
or single-cell is detailed in Supplementary Table 1. The PCR was performed in buffer and cDNA constructed as described above.
50 µl reaction volumes with Advantage 2 PCR Buffer (Clontech), dNTP mix, PCR
primer (5′-AAGCAGTGGTATCAACGCAGAGT-3′), Advantage 2 Polymerase Alignment of short reads to genome and transcriptome. Reads were inde-
Mix (Clontech) and Nuclease-Free water, resulting in a few nanograms of ampli- pendently aligned using Bowtie21 against the respective genome assembly
fied cDNA. The length distribution of amplified cDNA was monitored using (hg19 or mm9) and transcriptome sequences (Ensembl, human and mouse
High Sensitivity kits on a Bioanalyzer (Agilent), expecting a distinct peak around annotations were downloaded 16 May 2011 and 13 December 2010, respec-
500–5,000 bp (although lengths of mRNAs differ between cell types). tively). Transcriptome mapped reads were converted from transcriptome
­coordinates to genomic coordinates and thereafter compared with the genome
Construction and sequencing of Smart-Seq sequencing libraries. Amplified mapped reads to identify reads that map to a unique genomic location. This
npg

cDNA (~5 ng cDNA) was used to construct Illumina sequencing libraries using procedure ensured that mapped reads were unique across both the genome
either Illumina’s Ultra Low Input mRNA-Seq Guide (the ‘PE’ protocol) or a modi- and transcriptome, while allowing for reads to map to different transcripts of
fication of Epicentre’s Nextera DNA sample preparation protocol (the ‘Tn5’ proto- the same gene in the initial transcriptome mapping. The uniquely mapped
col). With the PE protocol, the amplified cDNA was fragmented using a Covaris reads were converted to binary BAM files using Samtools22. The resulting
acoustic shearing instrument. The resulting fragments were end-repaired, fol- transcriptome data were visualized using the Integrated Genome Viewer (IGV,
lowed by the addition of a single A base, ligation to Illumina PE adaptors, and Broad Institute) using the histogram visualization for Supplementary Figure 3
then amplification in 12–18 cycles of PCR (depending on starting amounts of and heatmap visualization for Figure 3c.
RNA, see Supplementary Table 1 for detailed instructions of all libraries gener-
ated). With the Tn5 protocol, the amplified cDNA was ‘tagmentated’ at 55 °C for Expression level estimation and technical comparisons of sensitivity and
5 min in a 20-µl reaction with 0.25 µl of transposase and 4 µl of 5× HMW Nextera variation. Gene expression levels for Refseq transcripts were summarized
reaction buffer. We added 35 µl of PB to the tagmentation reaction mix to strip as RPKM values and read counts using rpkmforgenes23. RefSeq annota-
the transposase off the DNA, and the tagmentated DNA was purified with 88 µl of tions for human and mouse were downloaded on the 31 August 2011 and 13
SPRI XP beads (sample to beads ratio of 1:1.6). Purified DNA was then amplified December 2010, respectively. RPKM calculations only considered uniquely
by nine cycles of standard Nextera PCR. Library quality was confirmed using mappable positions for transcript length normalizations using the ENCODE
DNA 1000 kits on a Bioanalyzer (Agilent), and the libraries were then sequenced Mappability track (wgEncodeCrgMapabilityAlign50mer.bigWig) for human
on either Illumina’s HiSeq 2000, GAIIx or MiSeq instruments, and all clusters and in-house–computed uniqueness files for mouse. Overlapping RefSeq
that passed filter were exported into fastq files. Details on the sequence depth, transcripts were collapsed giving one expression value per gene locus. Only
sequencing platform and library construction method for each dilution replicate 10 million randomly selected mapped reads were used per sample to compare
and single cell are included in Supplementary Table 1. All data shown in the sensitivity and variation in gene and exon levels. Samples with fewer than
figures of this manuscript were generated using the PE protocol unless otherwise 10 million uniquely mappable reads (a few ESCs8) were therefore discarded
specified in the figure legend. from analyses. Samples with 20 pg of total RNA (used in Fig. 2b,d) were simu-
lated by using 5 million reads each of two different 10 pg samples. Analyses of
Construction and sequencing of standard mRNA-Seq libraries. We gener- gene detection (Fig. 2a,b and Supplementary Fig. 4b,c) were calculated over
ated mRNA-Seq transcriptome data following the Illumina mRNA-Seq kit pairs of technical replicates or individual cells. Genes were binned by the high-
from 100 ng and 1 µg of total RNA, as detailed in Supplementary Table 1. est expression level of the two samples, and was considered detected if it had an

doi:10.1038/nbt.2282 nature biotechnology


RPKM above 0.1 in both samples. The mean for all possible pairs of technical Detection of alternatively spliced exons. We analyzed exon inclusion levels
replicates within a group was used together with standard deviation using the for a collection of alternatively skipped exons previously identified from EST
adjusted Wald method. Analyses of variation (Fig. 2b,d) were also calculated and cDNA data4. We used the mixture of isoforms (MISO) framework11 to
on pairs of samples, binning genes by the mean of log expression, excluding calculate exon inclusion levels with confidence intervals. We used the default
genes below 0.1 RPKM in either sample. As gene expression levels across single MISO settings, which require at least 20 reads mapping to the alternative
cells are often log normally distributed24, we calculated absolute difference exon or the immediate upstream or downstream exon or exon-exon junctions
in log10 expression values and s.d. by multiplying mean variation in a bin between them. For a fair assessment of read coverage across exons (Fig. 3b),
with 0.886. Scatter plots were generated in R using smoothScatter (geneplotter we matched the sequence depth by randomly sampling 10 million uniquely
package) and loess nonlinear regression using the graphics package. Pearson mapped reads per sample.
and Spearman correlations were computed using absolute or relative expres-
sion levels as log2 RPKM values. We included publicly available human UHRR, Hierarchical clustering analyses. Genes with average expression above
brain and LNCaP data for comparison4,25–27. To analyze how sensitivity and 20 RPKM (3,690 genes) were clustered by Spearman correlation and complete
variation improve with a larger numbers of cells, we used Smart-Seq data gen- linkage using python scipy (hcluster). To evaluate the significance (or robust-
erated from 10 LNCaP cells (Supplementary Table 1). To obtain estimates for ness) of each branchpoint, we generated thousand bootstrap gene set replicates
the effect of using larger numbers of cells (used in Fig. 2b,d), we created two that were independently clustered, and from these we counted the percentage
combined samples using 25, 10 and 3 cell samples from picked LNCaP cells, 25, of times each branch was recovered.
10 and 5 cell samples from LNCaP cells spiked into healthy donor’s blood and
isolated using the EPCAM marker, and 2 single-cell LNCaP samples, achieving Analyses of differential exon inclusion. To find significant differences in
a total of 80 cells per each of the two sample pools. These were sequence-depth inclusion levels of alternative exons we applied a t-test with variance shrinkage,
matched to 10 millions reads, by using 125,000 random reads from single-cell known to counteract false positives in microarray analyses32. A variance was
samples, 375,000 from 3-cell samples and so on. calculated for each alternative exon based on the exon inclusion levels across
biological replicates. For each sample group (cell line) the 90th percentile of
Analyses of read coverage across transcriptome. The read coverage analyses the variation was included in the variance term ((90th percentile variation +
© 2012 Nature America, Inc. All rights reserved.

were based on human and mouse RefSeq transcripts. Reads were mapped to gene variation)/2) when calculating the t statistic. The null distribution of the
RefSeq transcripts directly rather than to the genome, using Bowtie allowing t statistic was calculated by shuffling the sample labels (cell-to-cell line map-
for up to 10 hits per read. Each transcript was divided into 40 equally sized ping) repeatedly and for each shuffle compute the t statistics, thus allowing
bins, and the number of reads was counted for each bin and gene. The read the conversion of t statistics to P values for the cancer-cell comparison. To
count per bin for each gene was divided by total read count for that gene estimate false discovery rates, the sample groups were randomly split in half
before the bins for all the different genes were summed up. The calculated and combined with half from the other sample group, and the number of sig-
read coverage per bin was later normalized through the division by the bin nificant exons was counted using the t statistics introduced above (repeatedly,
with the largest read coverage. The mean and s.d. over replicates were shown to vary the random splitting of sample groups). The false discovery rate was
in Figure 1 and Supplementary Figure 2, including all transcripts with at then estimated as the number of significant exons in random shuffles divided
least ten mapped reads. Analyses of full-length transcript reconstructions were by the number of significant events with correct sample groups. The numbers of
based on RefSeq annotations, and we defined full-length reconstructed genes significant exons at different false discovery rates are presented in Figure 3d.
as those for which we obtained correct exon-intron structure throughout all
annotated exons of at least one isoform. We limited the analyses to expressed SNP and mutation detection. CTC RNA-Seq Fastq files were mapped to tran-
(≥0.1 RPKM) and multi-exon (≥2 exons) genes. scriptome (Ensembl, annotations downloaded 16 May 2011) and genome with
BWA33, allowing for no indels and removing multi-mapping reads. Samtools
Singular value decomposition. The global transcript expression values for rmdup22 was used to filter PCR duplicates, and BAM files were reordered by
cancer cells were analyzed using singular value decomposition (SVD) to Picard (http://picard.sourceforge.net/). Variant sites were called by the Genome
determine the fundamental patterns in the transcriptomes. The expression Analysis Toolkit34 jointly on reads from all six CTC samples, with a quality
levels in RPKM were normalized to unit length and the SVD computed using score threshold for sites of 500 and requiring detection in two or more samples
npg

SVDMAN28. Each cell was then projected onto the two strongest SVD compo- (see Supplementary Fig. 11 for more detailed information on varying these
nents to visualize the overall similarity in gene expression (Fig. 3a). threshold). We limited the analyses to transcribed regions using RefSeq gene
models, and the last 35 base pairs of transcripts were not considered to remove
Analyses of differential expression. One-way analysis of variance (ANOVA) false positives arising from mapping of reads with partial poly(A) tail. Analyses
was performed on expression levels (RPKM, log2) followed by Tukey of overlap with known SNPs were based on dbSNP build 132 (ref. 35).
post-hoc test in R/Bioconductor. Only genes significant after multiple testing
corrections (5% FDR, Benjamini-Hochberg) were evaluated with post-hoc 21. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient
test (P < 0.05). Lists of significantly differently expressed genes are available alignment of short DNA sequences to the human genome. Genome Biol. 10, R25
in Supplementary Table 4 for CTC, primary melanocyte and melanoma cell (2009).
22. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics
line comparisons, and in Figure 3a for comparisons between prostate and
25, 2078–2079 (2009).
bladder cancer cell line cells. 23. Ramsköld, D., Wang, E.T., Burge, C.B. & Sandberg, R. An abundance of ubiquitously
expressed genes revealed by tissue transcriptome sequence data. PLOS Comput.
Selection of marker genes for melanoma and immune cells. To identify the Biol. 5, e1000598 (2009).
24. Bengtsson, M., Ståhlberg, A., Rorsman, P. & Kubista, M. Gene expression profiling
100 transcripts most strongly associated with melanoma and immune cells,
in single cells from the pancreatic islets of Langerhans reveals lognormal distribution
respectively, we initially calculated the difference in mean gene expression of mRNA levels. Genome Res. 15, 1388–1392 (2005).
between melanoma samples29 and a combination of monocytes30, T cells31, 25. Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from
white blood cells and lymph node samples (Fig. 4a). The differences were paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 4570–4578 (2010).
26. Bullard, J.H., Purdom, E., Hansen, K.D. & Dudoit, S. Evaluation of statistical
divided by the highest expression value in any of the samples, to avoid dif-
methods for normalization and differential expression in mRNA-Seq experiments.
ferences driven by outlier expression values in one replicate only. We ranked BMC Bioinformatics 11, 94 (2010).
genes according to this metric and selected the 100 strongest markers for the 27. Sam, L.T. et al. A comparison of single molecule and amplification based sequencing
melanoma and for the immune cell combination. We then evaluated the mean of cancer transcriptomes. PLoS ONE 6, e17305 (2011).
28. Wall, M.E., Dyck, P.A. & Brettin, T.S. SVDMAN–singular value decomposition
expression values of each gene in the individual putative CTCs. To include the
analysis of microarray data. Bioinformatics 17, 566–568 (2001).
monocyte SAGE data, we converted 1.5 RPM to 1 RPKM, assuming an average 29. Berger, M.F. et al. Integrative analysis of the melanoma transcriptome.
transcripts length of 1.5 kb (ref. 23). Genome Res. 20, 413–427 (2010).

nature biotechnology doi:10.1038/nbt.2282


30. Zawada, A.M. et al. SuperSAGE evidence for CD14.CD16+ monocytes as a third 33. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler
monocyte subset. Blood 118, e50–e61 (2011). transform. Bioinformatics 25, 1754–1760 (2009).
31. Bernstein, B.E. et al. The NIH roadmap epigenomics mapping consortium. 34. McKenna, A. et al. The genome analysis toolkit: a mapreduce framework for analyzing
Nat. Biotechnol. 28, 1045–1048 (2010). next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
32. Allison, D.B., Cui, X., Page, G.P. & Sabripour, M. Microarray data analysis: 35. Sherry, S.T., Ward, M. & Sirotkin, K. dbSNP-database for single nucleotide
from disarray to consolidation and consensus. Nat. Rev. Genet. 7, 55–65 polymorphisms and other classes of minor genetic variation. Genome Res. 9,
(2006). 677–679 (1999).
© 2012 Nature America, Inc. All rights reserved.
npg

doi:10.1038/nbt.2282 nature biotechnology

You might also like