Advanced Applications of RNA Sequencing
Advanced Applications of RNA Sequencing
and Challenges
Yixing Han1, Shouguo Gao2, Kathrin Muegge1,3, Wei Zhang4 and Bing Zhou5
1
Mouse Cancer Genetics Program, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD,
USA. 2Bioinformatics and Systems Biology Core, National Heart Lung Blood Institute, National Institutes of Health, Rockville Pike, Bethesda,
MD, USA. 3Leidos Biomedical Research, Inc., Basic Science Program, Frederick National Laboratory, Frederick, MD, USA. 4Department of
Medicine, University of California, San Diego, La Jolla, CA, USA. 5Department of Cellular and Molecular Medicine, University of California,
San Diego, La Jolla, CA, USA.
Abstract: Next-generation sequencing technologies have revolutionarily advanced sequence-based research with the advantages of high-throughput,
high-sensitivity, and high-speed. RNA-seq is now being used widely for uncovering multiple facets of transcriptome to facilitate the biological applications.
However, the large-scale data analyses associated with RNA-seq harbors challenges. In this study, we present a detailed overview of the applications of this
technology and the challenges that need to be addressed, including data preprocessing, differential gene expression analysis, alternative splicing analysis,
variants detection and allele-specific expression, pathway analysis, co-expression network analysis, and applications combining various experimental proce-
dures beyond the achievements that have been made. Specifically, we discuss essential principles of computational methods that are required to meet the key
challenges of the RNA-seq data analyses, development of various bioinformatics tools, challenges associated with the RNA-seq applications, and examples
that represent the advances made so far in the characterization of the transcriptome.
Keywords: RNA-seq, data preprocessing, differential gene expression, alternative splicing, variants detection, pathway analysis, co-expression
network, systems biology
SUPPLEMENT: Current Developments in RNA Sequence Analysis Competing Interests: Authors disclose no potential conflicts of interest.
Citation: Han et al. Advanced Applications of RNA Sequencing and Challenges. Correspondence: [email protected]
Bioinformatics and Biology Insights 2015:9(S1) 29–46 doi: 10.4137/BBI.S28991.
Copyright: © the authors, publisher and licensee Libertas Academica Limited. This is
TYPE: Review an open-access article distributed under the terms of the Creative Commons CC-BY-NC
3.0 License.
Received: July 16, 2015. ReSubmitted: September 30, 2015. Accepted for
publication: October 02, 2015. aper subject to independent expert blind peer review. All editorial decisions made
P
by independent academic editor. Upon submission manuscript was subject to anti-
Academic editor: J.T. Efird, Associate Editor plagiarism scanning. Prior to publication all authors have given signed confirmation of
Peer Review: Three peer reviewers contributed to the peer review report. Reviewers’ agreement to article publication and compliance with all applicable ethical and legal
reports totaled 1186 words, excluding any confidential comments to the academic editor. requirements, including the accuracy of author and contributor information, disclosure of
competing interests and funding sources, compliance with ethical requirements relating
Funding: The Intramural Research Program of the NIH, National Cancer Institute, to human and animal study participants, and compliance with any copyright requirements
Center for Cancer Research and National Heart Lung Blood Institute, NIH supported this of third parties. This journal is a member of the Committee on Publication Ethics (COPE).
work. WZ is sponsored by NIH grant ES014811 funded to Dr. Trey Ideker. BZ is supported
by NIH grants GM049369, HG004659, and GM052872 funded to Dr. Xiangdong Fu. The Published by Libertas Academica. Learn more about this journal.
authors confirm that the funder had no influence over the study design, content of the
article, or selection of this journal.
genes.6 Thus, RNA-seq has been instrumental to catalog the data analysis, describe the challenges associated with the
diversity of novel transcript species including long non-coding RNA-seq application, and discuss examples that represent the
RNA, miRNA, siRNA, and other small RNA classes most advances in the transcriptome characterization.
(eg, snRNA and piRNA) involved in regulation of RNA
stability, protein translation, or the modulation of chroma- RNA-seq Workflow
tin states.7,8 For instance, RNA-seq has been used to dis- An overview of a typical RNA-seq workflow is outlined in
cover enhancer RNA, a class of short transcript directly Figure 1. Three main sections are presented: the Experimental
transcribed from the enhancer region, which contributes to Biology, the Computational Biology, and the Systems Biology.
our knowledge of epigenetic gene regulation.9,10 In addition, The experimental part includes the methods’ choice of RNA
RNA-seq can give information about transcriptional start collection, first strand synthesis, and library construction,
sites, revealing alternative promoter usage, information about resulting in millions of short reads from the NGS sequencer.
mRNA isoforms derived from alternative splicing, and pre- Multiple platforms (Table 1) have been applied for the RNA-seq
mature transcription termination at the 3’ end, which is criti- including sequencing-by-synthesis approach Illumina GA IIx
cal from mRNA stability.11–15 Most recently, RNA-seq was
used to study biological problems including precisely locat-
ing regulatory elements.16,17 RNA-seq information can also
Experimental
identify allele-specific expression, disease-associated single RNA extraction
biology
nucleotide polymorphisms (SNP), and gene fusions contrib-
uting to our understanding about disease causal variants in RNA fragmentation and reverse transcription
cancer.18–21 Furthermore, RNA-seq can provide information
about the transcription of endogenous retrotransposons and
other parasitic repeat elements that may influence the tran- Library construction and sequencing
Platform Illumina GAIIx Illmina HiSeq 2000 Illumina MiSeq v2 SOLiD–5500xl 454 GS FLX+ Ion Torrent PGM PacBio RS
Chemistry principle Sequence-by- Sequence-by- Sequence-by- Ligation and Pyro-sequencing Proton detection Real-time
synthesize synthesize synthesize two base coding sequencing
Instrument price $256 K $654 K $128 K $251 $450 K $80 K (System price including $695 K
PGM, server, OneTouch and
OneTouch ES.)
Sequence yield per run 30Gb 600Gb 1.5–2Gb 150Gb 0.7Gb 50 Mb on 314 chip, 400 Mb on 100 Mb
316 chip, 1.5Gb on 318 chip
Sequence cost per GB $148 $45 $502 $67.00 $50 $800 (318 chip) $2,000
Reagent cost per run** $17,575 $23,470 $1,070 $10,503 $4,842 $349 on 314 chip, $549 on $$300
316 chip, $749 on 318 chip
Reagent cost per MB $0.19 .$0.04 $0.14 ,$0.07 $7 $5 on 314 chip, $1.2 on 316 $2–17
chip, $0.6 on 318 chip
Run time 10 days 11 days 27 hours*** 7 Days for SE 20 hours 2–5 hours 2 hours
14 Days for PE
Observed raw error rate 0.76% 0.26% 0.80% ,0.1% 1% ∼1% ∼10%
Read length Up to 150 bases Up to 150 bases Up to 150 bases 85 bases 700 base ∼200 bases 3000 bases, up
to 15000 bases
Read type PE PE PE PE SR PE SR
Insert size Up to 700 bases Up to 700 bases Up to 700 bases 300 bases Up to 40 kb Up to 250 bases Up to 10 kb
Typical DNA amount 50–1000 ng 50–1000 ng 50–1000 ng 400–4000 ng 25–1000 ng 100–1000 ng ∼1 µg
requirement
Computation resources $222 cluste $222 cluste Desktop/cloud $35 cluster $5 (desktop) $16.5 (desktop) $65 cluster
Data file sizes (GB)*** 600 ,600 1 148 40 images, 8 sff 0.1 sff, 0.2 fastq on 314 chip, 2 (basecalls,
5 sff, 1 fastq on 316 chip, QV, kinetics)
10 sff, 2.5 fastq on 318 chip
Notes: *Information based on company sources alone, data update to 2013–2014. **Cost only count the cost per run, does not include general purpose and library preparation equipment, annual maintenance
agreements and extra sevices. ***New compressed binary data format saves base and quality-value data in a 1byte: 1base ratio.
and HiSeq, 29,30 Applied Biosystems SOLiD, 31 Roche 454 Life and low-quality N bases to enhance the quality of reads39 and
Science, 32 semi-conductor technology-driven Ion torrent Per- HTSeq to depict the base calling and evaluate the base quality
sonal Genome Machine, 33 single molecule real-time PCR at position-based way as well as the overall read features.40
machine PacBio, 34 and nanopore technology-driven por- Before alignment to the reference genome, RNA-seq
table device MinION, and PromethION (http://allseq.com/ data can be further preprocessed to meet expectations in the
blog/minion-and-promethion-oxford-nanopore-s-present- next sequencing mapping steps. There are multiple tools avail-
and-future). able for this purpose, for example, BBMerge from BBMap
RNA preparation methods may vary for different kinds package (http://sourceforge.net/projects/bbmap/) merges
of sequencing platforms, RNA subtypes, and sequencing pur- paired reads based on overlap to create longer reads and cre-
poses. However, sample quality is always the determinant of ates an insert-size histogram. FLASH41 combines paired-end
acquiring qualified data and deriving biological insights from reads that overlapped and converts them to single long reads.
unbiased analysis. Poly-A-selection of sufficient mRNA is It is also a good practice to assess the RNA-seq data quality
well used in a variety of whole-transcriptome analysis includ- after the preprocessing procedure, and there are packages, for
ing gene expression, alternative splicing, and variations example, RSeQC package, to comprehensively evaluate the
detection.35,36 While for single cell sequencing, molecular reads that will go to analysis.42
labeling, and random sequencing, the labeled molecules on Reads mapping. Once high-quality data are obtained
Illumina platform can achieve remarkable mRNA capture from preprocessing, the next step is to map the short reads
efficiency.36 With more RNA-seq applications in clinical to the reference genome or to assemble them into contigs and
samples, formalin-fixed paraffin-embedded (FFPE) tissue align them to the reference genome. This procedure refers to the
samples became invaluable recourse for transcriptomic stud- classic bioinformatics problem of discovering the most reliable
ies. Ribo rRNA depletion is the preferred method for archival original sources of a large scale of short DNA sequences from
and long-aged FFPE samples.37 the genome in a speed- and memory-efficient manner.43,44
The raw reads served as starting material of the second There are many popular bioinformatics programs that can be
part, the computational biology. First, technical and biologi- used for this purpose, including ELAND (http://support.illu-
cal contaminations were removed from preprocessing steps, mina.com/help/SequencingAnalysisWorkf low/Content/
followed by mapping the qualified reads to the genome or Vau lt / In for mat ics/S equenc ing _ A na ly sis/CA SAVA /
transcriptome. The mapped reads for each sample were sub- swSEQ _mCA_ReferenceFiles.htm), SOAP, 45 SOAP2, 46
sequently indexed into gene-level, exon-level, or transcript- MAQ ,47 Bowtie,48 BWA,49 ZOOM,50 STAR,51 etc. Com-
level to assess the abundance of each category depending on parable analyses on real data have been done to assess most
the experimental purpose. The summarized data were then mapping tools.52 These programs are typically suitable for
assessed by statistical models of differentially expression gene reads that are not located at the poly (A) tails or exon-intron
list and alternative splicing events, or regulatory mechanisms splicing junctions. Poly (A) tails can be easily identified by the
were evaluated via integration analysis with other datasets presence of multiple As or Ts, and a partial junction library
such as epigenomic or proteomic data. Finally, pathway or that contains the known junction sequence has been com-
network level analyses were implemented to gain biological piled to allow the alignment of difficult mapping reads.23,53
insight through the systems biology approaches. From a different point of view, the reads that locate at the
exon–intron boundaries are helps with the determination
Data Preprocessing of the alternative splicing pattern, where advent RNA-seq
Quality assessment. Since RNA-seq is a complicated, promotes the development of new generation of slice-align-
multiple-step process involving sample preparation, frag- ment software such as BLAT,54,55 TopHat,56,57 GEM,58
mentation, purification, amplification, and sequencing, it is and MapSplice.59
not straightforward to identify and quantify all RNA species Another problem in reads mapping is that of polymor-
from the reads sequenced. Hence, quality assessment is the phisms, which occur when sequence reads align to multiple
first step of the bioinformatics pipeline of RNA-seq, and also, locations of the genome. Polymorphisms are especially com-
it is important as a step before analysis. Often, it is necessary mon for the large and complex transcriptomes. For lower
to filter data, removing (trimming) low-quality sequences or repetitive reads, one can employ the solution of assigning the
bases adaptors, contaminations, or overrepresented sequences reads to multiple locations proportionally based on the neigh-
to ensure a coherent final result. An array of tools are available boring unique reads.31,53 However, for the short reads that
for this purpose with reads quality visualized graphically such have a very high copy number and repetitive sequences, poly-
as FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/ morphism is still a great challenge. A longer read sequencer
fastqc), HTQC,38 as listed in Table 2. Recently, more flexible such as the Roche 454 or PacBio sequence analyzer might be
and efficient preprocessing tools were developed: Trimmo- required. Alternatively, there are bioinformatics solutions to
matic was developed to remove adapters and scan every read extend the short pair-end reads into 200–500 bp fragments
with a 4-base sliding window and trim the lower-scored bases before deciding upon the multiple-aligned reads.60–62
Table 2. Selected list of packages and tools for RNA-seq data analysis.
(Continued )
Table 2. (Continued)
Table 2. (Continued)
Table 2. (Continued)
Reads counting. RNA-seq reads number that map to summarizeOverlaps, which is a function in the Genomics-
a gene is the measurement of the gene’s expression level. Ranges package in Bioconductor, and featureCounts, which
After mapping the reads to the reference genome, counting have implemented, highly efficient chromsome hashing
the reads number that mapped to gene body will facilitate and feature blocking methods, are suitable for RNA-seq or
the next steps. Library preparation methods, such as whether genomic DNA sequencing data. Different tools and their dif-
the protocol is strand-specific, whether first read is on the ferent related parameters generate different reads’ numbers,
same strand or opposite strands, are determinant factors for and thus affect downstream analysis because they use differ-
the counting of reads. One example of tools and packages ent strategies to assign reads to features.
for read counting from bam file is the multicov command in In addition, the gene model that hypothesizes the struc-
bedtools that takes a feature file (GFF) and read counts in ture of transcripts produced by a gene also affects the analy-
certain regions, such as all exons of a gene.63 By default, it sis. Among multiple genome annotation databases, RefGene,
counts reads on both strands within interested regions. But it Ensembl, and the UCSC annotation databases are the most
can work in a strand specific manner if necessary. HTseq is popular ones. The choice of genome annotation directly affects
a specialized utility for counting reads although speed lifting gene expression estimation. Recently, Zhao and Zhang sys-
is necessary in the future.40 However, it allows us to look for tematically characterized the impact of genome annotation
more fine-grained controls on read counting by setting dif- datasets choice on read mapping and transcriptome quantifi-
ferent parameters. This is very useful, especially when a read cation.67 They found that the impact of a gene model on map-
overlaps more than one gene and we want to use customized ping of nonjunction reads is different from junction reads. The
strategy. Note that HTseq-counts assume that the RNA-seq percentage of correct mapped nonjunction reads was much
data is strand-specific; it will only count those reads that higher than that of the junction reads for all gene models.
were mapped to the strand that the feature is on. R pack- Surprisingly, although there are 21,958 common genes among
ages include easyRNASeq, summarizeOverlaps and feature- RefGene, Ensembl, and UCSC annotation, only 16.3% of
Counts for reads counting. easyRNASeq hides the complex genes obtained identical quantification results. Approximately
interplay of the required packages and thus can be easily used. 28.1% of genes’ expression levels differed by $5% when using
different annotation, and of those, the relative expression read counts with respect to overall mapped read number and
levels for 9.3% of genes differed by $50%. This study revealed gene length.32,53 However, beside the read coverage, there are
that the different gene definition of gene models frequently other factors that determine the estimated transcript abun-
result in inconsistency in gene quantification. dance including sequencing depth, gene length, and isoforms
Normalization. After getting the read counts, data nor- abundance.72,77,78 Since the RPKM method handles all the
malization is one of the most crucial steps of data process- RNA-seq reads almost equally, for example, without concern
ing, and this process must be carefully considered, as it is for isoforms, it has been criticized. RNA-Seq by Expectation
essential to ensure accurate inference of gene expression and Maximization (RSEM) is a newly developed software tool,
subsequent analyses thereof. There are multiple facets of the which gives accurate estimates for gene and isoform expres-
RNA-seq data to be taken into account including transcript sion levels and can be used even for species without a reference
size, GC-content, sequencing depth sequencing error rate, genome assembly.79
insert size, etc.68,69 Multiple normalization methods should be Most algorithms to date for differential gene expression
compared for the specific bias elimination of a dataset, which analysis apply simple count-based probability distributions
can be done by comparing their corresponding estimated per- (eg, Poisson distribution) followed by Fisher’s exact test with-
formance parameters using measurement error models.69,70 out accounting for biological variability among samples. 32,53,80
Plenty of comparative analysis or integrative analysis con- While the technical variability of RNA-seq is extremely low
cluded the best approach in different types of RNA-seq data compared with microarray data, 32 the biological variability
analysis. For instance, quantile normalization can improve the could be significantly reduced by analyzing several replicates
mRNA-seq data quality including those from low amounts of through a permutation-derived methods.75 Serial analysis of
RNA.71,72 R package EDASeq using within-lane normaliza- gene expression has been developed for biological variability
tion procedures followed by between-lane normalization can assessment, in which larger scale datasets are used so that an
reduce GC-content bias.73 Lowess normalization and quantile additional dispersion parameter can be estimated based on an
normalization worked well in microRNA-seq data normal- extended Poisson distribution, allowing extensive molecular
ization.74 Further advancement of RNA-seq application calls characterization capability.81,82
for the development of effective statistical and computational However, for most applications, a large number of replica
methods for RNA-seq data normalization. may be too costly, and many developed methods have over-
There are other bioinformatics challenges for the RNA- come the problem by modeling biological variability and
seq reads mapping, for example, reducing the errors in image measuring the significance with limited number of samples,
analysis and base calling to enhance sequencing accuracy; applying pairwise or multiple group comparisons.75 Several
removing low-quality reads; and the development of appli- programs offer well-done solution for this purpose and have
cable approaches to store, retrieve, and process large datasets been applied in numerous studies for biomedical and clinical
in a time- and energy-efficient manner. research. Examples of these programs are Cuffdiff from the
Cufflinks package,7,83,84 DESeq, 31 DESeq2,85 and EdgeR.82
Differential Gene Expression Since RNA-seq read counts are integer numbers that range
The transcriptome is the complete set of transcripts in a cell or from zero to millions and are highly skewed, many kinds of
cell population, and transcriptome analysis provides informa- transformation algorithms have been applied to the counts so
tion about the identity and quantity of all RNA molecules. An that the numbers can be fit to statistic distribution models for
important application of RNA-seq is the comparison of tran- differential expression detection. For instance, Li et al devel-
scriptomes across different developmental stages, across a dis- oped PoissonSeq, a Poisson log-linear model for differential
ease state compared to normal cells, or specific experimental gene expression assay.86 Approaches developed for microar-
stimuli compared to physiologic conditions. This type of anal- ray data analysis based on continuous distribution have been
ysis requires identification of genes along with their isoforms improved for RNA-seq counts. Excellent example is the voom
and precise assessment of their abundance comparing two or function in the limma package, which offers a way to transform
multiple samples. It is essential for interpreting the functional count data into Gaussian distributed data so that significance
elements of the genome and uncovering the molecular consti- can be tested statistically.87–89 An extensive comparison to
tution providing important insights in the biological mecha- evaluate the performances of several DGE packages has been
nisms of development and diseases. recently reported.90,91 However, to the best of our knowledge,
After the step of preprocessing RNA-seq reads, it is an there is no one-size-fits-all strategy. Also, space for refine-
important question to reveal how the transcripts level differs ment of existing pipelines exists to develop effective strategies
across samples, known as DGE analysis. Numerous statisti- for the following questions: how to uniform the reads coverage
cal methods have been developed that use read coverage to along the genome with the nucleotide composition variation;
quantify transcript abundance since the microarray era.75,76 how to detect the “within-sample” variations without simply
The RPKM (reads per kilo base per million mapped reads) is assuming that the underlying conditions or treatments affect
widely used method to account for expression and normalized all individual gene equally; how to improve current methods to
detect differences in gene isoform preferences and abundance developed from a statistical method and used for detecting
level in varying conditions; and how to account for the dif- differential alternative splicing events from RNA-seq data.111
ferent probability in read coverage in long genes versus short rMATS is a statistical method for robust and flexible detec-
genes since we can gain great sequencing depth nowadays. tion of genome-wide differential alternative splicing from
paired or unpaired replicates.112 ALEXA-Seq assesses the
Alternative Splicing differential and alternative expression of the mRNA iso-
The biological complexity and genomic diversity are deter- forms after cataloging transcripts.12 An integrative analysis
mined, to a large degree, by the alternative splicing events.92 approach constructed an exon co-splicing network based on
Alternative splicing shapes the control of numerous pivotal distances combined with matrix correlations and found that
cellular processes, and abnormal splicing events are involved the co-splicing network was distinct and complementary to
in 15%–50% of disease-causing mutations in human.93 Com- the co-expression network, although they both possess scale-
pared with constitutive splicing, alternative splicing refers to free properties.113,114
the differential inclusion/exclusion of exons in the processed The field of alternative splicing analysis using RNA-seq
RNA product after splicing of a precursor RNA segment.94 data is still in its infancy and would benefit from new strate-
It is a crucial step in controlling the expression of ∼95% of gies. An extensive evaluation and comparison of the existing
all multiexon genes, and an increasing number of diseases are methods would be desirable, and to date, there is no gen-
found to be associated with the “wrong” splice sites usage, eral consensus regarding which method performs best under
while the overall transcript abundance does not change.95,96 given conditions. We are expecting to see the novel, explor-
Spliceosomes, composited of intricate structures with RNA– ing methods to be developed in this flourishing field in the
RNA, protein–protein, and RNA–protein interactions, carry near future.
out the splicing reaction.97 Splicing mechanism studies on
model genes have deduced many regulatory principles includ- Variants Detection and Allele-Specific Expression
ing the role of negative intrinsic sites binding and positive The main applications of RNA-seq analysis are novel gene
enhancement of splicing sites selection in the formation of identification, expression, and splicing analysis. However,
spliceosome assembly.94 RNA-seq data is also a useful by-product of sequence-based
However, given the variety of cis-acting elements and mutation analysis, though there are many limitations, such as
trans-acting factors involved in splicing, either cooperatively or highly differential coverage between different genes. Among
in a competing manner, the “code” for controlling alternative many variants calling and annotation methods such as
splicing needs still further deciphering using high-throughput ANNOVAR,115 SNPiR,116 and SNiPlay3,117 the best practical
approaches.98 RNA-seq technology allows us to estimate workflow provided by GATK may be still the best pipeline
alternative splicing events on genome-wide scales and in an to identify mutations from RNA-seq data, although it is still
unbiased manner. Deep surveying of alternative splicing by far from perfect and under heavy development (http://gatk-
RNA-seq revealed unprecedented wealth of splice junctions forums.broadinstitute.org/discussion/3891/calling-variants-
and RNA-binding motifs and provides more reliable measure- in-rnaseq). In GATK pipeline, the sequence reads are first
ments compared with microarray technology.99–101 Further- mapped to the reference using STAR aligner (2-pass protocol)
more, alternative splicing is tissue-specific, with hundreds of to produce a file in SAM/BAM format sorted by a coordi-
context-sensitive RNA features and tissue-dependent splicing nate. After marking and removing duplicates, GATK splits
regulatory elements, which generate thousands of combina- reads with N operators in the CIGAR strings into component
tions of alternative splicing events.102,103 In-depth of RNA- reads and trims to remove any overhangs into splice junctions
sequencing analysis yield a digital inventory of gene and mRNA to reduce the occurrence of artifacts. The remaining steps are
isoform expression with tissue specificity and high sensitivity similar to DNA-seq variants calling, such as local alignment
of single cells and provides a framework of understanding and haplotype variant call.
alternative splicing pattern on genome-wide scales.15,104 Heterozygous SNP, which means two different alleles in
With the rapid accumulation of RNA-seq data, many the same position in the DNA, may lead to the following: one
methods and tools have been developed to infer alternative of two alleles is highly transcribed into mRNA and another is
splicing events. These tools generally focus on either gapped lowly transcribed or even not transcribed at all. This is called as
alignment of short reads or de novo assembly and charac- allele-specific expression (ASE). Both genetic and epigenetic
terization of transcript models. Examples of these methods determinants govern transcriptional activity at the different
are MISO for identification and regulation of isoforms from alleles of a gene in a non-haploid genome, and impairment
CLIP-seq data and105 SpliceMap,106 SplitSeek,107 spliceR,108 of this highly regulated process can lead to disease.118,119
and SplicingCompass109 for detection of splice junctions and Whole genome DNA sequencing (WGS) allows identifica-
exon usage from pair-end RNA-seq. GLiMMPS provides tion of single nucleotide mutations or polymorphisms in the
a useful tool for elucidating the genetic variation of alterna- entire human genome. The expression state of the heterozy-
tive splicing in humans and model organisms.110 MATS is gous loci can be investigated in the matched RNA-Seq and
WGS sample from the same individual, and ASE activity can is unaffected by sample sizes, experimental designs, assay
be identified to uncover the instances of allele silencing.120 platforms, or other types of heterogeneity.133 GSAASeqSP
Though conceptual simple, there is still a challenge to identify offers a variety of statistical procedures by adapting and com-
ASE due to many problems, such as reads bias and lack of bining multiple gene-level and gene set-level statistics for RNA-
sophisticated statistical model.121 Recently, Mayba et al devel- seq count-based data. Such statistics include Weighted_KS,
oped a pipeline, MBASED to ASE detection, through aggre- L2Norm, Mean, WeightedSigRatio, SigRatio, Geometric-
gating information across multiple single nucleotide variation Mean, TruncatedProduct, FisherMethod, MinP, and Rank-
loci to obtain a gene-level ASE.122 More sophisticated soft- Sum.134 GSAASeqSP is a powerful platform for investigating
wares are needed for ASE identification. molecular differential activity within biological pathways.
The limitations of the gene set analysis methods devel-
Beyond the Differentially Expressed Gene Lists oped for microarrays in the context of RNA-seq data have
Creating lists of the differentially expressed genes is only the been comprehensively investigated.128 Several frequently used
starting point of gaining biological insights into experimen- RNA-seq normalization strategies were studied to exam-
tal systems, developmental stages, or specific disease sce- ine the performance of multivariate tests. Data transforma-
narios. To understand the biologic context of differentially tions were also investigated in an attempt to extend other
expressed genes, many advanced analyses have been working approaches beyond microarray data analysis. It was found that
on gene ontology,123,124 gene sets,125 network inference, and the use of log counts when normalized for sequence depth is a
knowledge databases.126,127 good strategy for data transformation prior pathway analysis.
Pathway Analysis. The interpretation of gene expression Previously, pathway analysis methods had been developed
data is based on the function of individual genes as well as based on algorithms considering pathways as simple gene lists
their role in pathways since genes work connectively in all bio- and ignoring pathway structure. Recently, methods have been
logical processes. In addition, for some genes, a small expres- developed that incorporate various aspects of pathway topol-
sion change may be not significant at a single gene level, but ogy. For example, SPIA captures pathway topology through
minor changes of several genes may be relevant in a pathway its scoring system, in which the positions and the interactions
and may have dramatic biological consequences. Thus, differ- of the genes in the pathway are considered.135 Accordingly,
entially expressed biological pathways provide better explana- interacting differentially expressed gene pairs are preferen-
tory results than a long list of seemingly unrelated genes.128 tially weighted over two non-interacting genes. Similarly,
One traditional analysis works with a gene list of inter- TAPPA is a scoring method in which higher weights are auto-
est, identified with genomics methods or curated by biologists, matically assigned to hub genes and interacting gene pairs.136
and applies statistical methods, such as the Fisher Exact Test, DEAP identifies the most differentially expressed path to
on contingency tables to test for enrichment of each anno- provide a refined focus for further biological exploration.137
tated gene set.129 Such approaches can be applied to the dif- Accordingly, biological pathways are represented by directed
ferentially expressed gene list identified with RNA-seq data graphs, where nodes are biological compounds and the edges
directly. Another class of analysis ranks all expressed genes represent catalytic or inhibitory regulatory.
according to metrics of expression difference and then uses Applying methods developed for microarray data analy-
Kolmogorov–Smirnov like tests to obtain enrichment sig- sis without considering specific data features of RNA-seq data
nificance. Gene set enrichment analysis (GSEA) is one such may lead to biases. For example, long or highly expressed tran-
highly effective method that has been widely used in studying scripts are more likely to be detected as differentially expressed
functional enrichment between two biological groups.130 than are the short and/or lowly expressed ones. By developing
Many studies have adapted pathway analysis tools from new statistical framework, the new problem of gene length
microarray data analysis and developed new tools applicable bias and total reads number bias from RNA-seq could be well
to RNA-seq data. For example, a non-parametric competitive corrected. One good example is the GOseq package for gene
GSA approach named Gene Set Variation Analysis has been ontology analysis. It considered the read counts bias by estimat-
developed to fit RNA-seq data characteristics. Such analy- ing the probability weighting function and used resampling
ses have given highly correlated results between microarrays strategy beyond the differentially expressed gene expression
and RNA-Seq sample sets of lymphoblastoids cell lines that so that it can highlight GO categories more consistent with
have been profiled using both technologies.131 SeqGSEA uses the known biology.138 Development of good methods to cor-
count data modeling with negative binomial distributions to rect the biases in pathway analysis brought by GC content,
score differential expression and then executes gene set enrich- dinucleotide distribution, and other factors is challenging.139
ment analysis to achieve biological insights. In real applica- Although many pathway databases are available, high-
tions, SeqGSEA detects more biologically meaningful gene resolution annotation of such knowledge bases is still lack-
sets without biases toward longer or more highly expressed ing. For example, .90% of the human genome is alternatively
genes.132 GAGE is another method for pathway analysis spliced and transcripts from the same gene may have distinct,
that is applicable to both microarray and RNA-seq data. It even opposing functions. However, current knowledge bases
only are curated at gene level. It is essential to also include co-expression networks.145 The co-expressed gene list in
knowledge about pathway-specific transcript activity. In COXPRESdb provides a comparable view of orthologous
addition, high-quality annotations for genes are still needed, genes among several species (human, mouse, rat, chicken, fly,
although there are enormous numbers of annotations available zebra fish, nematode, monkey, dog, and yeast) and the num-
in the public domain.140 We expect to see more sophisticated bers of common edges for all pairs of species.
data mining and machine learning algorithms applicable to Besides building gene co-expression networks under
RNA-seq data, especially those methods considering the gene defined conditions, finding co-expression modules in one
in the context of its pathway. condition and then testing if these modules show different
Co-expression network analysis. Co-expression network co-expression in other conditions can assist in understand-
analysis is an important complement to DGE analysis. A gene ing the regulatory change under disease conditions. Gene set
co-expression network is represented as an undirected graph, co-expression analysis was proposed to test differential co-
in which each node corresponds to a gene, and two nodes expression of known pathways through testing the changes
are linked if there is a significant co-expression relationship in co-expression over all gene pairs in the pathway.146 Based
between them. Because co-expressed genes are often function- on theoretical analysis, a small highly co-expressed subnet-
ally related, controlled by the same set of transcriptional factors, work was found to be a good indicator of disease onset or
or work together within same pathway, building co-expression other biological process. This finding has been validated with
networks can help to extract meaningful biological modules that real data and confirmed that this small set of genes clustered
are tightly associated within a specific biological process.141 within a strongly correlated subnetwork is able to provide the
The co-expression network has been extensively studied significant warning signal just before onset of disease.147 The
since microarray era and such data have been examined using current approach of building dynamic network biomarker is
RNA-seq data with the emergence of NGS technology. Com- based on population data. It might be interesting to build a
parison studies between RNA-seq co-expression networks co-expression of network with time series data from same
and microarray data-derived networks revealed that correla- subject with self-correlation or synchronization,148,149 such
tions from RNA-seq data are much higher due to the reason that we can use it to predict disease onset for diagnosis and
that RNA-seq data is of greater sensitivity and larger dynamic personalized medicine.
range. Although both co-expression networks show scale-free Interestingly, in biological systems, antagonistic and self-
properties, there is low overlap between hub-like genes. This reinforcing co-expressed modules have been found in system
phenomenon can be explained by low correlation between stability and adaptability.150 Algorithm have been designed to
microarray and RNA-seq data, especially for high- and low- model this phenomenon, for instance, DICER, in which the
transcript abundances.142 expression profiles of genes within each module of the pair
Both sample size and reads’ depth affect the quality of are correlated across all samples and the correlation between
RNA-seq-derived co-expression networks.143 Larger sample the two modules differ dramatically between the disease and
sizes and greater read depth can increase the functional con- normal samples.151 Weighted gene co-expression network
nectivity of the networks. The minimal suggested experimen- analysis is a powerful method to extract co-expressed groups
tal criteria to obtain performance on par with microarrays are of genes from large microarray data sets and has been suc-
at least 20 samples with total number of reads greater than cessfully applied to RNA-seq data. It is suggested to remove
10 million per sample. Meta-analysis across multiple data sets genes whose read counts are consistently low and normalize
is a good solution to increase the relatively poor performance of the data with a variance-stabilizing transformation before cal-
individual co-expression networks. Aggregation across differ- culating pairwise similarity of expression pattern. It can per-
ent experiments can improve performance significantly beyond form various aspects of weighted correction of co-expression
that attained by even the largest individual co-expression net- network analysis including network construction, module
works in one experiment. However, thousands of samples from detection, gene selection, calculations of topological prop-
different conditions are necessary to obtain the “gold standard” erties, data simulation, visualization, and interfacing with
co-expression networks. external software.152
The high quality of co-expression network by large As more and more RNA-seq data become publically
meta-analysis promises the power of a functional genom- available, there is a great need to develop new algorithms
ics tool to biologists and clinicians. GeneFriends project to formulate both the global and local characteristics of co-
team has constructed co-expression maps for human and expression networks, especially those dynamic changes asso-
mouse with RNA-seq datasets of 4,000 and 2,500 samples ciated with biological processes. Much work still remains for
from different experiments, respectively.144 This information the development of RNA-seq co-expression methodologies.
can be used statistically, such as using a guilt by association So far, there have been few published statistical studies that
approach to predict gene function, identifying and prioritizing have examined metrics for similarity of expression profiles
novel candidate genes involved in biological processes. with RNA-seq data. Si et al designed an algorithm to cluster
COXPRESdb is another database of RNAseq-based gene genes by measuring the differential expression patterns across
Centrality and network flow have been successful for ORF Poly-A site
Systems Biology
High-throughput sequencing technologies are now rou- Sequencing Sequence
tinely being applied to a wide range of topics in biology and
medicine, allowing scientists to address important questions
and reveal difficult discoveries that were impossible before. Computational analysis
Advances in genome sequencing and data analysis are of
Analyze
critical roles, while the procedure for how to prepare sam-
ples selectively and how to generate qualified data requires
sophisticated experimental design, which is essential part of
systems biology (Fig. 2). Figure 2. The STARR-seq pipeline and the corresponding ‘systems
biology’ steps. The sonicated genomic DNA are PCR amplified and
Concerning gene expression analysis, the integration data-
placed downstream of a minimal promoter in reporter vectors. The desired
sets from diverse platforms in this Next-Generation Genom- measurement are embedded in the genome. The reporter library is
ics era, including genomics, epigenomics, and proteomics with transfected into the cultured cell lines and Poly-A RNAs are isolated from
transcriptomics, is critical in the effort to understand complex the pool of total RNA. These steps are selectively to enrich the targets
biological systems. A wide scope of integrating analysis proj- interested. After RNA-seq is performed, the reads are mapped to the
ects were well defined for a more complete picture of gene reference genome and their enrichment over input are measured to reflect
enhancer activity. The steps of systems biology including mathematics and
regulation such as the Roadmap Epigenomics Project, the
computational biology analysis will help with the interpretation.
ENCODE Project, and The Cancer Genome Atlas.154 RNA-
seq has been used in combination with transcription factor
(TF) binding,155,156 histone modification,157,158 DNA methyla- on enhancer-associated histone modifications (H3K4me1,
tion,159,160 genotyping data,161,162 and RNA interference.163 In H2K27ac, H3K18ac, etc.)172–175; and (3) ChIP-seq on TFs or
this study, we summarize two excellent examples to illustrate cofactors (p300, CEBPB, etc.).176 However, the mapping of
RNA-seq application in the frame of systems biology. open chromatins and histone modifications usually lacks suffi-
STARR-seq: whole genome functional readout of cient resolution and specificity to detect precise enhancer loca-
enhancers. Enhancers are functional non-coding DNA tions, and the binding of some specific TFs or cofactors can
sequences that can recruit TFs, physically interact with pro- hardly cover all the active enhancers. Moreover, none of these
moters, and regulate the timing and tissue specificity of gene methods can provide a quantitative measurement of enhancers’
expression.164–169 Despite their important roles during devel- activities. The traditional quantitative reporter assays, on the
opment, in response to stimuli and various diseases, a genome- other hand, cannot be scaled up to a high-throughput genome-
wide approach to identify functional enhancer regions is still wide manner.177,178
lacking. Current high-throughput enhancer detection methods To address this question, Arnold and colleagues developed
can be grouped into three categories: (1) identification of open a method, named self-transcribing active regulatory region
chromatins, including deep sequencing of DNase hypersensi- sequencing (STARR-seq),17 which quantitatively measures
tive sites (DHS-seq)170 and formaldehyde-assisted isolation of the activity of enhancers in the whole genome. They sheared
regulatory elements sequencing (FAIRE-seq)171; (2) chromatin Drosophila melanogaster genomic DNA and selected ∼600 bp
immunoprecipitation followed by deep sequencing (ChIP-seq) fragments. These random fragments were PCR amplified and
placed downstream of a minimal promoter in reporter vectors skin infections to deep-seated systematic candidiasis with
(Fig. 2). The reporter library contains 11.3 million candidate high mortality rates, for which progression and severity
fragments, which covered 96% of the non-repetitive genome are determined by the host immune system.179 The disease
by 10-folds. In these constructs, if candidate DNA fragments caused by Candida albicans largely depends on the feature to
are enhancers, they will have an opportunity to activate their change its transcription landscape thus switch its morpholo-
own transcription. Furthermore, by transfecting the reporter gies in response to different host niches or environmental
library into Drosophila cell lines, isolating polyadenylated stimuli.180,181 Because of the clinical significance, based on
RNA, and performing RNA-seq, the authors were able to the feature of change transcriptome upon environmental clue,
quantitatively estimate the enhancers’ strength based on the Beuno with colleagues generated RNA-seq data from in vitro-
amount of their transcription. cultured C. albicans with diverse growth conditions includ-
Computational analyses include mapping the STARR- ing hyphae-inducing condition, high/low oxidative stress/pH
seq data to the genome and examining their enrichment over condition, nitrosative stress, and cell wall damage-inducing
input. From this, the authors identified 5,499 enhancers in condition.182 From a total of 177 million mapped reads, they
Drosophila S2 cells and validated 77 in addition to 65 negative have remarkably refined the primary genome annotations by
controls by luciferase assays. As a result, 81% of the predicted determining transcripts position, identifying new genes and
enhancers and 14% of negative controls showed enhancer activ- new introns, and determining expression levels under each
ity. There was a strong linear correlation (r = 0.83) between the growth condition and condition-specific expression of novel
levels of luciferase activity and their STARR-seq transcrip- transcripts. With similar experimental design strategy, Linde
tion readouts, indicating STARR-seq is a reliable quantitative et al depicted an even detailed transcriptional map by anno-
measurement of enhancers’ strength. tating protein coding genes and non-coding genes, intron and
STARR-seq is a high-throughput application of the UTR in another Candida species Candida glabrata under pH
traditional enhancer reporter assay that directly and quanti- and nitrosative stress.183 Comparison genomics also fueled
tatively assesses enhancers’ activities in a genome-wide man- this study to determine species-specific and condition-specific
ner. It complements existing enhancer detection methods adaptions are regulated by individual genetic repertories and
based mainly on chromatin features. One of the limitations of conserved orthologs on transcriptional level.184
STARR-seq, as the authors pointed out, is that it only assesses
the potential enhancer ability of DNA sequences irrespective Outlook/Perspective
of the endogenous genomic context, such as DNA accessibil- In this review, we have outlined major applications of the
ity and histone modification. RNA-seq in biomedical research, highlighted the compu-
Structural genome (re)annotation. The task of defining tational approach in data preprocessing, differential gene
the complete set of transcripts is complicated because of the fact expression, alternative splicing, pathway analysis, and co-
that transcriptomes are of high dynamic entities, which change expression network, and presented examples to show how
in response to both of the intracellular signals and extracellu- this technology can be applied in systems biology field
lar environment. In addition, expression level, allele expression, to advance our understanding in genomic level. Since it is
and alternative splicing events are involved in increasing the potent in investigating the transcriptome in a highly quanti-
complexity of transcriptome defining with regard to the devel- tative manner at single nucleotide resolution, complex disease
opment stages, growth condition, or disease status. diagnosis, and precision medicine, the rapidly accumulating
Genomic studies including gene expression by microar- genome sequence data allow researchers to address funda-
ray and chromatin feature assays by tiling array are based mental biological questions that were not even asked just a
on genome annotations. However, the genome annotation few years ago. Although many progresses have been made
is continuously being updated and even the current annota- since the initial application of this technology, there are still
tion is incomplete indicating that the previous studies might more applications possible if further refinement is provided
have missed important information or they are not precise for each of the topics.
enough to uncover the biological insight. Accumulating Single RNA-seq. RNA-seq in single cells has provided a
studies using RNA-seq to reveal the genome and transcrip- new powerful approach to study complex biological processes,
tome annotation structurally have been generating a more for instance, promoting advances in cancer studies starting
complete and more precise map to facilitate our understand- from qualitative microscopic images to quantitative genomic
ing of the gene transcription. We highlight in this study datasets in recent year.185 Single-cell genome and exome
examples that finely annotated transcriptional landscapes sequencing fueled the investigation of fundamental questions
in a major invasive fungal pathogen with combined elegant including resolving solid tumor heterogeneity, identifying
experiment design and RNA-seq following comprehensive stem cells, tracking cell lineages and population consump-
data analysis. tion, measuring mutation rates, and detecting fusion gene
Candida species is a major invasive fungal pathogen of events.19,186–188 Although single-cell sequencing can provide
humans, responsible for diseases ranging from superficial far more accurate measurement, however, the challenges of the
21. Conde L, Bracci PM, Richardson R, Montgomery SB, Skibola CF. Integrating 56. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2:
GWAS and expression data for functional characterization of disease-associated accurate alignment of transcriptomes in the presence of insertions, deletions and
SNPs: an application to follicular lymphoma. Am J Hum Genet. 2013;92(1):126–30. gene fusions. Genome Biol. 2013;14(4):R36.
22. Erwin JA, Marchetto MC, Gage FH. Mobile DNA elements in the generation 57. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with
of diversity and complexity in the brain. Nat Rev Neurosci. 2014;15(8):497–506. RNA-Seq. Bioinformatics. 2009;25(9):1105–11.
23. Wilhelm BT, Marguerat S, Watt S, et al. Dynamic repertoire of a eukaryotic 58. Marco-Sola S, Sammeth M, Guigo R, Ribeca P. The GEM mapper: fast, accu-
transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453(7199): rate and versatile alignment by filtration. Nat Methods. 2012;9(12):1185–8.
1239–43. 59. Wang K, Singh D, Zeng Z, et al. MapSplice: accurate mapping of RNA-seq
24. Saliba AE, Westermann AJ, Gorski SA, Vogel J. Single-cell RNA-seq: advances reads for splice junction discovery. Nucleic Acids Res. 2010;38(18):e178.
and future challenges. Nucleic Acids Res. 2014;42(14):8845–60. 60. Holt RA, Jones SJ. The new paradigm of flow cell sequencing. Genome Res. 2008;
25. Wilson NK, Kent DG, Buettner F, et al. Combined single-cell functional and 18(6):839–46.
gene expression analysis resolves heterogeneity within stem cell populations. Cell 61. Hillier LW, Marth GT, Quinlan AR, et al. Whole-genome sequencing and vari-
Stem Cell. 2015;16(6):712–24. ant discovery in C. elegans. Nat Methods. 2008;5(2):183–8.
26. Marzluff WF, Wagner EJ, Duronio RJ. Metabolism and regulation of canonical 62. Campbell PJ, Stephens PJ, Pleasance ED, et al. Identification of somatically
histone mRNAs: life without a poly(A) tail. Nat Rev Genet. 2008;9(11):843–54. acquired rearrangements in cancer using genome-wide massively parallel paired-
27. Yang L, Duff MO, Graveley BR, Carmichael GG, Chen LL. Genomewide char- end sequencing. Nat Genet. 2008;40(6):722–9.
acterization of non-polyadenylated RNAs. Genome Biol. 2011;12(2):R16. 63. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing
28. Parkhomchuk D, Borodina T, Amstislavskiy V, et al. Transcriptome analysis by genomic features. Bioinformatics. 2010;26(6):841–2.
strand-specific sequencing of complementary DNA. Nucleic Acids Res. 2009; 64. Delhomme N, Padioleau I, Furlong EE, Steinmetz LM. easyRNASeq: a biocon-
37(18):e123. ductor package for processing RNA-seq data. Bioinformatics. 2012;28(19):2532–3.
29. Nagalakshmi U, Wang Z, Waern K, et al. The transcriptional landscape of the 65. Lawrence M, Huber W, Pagès H, et al. Software for computing and annotating
yeast genome defined by RNA sequencing. Science. 2008;320(5881):1344–9. genomic ranges. PLoS Comput Biol. 2013;9(8):e1003118.
30. Liu L, Li Y, Li S, et al. Comparison of next-generation sequencing systems. 66. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for
J Biomed Biotechnol. 2012;2012:251364. assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30.
31. Cloonan N, Forrest AR, Kolle G, et al. Stem cell transcriptome profiling via 67. Zhao S, Zhang B. A comprehensive evaluation of ensembl, RefSeq, and UCSC
massive-scale mRNA sequencing. Nat Methods. 2008;5(7):613–9. annotations in the context of RNA-seq read mapping and gene quantification.
32. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assess- BMC Genomics. 2015;16:97.
ment of technical reproducibility and comparison with gene expression arrays. 68. Li S, Łabaj PP, Zumbo P, et al. Detecting and correcting systematic variation in
Genome Res. 2008;18(9):1509–17. large-scale RNA sequencing data. Nat Biotechnol. 2014;32(9):888–95.
33. Rothberg JM, Hinz W, Rearick TM, et al. An integrated semiconductor device 69. Filloux C, Cédric M, Romain P, et al. An integrative method to normalize
enabling non-optical genome sequencing. Nature. 2011;475(7356):348–52. RNA-seq data. BMC Bioinformatics. 2014;15:188.
34. Eid J, Fehr A, Gray J, et al. Real-time DNA sequencing from single polymerase 70. Sun Z, Zhu Y. Systematic comparison of RNA-Seq normalization methods
molecules. Science. 2009;323(5910):133–8. using measurement error models. Bioinformatics. 2012;28(20):2584–91.
35. Tariq MA, Kim HJ, Jejelowo O, Pourmand N. Whole-transcriptome RNAseq 71. Ager-Wick E, Henkel CV, Haug TM, Weltzien FA. Using normalization to
analysis from minute amount of total RNA. Nucleic Acids Res. 2011;39(18):e120. resolve RNA-Seq biases caused by amplification from minimal input. Physiol
36. Carrara M, Lum J, Cordero F, et al. Alternative splicing detection workflow Genomics. 2014;46(21):808–20.
needs a careful combination of sample prep and bioinformatics analysis. BMC 72. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods
Bioinformatics. 2015;16(suppl 9):S2. for normalization and differential expression in mRNA-Seq experiments. BMC
37. Webster AF, Zumbo P, Fostel J, et al. Mining the archives: a cross-platform Bioinformatics. 2010;11:94.
analysis of gene expression profiles in archival formalin-fixed paraffin-embedded 73. Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for
(FFPE) tissue. Toxicol Sci. 2015. RNA-seq data. BMC Bioinformatics. 2011;12:480.
38. Yang X, Liu D, Liu F, et al. HTQC: a fast quality control toolkit for Illumina 74. Garmire LX, Subramaniam S. Evaluation of normalization methods in mam-
sequencing data. BMC Bioinformatics. 2013;14:33. malian microRNA-Seq data. RNA. 2012;18(6):1279–88.
39. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina 75. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to
sequence data. Bioinformatics. 2014;30(15):2114–20. the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98(9):5116–21.
40. Anders S, Pyl PT, Huber W. HTSeq – a Python framework to work with high- 76. Grant GR, Manduchi E, Stoeckert CJ Jr. Analysis and management of microar-
throughput sequencing data. Bioinformatics. 2015;31(2):166–9. ray gene expression data. Curr Protoc Mol Biol. 2007;Chapter 19:Unit19.6.
41. Magoc T, Salzberg SL. FLASH: fast length adjustment of short reads to improve 77. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds
genome assemblies. Bioinformatics. 2011;27(21):2957–63. systems biology. Biol Direct. 2009;4:14.
42. Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bio- 78. Wang X, Wu Z, Zhang X. Isoform abundance inference provides a more accu-
informatics. 2012;28(16):2184–5. rate estimation of gene expression levels in RNA-seq. J Bioinform Comput Biol.
43. Trapnell C, Salzberg SL. How to map billions of short reads onto genomes. Nat 2010;8(suppl 1):177–92.
Biotechnol. 2009;27(5):455–7. 79. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-seq data
44. Flicek P, Birney E. Sense from sequence reads: methods for alignment and with or without a reference genome. BMC Bioinformatics. 2011;12:323.
assembly. Nat Methods. 2009;6(11 suppl):S6–12. 80. Jiang H, Wong WH. Statistical inferences for isoform expression in RNA-Seq.
45. Li R, Li Y, Kristiansen K, Wang J. SOAP: short oligonucleotide alignment pro- Bioinformatics. 2009;25(8):1026–32.
gram. Bioinformatics. 2008;24(5):713–4. 81. Sengoelge G, Winnicki W, Kupczok A, et al. A SAGE based approach to human
46. Li R, Yu C, Li Y, et al. SOAP2: an improved ultrafast tool for short read align- glomerular endothelium: defining the transcriptome, finding a novel molecule
ment. Bioinformatics. 2009;25(15):1966–7. and highlighting endothelial diversity. BMC Genomics. 2014;15:725.
47. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling vari- 82. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for
ants using mapping quality scores. Genome Res. 2008;18(11):1851–8. differential expression analysis of digital gene expression data. Bioinformatics.
48. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory- 2010;26(1):139–40.
efficient alignment of short DNA sequences to the human genome. Genome Biol. 83. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Dif-
2009;10(3):R25. ferential analysis of gene regulation at transcript resolution with RNA-seq. Nat
49. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Biotechnol. 2013;31(1):46–53.
transform. Bioinformatics. 2009;25(14):1754–60. 84. Trapnell C, Roberts A, Goff L, et al. Differential gene and transcript expres-
50. Lin H, Zhang Z, Zhang MQ , Ma B, Li M. ZOOM! Zillions of oligos mapped. sion analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc.
Bioinformatics. 2008;24(21):2431–7. 2012;7(3):562–78.
51. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq 85. Love MI, Huber W, Anders S. Moderated estimation of fold change and disper-
aligner. Bioinformatics. 2013;29(1):15–21. sion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.
52. Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ. Evaluation of next-generation 86. Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false
sequencing software in mapping and assembly. J Hum Genet. 2011;56(6):406–14. discovery rate estimation for RNA-sequencing data. Biostatistics. 2012;13(3):
53. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quanti- 523–38.
fying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–8. 87. Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear
54. Kent WJ. BLAT – the BLAST-like alignment tool. Genome Res. 2002;12(4):656–64. model analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29.
55. Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput 88. Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses
sequencing data. Bioinformatics. 2012;28(24):3169–77. for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.
89. Ritchie ME, Silver J, Oshlack A, et al. A comparison of background correction 121. Degner JF, Marioni JC, Pai AA, et al. Effect of read-mapping biases on detecting
methods for two-colour microarrays. Bioinformatics. 2007;23(20):2700–7. allele-specific expression from RNA-sequencing data. Bioinformatics. 2009;25(24):
90. Soneson C, Delorenzi M. A comparison of methods for differential expression 3207–12.
analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91. 122. Mayba O, Gilbert HN, Liu J, et al. MBASED: allele-specific expression detec-
91. Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting dif- tion in cancer tissues and cell lines. Genome Biol. 2014;15(8):405.
ferentially expressed genes from RNA-seq data. Am J Bot. 2012;99(2):248–56. 123. Dennis G Jr, Sherman BT, Hosack DA, et al. DAVID: database for annotation,
92. Graveley BR. Alternative splicing: increasing diversity in the proteomic world. visualization, and integrated discovery. Genome Biol. 2003;4(5):3.
Trends Genet. 2001;17(2):100–7. 124. Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unifica-
93. Wang GS, Cooper TA. Splicing in disease: disruption of the splicing code and tion of biology. The Gene Ontology Consortium. Nature Genetics. 2000;25(1):
the decoding machinery. Nat Rev Genet. 2007;8(10):749–61. 25–9.
94. Fu XD, Ares M Jr. Context-dependent control of alternative splicing by RNA- 125. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis:
binding proteins. Nat Rev Genet. 2014;15(10):689–701. a knowledge-based approach for interpreting genome-wide expression profiles.
95. Black DL. Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Proc Natl Acad Sci USA. 2005;102(43):15545–50.
Biochem. 2003;72:291–336. 126. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto
96. Wahl MC, Will CL, Luhrmann R. The spliceosome: design principles of a encyclopedia of genes and genomes. Nucleic Acids Res. 1999;27(1):29–34.
dynamic RNP machine. Cell. 2009;136(4):701–18. 127. Du J, Yuan Z, Ma Z, Song J, Xie X, Chen Y. KEGG-PATH: Kyoto encyclope-
97. Chen M, Manley JL. Mechanisms of alternative splicing regulation: insights from dia of genes and genomes-based pathway analysis using a path analysis model.
molecular and genomics approaches. Nat Rev Mol Cell Biol. 2009;10(11):741–54. Mol Biosyst. 2014;10(9):2441–7.
98. Pandit S, Zhou Y, Shiue L, et al. Genome-wide analysis reveals SR protein coope 128. Rahmatallah Y, Emmert-Streib F, Glazko G. Comparative evaluation of gene set
ration and competition in regulated splicing. Mol Cell. 2013;50(2):223–35. analysis approaches for RNA-Seq data. BMC Bioinformatics. 2014;15:397.
99. Pan Q , Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative 129. Huang D, Sherman BT, Lempicki RA. Systematic and integrative analysis of
splicing complexity in the human transcriptome by high-throughput sequencing. large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4:47.
Nat Genet. 2008;40(12):1413–5. 130. Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1[alpha]-responsive genes
100. Ray D, Kazan H, Cook KB, et al. A compendium of RNA-binding motifs for involved in oxidative phosphorylation are coordinately downregulated in human
decoding gene regulation. Nature. 2013;499(7457):172–7. diabetes. Nat Genet. 2003;34(3):267–73.
101. Sultan M, Schulz MH, Richard H, et al. A global view of gene activity and alter- 131. Hanzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for
native splicing by deep sequencing of the human transcriptome. Science. 2008; microarray and RNA-Seq data. BMC Bioinformatics. 2013;14(1):7.
321(5891):956–60. 132. Wang X, Cairns M. Gene set enrichment analysis of RNA-Seq data: integrating
102. Reddy AS, Rogers MF, Richardson DN, Hamilton M, Ben-Hur A. Deciphering differential expression and splicing. BMC Bioinformatics. 2013;14(suppl 5):S16.
the plant splicing code: experimental and computational approaches for predicting 133. Luo W, Friedman M, Shedden K, Hankenson K, Woolf P. GAGE: generally applica-
alternative splicing and splicing regulatory elements. Front Plant Sci. 2012;3:18. ble gene set enrichment for pathway analysis. BMC Bioinformatics. 2009;10(1):161.
103. Barash Y, Calarco JA, Gao W, et al. Deciphering the splicing code. Nature. 2010; 134. Xiong Q , Mukherjee S, Furey TS. GSAASeqSP: a toolset for gene set associa-
465(7294):53–9. tion analysis of RNA-seq data. Sci Rep. 2014;4:6347.
104. Tang F, Barbacioru C, Wang Y, et al. mRNA-Seq whole-transcriptome analysis 135. Tarca AL, Draghici S, Khatri P, et al. A novel signaling pathway impact analysis.
of a single cell. Nat Methods. 2009;6(5):377–82. Bioinformatics. 2009;25(1):75–82.
105. Katz Y, Wang ET, Airoldi EM, Burge CB. Analysis and design of RNA sequencing 136. Gao S, Wang X. TAPPA: topological analysis of pathway phenotype association.
experiments for identifying isoform regulation. Nat Methods. 2010;7(12):1009–15. Bioinformatics. 2007;23(22):3100–2.
106. Au KF, Jiang H, Lin L, Xing Y, Wong WH. Detection of splice junctions from 137. Haynes WA, Higdon R, Stanberry L, Collins D, Kolker E. Differential expres-
paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 2010;38(14):4570–8. sion analysis for pathways. PLoS Comput Biol. 2013;9(3):e1002967.
107. Ameur A, Wetterbom A, Feuk L, Gyllensten U. Global and unbiased detection 138. Young M, Wakefield M, Smyth G, Oshlack A. Gene ontology analysis for RNA-
of splice junctions from RNA-seq data. Genome Biol. 2010;11(3):R34. seq: accounting for selection bias. Genome Biol. 2010;11(2):R14.
108. Vitting-Seerup K, Porse BT, Sandelin A, Waage J. spliceR: an R package for 139. Zheng W, Chung LM, Zhao H. Bias detection and correction in RNA-sequencing
classification of alternative splicing and prediction of coding potential from data. BMC Bioinformatics. 2011;12:290.
RNA-seq data. BMC Bioinformatics. 2014;15:81. 140. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches
109. Aschoff M, Hotz-Wagenblatt A, Glatting KH, Fischer M, Eils R, Konig R. and outstanding challenges. PLoS Comput Biol. 2012;8(2):e1002375.
SplicingCompass: differential splicing detection using RNA-seq data. Bioinfor- 141. Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global
matics. 2013;29(9):1141–8. discovery of conserved genetic modules. Science. 2003;302(5643):249–55.
110. Zhao K, Lu ZX, Park JW, Zhou Q , Xing Y. GLiMMPS: robust statistical model 142. Giorgi FM, Fabbro CD, Licausi F. Comparative study of RNA-seq- and
for regulatory variation of alternative splicing using RNA-seq data. Genome Biol. microarray-derived coexpression networks in Arabidopsis thaliana. Bioinformatics.
2013;14(7):R74. 2013;29(6):717–24.
111. Shen S, Park JW, Huang J, et al. MATS: a Bayesian framework for flexible 143. Ballouz S, Verleyen W, Gillis J. Guidance for RNA-seq co-expression network con-
detection of differential alternative splicing from RNA-Seq data. Nucleic Acids struction and analysis: safety in numbers. Bioinformatics. 2015;31(13):2123–30.
Res. 2012;40(8):e61. 144. van Dam S, Craig T, de Magalhães JP. GeneFriends: a human RNA-seq-based gene
112. Shen S, Park JW, Lu ZX, et al. rMATS: robust and flexible detection of differ- and transcript co-expression database. Nucleic Acids Res. 2015;43(D1):D1124–32.
ential alternative splicing from replicate RNA-Seq data. Proc Natl Acad Sci U S A. 145. Obayashi T, Okamura Y, Ito S, Tadaka S, Motoike IN, Kinoshita K. COX-
2014;111(51):E5593–601. PRESdb: a database of comparative gene coexpression networks of eleven species
113. Li W, Dai C, Kang S, Zhou XJ. Integrative analysis of many RNA-seq datasets for mammals. Nucleic Acids Res. 2013;41(D1):D1014–20.
to study alternative splicing. Methods. 2014;67(3):313–24. 146. Choi Y, Kendziorski C. Statistical methods for gene set co-expression analysis.
114. Iancu OD, Colville A, Darakjian P, Hitzemann R. Chapter four – coexpres- Bioinformatics. 2009;25(21):2780–6.
sion and cosplicing network approaches for the study of mammalian brain tran- 147. Chen L, Liu R, Liu Z-P, Li M, Aihara K. Detecting early-warning signals for sud-
scriptomes. In: Robert H, Shannon M, eds. International Review of Neurobiology. den deterioration of complex diseases by dynamical network biomarkers. Sci Rep.
Vol 116. Waltham: Academic Press; 2014:73–93. 2012;6:2.
115. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic vari- 148. Bar-Joseph Z, Gitter A, Simon I. Studying and modelling dynamic biologi-
ants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164. cal processes using time-series gene expression data. Nat Rev Genet. 2012;13(8):
116. Piskol R, Ramaswami G, Li JB. Reliable identification of genomic variants from 552–64.
RNA-seq data. Am J Hum Genet. 2013;93(4):641–51. 149. Gao S, Wang X. Identification of highly synchronized subnetworks from gene
117. Dereeper A, Homa F, Andres G, et al. SNiPlay3: a web-based application for expression data. BMC Bioinformatics. 2013;14(suppl 9):S5.
exploration and large scale analyses of genomic variations. Nucleic Acids Res. 150. Yosef N, Shalek AK, Gaublomme JT, et al. Dynamic regulatory network con-
2015;43(W1):W295–300. trolling TH17 cell differentiation. Nature. 2013;496(7446):461–8.
118. Chuang LC, Kao CF, Shih WL, Kuo PH. Pathway analysis using information 151. Amar D, Safer H, Shamir R. Dissection of regulatory networks that are altered in
from allele-specific gene methylation in genome-wide association studies for disease via differential co-expression. PLoS Comput Biol. 2013;9(3):e1002955.
bipolar disorder. PLoS One. 2013;8(1):e53092. 152. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network
119. Pastinen T. Genome-wide allele-specific analysis: insights into regulatory varia- analysis. BMC Bioinformatics. 2008;9:559.
tion. Nat Rev Genet. 2010;11(8):533–8. 153. Si Y, Liu P, Li P, Brutnell TP. Model-based clustering for RNA-seq data.
120. Costa V, Aprile M, Esposito R, Ciccodicola A. RNA-Seq and human complex Bioinformatics. 2013;30(2):197–205.
diseases: recent accomplishments and future perspectives. Eur J Hum Genet. 154. Hawkins RD, Hon GC, Ren B. Next-generation genomics: an integrative
2013;21(2):134–42. approach. Nat Rev Genet. 2010;11(7):476–86.
155. Wei G, Abraham BJ, Yagi R, et al. Genome-wide analyses of transcription factor 178. Patwardhan RP, Hiatt JB, Witten DM, et al. Massively parallel functional
GATA3-mediated gene regulation in distinct T cell types. Immunity. 2011; dissection of mammalian enhancers in vivo. Nat Biotechnol. 2012;30(3):265–70.
35(2):299–311. 179. Klepser ME. Candida resistance and its clinical relevance. Pharmacotherapy.
156. Ouyang Z, Zhou Q , Wong WH. ChIP-Seq of transcription factors predicts 2006;26(6 pt 2):68S–75S.
absolute and differential gene expression in embryonic stem cells. Proc Natl Acad 180. Biswas S, Van Dijck P, Datta A. Environmental sensing and signal transduc-
Sci U S A. 2009;106(51):21521–6. tion pathways regulating morphopathogenic determinants of Candida albicans.
157. Han Y, Han D, Yan Z, et al. Stress-associated H3K4 methylation accumulates dur- Microbiol Mol Biol Rev. 2007;71(2):348–76.
ing postnatal development and aging of Rhesus macaque brain. Aging Cell. 2012; 181. Cutler JE. Putative virulence factors of Candida albicans. Annu Rev Microbiol.
11(6):1055–64. 1991;45:187–218.
158. Wei G, Hu G, Cui K, Zhao K. Genome-wide mapping of nucleosome occupancy, 182. Bruno VM, Wang Z, Marjani SL, et al. Comprehensive annotation of the
histone modifications, and gene expression using next-generation sequencing transcriptome of the human fungal pathogen Candida albicans using RNA-seq.
technology. Methods Enzymol. 2012;513:297–313. Genome Res. 2010;20(10):1451–8.
159. Lister R, Pelizzola M, Dowen RH, et al. Human DNA methylomes at base resolu- 183. Linde J, Duggan S, Weber M, et al. Defining the transcriptomic landscape of
tion show widespread epigenomic differences. Nature. 2009;462(7271):315–22. Candida glabrata by RNA-Seq. Nucleic Acids Res. 2015;43(3):1392–406.
160. Yu W, McIntosh C, Lister R, et al. Genome-wide DNA methylation patterns in 184. Grumaz C, Lorenz S, Stevens P, et al. Species and condition specific adaptation
LSH mutant reveals de-repression of repeat elements and redundant epigenetic of the transcriptional landscapes in Candida albicans and Candida dubliniensis.
silencing pathways. Genome Res. 2014;24(10):1613–23. BMC Genomics. 2013;14:212.
161. Montgomery SB, Sammeth M, Gutierrez-Arcelus M, et al. Transcriptome 185. Navin NE. Cancer genomics: one cell at a time. Genome Biol. 2014;15(8):452.
genetics using second generation sequencing in a Caucasian population. Nature. 186. Gerlinger M, Rowan AJ, Horswell S, et al. Intratumor heterogeneity and branched
2010;464(7289):773–7. evolution revealed by multiregion sequencing. N Engl J Med. 2012;366(10):883–92.
162. Pickrell JK, Marioni JC, Pai AA, et al. Understanding mechanisms underlying 187. Van Loo P, Campbell PJ. ABSOLUTE cancer genomics. Nat Biotechnol.
human gene expression variation with RNA sequencing. Nature. 2010;464(7289): 2012;30(7):620–1.
768–72. 188. Shah SP, Roth A, Goya R, et al. The clonal and mutational evolution spectrum
163. Solana J, Kao D, Mihaylova Y, et al. Defining the molecular profile of planar- of primary triple-negative breast cancers. Nature. 2012;486(7403):395–9.
ian pluripotent stem cells using a combinatorial RNAseq, RNA interference and 189. Sirbu A, Kerr G, Crane M, Ruskin HJ. RNA-seq vs dual- and single-channel
irradiation approach. Genome Biol. 2012;13(3):R19. microarray data: sensitivity analysis for differential expression and clustering.
164. Levine M, Tjian R. Transcription regulation and animal diversity. Nature. PLoS One. 2012;7(12):e50986.
2003;424(6945):147–51. 190. Westermann AJ, Gorski SA, Vogel J. Dual RNA-seq of pathogen and host.
165. Levine M. Transcriptional enhancers in animal development and evolution. Curr Nat Rev Microbiol. 2012;10(9):618–30.
Biol. 2010;20(17):R754–63. 191. Dodsworth BT, Flynn R, Cowley SA. The current state of naive human pluripo-
166. Levine M, Cattoglio C, Tjian R. Looping back to leap forward: transcription tency. Stem Cells. 2015.
enters a new era. Cell. 2014;157(1):13–25. 192. Das A, Chai JC, Kim SH, et al. Dual RNA sequencing reveals the expression of
167. Buecker C, Wysocka J. Enhancers as information integration hubs in develop- unique transcriptomic signatures in lipopolysaccharide-induced BV-2 microglial
ment: lessons from genomics. Trends Genet. 2012;28(6):276–84. cells. PLoS One. 2015;10(3):e0121117.
168. Calo E, Wysocka J. Modification of enhancer chromatin: what, how, and why? 193. Pittman KJ, Aliota MT, Knoll LJ. Dual transcriptional profiling of mice and Tox-
Mol Cell. 2013;49(5):825–37. oplasma gondii during acute and chronic infection. BMC Genomics. 2014;15:806.
169. Bulger M, Groudine M. Functional and mechanistic diversity of distal transcrip- 194. Choi YJ, Aliota MT, Mayhew GF, Erickson SM, Christensen BM. Dual RNA-
tion enhancers. Cell. 2011;144(3):327–39. seq of parasite and host reveals gene expression dynamics during filarial worm-
170. Boyle AP, Davis S, Shulha HP, et al. High-resolution mapping and characteriza- mosquito interactions. PLoS Negl Trop Dis. 2014;8(5):e2905.
tion of open chromatin across the genome. Cell. 2008;132(2):311–22. 195. Lu M, Zhang PJ, Li CH, Lv ZM, Zhang WW, Jin CH. miRNA-133 augments
171. Gaulton KJ, Nammo T, Pasquali L, et al. A map of open chromatin in human coelomocyte phagocytosis in bacteria-challenged Apostichopus japonicus via targeting
pancreatic islets. Nat Genet. 2010;42(3):255–9. the TLR component of IRAK-1 in vitro and in vivo. Sci Rep. 2015;5:12608.
172. Heintzman ND, Stuart RK, Hon G, et al. Distinct and predictive chromatin 196. Camilios-Neto D, Bonato P, Wassem R, et al. Dual RNA-seq transcriptional
signatures of transcriptional promoters and enhancers in the human genome. Nat analysis of wheat roots colonized by Azospirillum brasilense reveals up-regulation
Genet. 2007;39(3):311–8. of nutrient acquisition and cell cycle genes. BMC Genomics. 2014;15:378.
173. Heintzman ND, Hon GC, Hawkins RD, et al. Histone modifications at human 197. Lange M, Eisenhauer N, Sierra CA, et al. Plant diversity increases soil microbial
enhancers reflect global cell-type-specific gene expression. Nature. 2009;459(7243): activity and soil carbon storage. Nat Commun. 2015;6:6707.
108–12. 198. Schulze S, Henkel SG, Driesch D, Guthke R, Linde J. Computational predic-
174. Rada-Iglesias A, Bajpai R, Swigut T, Brugmann SA, Flynn RA, Wysocka J. tion of molecular pathogen-host interactions based on dual transcriptome data.
A unique chromatin signature uncovers early developmental enhancers in Front Microbiol. 2015;6:65.
humans. Nature. 2011;470(7333):279–83. 199. Torres-García W, Zheng S, Sivachenko A, et al. PRADA: pipeline for RNA
175. Bonn S, Zinzen RP, Girardot C, et al. Tissue-specific analysis of chromatin state sequencing data analysis. Bioinformatics. 2014;30(15):2224–6.
identifies temporal signatures of enhancer activity during embryonic develop- 200. Xu G, Strong MJ, Lacey MR, Baribault C, Flemington EK, Taylor CM. RNA
ment. Nat Genet. 2012;44(2):148–56. CoMPASS: a dual approach for pathogen and host transcriptome analysis of
176. Visel A, Blow MJ, Li Z, et al. ChIP-seq accurately predicts tissue-specific activ- RNA-seq datasets. PLoS One. 2014;9(2):e89445.
ity of enhancers. Nature. 2009;457(7231):854–8.
177. Melnikov A, Murugan A, Zhang X, et al. Systematic dissection and optimiza-
tion of inducible enhancers in human cells using a massively parallel reporter
assay. Nat Biotechnol. 2012;30(3):271–7.