ROGATO - Dynamics of Mathematical Models in Biology - Bringing Mathematics To Life
ROGATO - Dynamics of Mathematical Models in Biology - Bringing Mathematics To Life
Dynamics of
Mathematical
Models in
Biology
Bringing Mathematics to Life
Dynamics of Mathematical Models in Biology
Alessandra Rogato • Valeria Zazzu
Mario Guarracino
Editors
Dynamics of Mathematical
Models in Biology
Bringing Mathematics to Life
123
Editors
Alessandra Rogato Valeria Zazzu
National Research Council National Research Council
Institute of Biosciences and Bioresources Institute of Genetics and Biophysics, ABT
Naples, Italy Naples, Italy
Integrative Marine Ecology
Stazione Zoologica Anton Dohrn
Napoli, Italy
Mario Guarracino
Laboratory for Genomics, Transcriptomics
and Proteomics (Lab-GTP)
High Performance Computing
and Networking Institute (ICAR)
National Research Council (CNR)
Naples, Italy
Despite Galileo’s claim that mathematics is the language of nature, the two
disciplines of mathematics and life sciences had been considered two planets
belonging to two very far galaxies which would never meet. The two communities
were vastly different and it seemed impossible for them to collaborate. Only recently
when life scientists began producing experimental data at an unprecedentedly high
pace, did it become clear that mathematical models were necessary to interpret
such data and to structure them, with the ultimate goal of unveiling biological
mechanisms, to making new discoveries, and to making predictions.
There are very few examples of events that bring the two communities together
to discuss research questions. For this reason we decided to create a series of annual
workshops to gather a multidisciplinary and international community. “Bringing
Maths to Life” enabled the two communities of life scientists and mathematicians
to exchange a bidirectional flow of ideas. The broad community of mathematician
enabled life scientists to introduce new algorithms, methods, and software that may
be useful to model life. Biologists enabled scientists to pose new challenges for
mathematicians, thereby bringing to life novel opportunities for mathematicians to
explore interesting problems. From this workshop many ideas and collaboration
began. In the second year of the workshop, the leitmotiv had surrounded the concepts
of time and dynamicity of nature. In the necessary simplifications applied during
the modeling process, time is sometimes not accounted for in an attempt to avoid
exponential complexity in computations. Nevertheless time imposed a different
thought paradigm, which in turn created more elegant mathematical models.
The second workshop, held during October 19–21, 2015, in Naples (Italy)
featured three main sessions. “Dynamics of genomes and genetic variation” was the
topic of the first session. In this session, we discussed the molecular mechanisms
and evolutionary processes that shape the structure and function of genomes and
that govern genome dynamics. The session on “Dynamics of motifs” provided an
overview of current methods for motif searching in DNA, RNA, and proteins, a
key process to discover emergent properties of cells, tissues, and organisms. The
third session was dedicated to the “Dynamics of biological networks.” Networks
representing complex biological functions and activities are useful to interpret
v
vi Preface
processes in the cell, and several mathematical models and algorithms are now
available for their integration, analysis, and characterization. As mentioned above,
in the necessary simplifications applied during the modeling process, time was often
not accounted for in an effort to avoid exponential complexity in computations.
In this volume we collect many of the important ideas that derived from the
workshop which are representative of the research questions that can be posed
within such multidisciplinary applications. In the first chapter, Verena Thormann
and colleagues describe the transcriptional regulation (when 1 C 1 ¤ 2). In the chap-
ter “Differential Network Analysis and Graph Classification: A Glocal Approach”,
a glocal approach to differential network analysis and graph classification is
introduced by Giuseppe Jurmal and colleagues. Maria Pia Saccomanni and Karl
Thomaseth discuss the identifiability of differential equation models that are used
in systems biology. In the chapter “Boolean Dynamics of Regulatory Compound
Circuits”, Elisabeth Remy et al. discuss the regulatory circuits and their dynamics.
Target genes of homologous transcription factors are differentially analyzed by
Elijah K. Lowe and colleagues. In the chapter “Reconstructing a Genetic Network
from Gene Perturbations in Secretory Pathway of Cancer Cell Lines”, a pipeline
for gene regulatory networks reconstruction is proposed by Marina Piccirillo
et al., and in the chapter “Dissecting the Functions of the Secretory Pathway
by Transcriptional Profiling” the functions of secretory pathways are analyzed
starting from transcriptional profiling by Sonali Gopichand Chavan and colleagues.
Saraunas Germanas et al. propose a Beta-Binomial model to detect rare mutations
in NGS experiments. In the chapter “An Overview of Genotyping by Sequencing
in Crop Species and Its Application in Pepper”, pepper genotyping by sequencing
is discussed by Francesca Taranto et al. Irma Terracciano and colleagues describe
in the chapter “Hybridization-Based Enrichment and Next Generation Sequencing
to Explore Genetic Diversity in Plants” how to explore genetic diversity in plants.
Lastly, in the chapter “DecontaMiner: A Pipeline for the Detection and Analysis
of Contaminating Sequences in Human NGS Sequencing Data”, Ilaria Granata
and colleagues describe a pipeline for the detection of sequences belonging to
contaminating organisms in human NGS sequencing data.
We would like to acknowledge the work and support we have received for
realizing this volume.
The workshop has been organized by Alessandra Rogato (Institute of Bio-
sciences and Bioresources), Valeria Zazzu and Enza Colonna (Institute of Genetics
and Biophysics “Adriano Buzzati-Traverso”), and Mario Guarracino (High Per-
formance Computing and Networking Institute and Institute for Higher Math-
ematics “F. Saveri”) from the Italian National Research Council (CNR), Italy.
Gerardo Toraldo from the Department of Mathematics and Applications “Renato
Caccioppoli,” University of Naples Federico II, contributed to the organization.
The initiative has been supported by the Italian National Research Council (CNR),
the Institute for High Mathematics “F. Saveri” (INDAM), the High Performance
Preface vii
ix
x Contents
xi
xii Contributors
Beacon Center for Evolution in Action, Michigan State University, East Lansing,
MI, USA
Alberto Luini Institute of Protein Biochemistry at National Research Council
(CNR), Naples, Italy
Istituto di Ricovero e Cura a Carattere Scienti co SDN, Naples, Italy
Sebastiaan H. Meijsing Max Planck Institute for Molecular Genetics, Berlin,
Germany
Brigitte Mossé Aix Marseille Université, CNRS, Centrale Marseille, I2M, UMR
7373, Marseille, France
Seetharaman Parashuraman Institute of Protein Biochemistry, National
Research Council (CNR-IBP), Naples, Italy
Marina Piccirilo Laboratory for Genomics, Transcriptomics and Proteomics
(LAB-GTP), High Performance Computing and Networking Institute (ICAR) at
National Research Council (CNR), Naples, Italy
Elisabeth Remy Aix Marseille Université, CNRS, Centrale Marseille, I2M, UMR
7373, Marseille, France
Samantha Riccadonna Centro Ricerca e Innovazione, Fondazione Edmund Mach,
San Michele all’Adige, Italy
Prathyush Deepth Roy Institute of Protein Biochemistry at National Research
Council (CNR), Naples, Italy
Maria Pia Saccomani Department of Information Engineering, University of
Padova, Padova, Italy
Mara Sangiovanni ICAR-CNR, Naples, Italy
Francesca Taranto Consiglio per la ricerca in agricoltura e l’analisi dell’economia
agraria (CREA) – Centro di ricerca per l’orticoltura, Pontecagnano Faiano (SA),
Italy
Irma Terracciano Consiglio per la ricerca in agricoltura e l’analisi dell’economia
agraria (CREA) - Centro di ricerca per l’orticoltura, Pontecagnano Faiano (SA),
Italy
Denis Thieffry Computational Systems Biology team, Institut de Biologie de
l’Ecole Normale Supérieure (IBENS), CNRS UMR8197, INSERM U1024, Ecole
Normale Supérieure, PSL Research University, Paris, France
Karl Thomaseth Institute of Electronics, Computer and Telecommunication Engi-
neering (IEIIT-CNR) c/o DEI, Padova, Italy
Verena Thormann Max Planck Institute for Molecular Genetics, Berlin, Germany
Contributors xiii
1 Introduction
How can muscle cells have a distinct phenotype compared to blood cells although
both share the same genetic information? One answer to this fundamental question is
that different sets of genes are expressed in different cell types. Therefore, a detailed
understanding of the mechanisms that control the expression of genes is needed
to better understand how cells adopt and change their identity. Two key players
in regulating the expression of genes are cis- and trans-acting elements. The cis
elements are DNA sequences encoded in the genome that can be bound by trans-
acting transcription factors (TFs), which in turn can influence the recruitment or
activity of the RNA polymerase to influence the expression of genes. Notably, only
about 1 % of the genome codes for proteins, which leaves a large fraction of the
genome available for potential regulatory functions.
Activation of the right genes at the right place and at the right time is critical,
as the misexpression of genes can have pathological consequences. For example,
Sonic hedgehog is an essential gene involved in embryonic development of the limbs
and a failure to express this gene results in severe limb malformations, e.g., hands
with only one digit [1]. Similarly, expressing a gene in the wrong place can have
detrimental effects as was shown in the fruit fly Drosophila melanogaster where
misexpression of the Antp gene in the head leads to the growth of legs instead of
antennas [2]. In addition to expressing the right genes at the right place, proper
development requires genes to be expressed at the right level and a failure to
express genes at the right dosage can lead to impaired development and disease.
One well-known example is Down syndrome, where an extra copy of chromosome
21 and the resulting increased gene-dosage results in several severe developmental
defects. Similarly, expressing too little of the tumor suppressor gene p53 results in
an increased chance to develop cancer [3].
To regulate the expression of genes, TFs are recruited to specific regulatory
sequences, encoded in the genome (Fig. 1) [4]. These transcription factor binding
sites (TFBS) are specific DNA recognition sequences located in regulatory regions.
Typically, TFBSs for different TFs are found in clusters that can be referred to
as enhancers. These enhancers act on the promoter of genes to influence the
recruitment or activity of RNA polymerase and ultimately influence if and how
much of a gene is expressed (Fig. 1). Enhancers can be located proximal to the
promoter or at a large distance from the transcriptional start site (TSS) of genes [5],
which raises the question how enhancers that are remote from the promoter in linear
space can influence events at the promoter of genes. One explanation is the fact
that looping of the DNA and its three-dimensional organization in the nucleus can
bring together sequences that are remote in linear space [6]. Other levels of genome
organization that influence the functioning of enhancers include the fact that the
DNA in the nucleus is wrapped around histone proteins to form nucleosomes. The
tails of these histones can be post-translationally modified and specific modifications
were shown to correlate with the activity of enhancer elements [7].
Transcriptional Regulation: When 1 C 1 ¤ 2 3
Fig. 1 Signaling pathway of the glucocorticoid receptor. Unbound glucocorticoid receptor (GR)
resides in the cytoplasm and upon binding to its cognate steroid hormone (dark red) translocates
to the nucleus where it interacts with GR binding sites (GBS) in the promoter and/or in
enhancer regions. Genome-bound GR, together with cofactors and other transcription factors
bound to transcription factor binding sites (TFBS), influences the recruitment and activity of RNA
polymerase II to ultimately regulate the expression of its target gene
predictions. Second, we will discuss attempts to link the binding of TFs to the
regulation of genes, the role of the three-dimensional organization of the genome
in the nucleus, and how the sequence of TFBSs influence how much of a target gene
is expressed. Finally, we will present an outlook of how newly developed methods
can contribute to our understanding of the role of TFs and TFBSs in orchestrating
the expression of genes.
One of the crucial steps in the regulation of gene expression is the binding of
sequence-specific TFs to regulatory DNA sequences associated with their target
genes. In principle, the binding of TFs can be predicted from sequence. In practice,
however, sequence alone is a poor predictor of TF binding. This is in part a
consequence of the fact that TFBSs are typically short and degenerate and thus
potential binding sites are ubiquitously present in the genome and only a minority
of these potential binding sites is actually bound by TFs. Furthermore, TF binding is
often highly cell-type specific despite the fact that these cell types harbor the same
genome.
Experimentally, in vivo genome-wide binding of TFs can be determined by
chromatin immunoprecipitation (ChIP)-based techniques (Fig. 2a). As a first step
of the ChIP procedure, formaldehyde is used to covalently cross-link TFs to their
genomic binding sites. Subsequently, the cross-linked DNA is sheared into smaller
fragments of approximately 200–300 base pairs in length and the resulting protein–
DNA complexes are co-precipitated using an antibody specific for the TF of interest.
Finally, either qPCR-based methods or DNA sequencing (for ChIP-seq) identifies
the enrichment of DNA sequences that are occupied by a given TF (Fig. 2b). In the
past decade, the advent of next generation sequencing methods resulted in a wealth
of available genome-wide ChIP-seq data for different TFs from a wide variety of
different cell types, tissues, and model organisms. From this data, the recognition
sequence of a TF of interest can be derived using computational methods. These
methods can uncover sequences that are over-represented in regions bound by a
specific TF and can be used to generate a consensus motif. The consensus motif can
be graphically displayed as a sequence logo to represent the position weight matrix
(PWM) which describes the nucleotide preference at each nucleotide position within
the motif (Fig. 2c) [9].
Conceivably, the PWM could now be used to directly predict TF binding to a
given DNA sequence or even an entire genome. However, prediction of genome-
wide binding based on the PWM typically fails for several reasons. First, not all
DNA sequences that are bound in vivo match the consensus motif. This could
be due to the fact that some TFs can bind to highly degenerate sequences, in
which up to several base pairs can differ from its consensus sequence [10].
Transcriptional Regulation: When 1 C 1 ¤ 2 5
2
PWM for GR
bits
1
10
11
12
13
14
15
16
1
2
3
4
5
6
7
8
9
5’ 3’
position
Cooperative binding with other TFs can turn such degenerate sequences into high
affinity binding sites. For example, GR was reported to bind together with AP1 at
composite regulatory sequences to cooperatively regulate Notch4 gene expression
[11]. Moreover, some TFs can bind without direct contact to the DNA by binding
to other proteins, a mechanism referred to as DNA tethering. Hence, at tethered
regions computational prediction of TFBSs using the PWM would miss indirect
interactions mediated by other proteins. For example, studies using human cell lines
have shown that GR binds at promoter regions of genes involved in mediating
the immune-modulating actions of glucocorticoids that contain no obvious GR
consensus motif. At these regions, GR-tethering to NFkB was shown to be an
important mechanism responsible for GR-mediated gene regulation [12]. A second
reason why PWMs fail to accurately predict genome-wide TF binding patterns is
that not all computationally predicted DNA sequences are bound in vivo. In fact,
only a minor fraction of all possible sequences matching the consensus motif of a TF
are actually bound in vivo. For GR, the vast majority of genomic GR binding sites
6 V. Thormann et al.
are located in the so-called open chromatin [13], arguing that chromatin accessibility
(as assayed by DNase-I hypersensitivity assays) is a key player in specifying which
of the potential binding sites encoded in the genome can be bound. Changes in
chromatin accessibility, which can occur in response to environmental signals and
during cellular differentiation [14, 15], can thus explain why TF occupancy can be
highly cell-type [16, 17] and cell-stage specific [18].
Together, the computational prediction of genomic TFBSs suffers from two
critical issues. First, false-negative predictions, when TFBSs are missed due to TF
tethering by other DNA-binding factors or by binding to degenerate sequences.
Notably, comparison of different computational models for the prediction of TF
binding specificity showed that most often the best performing motifs were those
with the highest nucleotide degeneracy [19]. Second, computational prediction of
TFBSs may result in false-positive predictions for TFBSs that match the consensus
but are not available for TF binding in vivo, e.g., due to their location in closed
chromatin.
TF binding and the regulation of nearby genes are clearly connected. However,
this link is typically statistical rather than deterministic. For example, scanning a
window of 300 kb around the TSS of genes showed that for all genes with a GR
binding site in this window, only a fraction actually change their expression in
response to GR binding (Meijsing lab unpublished results). Although the fraction
of regulated genes is higher when only TFBSs in close proximity to the TSS are
considered, the link between promoter-proximal GR binding and gene regulation
remains far from deterministic. Similarly, ChIP-seq experiments typically uncover
several thousands of peaks for an individual TF, whereas TF perturbations usually
result in only a small number of affected genes [20, 21]. Consequently, TF binding is
a poor predictor of gene regulation and understanding what distinguishes productive
TF binding events (resulting in the regulation of a gene) from non-productive
binding events remains a key challenge.
One additional signal that may help distinguish productive from non-productive
binding events is the post-translational modification state of the histones located at
the enhancer regions harboring TFBSs. For example, actively transcribed promoters
and active enhancers show elevated levels of histone H3 lysine 27 acetylation
(H3K27ac) [22, 23]. Thus, one possibility to computationally predict productive TF
binding events is to combine information regarding TF binding with the occurrence
of specific histone modifications. Such computational strategies were shown to
be quite successful, especially when additional information such as sequence
conservation, DNA accessibility, or gene expression data were also taken into
Transcriptional Regulation: When 1 C 1 ¤ 2 7
account [24]. Testing if predicted enhancers are indeed capable of regulating the
expression of genes is traditionally done using reporter gene assays. To test their
activity, the predicted regulatory region is cloned in front of a minimal promoter
sequence that drives the expression of a reporter gene, e.g., the expression of
the luciferase gene (Fig. 3b). Next, the regulatory activity of a given TFBS can
be analyzed in a heterologous context by measuring the amount of reporter gene
activity. The presence of specific histone modifications at the enhancer region can
serve as a good indicator of in vivo regulatory activity as detected by reporter gene
assays [23]. However, the accuracy of the prediction is limited. For example, a high-
throughput functional screen of enhancers computationally predicted based on their
pattern of histone modifications, showed that only about one-fourth of all tested
Fig. 3 (a) Endogenous regulation of gene expression by enhancers. In vivo, bound TFBSs are
mainly located in open chromatin regions, where they either bind directly to DNA or indirectly by
tethering to other DNA-binding TFs. Productive TF binding can be influenced by the presence of
associated chromatin marks or the occurrence of other co-factors. TFBSs can be located several
thousands of kilo bases away from their target genes and can regulate gene expression by DNA-
looping. To regulate gene expression, bound TFs influence the recruitment and activity of RNA
polymerase II. (b) Reporter gene assays. To test the regulatory activity of a TFBS in reporter
gene assays, the candidate regulatory region is cloned in front of a minimal promoter that drives
the expression of a reporter gene. Upon transfection of the reporter plasmid into living cells, its
regulatory activity can be analyzed by measuring the amount of generated gene product. In the
depicted example, the regulatory activity of the tested regulatory region correlates with the level
of luciferase activity. (c) Tab.1. Features influencing the regulation of gene expression in vivo in
comparison to reporter gene assays
8 V. Thormann et al.
sequences was indeed active in reporter gene assays. Especially the classification
into strong and weak enhancers based on their level of histone modifications did
not have a great predictive value [25]. This could either mean that the predictions
are wrong or that the reporter setting fails to recapitulate the complexity of gene
regulation in the endogenous context. Regarding the latter, reporter genes differ
from the endogenous genomic setting at which gene regulation takes place in a
number of ways (Fig. 3). These differences include the fact that enhancers and
TFBS are typically tested using a heterologous promoter and that reporters fail
to recapitulate the endogenous sequence context or the chromatin environment of
the investigated TFBS. Therefore, a regulatory sequence that is unable to drive
reporter gene expression must not necessarily be inert in its natural genomic context.
Conversely, the ability of an enhancer to activate the reporter gene does not proof
that an enhancer region is capable of doing the same in the endogenous genomic
context.
Notably, even when the function of putative enhancers is tested in their endoge-
nous genomic context the results might be hard to interpret. For example, studies
in Drosophila showed that the deletion of two enhancers linked to the expression
of an important developmental gene resulted in only minor developmental defects
when cultured under standard laboratory conditions. In contrast, at high or low
temperatures the deletion of these obviously non-functional enhancers resulted
in pronounced developmental defects [26]. This shows that the importance of
enhancers might be context-dependent and only become apparent under specific
environmental conditions. Furthermore, functional redundancy among enhancers
might mask the functional importance of a specific enhancer when they are mutated
individually [27].
In summary, although on a global scale the binding of TFs and the regulation of
genes are clearly connected, if and how TF binding and gene regulation are linked
at individual genes is typically unknown. Thus, unraveling the operating principles
that specify which binding events are productive remains a major challenge.
Furthermore, this study found that the magnitude of gene expression changes
increased with an increasing number of TFBSs that show long-range interactions
[43]. This suggests that data from 3C-based approaches can help identify, or at least
enrich, for productive TF binding events that result in the regulation of associated
target genes. However, although the ability to predict changes in gene expression
improves when taking long-range interactions into account, the connection is still
far from deterministic. This might in part be due to the limited resolution of the
Hi-C experiments (5–10 kb range), which could result in false-positive enhancer–
promoter contacts and might thus improve further if technological advances improve
the resolution of 3C-based methods.
So far we have discussed the regulatory activity of TFBSs and their associated target
gene expression as an all-or-nothing event where genes are either regulated by a TF
or not. However, in addition to expressing the right genes at the right time, getting
the dosage of individual genes right is important for development and homeostasis.
This fine-tuning of gene expression is a consequence of the integration of several
signaling inputs that impinge on a gene. These inputs include the combinatorial
interactions of TFs at response elements, post-translational modifications of DNA,
RNA, and proteins, and processes that influence the stability of RNA once produced.
Here, we will focus on one signaling input that can influence the level of expression
of genes: the sequence of the TFBS and its sequence environment. One mechanism
by which the sequence of a TFBS can influence transcriptional output is through
differences in TF affinity, with high affinity binding sites resulting in more TF
recruitment and consequently higher expression levels of associated target genes.
However, in addition to affinity-driven differences in activity, TFBS sequence
variants may also modulate transcriptional output by acting as allosteric ligands
that influence the structure and activity of associated TFs towards their target genes.
The sequences of individual TFBSs bound by a TF typically differs between
genomic loci and depending on the sequence, TFs can have higher or lower
affinities for individual binding sites. In vitro, systematic evolution of ligands by
exponential enrichment (SELEX) can be used to identify DNA sequences with
the highest binding affinity for a specific TF. SELEX starts with a large initial
library of random DNA oligonucleotides. From this library, high affinity binding
sites are enriched by repeated cycles of TF binding followed by isolation and PCR
amplification of bound sequences. The resulting pool of enriched DNA sequences
can then be sequenced to identify sequences bound by the TF of interest [46] and
to calculate relative TF affinities from the level of sequence enrichment [47]. In
vitro approaches, such as SELEX, showed that the intrinsic DNA-binding affinity
for a TF is in part determined by the base readout of the TF binding sequence
as represented by its consensus motif. However, the base readout is not the only
12 V. Thormann et al.
variable that contributes to the overall binding preference of a given TF. Evidence
from structural biology showed that TF binding affinities are also influenced by the
sequence-specific higher order conformation of DNA, resulting in specific bending
of the DNA structure and altered protein-DNA interactions [48]. The consensus
recognition motif derived from SELEX experiments captures which sequences are
bound at high affinity by the TF investigated. In vivo, however, high affinity binding
sites are not necessarily responsible for the biological consequences of TF signaling.
In fact, the biological significance of low-affinity binding sites was confirmed for
several TFs [49, 50]. For instance, it was shown that low-affinity binding sites of a
Hox TF are responsible for the regulation of target genes in vivo. In addition, these
low-affinity binding sites safeguard that only specific members of the HOX family
of TFs can bind and activate transcription from these binding sites. Thus, low-
affinity binding sites provide specificity among paralogous Hox TFs that was lost
when these binding sites were changed to high affinity binding sites [50]. In support
of the importance of low-affinity binding sites in gene regulation, computational
modeling of enhancer evolution predicted that regulation by multiple low-affinity
binding sites might be favored by evolutionary selection. A possible reason for
this could be that multiple low-affinity binding sites offer more possibilities for the
regulation of gene expression by changing multiple weak sites rather than one high
affinity TFBS [51]. Furthermore, the usage of multiple low-affinity binding sites
was suggested to enable efficient fine-tuning of gene expression in response to the
integration of several signaling inputs [52]. Finally, enhancers containing multiple
low-affinity binding sites for the same TF could maintain genetic redundancy and
confer regulatory robustness [50].
If affinity is a major driver of transcriptional output levels, these levels can be
calculated based on TFBS affinity [53, 54]. However, this occupancy hypothesis
has recently been challenged by several studies showing that high affinity binding
sites are not necessarily those with the highest activity [50, 55–57]. For example,
the affinity of GR for different GR binding site variants determined in vitro does not
correlate with in vivo transcriptional output as determined by reporter gene assays
[56]. An alternative explanation for the binding site-specific activities could be that
sequence variants induce distinct subtle structural changes in associated TFs which
in turn influence their activity towards target genes [56, 58]. Although studying the
role of the TFBS sequence on transcriptional output in isolation, where all other
variables are kept the same, simplifies interpretation of the results, in reality, TFBSs
are not an isolated linear stretch of DNA, but are embedded in a binding-site-specific
context. Consequently, in vivo, additional factors contribute to the overall binding
affinity and activity of a TFBS. For instance, the conformation of DNA is not only
influenced by the core TFBS sequence but also by nucleotides flanking these sites
[59]. Further, interactions between TFs binding at regions with multiple TFBSs
modulate their interaction with the genome by direct physical interactions [60].
These interactions between TFs bound at regulatory regions can either be additive,
synergistic, or antagonistic which can influence the level of transcriptional output.
To complicate things even further, depending on the composition of the proteins
binding at a single TFBS, GR can either act synergistically or antagonistically with
these proteins [61].
Transcriptional Regulation: When 1 C 1 ¤ 2 13
Together, the multitude of mechanisms and signaling inputs that influence the
expression level of genes provides the cell with a variety of mechanisms to fine-tune
the expression of genes within individual cells or tissues. The effects of individual
signaling inputs on gene expression may be context-specific and consequently,
predicting expression levels from a limited number of features, for example, the
affinity of a TF for its TFBS, is unlikely to achieve great levels of accuracy.
References
1. Chiang, C., et al.: Manifestation of the limb prepattern: limb development in the absence of
sonic hedgehog function. Dev. Biol. 236(2), 421–435 (2001)
2. Struhl, G.: A homoeotic mutation transforming leg to antenna in Drosophila. Nature 292(5824),
635–638 (1981)
3. Donehower, L.A., et al.: Mice deficient for p53 are developmentally normal but susceptible to
spontaneous tumours. Nature 356(6366), 215–221 (1992)
4. Consortium, E.P.: An integrated encyclopedia of DNA elements in the human genome. Nature
489(7414), 57–74 (2012)
5. Bulger, M., Groudine, M.: Functional and mechanistic diversity of distal transcription
enhancers. Cell 144(3), 327–339 (2011)
6. de Laat, W., Duboule, D.: Topology of mammalian developmental enhancers and their
regulatory landscapes. Nature 502(7472), 499–506 (2013)
7. Calo, E., Wysocka, J.: Modification of enhancer chromatin: what, how, and why? Mol. Cell
49(5), 825–837 (2013)
8. Meijsing, S.H.: Mechanisms of glucocorticoid-regulated gene transcription. Adv. Exp. Med.
Biol. 872, 59–81 (2015)
9. Zhang, Z., et al.: Evolutionary optimization of transcription factor binding motif detection.
Adv. Exp. Med. Biol. 827, 261–274 (2015)
10. Zhang, C., et al.: A clustering property of highly-degenerate transcription factor binding sites
in the mammalian genome. Nucleic Acids Res. 34(8), 2238–2246 (2006)
11. Wu, J., Bresnick, E.H.: Glucocorticoid and growth factor synergism requirement for Notch4
chromatin domain activation. Mol. Cell Biol. 27(6), 2411–2422 (2007)
12. Rao, N.A., et al.: Coactivation of GR and NFKB alters the repertoire of their binding sites and
target genes. Genome Res. 21(9), 1404–1416 (2011)
13. Biddie, S.C., et al.: Transcription factor AP1 potentiates chromatin accessibility and glucocor-
ticoid receptor binding. Mol. Cell 43(1), 145–155 (2011)
14. West, J.A., et al.: Nucleosomal occupancy changes locally over key regulatory regions during
cell differentiation and reprogramming. Nat. Commun. 5, 4719 (2014)
15. He, H.H., et al.: Differential DNase I hypersensitivity reveals factor-dependent chromatin
dynamics. Genome Res. 22(6), 1015–1025 (2012)
16. Gertz, J., et al.: Distinct properties of cell-type-specific and shared transcription factor binding
sites. Mol. Cell 52(1), 25–36 (2013)
17. Morikawa, M., et al.: ChIP-seq reveals cell type-specific binding patterns of BMP-specific
Smads and a novel binding motif. Nucleic Acids Res. 39(20), 8712–8727 (2011)
18. Kvon, E.Z., et al.: Genome-scale functional characterization of Drosophila developmental
enhancers in vivo. Nature 512(7512), 91–95 (2014)
Transcriptional Regulation: When 1 C 1 ¤ 2 15
19. Weirauch, M.T., et al.: Evaluation of methods for modeling transcription factor sequence
specificity. Nat. Biotechnol. 31(2), 126–134 (2013)
20. Cusanovich, D.A., et al.: The functional consequences of variation in transcription factor
binding. PLoS Genet. 10(3), e1004226 (2014)
21. Gitter, A., et al.: Backup in gene regulatory networks explains differences between binding and
knockout results. Mol. Syst. Biol. 5, 276 (2009)
22. Creyghton, M.P., et al.: Histone H3K27ac separates active from poised enhancers and predicts
developmental state. Proc. Natl. Acad. Sci. U. S. A. 107(50), 21931–21936 (2010)
23. Heintzman, N.D., et al.: Distinct and predictive chromatin signatures of transcriptional
promoters and enhancers in the human genome. Nat. Genet. 39(3), 311–318 (2007)
24. Hardison, R.C., Taylor, J.: Genomic approaches towards finding cis-regulatory modules in
animals. Nat. Rev. Genet. 13(7), 469–483 (2012)
25. Kwasnieski, J.C., et al.: High-throughput functional testing of ENCODE segmentation predic-
tions. Genome Res. 24(10), 1595–1602 (2014)
26. Frankel, N., et al.: Phenotypic robustness conferred by apparently redundant transcriptional
enhancers. Nature 466(7305), 490–493 (2010)
27. Spivakov, M.: Spurious transcription factor binding: non-functional or genetically redundant?
Bioessays 36(8), 798–806 (2014)
28. So, A.Y., et al.: Determinants of cell- and gene-specific transcriptional regulation by the
glucocorticoid receptor. PLoS Genet. 3(6), e94 (2007)
29. Amano, T., et al.: Chromosomal dynamics at the Shh locus: limb bud-specific differential
regulation of competence and active transcription. Dev. Cell 16(1), 47–57 (2009)
30. Levings, P.P., Bungert, J.: The human beta-globin locus control region. Eur. J. Biochem. 269(6),
1589–1599 (2002)
31. Hilton, I.B., et al.: Epigenome editing by a CRISPR-Cas9-based acetyltransferase activates
genes from promoters and enhancers. Nat. Biotechnol. 33(5), 510–517 (2015)
32. Tolhuis, B., et al.: Looping and interaction between hypersensitive sites in the active beta-
globin locus. Mol. Cell 10(6), 1453–1465 (2002)
33. Dekker, J., et al.: Capturing chromosome conformation. Science 295(5558), 1306–1311 (2002)
34. Dekker, J., Marti-Renom, M.A., Mirny, L.A.: Exploring the three-dimensional organization of
genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 14(6), 390–403 (2013)
35. Dixon, J.R., et al.: Topological domains in mammalian genomes identified by analysis of
chromatin interactions. Nature 485(7398), 376–380 (2012)
36. Zuin, J., et al.: Cohesin and CTCF differentially affect chromatin architecture and gene
expression in human cells. Proc. Natl. Acad. Sci. 111(3), 996–1001 (2014)
37. Nagano, T., et al.: Single-cell Hi-C reveals cell-to-cell variability in chromosome structure.
Nature 502(7469), 59–64 (2013)
38. Rao, S.S., et al.: A 3D map of the human genome at kilobase resolution reveals principles of
chromatin looping. Cell 159(7), 1665–1680 (2014)
39. Fang, F., et al.: Coactivators p300 and CBP maintain the identity of mouse embryonic stem
cells by mediating long-range chromatin structure. Stem Cells 32(7), 1805–1816 (2014)
40. Drissen, R., et al.: The active spatial organization of the beta-globin locus requires the
transcription factor EKLF. Genes Dev. 18(20), 2485–2490 (2004)
41. Bouwman, B.A., de Laat, W.: Getting the genome in shape: the formation of loops, domains
and compartments. Genome Biol. 16, 154 (2015)
42. Vakoc, C.R., et al.: Proximity among distant regulatory elements at the beta-globin locus
requires GATA-1 and FOG-1. Mol. Cell 17(3), 453–462 (2005)
43. Jin, F., et al.: A high-resolution map of the three-dimensional chromatin interactome in human
cells. Nature 503(7475), 290–294 (2013)
44. Kilpinen, H., et al.: Coordinated effects of sequence variation on DNA binding, chromatin
structure, and transcription. Science 342(6159), 744–747 (2013)
45. Corradin, O., et al.: Combinatorial effects of multiple enhancer variants in linkage disequilib-
rium dictate levels of gene expression to confer susceptibility to common traits. Genome Res.
24(1), 1–13 (2014)
16 V. Thormann et al.
46. Wang, J., et al.: In vitro DNA-binding profile of transcription factors: methods and new insights.
J. Endocrinol. 210(1), 15–27 (2011)
47. Slattery, M., et al.: Cofactor binding evokes latent differences in DNA binding specificity
between Hox proteins. Cell 147(6), 1270–1282 (2011)
48. Stella, S., Cascio, D., Johnson, R.C.: The shape of the DNA minor groove directs binding by
the DNA-bending protein Fis. Genes Dev. 24(8), 814–826 (2010)
49. Ramos, A.I., Barolo, S.: Low-affinity transcription factor binding sites shape morphogen
responses and enhancer evolution. Philos. Trans. R. Soc. Lond. B Biol. Sci. 368(1632),
20130018 (2013)
50. Crocker, J., et al.: Low affinity binding site clusters confer hox specificity and regulatory
robustness. Cell 160(1-2), 191–203 (2015)
51. He, X., Duque, T.S., Sinha, S.: Evolutionary origins of transcription factor binding site clusters.
Mol. Biol. Evol. 29(3), 1059–1070 (2012)
52. Gao, R., Stock, A.M.: Temporal hierarchy of gene expression mediated by transcription factor
binding affinity and activation dynamics. mBio 6(3), e00686-15 (2015)
53. Bain, D.L., et al.: Glucocorticoid receptor-DNA interactions: binding energetics are the
primary determinant of sequence-specific transcriptional activity. J. Mol. Biol. 422(1), 18–32
(2012)
54. Segal, E., et al.: Predicting expression patterns from regulatory sequence in Drosophila
segmentation. Nature 451(7178), 535–540 (2008)
55. Garcia, H.G., et al.: Operator sequence alters gene expression independently of transcription
factor occupancy in bacteria. Cell Rep. 2(1), 150–161 (2012)
56. Meijsing, S.H., et al.: DNA binding site sequence directs glucocorticoid receptor structure and
activity. Science 324(5925), 407–410 (2009)
57. Hammar, P., et al.: Direct measurement of transcription factor dissociation excludes a simple
operator occupancy model for gene regulation. Nat. Genet. 46(4), 405–408 (2014)
58. Zhang, J., et al.: DNA binding alters coactivator interaction surfaces of the intact VDR-RXR
complex. Nat. Struct. Mol. Biol. 18(5), 556–563 (2011)
59. Rohs, R., et al.: Nuance in the double-helix and its role in protein-DNA recognition. Curr.
Opin. Struct. Biol. 19(2), 171–177 (2009)
60. Meyer, M.B., Benkusky, N.A., Pike, J.W.: Selective distal enhancer control of the Mmp13 gene
identified through clustered regularly interspaced short palindromic repeat (CRISPR) genomic
deletions. J. Biol. Chem. 290(17), 11093–11107 (2015)
61. Diamond, M.I., et al.: Transcription factor interactions: selectors of positive or negative
regulation from a single DNA element. Science 249(4974), 1266–1272 (1990)
62. John, S., et al.: Chromatin accessibility pre-determines glucocorticoid receptor binding
patterns. Nat. Genet. 43(3), 264–268 (2011)
63. Arnold, C.D., et al.: Genome-wide quantitative enhancer activity maps identified by STARR-
seq. Science 339(6123), 1074–1077 (2013)
64. Zabidi, M.A., et al.: Enhancer-core-promoter specificity separates developmental and house-
keeping gene regulation. Nature 518(7540), 556–559 (2015)
65. Dupin, C., et al.: Treatment of head and neck paragangliomas with external beam radiation
therapy. Int. J. Radiat. Oncol. Biol. Phys. 89(2), 353–359 (2014)
66. Korkmaz, G., et al.: Functional genetic screens for enhancer elements in the human genome
using CRISPR-Cas9. Nat. Biotechnol. 34(2), 192–198 (2016)
67. Maeder, M.L., et al.: CRISPR RNA-guided activation of endogenous human genes. Nat
Methods 10(10), 977–979 (2013)
68. Mendenhall, E.M., et al.: Locus-specific editing of histone modifications at endogenous
enhancers. Nat. Biotechnol. 31(12), 1133–1136 (2013)
69. Lupianez, D.G., et al.: Disruptions of topological chromatin domains cause pathogenic
rewiring of gene-enhancer interactions. Cell 161(5), 1012–1025 (2015)
70. Zhang, X., et al.: Identification of focally amplified lineage-specific super-enhancers in human
epithelial cancers. Nat. Genet. 48(2), 176–182 (2016)
Differential Network Analysis and Graph
Classification: A Glocal Approach
Abstract Based on the glocal HIM metric and its induced graph kernel, we propose
a novel solution in differential network analysis that integrates network comparison
and classification tasks. The HIM distance is defined as the one-parameter family
of product metrics linearly combining the normalised Hamming distance H and
the normalised Ipsen–Mikhailov spectral distance IM. The combination of the
two components within a single metric allows overcoming their drawbacks and
obtaining a measure that is simultaneously global and local. Furthermore, plugging
the HIM kernel into a Support Vector Machine gives us a classification algorithm
based on the HIM distance. First, we outline the theory underlying the metric
construction. We introduce two diverse applications of the HIM distance and the
HIM kernel to biological datasets. This versatility supports the adoption of the HIM
family as a general tool for information extraction, quantifying difference among
diverse instances of a complex system. An Open Source implementation of the HIM
metrics is provided by the R package nettools and in its web interface ReNette.
1 Introduction
The paradigm shift towards complex systems science [3], stimulated by its recent
theoretical and computational advances [4, 15], has paved the way for a parallel leap
in computational biology by moving the focus from the differential gene expression
analysis to differential network analysis (NetDA) [16, 25]. Due to the heterogeneity
in the NetDA process and potential ill-posedness of some of the involved functional
operations [1, 5, 38], a number of alternative approaches have appeared in the
literature, with different strategies and aims [6, 7, 10, 16, 22, 23, 25, 41, 45, 50, 51].
oncogenomics [20] and oncoimmunology [39]. In all cases, the findings derived
by NetDA have been validated by matching the obtained quantitative outcomes
with the qualitative biological knowledge reported in the literature. Moreover,
the same method has found applicability also out of computational biology, e.g.,
socioeconomics [32] or even in multiplex network theory [29]. Here we present,
after a brief summary of the main definitions, two novel application examples, in
neurogenomics and in developmental functional genomics. In the first example, we
highlight and quantify weighted network dissimilarities among gene expression of
brain tissues with different phenotypes (location, sex and health status), while in the
latter we describe the trajectory of the binary developmental gene network in fruit
fly across its different life stages.
Finally, we describe the CRAN R package nettools and the web framework
ReNette [19], which are available to implement NetDA projects.
We recap hereafter the main definitions and results about the HIM metric and kernel.
The synthesis is based on the notations of Table 1: a fully detailed description,
including mathematical proofs, goes beyond the scope of the present chapter, and it
is included in [31]. The (normalised) Hamming distance [18, 24, 28, 40, 48] is the
(local) simplest edit metric, counting the presence/absence of matching links:
Note that, for H, all links are equivalent regardless of their position within the
network: for instance, in Fig. 2, both networks B1 and B2 differ from A for just
one link, and thus H.A; B1 / D H.A; B2 /, although B1 is connected as A while B2
is not.The Ipsen–Mikhailov distance [27] is the (global) L2 integrated difference of
the Laplacian spectral densities:
sZ
1
IM.N1 ; N2 / D ŒN1 .!; / N2 .!; /2 d! :
0
B1 A B2
Fig. 2 Link equivalence for Hamming metric: H.A; B1 / D H.A; B2 / although B1 is connected
while B2 consists of two connected components
X
N
xR i C Aij .xi xj / D 0 for i D 0; ; N 1 :
jD1
The vibrational frequencies !i for this network model are given by the square root of
the eigenvalues of the Laplacian matrix of the network: i D !i2 , with 0 D !0 D 0.
The spectral density for a graph as the sum of Lorentz distributions is defined as
Differential Network Analysis and Graph Classification: A Glocal Approach 21
X
N1
.!; / D K ;
iD1
.! !i /2 C 2
where isZ the common width and K is the normalisation constant defined by the
1
condition .!; /d! D 1, and thus
0
1
KD :
XZ 1
N1
d!
iD1 0 .! !i /2 C 2
The highest value of is reached, for each N, when evaluating the distance between
EN and FN . Denote then by the unique solution of
.EN ; FN / D 1 :
When is not close to the bounds f0; C1g (and one of the factors becomes
dominant), the impact of is minimal, and in general more relevant when HIM is
used as a kernel [21]. Hereafter D 1 will be assumed, and the subscript omitted.
Again, HIM is bounded between 0 and 1, with
The HIM distance naturally induces a kernel via Gaussian (Radial Basis
Function) map [9, 13] to be used standalone or in a Multi-Kernel Learning
framework to increase performance and enhance interpretability [33]:
2
K.N1 ; N2 / D eHIM .N1 ;N2 / ;
Although the HIM kernel is not positively defined in general for all 2 RC 0 ,
by results in [44] it can be used in Support Vector Machines or other algorithms
whenever K is positively defined for the given training data, which is the case for
all the examples shown in what follows. In general, the range of suitable values
for can be computed by imposing positiveness to all eigenvalues of the matrix
2
eHIM .xi ;xj / for xi ; xj in the training set.
1
Available as GEO46706 at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE46706.
Differential Network Analysis and Graph Classification: A Glocal Approach 23
Table 2 Sample size of the UKBEC human brain dataset stratified by gender and tissue location
(a) and by gender and age group (b)
(a) (b)
Region Abbr. M F Region Abbr. M F Age M F Age M F
Cerebellar CB 95 35 Frontal cortex FCX 93 34 < 32 86 39 58–62 117 20
cortex
Hippo campus HC 92 30 Medulla Med 88 31 32–44 130 19 62–68 72 29
Occipital cortex OCC 94 35 Putamen PUT 96 33 44–48 74 24 68–76 82 39
Substantia nigra SN 73 28 Temporal cortex TCX 86 33 48–53 109 27 76–83 66 56
Thalamus Thal 91 33 White matter WM 97 34 53–58 101 20 83 68 53
Region: the tissue location. Abbr.: abbreviation as in Fig. 3, M: number of samples from male
individuals, F: number of samples from female individuals. a b means a < x b
2
Available at http://software.broadinstitute.org/gsea/msigdb/cards/BRAIN_DEVELOPMENT.
3
The platform has no probes for the 51st gene of the pathway, VCX3A.
24 G. Jurman et al.
Med M
Thal M
CB M
0.04
WM M
PUT
HC M
0.02
TCX M
SN M
OCC M
FCX M
0.00
WM F OCC F
−0.02
CB F
FCX F PUT F
Thal F
TCX F
−0.04
HC F
SN F
Med F
Fig. 3 Metric multidimensional scaling projection on two dimensions of all 190 mutual HIM
distances between gene coexpression brain development networks stratified by gender and tissue
locations
e.g., the 32–44. Our results are consistent with findings obtained with different data
and methodology by Berchtold and colleagues in [8], suggesting the existence of a
global pattern of gene expression change associated with brain aging, more evident
from the sixth decade onward, with different evolutions between males and females,
with larger variations in male subjects. Biologically, this is due to a wider global
decrease in males in the catabolic and anabolic capacity with aging, mainly in genes
linked to energy production and protein synthesis and transport [8].
Differential Network Analysis and Graph Classification: A Glocal Approach 25
(a) (b)
0.08
< 32 76~83 32~44
0.06
32~44
44~48
0.06
48~53
0.04
0.04
0.02
58~62
0.02
48~53 44~48 >=83 62~68 48~53 68~76
0.00
−0.02 0.00
58~62
< 32 32~44 76~83 >=83 < 32
53~58
−0.02
62~68
44~48
−0.04
68~76 53~58
−0.06
−0.04 −0.02 0.00 0.02 0.04 0.06 −0.05 0.00 0.05
Fig. 4 Metric multidimensional scaling projection on two dimensions of all 45 mutual HIM
distances between gene coexpression brain development networks stratified by age groups,
separately for the male (a) and female (b) subjects. a b means a < x b
In [34], Kolar and colleagues applied the Keller algorithm to infer the gene regu-
latory networks of Drosophila melanogaster from a time series of gene expression
data measured during its full life cycle, originally published in [2]. They followed
the dynamics of 588 development genes along 66 time points spanning through
four different stages (Embryonic—time points 1–30, Larval—t.p. 31–40, Pupal—
t.p. 41–58, Adult—t.p. 59–66), constructing a time series of inferred networks Ni ,4
where a link between two nodes exists whenever the Keller algorithm detects a
mutual inference between the corresponding genes at the given time point: in Fig. 5a
we show four instances of the Ni networks, at different timing.
As a first step in the quantitative NetDA of this dataset, we measure the HIM
distance between each Ni and the initial network N1 : the resulting distance time
series is shown in Fig. 5b. The largest variations, both between consecutive terms
and with respect to the initial network N1 , occur in the Embryonal stage (E). In
particular, the HIM distance grows until time point 23; next networks get closer
again to N1 , showing that the interactions of the selected 588 genes in the adult
stage are more similar to the corresponding net of interaction in the Embryonal
stage, rather than in the other two stages, consistently with the findings reported
in the original reference [34]. Moreover, while the Hamming component ranges
between 0 and 0:0223, the Ipsen–Mikhailov distance has 0:0851 as its maximum,
4
Publicly available at http://cogito-b.ml.cmu.edu/keller/downloads.html.
26 G. Jurman et al.
(a)
t=1
(b)
0.08 E L P A
0.06
Distance
t=20
0.04
0.02
H
IM
0.00 HIM
1 10 20 30 40 50 60 66
Time step
t=35
(c)
t=66
Fig. 5 D. melanogaster development network dataset. (a) Keller interaction network Ni for the D.
melanogaster development genes at the time points i D 1; 20; 35; 66. (b) Evolution of H (cyan),
IM (magenta) and HIM (golden red) distances network time series across 66 time points in the four
stages Embryonic (E), Larval (L), Pupal (P) and Adult (A). (c) Metric multidimensional scaling
planar projection of the mutual HIM distances between the 66 networks Ni , coloured according to
the developmental stage Embryonic (blue), Larval (red), Pupal (green) and Adult (orange)
Differential Network Analysis and Graph Classification: A Glocal Approach 27
4 Conclusion
The interest of the HIM metric is its global/local approach: by combining edit
and spectral distance types, we overcome the drawbacks of the two distance
components. The two presented applications in functional high-throughput -omics
support the effectiveness of the approach. The strategy of a NetDA based on the HIM
distance offers a reproducible method: the metric gives a completely quantitative
assessment of the differences among networks (on shared nodes) as well as a scalar
product for kernel learning machines.
Operatively, we provide an Open Source implementation of the HIM distance
with the R package nettools available on CRAN and GitHub,5 and in the web
interface ReNette [19].6 In particular, ReNette includes a complete pipeline for
NetDA, integrating a comprehensive collection of tools for network inference, net-
work comparison and network stability analysis [20] (a methodology for assessing
the robustness of an inferred network w.r.t. data subsampling) through queue-based
submission system and asynchronous task management. The software is already
configured for usage on multicore workstations, on high performance computing
clusters and on a cloud-based cluster, to deal with the extraction of the Laplacian
spectrum, which represents the computational bottleneck of the algorithm.
5
https://github.com/MPBA/nettools.git.
6
http://renette.fbk.eu.
28 G. Jurman et al.
References
1. Angulo, M., Moreno, J., Barabási, A.L., Liu, Y.Y.: Fundamental limitations of network
reconstruction (2015). arXiv:1508.03559
2. Arbeitman, M., Furlong, E., Imam, F., Johnson, E., Null, B., Baker, B., Krasnow, M., Scott,
M., Davis, R., White, K.: Gene expression during the life cycle of Drosophila melanogaster.
Science 297(5590), 2270–2275. Erratum in Science 298(5596), 1172 (2002)
3. Barabási, A.L.: The network takeover. Nat. Phys. 8, 14–16 (2012)
4. Barabási, A.L.: Network science. Philos. Trans. R. Soc. A 371(1987), 20120375 (2013)
5. Baralla, A., Mentzen, W., de la Fuente, A.: Inferring gene networks: dream or nightmare? Ann.
N. Y. Acad. Sci. 1158, 246–256 (2009)
6. Barla, A., Jurman, G., Visintainer, R., Squillario, M., Filosi, M., Riccadonna, S., Furlanello,
C.: A machine learning pipeline for discriminant pathways identification. In: Biganzoli,
E., Vellido, A., Ambrogi, F., Tagliaferri, R. (eds.) Computational Intelligence Methods for
Bioinformatics and Biostatistics. Lecture Notes in Computer Science, vol. 7548, pp. 36–48.
Springer, Berlin (2012)
7. Barla, A., Jurman, G., Visintainer, R., Squillario, M., Filosi, M., Riccadonna, S., Furlanello,
C.: A Machine learning pipeline for discriminant pathways identification. In: Kasabov, N. (ed.)
Springer Handbook of Bio-/Neuroinformatics, Chap. 53, p. 1200. Springer, Berlin (2013)
8. Berchtold, N., Cribbs, D., Coleman, P., Rogers, J., Head, E., Kim, R., Beach, T., Miller, C.,
Troncoso, J., Trojanowski, J., Zielke, H., Cotman, C.: Gene expression changes in the course
of normal brain aging are sexually dimorphic. Proc. Natl. Acad. Sci. U. S. A. 105(40), 15605–
15610 (2008)
9. Bolla, M.: Spectral Clustering and Biclustering: Learning Large Graphs and Contingency
Tables. Wiley, New York (2013)
10. Chuang, H.Y., Lee, E., Liu, Y.T., Lee, D., Ideker, T.: Network-based classification of breast
cancer metastasis. Mol. Syst. Biol. 3, 140 (2007)
11. Chung, F.: Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, vol. 92.
American Mathematical Society, Philadelphia (1997)
12. Cootes, A., Muggleton, S., Sternberg, M.: The identification of similarities between biological
networks: application to the metabolome and interactome. J. Mol. Biol. 369, 1126–1139 (2007)
13. Cortes, C., Haffner, P., Mohri, M.: Positive definite rational kernels. In: Learning Theory and
Kernel Machines. Proceedings of COLT 2003. Lecture Notes on Computer Science, vol. 2777,
pp. 41–56. Springer, Berlin (2003)
14. Cox, T., Cox, M.: Multidimensional Scaling. Chapman and Hall, Boca Raton (2001)
15. Csermely, P., Korcsmáros, T., Kiss, H., London, G., Nussinov, R.: Structure and dynamics of
biological networks: a novel paradigm of drug discovery. A comprehensive review. Pharmacol.
Ther. 138, 333–408 (2013)
16. de la Fuente, A.: From ‘differential expression’ to ‘differential networking’ - identification of
dysfunctional regulatory networks in diseases. Trends Genet. 26(7), 326–333 (2010)
17. Dehmer, M., Mowshowitz, A.: The discrimination power of structural superindices. PLoS ONE
8(7), e70551 (2013)
18. Dougherty, E.: Validation of gene regulatory networks: scientific and inferential. Brief.
Bioinform. 12(3), 245–252 (2010)
19. Filosi, M., Droghetti, S., Arbitrio, E., Visintainer, R., Riccadonna, S., Jurman, G., Furlanello,
C.: ReNette: a web-infrastructure for reproducible network analysis (2014). bioRxiv-doi:10.
1101/008433
20. Filosi, M., Visintainer, R., Riccadonna, S., Jurman, G., Furlanello, C.: Stability indicators in
network reconstruction. PLoS ONE 9(2), e89815 (2014)
21. Furlanello, T., Cristoforetti, M., Furlanello, C., Jurman, G.: Sparse predictive structure of
deconvolved functional brain networks. High-Dimensional Statistical Inference in the Brain,
NIPS 2013 Workshop (2013). arXiv:1310.6547[q-bio.NC]
Differential Network Analysis and Graph Classification: A Glocal Approach 29
22. Gill, R., Datta, S., Datta, S.: A statistical framework for differential network analysis from
microarray data. BMC Bioinf. 11(1), 1–10 (2010)
23. Ha, M., Baladandayuthapani, V., Do, K.A.: DINGO: differential network analysis in genomics.
Bioinformatics 31(21), 3413–3420 (2015)
24. Hamming, R.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160
(1950)
25. Ideker, T., Krogan, N.: Differential network biology. Mol. Syst. Biol. 8, 565 (2012)
26. Ioannidis, J., Allison, D., Ball, C., Coulibaly, I., Cui, X., Culhane, A.C., Falchi, M., Furlanello,
C., Game, L., Jurman, G., Mehta, T., Mangion, J., Nitzberg, M., Page, G., Petretto, E., van
Noort, V.: Repeatability of published microarray gene expression analyses. Nat. Genet. 41(2),
499–505 (2009)
27. Ipsen, M., Mikhailov, A.: Evolutionary reconstruction of networks. Phys. Rev. E 66, 046109
(2002). Erratum in Phys. Rev. E 67, 039901 (2003)
28. Iwayama, K., Hirata, Y., Takahashi, K., Watanabe, K., Aihara, K., Suzuki, H.: Characterizing
global evolutions of complex systems via intermediate network representations. Sci. Rep. 2,
423 (2012)
29. Jurman, G.: Metric projections for dynamic multiplex networks (2016). arXiv:1601.01940
30. Jurman, G., Visintainer, R., Furlanello, C.: An introduction to spectral distances in networks.
In: Apolloni, B., Bassis, S. (eds.) Proceedings of WIRN10, Frontiers in Artificial Intelligence
and Applications, vol. 226, pp. 227–234. IOS Press, Amsterdam (2011)
31. Jurman, G., Visintainer, R., Riccadonna, S., Filosi, M., Furlanello, C.: The HIM glocal metric
and kernel for network comparison and classification (2014). arXiv:1201.2931v3
32. Jurman, G., Visintainer, R., Filosi, M., Riccadonna, S., Furlanello, C.: The HIM glocal
metric and kernel for network comparison and classification. In: Proceedings IEEE DSAA’15,
vol. 36678, pp. 1–10. IEEE, New York (2015)
33. Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A.: `p -norm multiple kernel learning. J. Mach.
Learn. Res. 12, 953–997 (2011)
34. Kolar, M., Song, L., Ahmed, A., Xing, E.: Estimating time-varying networks. Ann. Appl. Stat.
4(1), 94–123 (2010)
35. Koutra, D., Vogelstein, J., Faloutsos, C.: DELTACON: a principled massive-graph similarity
function. In: Proceedings of the 13th SIAM International Conference on Data Mining (SDM),
pp. 162–170. SIAM, New York (2013)
36. Liu, Y.Y., Slotine, J.J., Barabási, A.L.: Controllability of complex networks. Nature 473(7346),
167–173 (2011)
37. Mardia, K.: Some properties of classical multidimensional scaling. Commun. Stat. Theory
Meth. A7, 1233–1241 (1978)
38. Meyer, P., Alexopoulos, L., Bonk, T., Califano, A., Cho, C., de la Fuente, A., de Graaf, D.,
Hartemink, A., Hoeng, J., Ivanov, N., Koeppl, H., Linding, R., Marbach, D., Norel, R., Peitsch,
M., Rice, J., Royyuru, A., Schacherer, F., Sprengel, J., Stolle, K., Vitkup, D., Stolovitzky,
G.: Verification of systems biology research in the age of collaborative competition. Nat.
Biotechnol. 29(9), 811–815 (2011)
39. Mina, M., Boldrini, R., Citti, A., Romania, P., D’Alicandro, V., De ioris, M., Castellano,
A., Furlanello, C., Locatelli, F., Fruci, D.: Tumor-infiltrating T lymphocytes improve clinical
outcome of therapy-resistant neuroblastoma. Oncoimmunology 4(9), e1019981 (2015)
40. Morris, M., Handcock, M., Hunter, D.: Specification of exponential-family random graph
models: terms and computational aspects. J. Stat. Softw. 24(4), 1–24 (2008)
41. Pavlopoulos, G., Secrier, M., Moschopoulos, C., Soldatos, T., Kossida, S., Aerts, J., Schneider,
R., Bagos, P.: Using graph theory to analyze biological networks. BioData Min. 4(1), 10 (2011)
42. Ramasamyi, A., Trabzuni, D., Guelfi, S., Varghese, V., Smith, C., Walker, R., De, T.,
United Kingdom Brain Expression Consortium (UKBEC), North American Brain Expression
Consortium, Coin, L., de Silva, R., Cookson, M., Singleton, A., Hardy, J., Ryten, M., Weale,
M.: Genetic variability in the regulation of gene expression in ten regions of the human brain.
Nat. Neurosci. 17(10), 1418–1428 (2014)
30 G. Jurman et al.
43. Ruan, D., Young, A., Montana, G.: Differential analysis of biological networks. BMC Bioinf.
16, 327 (2015)
44. Schölkopf, B.: Support Vector Learning. Oldenbourg, Munchen (1997)
45. Sharan, R., Ideker, T.: Modeling cellular machinery through biological network comparison.
Nat. Biotechnol. 24(4), 427–433 (2006)
46. Trabzuni, D., Ramasamy, A., Imran, S., Walker, R., Smith, C., Weale, M., Hardy, J., Ryten, M.,
North American brain expression consortium. Widespread sex differences in gene expression
and splicing in the adult human brain. Nat. Commun. 4, 2771 (2013)
47. Trabzuni, D.: United Kingdom Brain Expression Consortium (UKBEC), Thomson, P.: Anal-
ysis of gene expression data using a linear mixed model/finite mixture model approach:
application to regional differences in the human brain. Bioinformatics 30(11), 1555–1561
(2014)
48. Tun, K., Dhar, P., Palumbo, M., Giuliani, A.: Metabolic pathways variability and
sequence/networks comparisons. BMC Bioinf. 7(1), 24 (2006)
49. Xiao, Y., Dong, H., Wu, W., Xiong, M., Wang, W., Shi, B.: Structure-based graph distance
measures of high degree of precision. Pattern Recogn. 41(12), 3547–3561 (2008)
50. Yang, B., Zhang, J., Yin, Y., Zhang, Y.: Network-based inference framework for identifying
cancer genes from gene expression data. BioMed. Res. Int. 2013, 12pp. (2013). Article ID
401649
51. Yoon, B.J., Qian, X., Sahraeian, S.: Comparative analysis of biological networks. IEEE Signal
Process. Mag. 29(1), 22–34 (2012)
52. Zandoná, A., Chierici, M., Jurman, G., Furlanello, C., Cucchiara, S., Del Chierico, F.,
Putignani, L.: A metagenomic pipeline integrating predictive profiling methods and complex
networks for the analysis of NGS microbiome data. NIPS Workshop - Machine Learning in
Computational Biology (2014)
Structural vs Practical Identifiability
of Nonlinear Differential Equation Models
in Systems Biology
Abstract This paper reappraises two different viewpoints adopted for testing
identifiability of nonlinear differential equation models. The aim is to take advantage
through their joint use of the complementary information provided. The common
objective is to assess whether model parameters can be estimated from specific
input/output (I/O) experiments. The structural identifiability analysis investigates
whether unknown model parameters can be identified uniquely, at all, with a
particular I/O configuration. This is investigated using differential algebra, e.g.,
as implemented in the software DAISY (Differential Algebra for Identifiability of
SYstems). In contrast, practical identifiability analysis is a data-based approach
to assess the precision of parameter estimates obtainable from experimental data.
It is based on simulated model outputs and their sensitivities with respect to
parameters. The relevant novelty of using both methodologies together is that
structural identifiability analysis allows a clearer understanding of the practical
identifiability results. This result is shown in the identifiability analysis of a much
quoted biological model describing the erythropoietin(Epo)-induced activation of
the JAK-STAT signaling pathway, which is known to play a role in the regulation
of cell proliferation, differentiation, chemotaxis, and apoptosis and is important for
hematopoiesis, and immune development. This study shows that some results on
practical identifiability tests can be proven in an analytical way by a differential
algebra test and that this test can provide additional information helpful for the
experiment design.
1 Introduction
with state x.t/ 2 Rn , input u.t/ 2 Rq ranging on some vector space of differentiable
functions, output y.t/ 2 Rm , and the constant unknown parameter vector
belonging to some open subset ‚ Rp . Whenever initial conditions are specified,
the relevant equation x.0/ D x0 is added to the system. The functions f and h are
vectors of rational functions in x.
Definition 1. The system (1), (2) is (a priori) globally (or uniquely) identifiable
from I/O data if, for at least a generic set of points 2 ‚, there exists (at least)
one input function of time, u.t/, such that the equation
x0 .; u/ D x0 . ; u/ (3)
showing that the model (1), (2) is globally identifiable. In any case, the Gröbner
basis provides the unique parameterization of the model and allows to count the
number of solutions, i.e., the number of distinct values of the unknown parameter
that solve the system of equations implied by (3).
In contrast, (3) has infinite many solutions if the basis G.; / has less
components than the number of estimated parameters. This occurs either if one
or more parameters disappear from the Gröbner basis or if the parameters satisfy
a number of algebraic relations less than p. This means that the I/O map will be
identical, thus non-distinguishable, for all values of the hidden parameters, and/or
for specific, analytically known, combinations of parameters.
purpose the model output equations (2) are normally revised by adding measurement
noise, such as
1X
N
VN ./ WD Œy.tk / yO .tk ; /> Qk ; Œy.tk / yO .tk ; / (6)
2 kD1
where Qk are positive semidefinite weights usually taken as the inverse of measure-
ment noise variance, but without loss of generality, assumed in the following equal
to the identity matrix, Qk D I. Finally, with parameter estimates obtained as
The model (1), (2), or parameter , can be defined practically identifiable if the
minimum of VN ./ is well characterized in terms of necessary and sufficient
conditions for a local minimum, i.e., a vanishing gradient: r VN ./O D 0, and
convexity in the neighborhood of ,O i.e., with a positive definite Hessian matrix:
O > 0.
r 2 2 VN ./
Straightforward calculations under the simplifying a priori assumption of
expected zero measurement noise yield the simplified Hessian matrix:
X
N X
m
O
r 2 VN ./ D O >
r yj .tk ; /r O
yj .tk ; / D S./ S./ ;
T
(8)
kD1 jD1
In this section we consider a dynamic model published in several journals [10, 14].
The aim of the model is to investigate the Epo-induced activation of the JAK-STAT
signaling pathway that primarily consists of the cytoplasmic tyrosine kinase JAK
and the latent transcription factor STAT. This pathway is known to play a role in
the regulation of cell proliferation, differentiation, chemotaxis, and apoptosis and is
important for hematopoiesis and immune development. The biochemical reactions
of the JAK-STAT pathway are described by the following nonlinear ODE system:
8
ˆ
ˆ xP 1 .t/ D k1 u.t/ x1 .t/ C 2 k4 x4 .t /
ˆ
ˆ
ˆ
ˆ xP 2 .t/ D k2 x2 2 .t/ C k1 u.t/ x1 .t/
ˆ
<
xP 3 .t/ D k3 x3 .t/ C k2 x2 2 .t/=2
(10)
ˆ xP 4 .t/ D k4 x4 .t/ C k3 x3 .t/
ˆ
ˆ
ˆ
ˆ
ˆ y1 .t/ D s1 .x2 .t/ C 2 x3 .t//
:̂
y2 .t/ D s2 .x1 .t/ C x2 .t/ C 2 x3 .t//
Here we check structural identifiability of model (10) by using the software DAISY
and practical identifiability at nominal parameter values by determining the rank of
the sensitivity matrix. This comparative analysis seems to be done for the first time.
DAISY is applied initially to check the uniqueness of parameter estimates in
the entire parameter space. Results, supported by the Gröbner bases computed
analytically by the algorithm, show that parameters k2 ; s1 ; s2 ; ic1 are linked by
algebraic constraints with one degree of freedom. This actually indicates that these
four parameters are structurally nonidentifiable and that it is sufficient to know
just one of them (not necessarily the scale factor parameter) to make the model
structurally globally identifiable. This “flexibility” for recovering identifiability is
an important issue because it allows for different choices in the design of the
experiment, where many constraints exist especially in a biological experimental
setup. The structural test also guarantees that the practical nonidentifiability of k3
is actually only due to data problems and that it is sufficient to include only one
constraint equation on the initial conditions to retrieve the structural identifiability
of the model. Hence the structural identifiability analysis has essentially replicated
by an analytical approach without experimental data nor assumptions on the initial
conditions values, the results obtained about a nominal point in [10].
In order to integrate the analytical results provided by DAISY with the practical
identifiability approach, the Gröbner basis determined for the JAK-STAT model (10)
38 M.P. Saccomani and K. Thomaseth
is recalculated for the same nominal parameter values used to assess practical
identifiability [14]. Nominal parameter values are reported here as decimal as
well as rational numbers, in parenthesis, being the latter used by DAISY for
carrying out calculations on a ring with infinite precision: k1 D 1:95.39=20/,
k2 D 0:11.11=100/, k3 D 0:68.17=25/, k4 D 1:49.149=100/, s1 D 1:25.5=4/,
s2 D 0:95.19=20/, ic1 D 1.1/ where k3 was fixed to twice the lower limit.
The Gröbner basis reported in Table 1 (first column), and its Jacobian matrix
formed by the partial derivatives of the Gröber basis with respect to the parameters,
shown for an easier inspection in the right-hand side of Table 1, confirm the
structural results already discussed. In particular, the first, third, and fourth rows
and columns depend each upon one parameter only, and define thus uniquely the
values of the structurally globally identifiable parameters k1 , k3 , and k4 . The second,
fifth, and sixth rows involve the remaining nonidentifiable parameters, namely k2 , s1 ,
s2 , and ic1 , that are thus linked by algebraic constraints with one degree of freedom,
i.e., three Gröbner basis equations in four unknowns.
In order to check practical identifiability, the model variables y1 .t/ and y2 .t/
are simulated between 0 and 60 min, together with all sensitivity equations, using
the nominal parameter values mentioned previously. The model input, u.t/, was
calculated between 0 and 30 min, as the positive half-cycle of the sine wave with
total period 60 min. In Fig. 1 the time course of y1 .t/ and y2 .t/ are shown.
Virtually identical profiles (not shown) are obtained by changing the model
nonidentifiable parameters according to the above Gröbner basis (Table 1).
In particular, the Gröbner basis polynomials equated to zero provide a globally
identifiable parameterization of the model. It is easy to see that the most straightfor-
ward approach to reach it is to fix s2 to an arbitrary numerical value and calculate
the remaining parameters from the other equations: k1 D 39=20, k2 D 11 s2 =95,
k3 D 17=25, k4 D 149=100, s1 D 25 s2 =19, ic1 D 19=.20 s2 /, where fixed
parameters are reported for completeness. By varying the value of s2 one could
observe that the trajectory of y.t/ is not affected.
Alternatively, as considered in [14], one can fix the initial condition ic1 , and
calculate from Table 1, s2 = 19/(20 ic1 ), s1 = 5/(4 ic1 ), k2 = 209/(1900 ic1 ).
Obviously, assignment of any one of the parameters k2 , s1 , s2 , and ic1 and
recalculation of the other parameters is possible.
A final remark regards a geometric interpretation of the relationship between
Gröbner basis and Jacobian matrix reported in Table 1, and the eigenvectors of
Table 2 that form a basis for expressing output variations as functions of parameter
Structural vs Practical Identifiability of Nonlinear Differential Equation Models 39
1.0
0.8
0.6
y1
y
y2
0.4
0.2
0.0
0 10 20 30 40 50 60
t (min)
Fig. 1 Time course of model outputs using as forcing input u.t/ D max.0; sin.2
t=60//
Table 2 Singular values of sensitivity matrix S./ and relative right eigenvectors V
23.94 11.29 4.734 0.8312 0.5376 0.313 0
k1 0.00103 0.003642 0.05622 0.06273 0.8462 0.5261 0
k2 0.9356 0.2611 0.2281 0.02152 0.02083 0.01021 0.05899
k3 0.0228 0.000688 0.02984 0.2179 0.5047 0.8345 0
k4 0.04448 0.04458 0.01768 0.9721 0.167 0.151 0
s1 0.1928 0.07973 0.7103 0.01419 0.01845 0.04546 0.6704
s2 0.1296 0.7311 0.4328 0.03032 0.02316 0.01742 0.5095
ic1 0.2612 0.6236 0.5018 0.04418 0.003354 0.03915 0.5363
variations: ıy.t/ D S./ı. In particular, the last column of Table 2, V7 , defines the
null space for parameter perturbations, i.e., ıy.t/ D 0 if ı / V7 , because 7 D 0.
Interestingly, it can be verified that V7 generates also the null space for the Jacobian
matrix in Table 1, i.e., rG.; /V7 D 0 (up to roundoff errors). This result may
be unexpected but not surprising since G.; / D 0 defines, for a fixed , the
values of that produce identical output trajectories. This is consistent with the fact
that parameter variations, which do not modify output trajectories, do not change
G. C ı; / rG. ; /ı.
4 Conclusions
References
1. Audoly, S., Bellu, G., D’Angiò, L., Saccomani, M.P., Cobelli, C.: Global identifiability of
nonlinear models of biological systems. IEEE Trans. Biomed. Eng. 48(1), 55–65 (2001)
2. Bellu, G., Saccomani, M.P., Audoly, S., D’Angiò, L.: DAISY: a new software tool to test global
identifiability of biological and physiological systems. Comput. Methods Prog. Biomed. 88,
52–61 (2007)
3. Buchberger, B.: Ph.D. thesis 1965: An algorithm for finding the basis elements of the residue
class ring of a zero dimensional polynomial ideal. J. Symb. Comput. 41(3), 475–511 (2006)
4. Chapman, M.J., Godfrey, K.R., Chappell, M.J., Evans, N.D.: Structural identifiability of non-
linear systems using linear/non-linear splitting. Int. J. Control 76(3), 209–216 (2003)
5. Chis, O., Banga, J.R., Balso-Canto, E.: Structural identifiability of systems biology models: a
critical comparison of methods. PloS ONE 6(11), e27755 (2011)
6. Cobelli, C., Saccomani, M.P.: Unappreciation of a priori identifiability in software packages
causes ambiguities in numerical estimates. Letter to the editor. Am. J. Physiol. 21, E1058–
E1059 (1990)
Structural vs Practical Identifiability of Nonlinear Differential Equation Models 41
7. Joly-Blanchard, G., Denis-Vidal, L.: Some remarks about identifiability of controlled and
uncontrolled nonlinear systems. Automatica 34, 1151–1152 (1998)
8. Ljung, L., Glad, S.T.: On global identifiability for arbitrary model parameterizations.
Automatica 30(2), 265–276 (1994)
9. Ollivier, F.: Le problème de l’identifiabilité structurelle globale: étude théorique, méthodes
effectives et bornes de complexité. Thèse de Doctorat en Science, École Polytéchnique, Paris
(1990)
10. Raue, A., Kreutz, C., Maiwald, T., Bachmann, J., Shilling, M., Klingmüller, U., Timmer, J.:
Structural and practical identifiability analysis of partially observed dynamical models by
exploiting the profile likelihood. Bioinformatics 25, 1923–1929 (2009)
11. Raue, A., Karlsson, J., Saccomani, M.P., Jirstrand, M.M., Timmer, J.: Comparison of
approaches for parameter identifiability analysis of biological systems. Bioinformatics 30(10),
1440–1448 (2014)
12. Rodriguez-Fernandez, M., Rehberg, M., Kremling, A., Banga, J.R.: Simultaneous model
discrimination and parameter estimation in dynamic models of cellular systems. BMC Syst.
Biol. 7, 76 (2013)
13. Saccomani, M.P., Audoly, S., D’Angiò, L.: Parameter identifiability of nonlinear systems: the
role of initial conditions. Automatica 39, 619–632 (2004)
14. Schelker, M., Raue, A., Timmer, J., Kreutz, C.: Comprehensive estimation of input signals and
dynamics in biochemical reaction networks. Bioinformatics, ECCB 28, i529–i534 (2012)
15. Seber, G.A., Wild, C.J.: Nonlinear Regression. Wiley, New York (1989)
16. Thomaseth, K., Batzel, J.J., Bachar, M., Furlan, R.: Parameter estimation of a model for
Baroreflex control of unstressed volume. In: Mathematical Modeling and Validation in
Physiology, 215–246. Springer, Berlin (2012)
Boolean Dynamics of Compound Regulatory
circuits
1 Introduction
1.1 Motivations
relating the occurrence of such circuits with the corresponding dynamical properties
have been defined and properly demonstrated in continuous and discrete frameworks
[13, 14, 16, 17]. However, the dynamical properties of more complex regulatory
motifs made of intertwined circuits still need to be clarified [6]. In this article, we
rely on a Boolean modelling framework (introduced in Sect. 1.2) to review recent
achievements associating simple or more complex regulatory motifs with specific
dynamical properties, i.e. in terms of the number and type of attractors (Sect. 1.3).
Next, we report novel results regarding the dynamical properties of chorded circuits,
made of an elementary (positive or negative) circuit with a chord (Sect. 2). For sake
of brevity, we introduce our main results here, leaving the details (theorems and
proofs) for a forthcoming publication.
1
Classical terms of graph theory can be found in [3]. Moreover, we use here the following
terminology:
Isolated (elementary) circuit: a connected directed graph with every vertex of in-degree and
out-degree equal to 1;
Circuit: a subgraph of a regulatory graph amounting to an isolated circuit;
Flower-graph: group of circuits sharing one single vertex;
Chorded circuit: circuit with a chord, possibly a self-loop;
Cycle: a subgraph of a state transition graph amounting to an isolated circuit.
Boolean Dynamics of Compound Regulatory circuits 45
successor. In contrast, according to the asynchronous policy, only one variable can
be updated at each step and all the possible successors of a state are considered
(non-deterministic, branching dynamics).
Of particular interest are the sets of states forming attractors, i.e., minimal
groups of states from which the system cannot escape, which represent potential
asymptotical behaviours. Attractors can be ranged into two main classes: stable
states, corresponding here to fixed states (i.e. without successors), and cyclic
attractors, corresponding here to terminal cycles or to more complex terminal
strongly connected components comprising several intertwined cycles.
Several methods have been proposed to efficiently identify all stable states (see,
e.g., [10]). However, other means are needed to assess the reachability of the stable
states from specific initial states, or yet to identify cyclic attractors (see, e.g., [8]).
Proper dynamical analyses often rely on the computation of the STG. As the size of
the model increases, the size of the STG increases exponentially. To cope with this
problem, one can reduce directly the model before simulation (model reduction),
and/or compress the resulting STG into a hierarchical transition graph (HTG) [4].
Strongly connected components (SCCs) form a partition of the STG. They are
trivial (constituted by a unique state) or complex (containing at least two states).
The compression of an STG into a HTG is achieved by clustering the states of
the complex SCCs, and gathering the trivial SCCs leading to the same complex
SCC and attractors. The components grouping trivial SCCs are called irreversible
components.
The HTG displays all the reachable attractors, and the other clusters of states
leading to one single attractor or to specific subsets of attractors. HTG computation
is done on the fly, without having to store the whole STG, which often enables
strong memory and CPU usage shrinking [4]. Furthermore, this functionality
eases the identification of the key commutations (change of component levels)
underlying irreversible choices between the different reachable attractors. The HTG
representation is very compact and very informative regarding the organisation of
the original STG.
A
Coherent FFM Direct and indirect (via Filtering of transient signal
B B) interactions from input A onto output
C, with coherent (positive or negative) ef-
C fects on output
A
Incoherent FFM Direct and indirect (via Generation of pulses
B B) interactions from input A onto output
C, with incoherent effects on output
C
Fig. 1 Boolean dynamics of simple regulatory motifs: summary of previous results (notation
FFM: feedforward motif)
Over the last years, a series of results has been obtained regarding the dynamical
properties of compound regulatory circuits, in particular sets of circuits sharing one
single vertex (‘hub vertex’). In such cases, one can infer the dynamics of the whole
system based on that of the hub vertex, as the hub vertex fully determines (directly
or indirectly) the behaviour of the other vertices. Hence, these flower-graphs (as
they are called in [7]) can give rise to 0, 1 or 2 stable states. Figure 2 gives six
examples of such motifs; they together illustrate all possible situations in terms
of attractors, i.e., regarding the potential occurrence of multiple stable states or of
cyclic attractors. The first motif is associated with bistability (coexistence of two
mirroring stable states), in the absence of cyclic attractor. The second motif has a
unique, cyclic attractor. The fifth and sixth motifs have a unique stable state and no
cyclic attractor. The third and fourth motifs correspond to a variety of dynamical
situations depending on the logical rules associated with the hub vertex (AND, OR
or XOR).
Note that each of these cases represents a large class of networks, encompassing
potentially more vertices and interactions, but which can be formally reduced to
these prototypic motifs without fundamental impact on the dynamics (i.e. regarding
the number and types of attractors, see [11]). This suggests that the association
of specific dynamical behaviours with the motifs listed in Figs. 1 and 2 could be
extended to larger classes of motifs.
Boolean Dynamics of Compound Regulatory circuits 47
Most of the works on the regulatory motif listed in Fig. 2 focus only on their
asymptotical behaviours (attractors, and even often only stable states). In the line
of our previous study devoted to isolated circuits [13], we describe the whole
synchronous and asynchronous STGs of regulatory motifs made of an isolated
circuit with a unique chord—possibly a self-loop—(chorded circuits), and compare
their dynamical properties to those of isolated circuits. Using combinatorics on
specific abacus and analysis of recurrent sequences (not shown here), we emphasise
that whatever the chosen updating rule, the STG depends on a small number of
parameters.
We recall the structural properties of the synchronous and asynchronous STGs
of isolated circuits in Sect. 2.1. Then, we present an outline of our new results
concerning the synchronous and asynchronous dynamical structures of chorded
circuits (Sect. 2.2).
48 E. Remy et al.
Chorded circuits are made of a long circuit with a chord (additional short-cut
interaction) between two components of the circuit (or amounting to a self-loop),
thereby creating a small circuit (see Fig. 3(II), (III) and (IV) left). The chorded
circuit is coherent if the signs of the two embedded circuits are identical; otherwise,
the chorded circuit is incoherent. We compared the dynamics of chorded circuits
with the dynamics of the long circuit. In any case, part of the states keep the same
updating calls, while other states are sensitive to the presence of the short-cut, and
called therefore hereafter sensitive states. Three cases for the logical rule have been
considered, using the logical operators OR, AND and XOR. Note that using XOR
amounts to define two dual interactions (i.e. with context sensitive signs) converging
on a single vertex. The dynamics obtained with OR and AND rules are symmetrical:
one can transform one of the resulting STGs into the other one by switching (ON
or OFF) all component values. The topology of the STG, and thus the dynamical
properties depend on the sign of the long circuit, and if it is a coherent chorded
circuit or not. In contrast, the topology of the STG and the dynamics obtained with
the XOR rule depends only on the number of genes involved, not on the signs of the
two circuits.
In the cases of the OR and AND logical rules, the synchronous STG contains terminal
cycles.
• If the long circuit is positive, these terminal cycles are found in the synchronous
STG of the long circuit. If the chorded circuit is further coherent (positive small
circuit), there are two stable states; if it is incoherent (negative small circuit),
there is only one stable state.
(I)
1010 0101
1101
0001
i# 2
0100 1100
1001
A 0111
1000
cc# 12
D B 0010 1011
C
0110 0011
1110 0000 1111
0000 1111
(II)
1010 0101
1101
0001
1010
0100 1100
A 1001
0111
1000
D B cc# 12
0010 1011
C
0100
C if A AND B 0110 0011
1110 0000 1111
0000 1111
(III)
1010 0101
1101
0001
i# 3
0100 1100
A 1001
0111
1000
D B cc# 12
0010 1011
C
C if Ā AND B 0110 0011
1110 0000
0000 1111
(IV)
1010 0101
1101
0001
0100 1100 i# 7
1001
A 0111
1000
D B 0010 1011 i# 7
C
A if A AND D 0110 0011
1110 0000 1111
0000 1111
Fig. 3 Description of the asynchronous dynamics of: a 4-components isolated circuit (I); a
coherent chorded circuit (II); an incoherent chorded circuit (III); a circuit with a coherent
self-regulation (IV). From left to right: regulatory graph, state transition graph (STG) and its
compression into a hierarchical transition graph (HTG). In the later, ‘cc’ and ‘i’ stand for cyclic
and irreversible components, respectively, while the number written after ‘#’ corresponds to the
number of states encompassed by the component
50 E. Remy et al.
• If the long circuit is negative, the terminal cycles differ from those obtained for
the long circuit. If the chorded circuit is incoherent (positive small circuit), there
is only one stable state; if it is coherent (negative small circuit), there is no stable
state.
Accordingly, in the cases OR and AND, a coherent chorded circuit and its corre-
sponding long circuit have the same number of stable states.
In the case of the XOR logical rule, the synchronous STG is constituted of vertex-
disjoint cycles. It contains only one stable state and cycles with pseudo-random
sequence of states, whatever the signs of the circuits.
When the small circuit is not a self-loop, the asynchronous STG of the chorded
circuit is obtained from that of the long circuit by changing the direction of edges
between pairs of sensitive states that differ by the coordinate of the target component
of the short-cut. When the small circuit consists in a self-loop, these edges are
suppressed or created.
In the cases of the OR and AND logical rules, compare Fig. 3(II) and (III)
with Fig. 3(I) if the small circuit is not a self-loop, and Fig. 3(IV) and (I) in the
case of a self-loop. It can be demonstrated that a coherent chorded circuit and its
corresponding long circuit have the same number and type of attractors, and in
particular the same number of stable states. When the chorded circuit is incoherent,
there is a unique attractor: a stable state. Moreover, the STG of an isolated circuit
is always symmetrical by the transformation switching the component values (cf.
Fig. 3(I) centre: the structure of the STG is conserved when switching all Os to 1 and
vice-versa), and encompasses pairs of such symmetrical states at each level (a level
is characterised by a constant number of updating calls) [13]. The introduction of a
short-cut skews the dynamics. For example, in the case where both long and small
circuits are positive, the basin of attraction of one of the stable states is increased at
the expense of the other one (compare Fig. 3(II) with Fig. 3(I), right).
In the case of the XOR logical rule, the asynchronous STG of a chorded circuit
encompasses a unique stable state as sole attractor. As using an XOR rule amounts to
introduce dual regulations, this could be considered as a particular case of incoherent
chorded motif.
Figure 4 summarises our novel results regarding the dynamics of chorded circuits,
focusing on the Boolean framework and the asynchronous updating scheme, and
considering three different rules (AND, OR and XOR) for the vertex targeted by
two regulations. These results can be generalised to a wide range of regulatory
Boolean Dynamics of Compound Regulatory circuits 51
Connected level structure Deduced from the asynchronous Deduced from the asynchronous
Levels form the SCCs (except STG of the long circuit STG of the long circuit
perhaps for the two extremal levels), → deleting or creating edges if the → deleting or creating edges if the
and gather states with the same short-circuit is a self-loop small circuit is a self-loop
number of successors → inverting edges otherwise → inverting edges otherwise
Asynchronous STG
Fig. 4 Boolean asynchronous dynamics of chorded circuits, compared to that of isolated circuits
motifs, e.g., involving longer short-cut paths, with the help of the reduction method
described in [11]. However, simple and compound regulatory motifs are usually
embedded in large, intricated networks. In this respect, it can be shown that motifs
embedded in more complex networks may still display the associated properties in
specific conditions, called ‘context of functionality’ in [6].
Noteworthy, recent developments in synthetic biology recurrently refer to reg-
ulatory motifs corresponding to the classes considered in this study, thereby
demonstrating the potential practical impact of studies aiming at fully characterise
the dynamics of simple regulatory motifs (for recent reviews on synthetic biological
circuits, see [9, 12, 20]).
When facing a large and complex network, the enumeration and analysis of its
constitutive motifs can lead to interesting insights about the network dynamics. For
52 E. Remy et al.
example, in the Boolean case, a bound on the number of attractors can be computed
based on the number of positive regulatory circuits, taking into account potential
(indirect) cross-interactions between them [2, 15]. Such results could be refined by
considering recent results on the Boolean dynamics of more complex motifs, such
as the flower-graphs [7], or yet the chorded circuits reported here.
More prospectively, the results obtained in the Boolean framework could serve
as a guide to extend them to the multilevel logical framework, or even to transpose
them into the differential framework, as it was the case with the delineation
of theorems linking elementary positive and negative regulatory circuits with
multistability and cyclic properties (see, e.g., [17, 19]).
References
1. Alon, U.: Network motifs: theory and experimental approaches. Nat. Rev. Genet. 8(6), 450–
461 (2007)
2. Aracena, J., Demongeot, J., Goles, E.: Positive and negative circuits in discrete neural
networks. IEEE Trans. Neural Netw. 15(1), 77–83 (2004)
3. Bang-Jensen, J., Gutin, G.: Digraphs, Theory, Algorithms, Applications. Springer, Berlin
(2008)
4. Bérenguier, D., Chaouiya, C., Monteiro, P.T., Naldi, A., Remy, E., Thieffry, D., Tichit, L.:
Dynamical modeling and analysis of large cellular regulatory networks. Chaos (Woodbury
N.Y.) 23(2), 025114 (2013)
5. Chaouiya, C., Remy, E., Mossé, B., Thieffry, D.: Qualitative analysis of regulatory graphs :
a computational tool based on a discrete formal framework. In: Lecture Notes in Control and
Information Science, vol. 294, pp. 119–26. Springer, Berlin (2003)
6. Comet, J.-P., Noual, M., Richard, A., Aracena, J., Calzone, L., Demongeot, J., Kaufman, M.,
Naldi, A., Snoussi, E.H., Thieffry, D.: On circuit functionality in boolean networks. Bull.
Math. Biol. 75(6), 906–919 (2013)
7. Didier, G., Remy, E.: Relations between gene regulatory networks and cell dynamics in
Boolean models. Discret. Appl. Math. 160(15), 2147–2157 (2012)
8. Garg, A., Dicara, A., Xenarios, I., Mendoza, L., De Micheli, G.: Synchronous vs. Asyn-
chronous Modeling of Gene Regulatory Networks, Bioinformatics (Oxford, England) 24(17),
1917–1925
9. Khalil, A.S., Collins, J.J.: Synthetic biology: applications come of age. Nat. Rev. Genet. 11(5),
367–379 (2010)
10. Naldi, A., Thieffry, D., Chaouiya, C.: Decision diagrams for the representation and analysis of
logical models of genetic networks. In: Computational Methods in Systems Biology. Lecture
Notes in Computer Science, vol. 4695, pp. 233–47. Springer, Berlin (2007)
11. Naldi, A., Remy, E., Thieffry, D., Chaouiya, C.: Dynamically consistent reduction of logical
regulatory graphs. Theor. Comput. Sci. 412(21), 2207–2218 (2011)
12. Purcell, O., Savery, N.J., Grierson, Claire, S., di Bernardo, M.: A comparative analysis of
synthetic genetic oscillators. J. R. Soc. Interface/R. Soc. 7(52), 1503–1524 (2010)
13. Remy, E., Mossé, B., Chaouiya, C., Thieffry, D.: A description of dynamical graphs associated
to elementary regulatory circuits. Bioinformatics (Oxford, England) 19(Suppl. 2), 172–178
(2003)
14. Remy, E., Ruet, P., Thieffry, D.: Graphic requirements for multistability and attractive cycles
in a Boolean dynamical framework. Adv. Appl. Math. 41(3), 335–350 (2008)
15. Richard, A.: Positive circuits and maximal number of fixed points in discrete dynamical
systems. Appl. Math. 157(15), 3281–3288 (2009)
Boolean Dynamics of Compound Regulatory circuits 53
16. Richard, A., Comet, J.-P.: Necessary conditions for multistationarity in discrete dynamical
systems. Discret. Appl. Math. 155(18), 2403–2413 (2007)
17. Soulé, C.: Graphic requirements for multistationarity. Complexus 1, 123–133 (2003)
18. Thomas, R.: On the relation between the logical structure of systems and their ability to
generate multiple steady states or sustained oscillations. In: Numerical Methods in the Study
of Critical Phenomena. Springer Series in Synergetics 9, 180–193 (1981)
19. Thomas, R., D’Ari, R.: Biological Feedback. CRC Press, Boca Raton (1990)
20. Weber, W., Fussenegger, M.: Synthetic gene networks in mammalian cells. Curr. Opin.
Biotechnol. 21(5), 690–696 (2010)
A Differential Transcriptomic Approach
to Compare Target Genes of Homologous
Transcription Factors in Echinoderm Species
1 Introduction
The developmental program of an organism and its phenotypic features are encoded
into its DNA. The binding of transcription factors to specific DNA, which controls
the expression of genes and ultimately the development of the embryo, is known
as a gene regulatory network (GRN). Evolutionary conservation has provided us
with a good tool to study the origins of phenotypic features and their developmental
programs. With the advances in sequencing technology and the continued drop in
prices, it has become more common to sequence an organism’s transcriptome. This
has facilitated the ability to examine organism on a genomic scale, allowing the
study of all genes expressed at a giving time point in development, as well as for
wild type versus experimental conditions. With transcriptomics we are able to better
understand the complicity of evolution and increasing studies are taking advantage
of this fact [1, 2].
In bilateria, homeobox-containing genes are important for the patterning of the
anterior–posterior axis and mediate much of the embryonic development, with one
of the most studied families being the Hox genes [3, 4]. Another important family
of homeobox genes is the ParaHox family—Gsx, Pdx (Xlox in echinoderms), and
Cdx, which are thought to be the ancient sister group to Hox genes and to have
emerged from the ProtoHox cluster [5]. The ParaHox genes have been shown to
be involved in gut development in vertebrates [6, 7] and also in the echinoderms
[8, 9]. It appeared from the examination of the sea urchins Strongylocentrotus
purpuratus that echinoderm had lost some chordate-like features in their function
of Xlox and Cdx [10]. However, through the use of another echinoderm, the bat
star Patiria miniata, it was discovered that these features appear to only have been
lost in echinoids, while being retained in asteroids [11]. This shows the necessity
to continue to study new organisms in order to gain a more complete evolutionary
picture. The embryonic guts of both S. purpuratus and P. miniata first form a tube
like structure with no sections known as the archenteron, then later divide into three
sections, the foregut, the midgut, and the hindgut, which become in the larva the
esophagus, the stomach, and the intestine, respectively.
Portions of the GRN for gut development in echinoderms have already been
formed, but the network downstream of Xlox and Cdx has yet to be assembled.
In S. purpuratus, Xlox morpholino antisense oligo (MASO) RNA-seq experiments
have been conducted looking at known genes in the network [9], but have not been
studied in-depth. Here we present the groundwork for reconstructing the GRN for
gut development downstream of Xlox and Cdx in both S. purpuratus and P. miniata.
Through the analysis of these MASO RNA-seq experiments we will identify direct
and indirect targets of Xlox and Cdx in both species. Secondly, looking at the
overlap in these two networks at homologous stages, and will better define the genes
needed for the developing gut to form and properly section.
Fig. 1 Gene ortholog relationship between S. purpuratus (SPU), P. miniata (PMI), and X. tropi-
calis (XEN). Each circle represents one of the species and their overlap represents the orthologous
groups that are in common. The numbers in the larger black print are the total number of
orthologous groups and in the smaller blue print are the number of single copy orthologous groups
prism and the sea star early bipinnaria larva have already a tripartite shaped gut. The
pluteus larva is an extra time point we chose for the sea urchin in which the gut is
now complete with its cardiac and pyloric sphincters visible.
Differential expressed transcripts were identified using DESeq2 with a threshold
of log2 fc > ˙0.5 and adjusted p-value of <0.05. In S. purpuratus, as time progressed,
the knockdowns had a larger effect of more transcripts. There were only a couple
of hundreds (294) transcripts affected by the Sp-Lox MASO at 48 hpf, compared to
723 transcripts effected by Sp-Cdx at 66 h, and 2384 at 72 h. Fifty-seven percent
(167) of transcripts affected by the Sp-Lox MASO at 48 hpf were also affected at
72 hpf, showing similarities in the GRN as gut transitions from a tube like structure
to a trisectioned gut.
When examining the P. miniata Xlox MASO RNA-seq we did not find a large
number of transcripts to be differentially expressed at the late gastrula stage, with
there being only 109 transcripts differentially expressed. However, when examining
Pm-Cdx MASO RNA-seq at the early larva stage we observed many more genes
being affected, 693, 450 (65 %) of which had a homologous relation to S. purpuratus
and/or X. tropicalis.
Across all species at least 65 % of the transcripts were clustered into homologous
groups, meaning that 30–35 % of the transcripts from each experiment were species
specific or fell below our threshold (Table 1). Further analysis including phylogenic
trees is necessary to better understand the relationship of these two species, but is
currently out of the scope of this paper.
Through the use of our orthology analysis and our MASO differential expression
analysis we are able to discover conserved components in the downstream networks
of both Xlox and Cdx in S. purpuratus and P. miniata. In the Cdx MASO RNA-seq
there was the largest overlap between the species, 129 transcripts were found
A Differential Transcriptomic Approach to Compare Target Genes. . . 59
in both networks. Ninety-one out of these 129 genes were found in the “core”
orthology group, meaning that at least one gene from S. purpuratus, P. miniata,
and X. tropicalis was present in the orthologous group, and 11 (9 %) of those genes
were identified as transcription factors that belong to the bzip, bHLH, C2H2, hmg,
p53, and zf-C4 families. Late gastrula in S. purpuratus and P. miniata occurs at
48 hpf and 66 hpf, respectively, with 15 genes shared in their network, 67 % (10)
of which were transcription factors. Although 48 hpf in S. purpuratus and 66 hpf
in P. miniata are more morphologically similar, the overlaps in affected genes were
stronger at 72 hpf in S. purpuratus and 66 hpf in P. miniata, with an additional
10 genes (25 in total) compared to the earlier stage, which also included the same
group of transcription factors. Without the use of ChIP or other technologies such as
ATAC-seq we are not able to determine the connectivity of these GRNs. Although
we are not able to distinguish direct versus indirect targets in this study, identifying
key components in the way of transcription factors is essential and will provide a
foundation for future studies.
3 Conclusion
Here we present the foundation for studying the downstream GRN for gut devel-
opment in S. purpuratus and P. miniata through the use of a MASO RNA-seq
analysis. Seeing that RNA-seq can yield hundreds to thousands of potential genes
we used the correlation between S. purpuratus and P. miniata to identify a subset
of genes to be examined in future studies. Moreover the genes identified in our
study as transcription factors will be the starting points for ATAC-seq and ChIP
analyses. This study provides evidence that a genome-wide approach to study GRNs
in development and evolution is feasible in echinoderms.
4 Methods
Adults S. purpuratus and P. miniata have been obtained from Patrick Leahy
(Kerchoff Marine Laboratory, California Institute of Technology, Pasadena,
CA, USA), housed in circulating seawater aquaria in the Stazione Zoologica
Anton Dohrn of Naples and kept in large tanks of seawater at 15–16 ı C.
60 E.K. Lowe et al.
Injected and uninjected sea urchin and sea star fertilized eggs have been allowed to
develop until the desired stage at 15 ı C in filtered seawater and then collected for
the RNA extraction. The embryos have been collected in a tube and centrifuged
at 3000 rpm for 2–3 min to remove all the seawater. RNA extraction has been
carried out using the RNAqueous-Micro Kit (Ambion). Integrity and quantification
of RNA has been checked before the sequencing using the Agilent Bioanalyzer
2100 with the RNA 6000 Pico kit for total eukaryote RNA. cDNA libraries have
been prepared with 1 g of starting total RNA and using the Illumina TruSeq RNA
Sample Preparation Kit (Illumina), according to TruSeq protocol. Each library has
been diluted to 2 nM and denaturated; 8 pM of each library has been loaded onto
cBot (Illumina) for cluster generation with cBot Paired End Cluster Generation Kit
(Illumina) and sequenced using the Illumina HiSeq 1500 with 100 bp paired-end
reads in triplicate, obtaining 31–38 million reads for replicate. The sequencing
service has been provided by the Laboratory of Molecular Medicine and Genomics
(http://www.labmedmolge.unisa.it) at the University of Salerno, Italy.
Reads were first trimmed using Trimmomatic (v0.33) with the scripts trim_pm.qsub,
trim_spcdx.qsub, and trim_splox.qsub [15]. The parameters for trimming were
chosen to efficiently remove erroneous reads while maximizing the information
within the reads [16]. S. purpuratus reads were mapped to Genome sequence
(V3.1) [17] and P. miniata reads were mapped to the genome sequence (V1.0)
Scaffolds [18] using Bowtie2 (2.2.6) and Tophat (2.0.8b) [19, 20]. After mapping,
reads were sorted using SamTools (v1.2) [21] and counts were extracted using
A Differential Transcriptomic Approach to Compare Target Genes. . . 61
HTSeq (v0.6.1) [22]. The gff3 from Build 7 was used for generating exon-based
transcript counts for S. purpuratus which is more informative seeing than DESeq2
does not use length-based count normalization [23, 24]. The following scripts were
used sp_cdx.qsub, sp_lox48.qsub, and sp_lox72.qsub for Sp, while pm_cdx.qsub
and pm_lox.qsub were used for Pm.
Differentially expressed genes were identified using DESeq2 [23], transcripts not
meeting the threshold of 10 counts for at least one of the samples were removed.
DESeq2 provides two methods of hypothesis testing: Wald test and likelihood
ratio test (LRT). To account for the batch effect across different animals we
used LRT, with the full model being batch C condition and the reduced model
being batch. After, the differentially expressed genes using extracted information
from Echinobase [18] for both species, which are in the data/ directory, using
annot_sp.py and annot_pm.py scripts.
Both the S. purpuratus and P. miniata proteomes were searched against the Pfam
database [28] using HMMER/3.1b2 hmmscan [29]. These commands were executed
using the following scripts hmmer_pm_tf.qsub and hmmer_spur_tf.qsub. The grep
program was then used to search for the following term homeobox, Pax, bzip,
hmg, sox, hlh, PF00104.25 (nuclear receptor), t-box, mh2 (smad), b-box, f-box,
fork_head, ets, phd-finger, zf-C2H2 within –tblout output. Additionally, Pfam ids
were extracted from the DBD Transcription Factor prediction database [30] and
62 E.K. Lowe et al.
then grep against the –tblout output, combined filtered for redundancy. The list of
Pfam ids can be found in the data directory in the github repository along with the
TF we identified for S. purpuratus and P. miniata.
Acknowledgements This work was supported in part by Michigan State University through
computational resources provided by the Institute for Cyber-Enabled Research and by MIUR
(premiale PANTRAC to MIA). C.C. has been supported by a SZN PhD fellowship.
References
1. Wang, Z., Dai, M., Wang, Y., Cooper, K.L., et al.: Unique expression patterns of multiple key
genes associated with the evolution of mammalian flight. Proc. Biol. Sci. 281(1783), 20133133
(2014)
2. Lmanna, F., Kirschbaum, F., Waurick, I., Dieterich, C., Tiedemann, R.: Cross-tissue
and cross-species analysis of gene expression in skeletal muscle and electric organ of
African weakly-electric fish (Teleostei; Mormyridae). BMC Genomics 16, 668 (2015).
doi:10.1186/s12864-015-1858-9
3. Finnerty, J.R.: The origins of axial patterning in the metazoa: how old is bilateral symmetry?
Int. J. Dev. Biol. 47(7–8), 523–529 (2003)
4. Mallo, M., Alonso, C.R.: The regulation of hox gene expression during animal. Development
140(19), 3951–3963 (2013)
5. Brooke, N.M., Garcia-Fernandez, J., Holland, P.W.: The ParaHox gene cluster is an evolution-
ary sister of the Hox gene cluster. Nature 392, 920–922 (1998)
6. Wright, C.V., Cho, K.W., Oliver, G., De Robertis, E.M.: Vertebrate homeodomain proteins:
families of region-specific transcription factors. Trends Biochem. Sci. 14, 52–56 (1989)
7. Young, T., Deschamps, J.: Hox, Cdx, and anteroposterior patterning in the mouse embryo.
Curr. Top. Dev. Biol. 88, 235–255 (2009)
8. Cole, A.G., Rizzo, F., Martinez, P., Fernandez-Serra, M., Arnone, M.I.: Two ParaHox genes,
SpLox and SpCdx, interact to partition the posterior endoderm in the formation of a functional
gut. Development 136, 541–549 (2009)
9. Annunziata, R., Arnone, M.I.: A dynamic regulatory network explains ParaHox gene
control of gut patterning in the sea urchin. Development 141(12), 2462–2472 (2014).
doi:10.1242/dev.105775
10. Arnone, M.I., Rizzo, F., Annunciata, R., Cameron, R.A., Peterson, K.J., Martínez, P.: Genetic
organization and embryonic expression of the ParaHox genes in the sea urchin S. purpuratus:
insights into the relationship between clustering and collinearity. Dev. Biol. 300, 63–73 (2006)
11. Annunziata, R., Martinez, P., Arnone, M.I.: Intact cluster and chordate-like expression of
ParaHox genes in a sea star. BMC Biol. 11, 68 (2013). http://www.biomedcentral.com/1741-
7007/11/68
12. Parnell, L.D., Lindenbaum, P., Shameer, K., Dall’Olio, G.M., Swan, D.C., et al.: BioStar: an
online question & answer resource for the bioinformatics community. PLoS Comput. Biol.
7(10), e1002216 (2011)
13. Li, J.W., Schmieder, R., Ward, R.M., Delenick, J., Olivares, E.C., Mittelman, D.: SEQanswers:
an open access community for collaboratively decoding genomes. Bioinformatics 28(9), 1272–
1273 (2012)
14. Cheatle Jarvela, A.M., Hinman, V.: A method for microinjection of Patiria miniata zygotes. J.
Vis. Exp. (91), e51913 (2014). doi:10.3791/51913
15. Bolger, A.M., Lohse, M., Usadel, B.: Trimmomatic: a flexible trimmer for Illumina sequence
data. Bioinformatics 30, 2114–2120 (2014)
A Differential Transcriptomic Approach to Compare Target Genes. . . 63
16. MacManes, M.D.: On the optimal trimming of high-throughput mRNAseq data. bioRxiv
(2014). doi: 10.1101/000422
17. Sodergren, E., Weinstock, G.M., Davidson, E.H., Cameron, R.A., Gibbs, R.A., Angerer,
R.C., Coffman, J.A.: The genome of the sea urchin Strongylocentrotus purpuratus. Science
314(5801), 941–952 (2006)
18. Cameron, R.A., Samanta, M., Yuan, A., He, D., Davidson, E.: SpBase: the sea urchin genome
database and web site. Nucleic Acids Res. 37, D750–D754 (2009)
19. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4),
357–359 (2012)
20. Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., Salzberg, S.L.: TopHat2: accurate
alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome
Biol. 14(4), R36 (2013)
21. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G.,
Durbin, R., 1000 Genome Project Data Processing Subgroup: The Sequence alignment/map
(SAM) format and SAMtools. Bioinformatics 25, 2078–2079 (2009)
22. Anders, S., Pyl, P.T., Huber, W.: HTSeq — a Python framework to work with high-throughput
sequencing data. Bioinformatics 31, 166–169 (2014). doi:10.1093/bioinformatics/btu638
23. Love, M.I., Huber, W., Anders, S.: Moderated estimation of fold change and dispersion for
RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014). doi:10.1186/s13059-014-0550-8
24. Zhao, S., Xi, L., Zhang, B.: Union exon based approach for RNA-seq gene quantification: to
be or not to be? PLOS One (2015). doi:10.1371/journal.pone.0141910
25. Cunningham, F., Amode, M.R., Barrell, D., Beal, K., Billis, K., Brent, S., Carvalho-Silva,
D., Clapham, P., Coates, G., Fitzgerald, S., Gil, L., Girón, C.G., Gordon, L., Hourlier, T.,
Hunt, S.E., Janacek, S.H., Johnson, N., Juettemann, T., Kähäri, A.K., Keenan, S., Martin,
F.J., Maurel, T., McLaren, W., Murphy, D.N., Nag, R., Overduin, B., Parker, A., Patricio,
M., Perry, E., Pignatelli, M., Riat, H.S., Sheppard, D., Taylor, K., Thormann, A., Vullo, A.,
Wilder, S.P., Zadissa, A., Aken, B.L., Birney, E., Harrow, J., Kinsella, R., Muffato, M., Ruffier,
M., Searle, S.M.J., Spudich, G., Trevanion, S.J., Yates, A., Zerbino, D.R., Flicek, P.: Ensembl
2015. Nucleic Acids Res. 43(Database issue), D662–D669 (2015). doi:10.1093/nar/gku1010
26. Fischer, S., Brunk, B.P., Chen, F., Gao, X., Harb, O.S., Iodice, J.B., Shanmugam, D., Roos,
D.S., Stoeckert Jr., C.J.: Using OrthoMCL to assign proteins to OrthoMCL-DB groups or
to cluster proteomes into new ortholog groups. Curr. Protoc. Bioinformatics. Chapter 6:Unit
6.12.1–19 (2011)
27. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L.:
BLASTC: architecture and applications. BMC Bioinformatics 10, 421 (2008)
28. Finn, R.D.: Pfam: the protein families database. Encyclopedia of Genetics, Genomics,
Proteomics and Bioinformatics (2012)
29. Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids. Cambridge University Press (1998). ISBN 0-521-
62971-3
30. Wilson, D., Charoensawan, V., Kummerfeld, S.K., Teichmann, S.A.: DBD - taxonomically
broad transcription factor predictions: new content and functionality. Nucleic Acids Res. 36,
D88–D92 (2008). doi:10.1093/nar/gkm964
Reconstructing a Genetic Network from Gene
Perturbations in Secretory Pathway of Cancer
Cell Lines
1 Introduction
The secretory pathway has evolved to facilitate the transfer of cargo molecules
to internal and cell-surface membranes [8]. Its study and characterization are
a challenge, that the use of high-throughput experiments and network analysis
tools have enabled to outdo. In this work, we try to reconstruct the regulatory
networks of secretory pathways starting from 476,251 signatures and 22,268 probes
present in the LINCS website (http://www.lincscloud.org/); selecting the gene
expression profile data related to 33 gene perturbation experiments carried out in
four cancer cell lines (A549, HA1E, HEPG2, and PC3). In Mitocheck database
(http://www.mitocheck.org/), these latter are classified as mild or strong inhibitors of
secretory cargo proteins from ER to plasma membrane and they are involved in the
morphological alterations of COPII and/or COPI vesicular coat complexes. Then,
for each expression profile we collected two or more technical replicates at different
time points, considering profiles of differentially expressed genes computed by
robust z-scores for each profile relative to population control. The genes are shown
in Fig. 2, where we see the 33 perturbations with all the functions in which they are
involved.
accumulation of gfp-rnf168 on
retention of sh4(yes)-mcherry
reduction in ir-induced 53bp1
reduction in ir-induced 53bp1
condensation without mitosis
retention of sh4(haspb)-gfp
altered gm130 morphology
mild inhibition of secretion
nuclei stay close together
failure in decondensation
increased proliferation
segregation problems
inhibition of secretion
prometaphase delay
migration (distance)
enhanced secretion
dynamic changes
migration (speed)
metaphase delay
pulsating nuclei
small nucleus
large nucleus
cell migration
mitotic delay
cell death
polylobed
binuclear
grape
large
ACSL1
ACTR3
AR
BUB1B
C1S
CALM3
CIRBP
COPB2
DNM1
EML3
ESYT1
GPR142
GPR35
GSTP1
HMBOX1
IFIH1
LRP4
MMP14
MXD4
NT5C2
PHKG2
PML
PROCR
RAP1GDS1
RUVBL1
RXRA
SIRT2
SLC25A4
SLC25A46
SOS2
TBX20
ZFP36
ZFP36L2
3. For each cell line and for each perturbation, we determined the biological
replicate bN l among bl with maximum correlation, and then we constructed a
ij ij
The molecular interaction networks, for each cell line, were studied using the
network visualization software Cytoscape [5]. To reconstruct the characteristics of
the secretory pathway, we performed an enrichment analysis using the Molecu-
lar Signature Database (MsigDB) (http://www.broadinstitute.org/gsea/downloads.
jsp).1 We used Kyoto Encyclopedia of Genes and Genomes (KEGG) and BIO-
CARTA pathway gene set to study the enriched pathways in the networks. These
tools reduce the complexity of analysis by grouping long lists of individual genes
into smaller sets of related genes or proteins, that are involved in the same biological
processes, components, or structures [4].
3 Results
Over the last years with the new high-throughput imaging-based methods, and more
recently, with RNA interference (RNAi)-mediated gene knockdown experiments, a
significant number of regulators associated with the secretory pathway have been
revealed. In this study, we used a computational approach to try to find these
regulators of secretory pathway. Comparing our networks with a list of secretion
inhibitors involved in cell death, cell division, and motility; which we selected
from a precedent study [8], we found that some perturbations regulate several
of these inhibitors in each network as shown in Tables 1 and 2. An enrichment
1
http://software.broadinstitute.org/gsea/msigdb/annotate.jsp.
70 M. Piccirillo et al.
Fig. 3 Regulatory network in A375 cancer cell line. In red are depicted the 33 perturbations
analysis of the genes showed that more than 70 % of them participate in fundamental
cellular processes such as transcription, explaining how their knockdown, causes
cell death and therefore the transport inhibition. In particular, we put our focus
on the membrane traffic regulators, such as COPB2, which encode for subunit
beta of the Golgi coatomer complex and whose depletion causes cell death, thus
underlining the importance of the secretory pathway for general cell health. This is
a perturbation that is indirectly down-regulated from other perturbations in A375,
A549, and HA1E networks. For example, in A549 cancer cell line, COPB2 is
indirectly connected with PML and EML3 perturbations. See Fig. 7.
A crucial observation is that in HA1E network, COPB2 is connected directly
with RUVBL1, which is located in Golgi apparatus and vesicles, and both regulate
some common genes. Membrane traffic pathways are also regulated through the
activities of kinases and phosphatases, and some of these are involved in ER–Golgi
recycling. Overlapping, our regulatory interactions with a list of 48 genes, which are
scored as secretion inhibitors in the study of Farhan et al. [3], we found that, in each
network, there are some of these genes that are regulated from the perturbations
(see Table 3). With respect to HEPG2 and A549 networks, two perturbations are
connected indirectly through EPBHB2 of ER receptor tyrosine kinase, involved in
axon guidance.
Reconstructing a Genetic Network from Gene Perturbations in Secretory. . . 71
main result of the proposed procedure is that all these regulatory genes are also con-
nected with each other through hub nodes. It means that the transcriptional response
with respect to each perturbations does not have independent behavior, but somehow
these perturbations put a combinatorial effect on transcriptional regulation, perhaps
there is a global effect of all these perturbations on all subnetworks present in
an interaction network. Our analyses indicate that the perturbations control some
genes, which are involved in several processes of secretory pathway regulations.
For example, depletion of the Golgi coatomer complex, COPB2 which is essential
for Golgi budding and vesicular trafficking, caused cell death. Furthermore, we
Reconstructing a Genetic Network from Gene Perturbations in Secretory. . . 73
also found kinases and phosphatases, that can regulate membrane traffic pathways.
In particular we found that the secretion inhibitors EPHB2 indirectly connect two
perturbations in HEPG2 and A549 network. Together, these results imply that the
mammalian cells have a highly sophisticated signaling and feedback system that
allows them to modulate their secretory activity in response to external signals and
their local environment. So our algorithm may help answer a multitude of questions
about the genetic architecture of organisms. What is the structure of genetic
networks? How do patterns of interactions genes change in different developmental
stages, in different physiological states, in different environmental conditions, or
in different cell types? Are there many genes that do not affect the activity of
other genes? This approach could be useful also in determining the potential drug
targets in case of aggressive human tumors. Furthermore, it is possible to cluster
the mechanisms of action rather than the gene expression pattern; and moreover our
algorithm is very scalable in term of the number of experiments used to a given
network. For the future work by taking into account of existing algorithms and
inference methods for reverse engineering of gene networks from large scale gene
expression data, we plan to deepen the study of these techniques and overcome their
limits; indeed our next goal will be to implement Wagner’s algorithm [10], which is
important to find short cycles and loops in the network.
74 M. Piccirillo et al.
Table 1 List of secretion inhibitors involved in cell death, cell division, and motility in A375 and
A549 cells
A375 A549
Gene Cell division Gene Cell division
symbol Ensembl ID phenotypes symbol Ensembl ID phenotypes
ACADVL ENSG00000072778 Mitosis, cell AKT1 ENSG00000142208 Other
death phenotypes
BMPR1A ENSG00000107779 Mitosis, Other BUB1B ENSG00000156970 Mitosis
phenotypes
BUB1B ENSG00000156970 Mitosis CDC23 ENSG00000094880 Mitosis
CDK4 ENSG00000135446 Other CDK4 ENSG00000135446 Other
phenotypes phenotypes
CHN1 ENSG00000128656 Other COPB2 ENSG00000184432 Cell death
phenotypes
CLIC4 ENSG00000169504 Other EML3 ENSG00000149499 Mitosis
phenotypes
COPB2 ENSG00000184432 Cell death GSTP1 ENSG00000084207 Cell death
DNAL4 ENSG00000100246 Other IDH1 ENSG00000138413 Mitosis
phenotypes
EML3 ENSG00000149499 Mitosis MXD4 ENSG00000123933 Other
phenotypes
GJB3 ENSG00000188910 Mitosis NOL3 ENSG00000140939 Mitosis
GRWD1 ENSG00000105447 Cell death PML ENSG00000140464 Mitosis
GSTP1 ENSG00000084207 Cell death SAMD ENSG00000020577 Other
4A phenotypes
KIF2C ENSG00000142945 Mitosis SIRT2 ENSG00000068903 Mitosis
MXD4 ENSG00000123933 Other ST6GAL ENSG00000122912 Other
phenotypes NAC2 phenotypes
NFKBIE ENSG00000146232 Mitosis TBCA ENSG00000171530 Cell death
NUP93 ENSG00000102900 Cell death TRIB3 ENSG00000101255 Migration
PLCB2 ENSG00000137841 Mitosis TXND ENSG00000115514 Mitosis
C9
PML ENSG00000140464 Mitosis TYMS ENSG00000176890 Cell death
PPOX ENSG00000143224 Other
phenotypes
PRSS23 ENSG00000150687 Other
phenotypes
RPA1 ENSG00000132383 Mitosis, cell
death
SCYL3 ENSG00000000457 Other
phenotypes
SIRT2 ENSG00000068903 Mitosis
SLC16A3 ENSG00000141526 Other
phenotypes
SLC25A ENSG00000122912 Other
16 phenotypes
ST6GAL ENSG00000122912 Other
NAC2 phenotypes
TXNDC9 ENSG00000115514 Mitosis
Reconstructing a Genetic Network from Gene Perturbations in Secretory. . . 75
Table 2 List of secretion inhibitors involved in cell death, cell division, and motility in HA1E and
HEPG2 cells
HA1E HEPG2
Gene Cell division Gene Cell division
symbol Ensembl ID phenotypes symbol
Ensembl ID phenotypes
AMHR2 ENSG00000135409 Other phenotypes ACAD ENSG00000072778 Mitosis, cell
VL death
BUB1B ENSG00000156970 Mitosis ALDOA
ENSG00000149925 Mitosis
C10orf68 ENSG00000150076 Other phenotypes BUB1BENSG00000156970 Mitosis
CDK5R1 ENSG00000176749 Mitosis, migration, CLIC4
ENSG00000169504 Other
other phenotypes phenotypes
CER1 ENSG00000147869 Cell death COPB2
ENSG00000184432 Cell death
CLCNKB ENSG00000184908 Mitosis EML3ENSG00000149499 Mitosis
CLIC4 ENSG00000169504 Other phenotypes GSTP1 ENSG00000084207 Cell death
COPB2 ENSG00000184432 Cell death IDH1ENSG00000138413 Mitosis
ECD ENSG00000122882 Mitosis ITGB5
ENSG00000082781 Migration,
other
phenotypes
EEF1E1 ENSG00000124802 Migration MXD4 ENSG00000123933 Other
phenotypes
EML3 ENSG00000149499 Mitosis MYL6B ENSG00000196465 Mitosis
GRWD1 ENSG00000105447 Cell death PML ENSG00000140464 Mitosis
GSTP1 ENSG00000084207 Cell death SAMD ENSG00000020577 Other
4A phenotypes
KCNQ4 ENSG00000117013 Other phenotypes SIRT2 ENSG00000068903 Mitosis
MXD4 ENSG00000123933 Other phenotypes ST6GAL ENSG00000122912 Other
NAC2 phenotypes
NBR1 ENSG00000188554 Cell death TAGLN ENSG00000149591 Cell death
OGG1 ENSG00000114026 Mitosis TYMS ENSG00000176890 Cell death
PLA2G3 ENSG00000138308 Cell death USP1 ENSG00000162607 Mitosis,
migration
PML ENSG00000140464 Mitosis
PPP2R1A ENSG00000105568 Mitosis
ROS1 ENSG00000047936 Other phenotypes
RRM1 ENSG00000167325 Other phenotypes
SAMD4AENSG00000020577 Other phenotypes
SCN5A ENSG00000183873 Mitosis
SIRT2 ENSG00000068903 Mitosis
ST6GAL ENSG00000122912
NAC2
TAGLN ENSG00000149591 Cell death
TRIB3 ENSG00000101255 Migration
TXNDC9 ENSG00000115514 Mitosis
TYMS ENSG00000176890 Cell death
76 M. Piccirillo et al.
IGKC
FABPS
up GOLT1B
dw dw
dw
PML
COPB2
dw
dw dw
dw LGMN
RPS4Y1 dw
up
up TMSB4X
LGALS1
dw
IKBKE
EML3 NPRL2
Fig. 7 Sub-network of A549 cancer cell line, in which we can see all the first neighbors nodes of
COPB2 and its indirect interactions with other perturbations depicted in red
Table 3 List of kinases and phosphatases which are regulated from our perturbations in each
network
Farhan et al. Class Gene symbol Gene bank Description
A375
Golgi ABL1 NM005157 v-Abl Abelson murine leukemia viral
oncogene homolog 1
Golgi AURKB NM004217 Aurora kinase B
Golgi CDK4 NM000075 Cyclin-dependent kinase 4
A549
Golgi ABL1 NM005157 v-Abl Abelson murine leukemia viral
oncogene homolog 1,
Golgi AURKB NM004217 Aurora kinase B
Golgi CDK4 NM000075 Cyclin-dependent kinase 4
ER EPHB2 NM017449 EPH receptor B2
Golgi KIT NM000222 v-Kit Hardy-Zuckerman 4 feline sarcoma
viral oncogene homolog
HA1E
Golgi ABL1 NM005157 v-Abl Abelson murine leukemia viral
oncogene homolog 1
ER EGFR NG007726 Epidermal growth factor receptor
ER IKBKB NM001556 Inhibitor of kappa light polypeptide gene
enhancer in B-cells, kinase beta
Golgi KIT NM000222 v-Kit Hardy-Zuckerman 4 feline sarcoma
viral oncogene homolog
HEPG2
ER EPHB2 NM017449 EPH receptor B2
Golgi KIT NM000222 v-Kit Hardy-Zuckerman 4 feline sarcoma
viral oncogene homolog
Reconstructing a Genetic Network from Gene Perturbations in Secretory. . . 77
References
1. Crombach, A., Hogeweg, P.: Evolution of evolvability in gene regulatory networks. PLoS
Comput. Biol. 4, e1000112 (2008)
2. Farhan, H., Rabouille, C.: Signalling to and from the secretory pathway J. Cell Sci. 124,
171–180 (2011)
3. Farhan, H., et al.: MAPK signaling to the early secretory pathway revealed by
kinase/phosphatase functional screening. J. Cell Biol. 189, 997–1011 (2010)
4. Khatri, P., Sirota M., Butte A.J.: Ten years of pathway analysis: current approaches and
outstanding challenges. PLoS Comput. Biol. 8, e1002375 (2012)
5. Kohl, M., Wiese, S., Warscheid, B.: Cytoscape: software for visualization and analysis of
biological networks. Methods Mol. Biol. 696, 291–303 (2011)
6. Liu, L.-Z., Wu, F.-X., Zhang, W.-J.: Reverse engineering of gene regulatory networks from
biological data. WIREs Data Min. Knowl. Discovery 2, 365–385 (2012)
7. Marbach, D., Costello, J.C., Küffner, R., et al.: Wisdom of crowds for robust gene network
inference. Nat. Methods 9, 796–804 (2012)
8. Simpson, J.C., Joggerst, B., Laketa, V., et al.: Genome-wide RNAi screening identifies human
proteins with a regulatory function in the early secretory pathway. Nat. Cell Biol. 14, 764–774
(2012)
9. Tegner, J., Yeung, M.K., Hasty, J., Collins, J.J.: Reverse engineering gene networks: integrating
genetic perturbations with dynamical modeling. Proc. Natl. Acad. Sci. U S A 100, 5944–5949
(2003)
10. Wagner, A.: How to reconstruct a large genetic network from n gene perturbations in fewer
than n2 easy steps. Bioinformatics 17, 1183–1197 (2001)
11. Wlodkowic, D., Skommer, J., McGuinness, D., Hillier, C., Darzynkiewicz, Z.: ER-Golgi
network–a future target for anti-cancer therapy. Leuk. Res. 33, 1440–1447 (2009)
Dissecting the Functions of the Secretory
Pathway by Transcriptional Profiling
1 Introduction
2.1 Strategy
To identify the modules that interact with the secretory pathway, we have per-
turbed its functioning by knocking down secretory pathway localized genes (using
shRNAs), followed by an analysis of the changes in gene expression. Pathways or
functions that were modulated under these conditions were identified using gene
set enrichment analysis (GSEA). Following this, the transcription factors (TF) that
might potentially regulate these pathways or functions were identified. The TFs can
then be used to predict upstream signaling pathways that respond to the original
perturbation (knockdown of secretory pathway localized genes). This analysis
Dissecting the Functions of the Secretory Pathway by Transcriptional Profiling 81
Fig. 1 Strategy to identify interacting modules of secretory pathway and the underling molecular
circuit. Gene expression profiles obtained from cells, where the secretory pathway is perturbed by
shRNA mediated silencing of secretory pathway localized genes, will be analyzed using GSEA
to obtain pathways that are modulated by the perturbation. Then putative upstream transcription
factors that can regulate the genes associated with these pathways will be predicted and validated.
Then, literature mining coupled to experimental validation will be used to dissect the signaling
pathways that modulate the TF activity under these conditions so to build the molecular circuit
connecting perturbation to the gene expression changes
will help to map the molecular pathway connecting the perturbation of secretory
pathway function to the modulation of other specific functions of the cells. This
connection between cause and effect would help reveal the underlying molecular
circuit regulating the interacting module. This general strategy is represented in
Fig. 1.
Plasma membrane
M6PR
ARF1
COPB2 BLZF1/Golgin-45
COPZ1 GOLGA5/Golgin-84
COPA PLA2G4A
COG2 Ykt6
COG4
COG7
AKAP9 Golgi apparatus
Rab1B
TMED7
TMED9
Sar1B TMED10
Sec24B
Sec24C
Sec24D BNIP1
Fig. 2 Localization of the genes, whose expression was perturbed, to the compartments of the
secretory pathway is represented
Dissecting the Functions of the Secretory Pathway by Transcriptional Profiling 83
The PRLs generated from gene expression profiles were subjected to GSEA using
a java desktop application available at Molecular Signature Database (MsigDB;
http://www.broadinstitute.org/gsea/down-loads.jsp). Given a set of a priori anno-
tated set of genes (based on Gene ontology classifications, KEGG (Kyoto Encyclo-
pedia of Genes and Genomes) pathways, or others), GSEA determines whether this
set of genes shows statistically significant differences between two biological states
viz. perturbation vs control that are being analyzed [12]. MsigDB has a collection
of annotated gene sets (curated gene set, motif gene set, GO gene set, oncogenic
signature, immunologic signature, etc.) for use with GSEA software. In order to
study for the enrichment of the pathways, we used KEGG pathway gene set. Using
GSEA, a number of enriched pathways were predicted across all the 22 PRLs. Only
those pathways with the False Discovery Rate (FDR) cutoff 0.25 were taken into
account. It has been suggested that given the lack of coherence in most expression
datasets and the relatively small number of gene sets being analyzed, a FDR cutoff
0.25 is appropriate for the purposes of hypothesis generation [12]. We noted that
many predicted pathways had significantly overlapping set of genes. So in order to
streamline the results, the enriched pathways were consolidated into one group if
they have more than 50 % of the genes overlapping.
The upstream transcription factors that can potentially regulate the expression of the
genes belonging to the enriched pathways were predicted using the online resources
TransFind (http://transfind.sys-bio.net/) and Locamo Finder (https://sysimm.ifrec.
osakau.ac.jp/tfbs/locamo/) and HTRIDB (http://www.lbbc.ibb.unesp.br/htri). Trans-
Find and Locamo Finder predict the TFs based on their affinity towards the putative
promoters of the genes on interest. These affinities have been pre-calculated based
on the available positional frequency matrices for the transcription factors [13]. On
the other hand, prediction by HTRIDB is based on experimentally verified human
transcriptional regulation interactions. Among the TFs obtained, only those that
were commonly predicted by all these tools were selected for further analysis.
Fig. 3 The perturbed genes were grouped based on the common enriched pathways. The color
code refers to downregulated pathways (in red) and upregulated pathways (in blue). Module
unrelated to secretory pathway is marked by asterisk (orange)
the perturbation conditions. The perturbations were then grouped on the basis of
the pathways that were modulated in common (Fig. 3). Most of these groups were
related to secretory pathway module viz. glycosaminoglycan (GAG) biosynthesis,
protein export, ribosome, Pantothenate and CoA biosynthesis, aminoacyl tRNA
biosynthesis, and Phe, Tyr, His biosynthesis pathway, as expected. However, the
group of COPZ1, COG4, and COG7 gene KDs was associated with the downregu-
lation of DNA repair and replication pathway, which is a function not known to be
related to the secretory pathway module. Thus GSEA analysis reveals both expected
and unexpected modules that are modulated in response to a perturbation of the
secretory pathway. COPZ1, COG4, and COG7 share a common known function
of retrograde transport from Golgi to ER. This association suggests a possible
interaction between the Golgi retrograde transport and the DNA repair response.
We then experimentally tested whether the DNA repair pathway is indeed regu-
lated by the secretory pathway localized genes. To this end, we have downregulated
COPZ1, COG4, or COG7 using siRNAs in HeLa cells, and then measured the
increase of the DNA damage by studying the changes in the levels of phospho
histone H3, a marker of the sites of DNA double strand breaks. Only the downregu-
lation of COPZ1 showed increased levels of DNA damage as shown by an increase
in the levels of phospho histone H3. Moreover, knockdown of coatomer proteins
(COPA, COPZ1) has already been showed to increase DNA damage [14]. These
findings suggest that the interaction between the modules of the secretory pathway
Dissecting the Functions of the Secretory Pathway by Transcriptional Profiling 85
and DNA repair that we identified is probably a true interaction and moreover
validates our strategy for identification of modules interacting with the secretory
pathway.
We then analyzed the genes belonging to the DNA repair pathway that is
modulated by the perturbation of Golgi retrograde transport (COPZ1, COG4, or
COG7 KD), to identify the putative TFs that can regulate their expression. The
enriched TFs obtained for this DNA repair group are listed in Table 1. Among these,
E2F1, E2F4, NRF1, and RFX1 are known to be involved in regulation of DNA repair
pathway genes of which E2F4 and RFX1 act as repressors [15, 16].
Since transcription factors are usually co-expressed along with their target genes,
their position across the PRL (rank in the PRL) associated with COPZ1 KD was
analyzed (Table 2), in order to restrict the TFs to those that are more likely to
be the true effectors under our perturbation conditions. This analysis revealed
that transcription factors E2F1, NRF1, and E2F4 are probably downregulated
and HIF1A and RFX1 are probably upregulated. However, only the behavior
of E2F1 (activator), NRF1 (activator), and RFX1 (repressor) are in concordant
with the observed effect of the target genes, i.e., downregulation of DNA repair
pathways (Fig. 4). This TF information can be used to map the upstream signaling
pathways that connect DNA repair pathway to perturbation of COPZ1 expression
(or impaired retrograde transport) by further analysis using online resources as well
as experimental validation studies.
86 S.G. Chavan et al.
Fig. 4 Hypothetical model of Golgi retrograde transport mediated by COPZ1 possibly regulating
the DNA repair pathway. The predicted TFs that might be involved in this regulation are indicated
and the direction of their modulation (up- or downregulation) under conditions of COPZ1 KD is
indicated by colored arrows. The known activity of the TF as a transcriptional activator or repressor
is indicated by the color coding of the text. (Refer to the key for details)
3.1 Conclusion
References
1. Zhong, W.: Golgi during development. Cold Spring Harb. Perspect. Biol. 3(9), a005363 (2011)
2. Kelly, R.B.: Pathways of protein secretion in eukaryotes. Science 230(4721), 25–32 (1985)
3. Luini, A., Mavelli, G., Jung, J., Cancino, J.: Control systems and coordination protocols of the
secretory pathway. F1000prime reports 6 (2014)
4. Costanzo, M., Baryshnikova, A., Bellay, J., Kim, Y., Spear, E.D., Sevier, C.S., Ding, H., Koh,
J.L., Toufighi, K., Mostafavi, S., Prinz, J.: The genetic landscape of a cell. Science 327(5964),
425–431 (2010)
Dissecting the Functions of the Secretory Pathway by Transcriptional Profiling 87
5. Bard, F., Casano, L., Mallabiabarrena, A., Wallace, E., Saito, K., Kitayama, H., Guizzunti, G.,
Hu, Y., Wendler, F., DasGupta, R., Perrimon, N.: Functional genomics reveals genes involved
in protein secretion and Golgi organization. Nature 439(7076), 604–607 (2006)
6. Simpson, J.C., Joggerst, B., Laketa, V., Verissimo, F., Cetin, C., Erfle, H., Bexiga, M.G.,
Singan, V.R., Hrich, J.K., Neumann, B., Mateos, A.: Genome-wide RNAi screening identifies
human proteins with a regulatory function in the early secretory pathway. Nat. Cell Biol. 14(7),
764–774 (2012)
7. Jonikas, M.C., Collins, S.R., Denic, V., Oh, E., Quan, E.M., Schmid, V., Weibezahn, J.,
Schwappach, B., Walter, P., Weissman, J.S., Schuldiner, M.: Comprehensive characterization
of genes required for protein folding in the endoplasmic reticulum. Science 323(5922), 1693–
1697 (2009)
8. Scott, K.L., Kabbarah, O., Liang, M.C., Ivanova, E., Anagnostou, V., Wu, J., Dhakal, S., Wu,
M., Chen, S., Feinberg, T., Huang, J.: GOLPH3 modulates mTOR signalling and rapamycin
sensitivity in cancer. Nature 459(7250), 1085–1090 (2009)
9. De Matteis, M.A., Luini, A.: Mendelian disorders of membrane trafficking. N. Engl. J. Med.
365(10), 927–938 (2011)
10. Ideker, T., Thorsson, V., Ranish, J.A., Christmas, R., Buhler, J., Eng, J.K., Bumgarner, R.,
Goodlett, D.R., Aebersold, R., Hood, L.: Integrated genomic and proteomic analyses of a
systematically perturbed metabolic network. Science 292(5518), 929–934 (2001)
11. Li, F., Cao, Y., Han, L., Cui, X., Xie, D., Wang, S., Bo, X.: GeneExpressionSignature: an
R package for discovering functional connections using gene expression signatures. OMICS
17(2), 116–118 (2013)
12. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A.,
Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P.: Gene set enrichment
analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc.
Natl. Acad. Sci. U. S. A. 102(43), 15545–15550 (2005)
13. Roider, H.G., Kanhere, A., Manke, T., Vingron, M.: Predicting transcription factor affinities to
DNA from a biophysical model. Bioinformatics 23(2), 134–141 (2007)
14. Paulsen, R.D., Soni, D.V., Wollman, R., Hahn, A.T., Yee, M.C., Guan, A., Hesley, J.A., Miller,
S.C., Cromwell, E.F., Solow-Cordero, D.E., Meyer, T.: A genome-wide siRNA screen reveals
diverse cellular processes and pathways that mediate genome stability. Mol. Cell 35(2), 228–
239 (2009)
15. Verona, R., Moberg, K., Estes, S., Starz, M., Vernon, J.P., Lees, J.A.: E2F activity is regulated
by cell cycle-dependent changes in subcellular localization. Mol. Cell. Biol. 17(12), 7268–7282
(1997)
16. Lubelsky, Y., Reuven, N., Shaul, Y.: Autorepression of rfx1 gene expression: functional
conservation from yeast to humans in response to DNA replication arrest. Mol. Cell. Biol.
25(23), 10665–10673 (2005)
Detection of Rare Mutations Using
Beta-Binomial and Empirical Quantile Models
in Next-Generation Sequencing Experiments
1 Introduction
S. Germanas
Institute of Mathematics and Informatics, Vilnius university, LT-08663 Vilnius, Lithuania
e-mail: [email protected]
A. Jakaitiene ()
Department of Human and Medical Genetics, Faculty of Medicine, Vilnius University,
LT-01513 Vilnius, Lithuania
e-mail: [email protected]
M. Guarracino
Laboratory for Genomics, Transcriptomics and Proteomics (Lab-GTP), High Performance
Computing and Networking Institute (ICAR), National Research Council (CNR),
Via Pietro Castellino 111, Naples, Italy
e-mail: [email protected]
for variant calling, and for diagnostics [4, 8]. NGS suffers from high error rates
which come from several error sources including base-calling and alignment.
During the single nucleotide polymorphism (SNP) calling process using NGS
the variable sites of genome could be identified. Sophisticated SNP calling math-
ematical models could be used to reduce and quantify the uncertainty of variant
calling process caused by high error rates. Another way could be target-sequencing
of certain genetic region with higher sequencing rate (20x and more). However
increasing demand of large samples suggests that high-depth sequencing is too
expensive in time and cost. In such large sample cases alternative way of sequencing
could be grouping of patients to pools and sequencing them together. This strategy
gives a possibility to sequence more effective in terms of time and money. Although,
this strategy also has a drawback—the allele frequency of certain individual from the
pool cannot be estimated directly. The same applies to SNP and genotype calling.
Therefore even more sophisticated mathematical methods and pooling strategies
must be used in order to take advantage of pooled NGS data.
There are many SNP calling methods for pooled NGS data [1–3, 5, 6, 11, 13, 15].
One common property of these methods is that the non-referent allele frequency riv
is modeled, where i D 1; : : : ; N is genetic position, v D 1; : : : ; J is the index of a
pool. In [5, 6] hierarchical Beta-binomial model is offered and applied to synthetic
pooled genetic data of virus. The quantity riv is assumed to have Beta-binomial
distribution. In [3, 11, 15] random variable riv is assumed to have hierarchical
binomial–binomial distribution. In models of [1, 2] statistically significant differ-
ences of riv between pools are searched to identify genetic variants. In [9, 14]
sequencing error riv is modeled through genetic positions. This gives a benefit
depending on the number of genetic positions, which often is very large. Although
the sequencing error rate () which is assumed constant in these models could be
very different in different genetic loci. Applications of the methods mentioned above
to real data with high minor allele frequency (>5 %) show that methods are quite
sensitive and specific, but the rare event case is not clear or not sufficient [10, 12].
We offer two novel approaches for detecting SNPs in pooled NGS data. The first
model is modification of Beta-binomial model [5], when the posterior distribution of
riv is considered. In the second method we model the function of random variables
riv
niv
with different v and use empirical quantile to detect the SNP calling threshold.
Also, we use binomial approximation of riv to model the data in order to choose
significant value of empirical quantile.
We use pooled NGS data from 128 patients with diagnosis of neuromuscular
disease for SNP identification. Results show that the modification of Beta-binomial
model detects variants with almost 100 % specificity. However the empirical
quantile model has better sensitivity and is much faster compared to Beta-binomial
models.
Detection of Rare Mutations Using Beta-Binomial and Empirical Quantile. . . 91
ri;kj .s/ ji;kj .s/ Binomial.i;kj .s/ ; ni;kj .s/ /; (1)
i;kj .s/ Beta.
i ; /; (2)
where ri;kj .s/ is non-referent allele frequency (observed), ni;kj .s/ is read depth
(observed), i;kj .s/ is error rate parameter at position i D 1; : : : ; N in pool k.s/ D
k1 .s/; : : : ; kJ .s/,
i (expected value of Beta distribution) and (precision of Beta
distribution) are hyperparameters of the model; where kj D kj .s/ is a function which
maps from the model set-up s to the kj -th reference pool, j D 1; : : : ; J. Parameter
i;kj .s/ and hyperparameters
i ; are estimated using Expectation-Maximization
algorithm for the whole likelihood function:
Y
N Y
J
L.ri;kj .s/ ; i;kj .s/ j
i ; / D Pr.ri;kj .s/ ji;kj .s/ ; ni;kj .s/ /Pr.i;kj .s/ j
i ; / (3)
iD1 jD1
l.ri;kj .s/ ; i;kj .s/ j i ; / D ln L.ri;kj .s/ ; i;kj .s/ j i ; /; (4)
1
/; ni0 D J jD1 ni;kj .s/ , and distribution of reference data is modeled. Z-test is
applied for a main pool data.
We propose modification of Beta-binomial model. We use posterior expectation
of i;kj .s/ instead of prior:
P
i C JjD1 ri;kj .s/
E.i;kj .s/ jri;kj .s/ / D
post;i D P : (5)
C JjD1 ni;kj .s/
Therefore we use more information from the data and expect to get more accurate
estimates of
i and i;kj .s/ . Estimate of standard deviation i remains the same, but
also the posterior standard deviation could be used. Z-test is applied for the case
data.
We apply original Beta-binomial model and modification of it for set-ups of
pooled data described in Sect. 2.3. Significance value for Z-test is chosen ˛ D 106 .
92 S. Germanas et al.
In this section we present another SNP calling method as empirical quantile method.
The idea of this method is do not use any theoretical distribution when predicting
mutated positions and to use the data across all positions as it was applied in [9, 14].
We introduce the function f W Œ0; 1JC1 ! ŒJ; 1 (J is number of reference
pools):
X
J
ysi D f .Mi;l.s/ ; Ri;kj .s/ / D Mi;l.s/ Ri;kj .s/ ; (6)
jD1
ri;l.s/
where Mi;l.s/ D ni;l.s/
is relative frequency of non-referent allele of main pool and
ri;kj .s/
Ri;kj .s/ D ni;kj .s/
is relative frequency of non-referent allele of reference pool in i-th
position, kj .s/-th pool and s-th data set-up for a model; function l D l.s/ maps from
model set-up s to the l-th main pool.
We expect that the value ysi is higher for mutated positions and lower for non-
mutated positions. Therefore we use the empirical quantile q˛ :
X
Q
Mi;l.s/ D c D WpM Binomial.Q; pM /;
BM (8)
cD1
X
J X
J X
Q
Ri;kj .s/ D BRc D WpR Binomial.QJ; pR /; (9)
jD1 jD1 cD1
1 PS PN ri;l.s/
where BM c Bernoulli.pM /,Bc Bernoulli.pR /, pM D SN
R
sD1 iD1 ni;l.s/ and
1 PS PN PJ ri;kj .s/
pR D SJN sD1 iD1 jD1 ni;k .s/ are, respectively, estimated error rate of the main
j
and reference pools, c D 1; : : : ; Q is the number of a patient in a pool, Q is assumed
to be constant, and S is a number of data set-ups.
We do not use any prior information about mutation status of the positions and
model distribution of ysi in three cases:
1. There is no information about position i (neither confirmed as mutated nor
confirmed as non-mutated). ysi distribution using convolution formula is
Detection of Rare Mutations Using Beta-Binomial and Empirical Quantile. . . 93
X
Q
P.ysi D tji is general/ D PWRgen .h t/PWMgen .h/; (10)
hD1
In the second and the third case we use quantity QJQC2 instead of QJ because we
assume that part of main pool which is present in reference pools was canceled out
(see Table 1). Modeled sensitivity and specificity are computed using (11) and (12)
accordingly. Having calculated modeled sensitivity and specificity for different ysi ,
we determine the value of ysi and compute ˛ from general distribution expressed
in (10). Finally, for the assessment of the model, we calculate sensitivity and
specificity using positions checked with Sanger.
We use pooled data from 128 patients with neuromuscular disease to identify
mutated variants. Target exome regions were sequenced using Illumina sequencing
platform. The target region consists of approximately 13,000 position with relative
frequency of non-referent allele Mi;l.s/ varying from 0.01 to 0.06. Data consists of 8
original pools where each pool has 16 patients and 8 replicated pools which consist
from the same 128 patients but with different pool composition (Table 1). For the
models described above we used different organization of main and replicated pools
as it is presented in Table 2.
Every pool from the original pool group was taken as main pool together with
7 pools from the replicated pool group as reference pools in such a way that every
pair of patients from the main pool was not present in the reference pools. Every
such combination of one main pool and seven reference pools we denote s, where
s D 1; : : : ; 64, and call data set-up for the model in the paper. Positions in every data
set-up were filtered according to main pool—positions with Mi;l.s/ < 0:011 where
94 S. Germanas et al.
not considered, because of the reasoning in [4]: when Mi;l.s/ < 0:011 there cannot
be any mutation because of finite number (16) of individuals in pool.
We have a list of mutated and non-mutated positions confirmed using Sanger
sequencing which gives the possibility of golden standard for calculation of
sensitivity and specificity.
3 Results
Table 2 Organization of pools: for every original pool eight combinations of replicated pools
Model Main Model Main
set-up pools, set-up pools,
s l.s/ Reference pools, k.s/ Patient s l.s/ Reference pools, k.s/ Patient
1 10, 11, 12, 13, 14, 15, 16 1 2 33 10, 11, 12, 13, 14, 15, 16 65 66
2 9, 11, 12, 13, 14, 15, 16 3 4 34 9, 11, 12, 13, 14, 15, 16 67 68
3 9, 10, 12, 13, 14, 15, 16 5 6 35 9, 10, 12, 13, 14, 15, 16 69 70
4 9, 10, 11, 13, 14, 15, 16 7 8 36 9, 10, 11, 13, 14, 15, 16 71 72
1 5
5 9, 10, 11, 12, 14, 15, 16 9 10 37 9, 10, 11, 12, 14, 15, 16 73 74
6 9, 10, 11, 12, 13, 15, 16 11 12 38 9, 10, 11, 12, 13, 15, 16 75 76
7 9, 10, 11, 12, 13, 14, 16 13 14 39 9, 10, 11, 12, 13, 14, 16 77 78
8 9, 10, 11, 12, 13, 14, 15 15 16 40 9, 10, 11, 12, 13, 14, 16 79 80
9 10, 11, 12, 13, 14, 15, 16 17 18 41 10, 11, 12, 13, 14, 15, 16 81 82
10 9, 11, 12, 13, 14, 15, 16 19 20 42 9, 11, 12, 13, 14, 15, 16 83 84
11 9, 10, 12, 13, 14, 15, 16 21 22 43 9, 10, 12, 13, 14, 15, 16 85 86
12 9, 10, 11, 13, 14, 15, 16 23 24 44 9, 10, 11, 13, 14, 15, 16 87 88
2 6
13 9, 10, 11, 12, 14, 15, 16 25 26 45 9, 10, 11, 12, 14, 15, 16 89 90
14 9, 10, 11, 12, 13, 15, 16 27 28 46 9, 10, 11, 12, 13, 15, 16 91 92
15 9, 10, 11, 12, 13, 14, 16 29 30 47 9, 10, 11, 12, 13, 14, 16 93 94
16 9, 10, 11, 12, 13, 14, 15 31 32 48 9, 10, 11, 12, 13, 14, 16 95 96
17 10, 11, 12, 13, 14, 15, 16 33 34 49 10, 11, 12, 13, 14, 15, 16 97 98
18 9, 11, 12, 13, 14, 15, 16 35 36 50 9, 11, 12, 13, 14, 15, 16 99 100
19 9, 10, 12, 13, 14, 15, 16 37 38 51 9, 10, 12, 13, 14, 15, 16 101 102
20 9, 10, 11, 13, 14, 15, 16 39 40 52 9, 10, 11, 13, 14, 15, 16 103 104
3 7
21 9, 10, 11, 12, 14, 15, 16 41 42 53 9, 10, 11, 12, 14, 15, 16 105 106
22 9, 10, 11, 12, 13, 15, 16 43 44 54 9, 10, 11, 12, 13, 15, 16 107 108
23 9, 10, 11, 12, 13, 14, 16 45 46 55 9, 10, 11, 12, 13, 14, 16 109 110
24 9, 10, 11, 12, 13, 14, 15 47 48 56 9, 10, 11, 12, 13, 14, 16 111 112
25 10, 11, 12, 13, 14, 15, 16 49 50 57 10, 11, 12, 13, 14, 15, 16 113 114
26 9, 11, 12, 13, 14, 15, 16 51 52 58 9, 11, 12, 13, 14, 15, 16 115 116
27 9, 10, 12, 13, 14, 15, 16 53 54 59 9, 10, 12, 13, 14, 15, 16 117 118
28 9, 10, 11, 13, 14, 15, 16 55 56 60 9, 10, 11, 13, 14, 15, 16 119 120
4 8
29 9, 10, 11, 12, 14, 15, 16 57 58 61 9, 10, 11, 12, 14, 15, 16 121 122
30 9, 10, 11, 12, 13, 15, 16 59 60 62 9, 10, 11, 12, 13, 15, 16 123 124
31 9, 10, 11, 12, 13, 14, 16 61 62 63 9, 10, 11, 12, 13, 14, 16 125 126
32 9, 10, 11, 12, 13, 14, 15 63 64 64 9, 10, 11, 12, 13, 14, 16 127 128
0.06
General positions
Mutated positions
Nonmutated positions
0.04 SNP Calling treshold
0.02
−0.02
−0.04
−0.06
−0.08
−0.1
0 0.5 1 1.5 2 2.5
x 108
Fig. 1 In X axis all sequenced positions are represented, in Y axis values of ysi for model set-up
s D 1 are represented
1.0
−8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3
Fig. 2 In X axis values of ysi are represented, in Y axis modeled probabilities of ysi values for
general, mutated, and non-mutated positions
Detection of Rare Mutations Using Beta-Binomial and Empirical Quantile. . . 97
sensitivityD P.ysi ysi0 ji is mutated), modeled specificityD P.ysi < ysi0 ji is mutated)
for every value ysi0 . Our objective is to find ysi0 for which both modeled specificity
and sensitivity would be as close to 1. We obtained that only at ysi0 D 1 both modeled
sensitivity and specificity were larger than 0.7, i.e., 0.73, 1 and ˛ 0:03. We
use computed value ˛ D 0:03 in (7) to compute empirical quantile q˛ . q˛ is the
threshold to distinguish between mutated and non-mutated positions.
For the model performance evaluation and comparison, we calculated sensitivity
and specificity of Beta-binomial model, modification of Beta-binomial model, and
empirical quantile method using positions checked with Sanger sequencing. As
there is some methodological differences in significance value selection (˛ is
selected for Beta-binomial models and estimated for empirical quantile method),
we present sensitivity and specificity results at three different levels (see Table 3).
˛ D 106 was used in [5] to account for of multiple testing and ˛ D 0:03 was
estimated in empirical quantile method. Empirical quantile method gives better
sensitivity for ˛ equal 103 and 0:03. Additional advantage of empirical quantile
method is speed. It takes approximately 4–5 s estimate mutated positions of all
individuals. While the time for the implementation of Beta-binomial model is
approximately 1 week. Therefore, we can conclude that empirical quantile method
is applicable for detection of mutated positions in pooled NGS experiments.
However, empirical quantile method might be extended in several ways, as
there are some strong assumptions made: (1) model parameters are not position
dependent; (2) contributions of individuals into pools are equal; (3) sequencing error
is not modeled; (4) read errors are independent between pools and positions; (5)
selection of ˛ must be done by researcher; (6) the method depends on the experiment
structure; and (7) it was not considered in the model that observed frequencies in
pools which have leastwise one common individual are statistically dependent.
Mentioned assumptions could be relaxed when, for example, Poisson-Binomial
distribution instead of Binomial would be considered, dependence between pools
and positions would be taken into account, weighted sums instead of sums would
be calculated, selection of ˛ would be automated.
There are several articles in which error rate across positions of target region
is modeled. In [9] empirical quantile is computed with prescribed ˛ and Poisson
distribution is used to compute probability of SNP. In [14] also Poisson distribution
is assumed as read error distribution, parameter of Poisson distribution is calculated
98 S. Germanas et al.
from average error rate across positions, and threshold equal to 0.001 is used for
SNP calling. These methods have limitations that distribution of read errors is
assumed, parameter of Poisson distribution is assumed constant across positions,
and SNP calling threshold is chosen without knowledge about analyzed data.
Therefore the ability for selecting SNP calling threshold is advantage of proposed
empirical quantile method.
4 Concluding Remarks
References
1. Altmann, A., Weber, P., Quast, C., Rex-Haffner, M., Binder, E.B., Müller-Myhsok, B.: vipR:
variant identification in pooled DNA using R. Bioinformatics 27, 77–84 (2011)
2. Bansal, V.: A statistical method for the detection of variants from next-generation resequencing
of DNA pools. Bioinformatics 26, 318–324 (2010)
3. Chen, Q., Sun, F.: A unified approach for allele frequency estimation, SNP detection and
association studies based on pooled sequencing data using EM algorithms. BMC Genomics
14, 1–14 (2013)
4. Ferraro, M.B., Savarese, M., di Fruscio, G., Nigro, V., Guarracino, M.R.: Prediction of
rare single-nucleotide causative mutations for muscular diseases in pooled next-generation
sequencing experiments. J. Comput. Biol. 21, 665–675 (2014)
5. Flaherty, P., Natsoulis, G., Muralidharan, O., Winters, M., Buenrostro, J., Bell, J., Brown,
S., Holodniy, M., Zhang, N., Ji, H.P.: Ultrasensitive detection of rare mutations using next-
generation targeted resequencing. Nucleic Acids Res. 40, 861–872 (2011)
6. He, Y., Zhang, F., Flaherty, P.: RVD2: an ultra-sensitive variant detection model for low-depth
heterogeneous next-generation sequencing data. Bioinformatics 31–17, 2785–2793 (2015)
7. Mardis, E.R.: A decade’s perspective on DNA sequencing technology. Nature 470, 198–203
(2011)
8. Nielsen, R., Paul, J.S., Albrechtsen, A., Song, Y.S.: Genotype and SNP calling from next-
generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011)
9. Out, A.A., van Minderhout, I.J.H.M., Goeman, J.J., Ariyurek, Y., Ossowski, S., Schneeberger,
K., Weigel, D., van Galen, M., Taschner, P.E.M., Tops, C.M.J., Breuning, M.H., van Ommen,
G.-J.B., den Dunnen, J.T., Devilee, P., Hes, F.J.: Deep sequencing to reveal new variants in
pooled DNA samples. Hum. Mutat. 9, 1703–1712 (2009)
10. Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., Krabichler, B.,
Speicher, M.R., Zschocke, J., Trajanoski, Z.: A survey of tools for variant analysis of next-
generation genome sequencing data. Brief. Bioinform. 15–2, 256–278 (2012)
Detection of Rare Mutations Using Beta-Binomial and Empirical Quantile. . . 99
11. Raineri, E., Ferretti, L., Esteve-Codina, A., Nevado, B., Heath, S.: SNP calling by sequencing
pooled samples. BMC Bioinf. 13, 239–246 (2012)
12. Spencer, D.H., Tyagi, M.M., Vallania, F., Bredemeyer, A.J., Pfeifer, J.D., Mitra, R.D.,
Duncavage, E.J.: Performance of common analysis methods for detecting low-frequency single
nucleotide variants in targeted next-generation sequence data. J. Mol. Diagn. 16, 75–88 (2014)
13. Vallania, F.L.M., Druley, T.E., Ramos, E., Wang, J., Borecki, I., Province, M., Mitra, R.D.:
Quantification of rare allelic variants from pooled genomic DNA. Nat. Methods 6, 263–265
(2009)
14. Wang, C., Mitsuya, Y., Gharizadeh, B., Ronaghi, M., Shafer, R.W.: Characterization of muta-
tion spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome
Res. 17, 1195–1201 (2007)
15. Wei, Z., Wang, W., Hu, P., Lyon, G.J., Hakonarson, H.: SNVer: a statistical tool for variant
calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids
Res. 39, 132–144 (2011)
An Overview of Genotyping by Sequencing
in Crop Species and Its Application in Pepper
1 Introduction
The accessibility and use of natural genetic variation in plant breeding is currently
restricted due to gaps in the genetic information that limit the comparison of
germplasm accessions of different crops. The generation of novel varieties and
the establishment of innovative breeding programs play a crucial role in food
security and nutrition. In the last century, breeding programs have led to the
selection of a small number of cultivars carrying genes for resistance to diseases
and pests with higher and more uniform yield [1], implicating a reduction of
genetic diversity. In order to contrast this trend, international efforts are focusing
on the recovery, protection and assessment of biodiversity and on the promotion
of the sustainable use of plant genetic resources. Plant collections are constituted
over time using both locally ecotypes, selected on the basis of a recognizable
phenotype, and well-adapted crops selected for their fitness in climate change-
affected production systems. The use of wild relatives and under-utilized varieties
are instead challenging due to their unexplored genetic potentiality.
Crop improvement programs reaped the benefits from cutting-edge technolo-
gies in biological science, particularly in form of molecular markers, which in
combination with conventional phenotype-based selection, define modern plant
breeding practices. Molecular markers are extremely useful in plants to characterize
germplasm collections and improve the conventional plant breeding schemes
through marked-assisted selection (MAS). Different molecular markers have been
successfully applied for genetic mapping [2], to infer phylogenetic relationships
[3, 4], for the development of mapped genetic resources [5, 6] and comparative
studies [7, 8].
Among various types of markers in use, single nucleotide polymorphisms (SNPs)
are abundant in plant genomes; however, before the advent of next generation
sequencing (NGS) technologies, they were considered costly for application in plant
breeding [9, 10]. NGS is used for both whole genome sequencing and re-sequencing
projects, leading to the discovery of a large number of SNPs useful to explore inter-
and intra-species nucleotide diversity. As a consequence, SNPs have become the
primary choice for many genetic studies thanks to their flexibility, speed and cost-
effectiveness [11] inducing plant breeders to use them in their programs.
Genotyping by sequencing (GBS) has recently emerged as an innovative genomic
approach for exploring plant genetic diversity on a genome-wide scale [12, 13].
GBS is based on genome reduction with restriction enzymes; it does not require a
reference genome for SNP discovery and provides a rapid, high-throughput and
cost-effective tool for the investigation of genetic variability in model and non-
model species. Herewith it is provided an overview of the GBS method through the
description of the main protocols in use and their applications in plants. In addition,
research activity on the investigation of the genetic diversity in cultivated pepper
(Capsicum annuum) is illustrated.
GBS was first introduced in plant science by Elshire et al. [13], and to date is one
of the most powerful applications in the field of plant breeding. The information
derived from GBS experiments have been widely used in genomic diversity studies
and molecular marker discovery, genome-wide association studies (GWAS), genetic
linkage analysis and genomic selection [9]. GBS can be performed through either a
reduced-representation or a whole genome re-sequencing approach [12] generating
a large number of genome-wide SNP data. It does not require any a priori knowledge
on the genome of the species of interest, though several studies have been mainly
carried out in species with reference genomes because SNP genotyping is much
easier when a reference genome is available. Furthermore, GBS typically shows
good results when it is applied to an inbred diploid species with a well-established
reference genome as in the case of barley, maize, sorghum and brassica [9, 13].
Some studies have also made some progresses towards GBS of out-crossing species
lacking reference genomes and of many agriculturally important polyploids crops
such as wheat, cotton and potato [9, 14, 15]. Despite its benefits, GBS shows some
limitations such as the presence of large amount of missing data, largely due to the
use of low coverage sequencing and uneven genome coverage [16].
The GBS protocol includes four major steps: (1) sample preparation, (2) NGS
library construction, (3) SNP discovery and (4) genetic analysis (Fig. 1).
1.00
0.80
0.60
0.40
0.20
0.00
1.00
0.80
0.60
0.40
0.20
0.00
Peterson et al. [16], the GBS protocol uses the Illumina ‘Generate FASTQ’
workflow, the ‘FASTQ Only’ application and ‘TruSeq HT’ assay to generate a
de-multiplexed set of FASTQ files with the adapter sequences removed upon
completion of the sequencing run. After sequencing run, raw data are downloaded.
Each sample has two FASTQ files representing the forward and reverse sequenced
reads. FASTQ files are text-based files for storing biological sequences (FASTA)
with embedded quality scores.
Different computational pipelines have been specifically developed for SNP dis-
covery and genotyping from FASTQ files. The TASSEL-GBS Discovery Pipeline
is the most used in diploid plants with a reference genome [19]. It uses the first
64 nucleotides (nts) of the reads to minimize the effects of sequencing errors.
As mentioned above, the sequencing produces million reads, split across multiple
FASTQ files. All unique sequence tags from each sequence file are captured and
then collapsed to generate a master tag file. The alignment of the unique 64-nts
reads (tags) to reference genome is carried out using Bowtie2 [20] or BWA.
A ‘TagsOnPhysicalMap’ (TOPM) file is returned as output and it can be used for
SNP calling. SNP call is carried out for each set of tags originating from the same
restriction enzyme cut site. Every set of tags aligns to the exact starting genomic
position and strand, where the starting genomic position of a tag is identify by
the cut site residue at the beginning of the tag. Raw SNP data output produced by
the TASSEL-GBS pipeline are further filtered for studying purposes. Usually, the
parameters considered are: inbreeding coefficient (FIT ) and minimum minor allele
frequency (mnMAF). FIT is largely used to filter SNPs from NGS data in inbred lines
[21] and it is calculated based on the expectation–maximization (EM) algorithm
[22]. In GBS analysis, spurious SNPs will appear to be excessively heterozygous,
so it is necessary to calculate the FIT and apply the minimum FIT filter, generally 0.8
[19]. To detect and filter out error-prone SNPs, the TASSEL-GBS pipeline relies on
population-genetic parameters such as MAF. The minimal filter used is in general
set to MAF>0.01. Minimum minor allele count (mnMAC) and minimum locus
coverage (mnLCov) are two additional parameters used in GBS analysis to count
the number of minor alleles for each marker and to evaluate the proportion of taxa
with a genotype, respectively [19].
The TASSEL-GBS pipeline provides SNP calls in both HapMap and VCF
formats. The pipeline provides two sets of HapMap files: (1) a set without post
SNP calling filtering; (2) a set with additional filtering on missingness and allele
frequency. VCF format is an alternative format for holding SNP information that
retains information on depth of coverage for each allele, and the genotype likelihood
scores are calculated according to Etter et al. [23]. Specific software packages, such
as VCFtools and VCFlib, have been developed for working with and manipulating
VCFfiles [24]. For species with no reference genomes, a network-based algorithm
106 F. Taranto et al.
The output data files generated from the bioinformatic pipelines are widely used in
different genetic studies including conventional analysis to evaluate heterozygosity
and genetic relationships among individuals, genetic diversity and population
structure in large germplasm collections, high density linkage maps development,
phylogenetic and association mapping studies. Each of these aspects requires
complex analysis. In the next paragraph, a brief overview of the main methods used
for genetic diversity studies is given.
The study of genetic variation is of great interest for trait association analysis and
evolutionary researches. The first step is to investigate the population structure (the
presence of genetic differences among groups of individuals and their assignment
to different clusters based on allele frequency) given the large amount of SNP data.
So far, several algorithms have been proposed which can be divided into two major
computational paradigms: parametric and non-parametric. Parametric approaches
assume a model in which there are K populations, each of which characterized by
a set of allele frequencies at each locus. The assignment of individuals to a specific
cluster is based on statistical likelihood method, using assumption such as Hardy–
Weinberg equilibrium (HWE) for each marker and linkage equilibrium (LE) among
markers [25]. The structure paradigm consists in a model-based clustering approach
to infer the presence of distinct populations, assign each individual to a population
and estimate ancestral population allele frequencies based on a statistical method
known as the allele-frequency admixture model [26]. The most popular software
to investigate the genetic structure in plants is STRUCTURE [26], although, in the
last years, the ADMIXTURE [27, 28] program usage is growing. Both software
used the same statistical model and input files (i.e. HapMap by the TASSEL-GBS
pipeline) although ADMIXTURE performs much more rapidly since it employs
a fast numerical optimization. STRUCTURE uses a Markov Chain Monte Carlo
(MCMC) stochastic algorithm to produce sample-based estimates of a target
distribution of choice and Bayesian approach based on the posterior distribution of
defined population quantities. ADMIXTURE employs the same likelihood model
but focuses on maximizing the likelihood rather than the posterior distribution.
ADMIXTURE makes the further assumption of linkage equilibrium among the
markers where dense marker sets should be pruned to mitigate background linkage
disequilibrium (LD).
An Overview of Genotyping by Sequencing in Crop Species and Its Application. . . 107
run ( 25 million reads in total) with 90 mapping individuals plus parents (three
redundant samples each), resulting in a total of 576 SNPs genetically mapped with
the aid of the reference genome [9]. In rice, 30,894 SNPs were identified on 176
RILs and used to map the recombined hot and cold spots and QTLs for leaf width
and aluminium tolerance [34].
Few studies have been performed on germplasm collections to characterize the
genetic structure and to provide a tool for association mapping analysis for complex
traits. As an example we report the work by Nimmakayala et al. [32], where the
genetic structure of 183 domesticated watermelon accessions is investigated using a
data set of 11,485 SNPs. Based on 5,254 filtered SNPs, linkage disequilibrium and
population structure were estimated in order to identify agronomically important
candidate genes. GBS has also been used for marker development in cassava
(Manihot esculenta Crantz). Using a set of 917 accessions, 56,489 SNP loci were
genotyped to assess population structure and perform varietal identification [39].
GBS was applied also in polyploid species such as potato, wheat and cotton. In
potato, 12.4 gigabases of high-quality sequence data and 129,156 sequence variants
have been identified [15]. In bread wheat, GBS was used to develop a high density
An Overview of Genotyping by Sequencing in Crop Species and Its Application. . . 109
map of 20,000 SNPs. To further evaluate GBS in wheat, a de novo genetic map was
also constructed using only SNP markers from GBS experiment. The GBS approach
presented here provides a powerful method of developing high density markers in
species without a sequenced genome while providing valuable tools for anchoring
and ordering physical maps and whole genome shotgun sequences [17].
Successful results in markers assisted breeding programs are reported. In cotton,
GBS was used to genotype two BC4 F1 populations and design strategies to obtain
near isogenic lines (NILs) [9]. Two reciprocal sets of NILs by introgression between
two tetraploid species were developed. In the first, 956 SNPs were used to genotype
39 individuals, which resulted in a total finding of 106 introgressions on average.
The second set consisted of 39 individuals genotyped with 914 SNPs for a total
of 114 introgressions. In pepper, GBS technology was used to develop a marker-
assisted backcrossing (MABC) program for the constitution of new pepper varieties
containing capsinoids, starting from BC1 F1 and BC2 F1 populations [44].
Despite the economical and nutritional importance of Solanaceae and the
huge variability within, analytical studies on the genetic variability in germplasm
collections using GBS are lacking. In the next paragraph we illustrate our research
activity aiming to investigate genetic diversity in a population of cultivated pepper
(Capsicum annuum) accessions.
15000
10000
5000
2 4 6 8 10 12
K
An Overview of Genotyping by Sequencing in Crop Species and Its Application. . . 111
Fig. 3 Estimate of genetic diversity of C. annuum accessions using GBS-SNP markers. Bar-
plot describing the population structure estimated by the Bayesian clustering. Each individual is
represented by a thin vertical line, which is partitioned into K coloured segments whose length
is proportional to the estimated membership coefficient (q). Population structure at (a) K D 3, (b)
K D 6, (c) K D 10 is reported. Three, six and ten groups are identified, respectively. The asterisk
shows the most informative K value (K D 3). Genotypes retrieved from the same geographical
areas are represented by yellow and brown lines at K D 6 and K D 10, respectively
possible to distinguish the accessions considering both geographical origin and fruit
characteristics. Detailed assessment of morphological fruit-related characteristics
was carried out using automated tools for the analysis (i.e. Chroma metre, 2D
scanner). In total over 300 thousand data points for 38 fruit size and shape
attributes were obtained. Main phenotypic variation was due to fruit size traits
(i.e. perimeter, area, fruit height and fruit width) which could be considered the
most relevant attributes for breeding new varieties. In order to identify genomic
regions responsible for the phenotypic variation, high-quality SNP (mmMAF 0.01,
coverage 90 %) were further selected. A first attempt to associate SNP alleles and
morphological traits was carried out on the basis of General Linear Model. Several
SNP highly correlated to the phenotypic variation were identified. For the main
traits responsible for fruit size variation as well as for shape traits of high interest
in breeding, highly correlated SNP were detected on chromosomes 2, 3, 6 and 9.
Next step will involve the integration of a parametric (STRUCTURE) with a non-
parametric approach (AWclust) in order to better refine the population structure with
the aim to select a core-set of accessions. Moreover, Mixed Linear Model will be
used for future association mapping analysis.
7 Conclusion
Acknowledgements This project was supported by the ‘GenHort’ project funded by the Italian
Ministry of University and Research (MIUR, PON02_00395_3215002) and the ‘PEPIC’ project
funded by the Italian Ministry of Agricultural, Food and Forestry.
An Overview of Genotyping by Sequencing in Crop Species and Its Application. . . 113
References
1. Hammer, K., Arrowsmith, N., Gladis, T.: Agrobiodiversity with emphasis on plant genetic
resources. Naturwissenschaften 90, 241–250 (2003)
2. Pei, C., Wang, H., Zhang, J., Wang, Y., Francis, D.M., Yang, W.: Fine mapping and analysis of a
candidate gene in tomato accession PI128216 conferring hypersensitive resistance to bacterial
spot race T3. Theor. Appl. Genet. 124(3), 533–542 (2012)
3. Di Dato, F., Parisi, M., Cardi, T., Tripodi, P.: Genetic diversity and assessment of markers linked
to resistance and pungency genes in Capsicum germplasm. Euphytica 1, 103–119 (2015)
4. Xu, Y., Ma, R.C., Xie, H., Liu, J.T., Cao, M.Q.: Development of SSR markers for the
phylogenetic analysis of almond trees from China and the Mediterranean region. Genome
47(6), 1091–1104 (2004)
5. Alseekh, S., Ofner, I., Pleban, T., Tripodi, P., Di Dato, F., Cammareri, M., Mohammad, A.,
Grandillo, S., Fernie, A.R., Zamir, D.: Resolution by recombination: breaking up Solanum
pennellii introgressions. Trends Plant Sci. 18(10), 536–538 (2013)
6. Laidò, G., Mangini, G., Taranto, F., Gadaleta, A., Blanco, A., Cattivelli, L., Marone, D.,
Mastrangelo, A.M., Papa, R., De Vita, P.: Genetic diversity and population 387 structure of
tetraploid wheats (Triticum turgidum L.) estimated by SSR, DArT and Pedigree Data. PLoS
One 8(6), e67280 (2013)
7. Wu, F., Eannetta, N.T., Durrett, Y.X.R., Mazourek, M., Jahn, M.M., Tanksley, S.D.: A COSII
genetic map of the pepper genome provides a detailed picture of synteny with tomato and new
insights into recent chromosome evolution in the genus Capsicum. Theor. Appl. Genet. 118,
1279–1293 (2009)
8. Wu, F., Eannetta, N.T., Xu, Y., Plieske, J., Ganal, M., Pozzi, C., Bakaher, N., Tanksley, S.D.:
COSII genetic maps of two diploid Nicotiana species provide a detailed picture of synteny with
tomato and insights into chromosome evolution in tetraploid N. tabacum. Theor. Appl. Genet.
120(4), 809–827 (2010)
9. Kim, C., Guo, H., Kong, W., Chandnani, R., Shuang, L.S., Paterson, A.H.: Application of
genotyping-by-sequencing technology to a variety of crop breeding programs. Plant Sci. 242,
12–14 (2016)
10. Rafalski, A.: Applications of single nucleotide polymorphisms in crop genetics. Curr. Opin.
Plant Biol. 5, 94–100 (2002)
11. He, J., Zhao, X., Laroche, A., Lu, Z.X., Liu, H., Li, Z.: Genotyping-by-sequencing (GBS), an
ultimate marker-assisted selection (MAS) tool to accelerate plant breeding. Front. Plant Sci. 5,
484 (2014)
12. Deschamps, S., Llaca, V., May, G.D.: Genotyping-by-sequencing in plants. Biology 1, 460–483
(2012)
13. Elshire, R.J., Glaubitz, J.C., Sun, Q., Poland, J.A., Kawamoto, K., Buckler, E.S., Mitchell, S.E.:
A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS
One 6(5), e19379 (2011)
14. Poland, J.A., Brown, P.J., Sorrells, M.E., Jannink, J.L.: Development of high-density genetic
maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach.
PLoS One 7, e32253 (2012)
15. Uitdewilligen, J.G., Wolters, A.A., D’hoop, B.B., Borm, T.J., Visser, R.G., van Eck, H.J.:
A next-generation sequencing method for genotyping by-sequencing of highly heterozygous
autotetraploid potato. PLoS One 8(5), e62355 (2013)
16. Peterson, G.W., Dong, Y., Horbach, C., Fu, Y.B.: Genotyping-by-sequencing for plant genetic
diversity analysis: a lab guide for SNP genotyping. Diversity 6, 665–680 (2014)
17. Poland, J., Endelman, J., Dawson, J., Rutkoski, J., Wu, S., et al.: Genomic selection in wheat
breeding using genotyping-by-sequencing. Plant Genome 5, 103–113 (2012)
114 F. Taranto et al.
18. Sonah, H., Bastien, M., Iquira, E., Tardivel, A., Legare, G., Boyle, B., Normandeau, E.,
Laroche, J., Larose, S., Jean, M., Belzile, F.: An improved genotyping-by-sequencing (GBS)
approach offering increased versatility and efficiency of SNP discovery and genotyping. PLoS
One 8(1), e54603 (2013)
19. Glaubitz, J.C., Casstevens, T.M., Lu, F., Harriman, J., Elshire, R.J., Sun, Q., Buckler, E.S.:
TASSEL-GBS: a high capacity genotyping-by-sequencing analysis pipeline. PLoS One 9(2),
e90346 (2014)
20. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9,
357–359 (2012)
21. Vieira, F.G., Fumagalli, M., Albrechtsen, A., Nielsen, R.: Estimating inbreeding coefficients
from NGS data: impact on genotype calling and allele frequency estimation. Genome Res. 23,
1852–1861 (2013)
22. Smith, C.A.B., Thomson, R.: Estimation of inbreeding from population samples. J. Appl.
Probab. 25, 127–135 (1988)
23. Etter, P.D., Bassham, S., Hohenlohe, P.A., Johnson, E.A., Cresko, W.A.: SNP discovery and
genotyping for evolutionary genetics using RAD sequencing. Methods Mol. Biol. 772, 157–
178 (2011)
24. Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., Depristo, M.A., Handsaker, R.,
Lunter, G., Marth, G., Sherry, S.T., McVean, G., Durbin, R., 1000 Genomes Project Analysis
Group: The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011)
25. Deejai, P., Assawamakin, A., Wangkumhang, P., Poomputsa, K., Tongsima, S.: On assigning
individuals from cryptic population structures to optimal predicted subpopulations: an empiri-
cal evaluation of non-parametric population structure analysis techniques. Comput. Syst. Biol.
Bioinform. 115, 58–70 (2010)
26. Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus
genotype data. Genetics 155, 945–959 (2000)
27. Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry in
unrelated individuals. Genome Res. 19(9), 1655–1664 (2009)
28. Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G., Francois, O.: Fast and efficient estimation
of individual ancestry coefficients. Genetics 196(4), 973–983 (2014)
29. Alexander, D.H., Lange, K.: Enhancements to the ADMIXTURE algorithm for individual
ancestry estimation. BMC Bioinf. 12, 246 (2011)
30. Whitley, E., Ball, J.: Statistics review 6: nonparametric methods. Crit. Care 6, 509–513 (2002)
31. Gao, X., Martin, E.R.: Using allele sharing distance for detecting human population stratifica-
tion. Hum. Hered. 68, 182–191 (2009)
32. Nimmakayala, P., Levi, A., Abburi, L., Abburi, V.L., Tomason, Y.R., Saminathan, T., Vajja,
V.G., Malkaram, G., Reddy, R., Wehner, T.C., Mitchell, S.E., Reddy, U.K.: Single nucleotide
polymorphisms generated by genotyping-by-sequencing to characterize genome-wide diver-
sity, linkage disequilibrium, and selective sweeps in cultivated watermelon. BMC Genomics
15, 767 (2014)
33. Romay, M.C., Millard, M.J., Glaubitz, J.C., Peiffer, J.A., Swarts, K.L., Casstevens, T.M.,
Elshire, R.J., Acharya, C.B., Mitchell, S.E., Flint-Garcia, S.A., McMullen, M.D., Holland, J.B.,
Buckler, E.S., Gardner, C.A.: Comprehensive genotyping of the USA national maize inbred
seed bank. Genome Biol. 14, R55 (2013)
34. Spindel, J., Wright, M., Chen, C., Cobb, J., Gage, J., Harrington, S.: Bridging the genotyping
gap: using genotyping-by-sequencing (GBS) to add high-density SNP markers and new value
to traditional bi-parental mapping and breeding populations. Theor. Appl. Genet. 126, 2699–
2716 (2013)
35. Jarquín, D., Kocak, K., Posadas, L., Hyma, K., Jedlicka, J., Graef, G., Lorenz, A.: Genotyping
by sequencing for genomic prediction in a soybean breeding population. BMC Genomics 15,
740 (2014)
36. Rocher, S., Jean, M., Castongyay, Y., Belzile, F.: Validation of genotyping-by-sequencing anal-
ysis in populations of tetraploid alfalfa by 454 sequencing. PLoS One (2015). doi:10.1371/jour-
nal.pone.0131918
An Overview of Genotyping by Sequencing in Crop Species and Its Application. . . 115
37. Huang, Y.F., Poland, J.A., Wight, C.P., Jackson, E.W., Tinker, N.A.: Using genotyping-by-
sequencing (GBS) for genomic discovery in cultivated oat. PLoS One 9(7), e102448 (2014)
38. Pootakham, W., Jomchai, N., Ruang-areerate, P., Shearman, J.R., Sonthirod, C., Sangsrakru,
D., Tragoonrung, S., Tangphatsornruang, S.: Genome-wide SNP discovery and identification
of QTL associated with agronomic traits in oil palm using genotyping-by-sequencing (GBS).
Genomics 105, 288–295 (2015)
39. Rabbi, Y.I., Kulakow, P.A., Manu-Aduening, J.A., Dankyi, A.A., Asibuo, J.Y., Parkes, E.Y.,
Abdoulaye, T., Girma, G., Gedil, M.A., Ramu, P., Reyes, B., Maredia, M.K.: Tracking crop
varieties using genotyping by-sequencing markers: a case study using cassava (Manihot
esculenta Crantz). BMC Genet. 16, 115 (2015)
40. Girma, G., Hyma, K.E., Asiedu, R., Mitchell, S.E., Gedil, M., Spillane, C.: Next generation
sequencing based genotyping, cytometry and phenotyping for understanding diversity and
evolution of guinea yams. Theor. Appl. Genet. 127, 1783–1794 (2014)
41. Bielenberg, D.G., Rauh, B., Fan, S., Gasic, K., Abbott, A.G., Reighard, G.L., Okie, W.R.,
Wells, C.E.: Genotyping by sequencing for SNP-based linkage map construction and QTL
analysis of chilling requirement and bloom date in peach [Prunus persica (L.) Batsch]. PLoS
One (2015). doi:10.1371/journal.pone.0139406
42. Ma, X.F., Jensen, E., Alexandrov, N., Troukhan, M., Zhang, L., Thomas-Jones, S., Farra, K.,
Clifton-Brown, J., Donnison, I., Swaller, T., Flavell, R.: High resolution genetic mapping by
genome sequencing reveals genome duplication and tetraploid genetic structure of the diploid
Miscanthus sinensis. PLoS One 7(3), e33821 (2012)
43. Pan, J., Wang, B., Pei, Z.Y., Zhao, W., Gao, J., Mao, J.F., Wang, X.R.: Optimization of the
genotyping-by-sequencing strategy for population genomic analysis in conifers. Mol. Ecol.
Resour. 15, 711–722 (2015)
44. Jeong, H.S., Jang, S., Han, K., Kwon, J.K., Kang, B.C.: Marker-assisted backcross breeding
for development of pepper varieties (Capsicum annuum) containing capsinoids. Mol. Breed.
35, 226 (2015)
45. Moscone, E.A., Scaldaferro, M.A., Grabiele, M., Cecchini, N.M., Sanchez Garcıa, Y., Jarret,
R., Davina, J.R., Ducasse, D.A., Barboza, G.E., Ehrendorfer, F.: The evolution of chili peppers
(Capsicum-Solanaceae): a cytogenetic perspective. Acta Hortic. 745, 137–170 (2007)
46. Hernández-Verdugo, S., Luna-Reyes, R., Oyama, K.: Genetic structure and differentiation of
wild and domesticated populations of Capsicum annuum (Solanaceae) from Mexico. Plant
Syst. Evol. 226(3–4), 129–142 (2001)
47. Cericola, F., Portis, E., Toppino, L., Barchi, L., Acciarri, N., Ciriaci, T., Sala, T., Rotino, G.L.,
Lanteri, S.: The population structure and diversity of eggplant from Asia and the Mediterranean
Basin. PLoS One 8(9), e73702 (2013)
48. Rodriguez, M., Rau, D., Bitocchi, E., Bellucci, E., Biagetti, E., Carboni, A., Biagetti, E.,
Carboni, A., Gepts, P., Nanni, L., Papa, R., Attene, G.: Landscape genetics, adaptive diversity
and population structure in Phaseolus vulgaris. New Phytol. 209(4), 1781–1794 (2015)
49. Kim, S., Park, M., Yeom, S.I., Kim, Y.M., Lee, J.M., Lee, H.A., Seo, E., Choi, J., Cheong, K.,
Kim, K.T., Jung, K., Lee, G.W., Oh, S.K., Bae, C., Kim, S.B., Lee, H.Y., Kim, S.Y., Kim, M.S.,
Kang, B.C., Jo, Y.D., Yang, H.B., Jeong, H.J., Kang, W.H., Kwon, J.K., Shin, C., Lim, J.Y.,
Park, J.H., Huh, J.H., Kim, J.S., Kim, B.D., Cohen, O., Paran, I., Suh, M.C., Lee, S.B., Kim,
Y.K., Shin, Y., Noh, S.J., Park, J., Seo, Y.S., Kwon, S.Y., Kim, H.A., Park, J.M., Kim, H.J.,
Choi, S.B., Bosland, P.W., Reeves, G., Jo, S.H., Lee, B.W., Cho, H.T., Choi, H.S., Lee, M.S.,
Yu, Y., Do Choi, Y., Park, B.S., van Deynze, A., Ashrafi, H., Hill, T., Kim, W.T., Pai, H.S., Ahn,
H.K., Yeam, I., Giovannoni, J.J., Rose, J.K., Sørensen, I., Lee, S.J., Kim, R.W., Choi, I.Y., Choi,
B.S., Lim, J.S., Lee, Y.H., Choi, D.: Genome sequence of the hot pepper provides insights into
the evolution of pungency in Capsicum species. Nat. Genet. 46(3), 270–278 (2014)
50. Qin, C., Yu, C., Shen, Y., Fang, X., Chen, L., Min, J., Cheng, J., Zhao, S., Xu, M., Luo, Y.,
Yang, Y., Wu, Z., Mao, L., Wu, H., Ling-Hu, C., Zhou, H., Lin, H., González-Morales, S.,
Trejo-Saavedra, D.L., Tian, H., Tang, X., Zhao, M., Huang, Z., Zhou, A., Yao, X., Cui, J., Li,
W., Chen, Z., Feng, Y., Niu, Y., Bi, S., Yang, X., Li, W., Cai, H., Luo, X., Montes-Hernández,
S., Leyva-González, M.A., Xiong, Z., He, X., Bai, L., Tan, S., Tang, X., Liu, D., Liu, J., Zhang,
116 F. Taranto et al.
S., Chen, M., Zhang, L., Zhang, L., Zhang, Y., Liao, W., Zhang, Y., Wang, M., Lv, X., Wen, B.,
Liu, H., Luan, H., Zhang, Y., Yang, S., Wang, X., Xu, J., Li, X., Li, S., Wang, J., Palloix, A.,
Bosland, P.W., Li, Y., Krogh, A., Rivera-Bustamante, R.F., Herrera-Estrella, L., Yin, Y., Yu, J.,
Hu, K., Zhang, Z.: Whole-genome sequencing of cultivated and wild peppers provides insights
into Capsicum domestication and specialization. Proc. Natl. Acad. Sci. U. S. A. 111(14), 5135–
5140 (2014)
51. Evanno, G., Regnaut, S., Goudet, J.: Detecting the number of clusters of individuals using the
software STRUCTURE: a simulation study. Mol. Ecol. 14, 2611–2620 (2005)
Hybridization-Based Enrichment and Next
Generation Sequencing to Explore Genetic
Diversity in Plants
1 Introduction
space and cost [3–6]. The hybridization-based method is one of the most efficient
and widely adopted among the available target enrichment techniques [7, 8]. It
has been demonstrated powerful, independently of the DNA capture protocol and
the sequencing platform used and it is often replacing PCR as the main target
enrichment method in plant sciences [3, 5].
Plant genomes can be extremely complex, repetitive, and are often polyploids;
as a consequence, some species are not well suited for whole genome sequencing
(WGS) approaches. By contrast, sequence capture and targeted re-sequencing have
the advantage of providing higher read depth for individual locus and support
the accurate identification of nucleotide polymorphisms also in plants with large
genomes and higher ploidy levels [9, 10].
In this manuscript, we provide a brief overview of the available strategies to
reduce genome complexity in plants with a special focus on hybridization-based
enrichment methods currently used for the characterization of natural/induced
genetic variation in plant species. Then, we highlight possible applications of these
technologies to plant research and describe a typical bioinformatic workflow for the
analysis of NGS data and the identification of sequence polymorphisms. Finally, we
discuss our experience in a project aimed at the identification of naturally occurring
sequence variation at candidate genes controlling carotenoid biosynthesis in tomato.
For plants that possess large size or polyploid genomes, for which whole genomes
cannot be readily assembled and the analysis of a large number of individuals
results still very expensive, an alternative strategy to WGS is to generate a
reduced representation of the genome. Genome reduction can be obtained using
target enrichment strategies. Target enrichment consists in the isolation of specific
genomic loci (e.g., genes, molecular markers, larger genomic regions, and organelle
genomes) coupled with NGS. Compared to WGS, the reduction in sequencing
space entails three main advantages: (1) sample multiplexing that implicates an
overall reduction of the sequencing cost per sample; (2) significant reduction in the
complexity of the analysis; and (3) the possibility of identifying the precise region
of interest given the depth of sequencing provided by NGS.
At present, transcriptome-based, restriction enzyme-based, PCR-based, and
hybridization-based methods, all compatible with the most popular NGS platforms,
have been developed to enrich specific targets [3].
Transcriptome-Based Enrichment is one of the most widely used strategies to
reduce genome complexity, since it focuses only on the transcribed portion of the
genome. The key aim of transcriptome sequencing, also known as RNA-seq, is to
determine gene expression profiles of each transcript during development and under
different conditions [11]. SNP discovery and molecular marker development via
Hybridization-Based Enrichment and Next Generation Sequencing to Explore. . . 119
RNA-seq are often performed, especially in organisms with large genomes [12].
Noteworthy, since RNA-seq is independent from any a priori knowledge on the
genome sequence of the species under investigation, it allows the analysis of poorly
characterized species.
Restriction Enzyme-Based Enrichment makes use of the discriminatory power
of the restriction endonucleases to produce restriction fragments among individuals
in a population. Three main techniques have been developed so far: RAD-seq
(restriction-site associated DNA sequencing) [13, 14], GR-RSC (genomic reduction
based on restriction site conservation) [15], and GBS (genotyping-by-sequencing)
[16]. All these methods, reviewed by Cronn et al. [3], are flexible and quite
inexpensive and have been used to identify and score, in a group of individuals,
thousands of genetic markers randomly distributed along the genome enabling SNP
discovery, genotyping as well as quantitative genetic and phylo-geographic studies.
PCR-Based Target Enrichment includes the direct sequencing of small and long
PCR products. NGS of PCR fragments has been preferentially applied to chloroplast
genomes in systematic studies [17] and in some cases also to nuclear genomic
regions despite their complexity [18]. The main disadvantages associated with
this method are the high level of failed target amplifications and/or non-specific
amplifications as well as the difficulty in obtaining an accurate pooling of samples
for NGS multiplexing [5]. Anyway, PCR-based enrichment remains feasible for
targeting small to medium-sized regions of the genome, but for high-throughput
sequencing of tens of thousands of PCR amplicons its efficiency falls off, given the
initial cost per sample and challenges in sample multiplexing. Microfluidic-based
multiplexing PCR can reduce costs but continues to be more expensive than other
enrichment methods [3].
Hybridization-Based Enrichment or sequence capture methods exploit the high
specificity of DNA or RNA probes (also called baits) which are designed to be
complementary to target genomic regions. RNA baits have significant advantages
over DNA probes because RNA–DNA hybrids have a higher affinity and melting
temperature than DNA–DNA hybrids. Two main technologies have been developed
for hybrid-capture applications: (1) on-array- or solid-based hybridization which
implies sample hybridization on a solid support (i.e., glass slide, microarray)
[8] and (2) in-solution- or liquid-based hybridization where pooled baits are
used in reaction tubes [7]. Due to their moderate costs and high specificity, low
amounts of required DNA per sample and power to simultaneously target large
numbers of markers, several protocols and commercial kits have been developed.
The most widespread ones and reliable in studies on plant species were provided by
Agilent Technologies (SureSelect), Roche NimbleGen (SeqCap EZ), MYcroarray
(MYbaits),and Ion Torrent (TargetSeq). Distinguishing features of these sequence
capture platforms are reported in Table 1.
120 I. Terracciano et al.
Table 1 List of the most important features of the commercially available target enrichment kits
On-array hybridization-based
capture In-solution hybridization-based capture
NimbleGen Agilent NimbleGen Agilent MYcroarray IonTorrent
Sequence Microarray SeqCap SureSelect MYbait TargetSeq
Capture Array EZ
Bait type DNA DNA DNA RNA RNA DNA
Bait length 60 bp 60 bp 55–105 bp 114–126 bp 80–120 bp 50–120 bp
Target size Up to 30 Mb N.D. Up to From 1 kb Up to From
200 Mb to 24 Mb 200,000 100 kb up
baits to 10 Mb
N.D. not determined
All these providers offer the opportunity to design custom kits for the species of
interest and make available services and tools to support probe design. Evidently,
it is necessary to have a reference sequence (complete or draft genome sequence,
transcripts, Expressed Sequence Tag database, etc.) to accomplish this task.
in promoter regions, 50 UTR regions, and in the first exon of genes because of high
GC content of these regions [21]. High or low GC content reduces the efficiency of
PCR amplifications [22], bait synthesis, and hybridization. Since this latter aspect
is related to nucleotide compositional properties of the probes, it can somehow
be corrected by probe design. The GC bias effect on sequencing coverage has
been studied by different authors, who plot GC content distribution against the
normalized mean read depth [19, 23]. Enrichment efficiency depends also on the
sequence capture protocol of choice as well as on the sequencing technology used.
The percentage of sequences that map to the selected targets (probe specificity)
can be influenced by the presence of closely related sequences (orthologs/paralogs)
of duplicated regions and/or interspersed repetitive elements in the genome [3].
Minimizing the number of off-target reads is desirable and it can be achieved by
selecting probes with high specificity.
A crucial parameter of a sequence capture experiment is the sensitivity, which
is the percentage of the target bases that are represented by one or more sequenced
reads. In other words, the higher the sequencing depth, the higher the confidence
that the base called at that position is correct, the better the estimation of SNP/InDel
frequency for any particular SNP/InDel. Also the experimental design has a great
impact on enrichment efficiency. Effectively, the right balance between the numbers
of targets to be sequenced and the expected sequencing depth must be found.
In the last few years, sequence capture and target enrichment followed by NGS
have been used to identify a high number of mutations in whole exomes, selected
gene families, and target genes or genomic regions of many plant species allowing
(1) generation of useful polymorphism resources in a quick and rather inexpensive
way; (2) biodiversity exploration and mining; (3) SNP marker development and
generation of genetic maps; (4) population structure definition or evolutionary
history in phylogenetics and phylogeography studies to be tracked; (5) QTL
mapping and candidate gene identification; and (6) genomic selection.
All these applications are intended to accelerate plant breeder activity for crop
improvement.
In this section we review recent literature on targeted re-sequencing of enriched
genomic DNA regions in crops and other economically important plant species and
briefly describe objectives and applications of each study (Table 2).
The first application of hybridization-based sequence capture in plant was
published by Fu et al. [9] who demonstrated the effectiveness of the enrichment
protocol in the identification of plant polymorphisms in divergent maize (Zea mays)
lines.
Table 2 List of several examples from the recent literature on the applications of hybridization-based enrichment protocols to plant genetics
122
Hybridization-
Target based
Species Nı Plant material Target size Assay type technology NGS platform References
Maize 1 2 inbred lines Non-repetitive portion 4.3 Mb Roche NimbleGen SB Roche 454 Fu et al. [9]
of 2.2 Mb and 43 array
genes
2 21 inbred lines Genomic regions 29 Mb Roche NimbleGen SB Roche 454 Muraya et al.
including genes for array [24]
biomass production
Rapeseed canola 3 10 genotypes 47 QTL-associated 51.2 Mb Roche NimbleGen SB and LB Roche 454 Clarke et al. [25]
genomic regions array and undefined and Illumina
liquid platform
4 4 accessions 29 regulatory 614 kb Agilent SureSelect LB Illumina Schiessl et al.
flowering-time genes [26]
Cotton 5 2 accessions 1000 geni (500 pairs of 550 kb Roche NimbleGen SB Roche 454 Salmon et al.
homeologs) array [27]
Sugarcane 6 2 genotypes (one Gene-rich regions 5.8 Mb Agilent SureSelect LB Illumina Bundock et al.
Saccharum from the close relative [28]
officinarum and one Sorghum bicolor
Saccharum hybrid)
Swichgrass 7 4 tetraploid lowland Whole exome 50 Mb Roche NimbleGen LB Illumina Evans et al. [29]
and 4 octoploid upland Seq-EZ
cultivars
Cassava 8 100 F1 progeny and 2 27,469 biallelic SNPs 2.49 Mb Ion TargetSeq Life LB Ion Torrent Pootakham et al.
parental strains from 10,105 regions Technologies Proton [30]
System
Loblolly pine 9 24 haploid samples Whole exome 6.57 Mb Agilent SureSelect LB Illumina Neves et al. [31]
10 72 haploid samples Whole exome 6.57 Mb Agilent SureSelect LB Illumina Neves et al. [32]
from a mapping
population
I. Terracciano et al.
Black 11 48 genotypes Predicted exons and 20.76 Mb Agilent SureSelect LB Illumina Zhou and
cottonwood upstream regulatory Holliday [23]
regions and random
genomic intervals
Eucalyptus 12 3 genotypes 94 genes involved in N.D. Agilent custom SB Illumina Dasgupta et al.
xylogenesis array [33]
Wheat 13 The wild emmer Exon regions of 3497 3.5 Mb Agilent SureSelect LB Illumina Saintenac et al.
accession (Triticum genes [10]
dicoccoides) and
durum wheat cultivar
Langdon (Triticum
turgidum var. durum)
14 8 UK alloexaploid Significant proportion 56.5 Mb Roche NimbleGen SB Illumina Allen et al. [34]
varieties of the wheat exome array
15 8 UK alloexaploid Significant proportion 56.5 Mb Roche NimbleGen SB Illumina Winfield et al.
varieties of the wheat exome array [35]
16 2 RILs chosen as Gene-rich regions of 110 Mb Roche NimbleGen LB Illumina Gardiner et al.
parent for a F2 the genome Seq-EZ [36]
mapping population
Rice and wheat 17 EMS-mutagenized rice Whole exome 42 Mb Roche NimbleGen LB Illumina Henry et al. [37]
(72) and wheat (6) (rice) Seq-EZ
individuals 107 Mb
(wheat)
Soybean 18 4 fast-neutron mutants Whole exome 52.3 Mb Roche NimbleGen SB Illumina Bolon et al. [38]
array
Hybridization-Based Enrichment and Next Generation Sequencing to Explore. . .
19 2 individuals of the Whole exome 52.3 Mb Roche NimbleGen SB Illumina Haun et al. [39]
cultivar Williams 82 array
Barley 20 36 samples from 13 Whole exome 61.6 Mb Roche NimbleGen LB Illumina Mascher et al.
barley cultivars and 7 Seq-EZ [40]
samples from 3 wild
relatives
123
(continued)
124
Table 2 (continued)
Hybridization-
Target based
Species Nı Plant material Target size Assay type technology NGS platform References
21 Parental lines and Whole exome 61.6 Mb Roche NimbleGen LB Illumina Pankin et al. [41]
BC1 F2 lines enriched Seq-EZ
for early flowering
genotypes
22 18 mutant plants and Whole exome 61.6 Mb Roche NimbleGen LB Illumina Mascher et al.
30 randomly selected Seq-EZ [40]
wild type plants
23 3 Hv/Hb ILs and the Whole exome 61.6 Mb Roche NimbleGen LB Illumina Wendler et al.
respective donor lines Seq-EZ [42]
Tribe trifoliae 24 6 individuals 62 low-copy nuclear 185 kb MYbaits LB Illumina de Sousa et al.
representing major genes and 257 short MYcroarray [43]
lineages within loci (exon sequences)
Medicago and distributed across all
Melilotus Medicago
chromosomes
Compositae 25 15 species 763 conserved N.D. MYbaits LB Illumina Mandel et al.
ortholog set (COS) loci MYcroarray [44]
Rosidae, 26 24 species (22 eudicots 300 plastidial genomes N.D. Agilent SureSelect LB Illumina Stull et al. [45]
asteridae, and 2 monocots)
caryophyllales,
asparagaceae,
and poaceae
I. Terracciano et al.
Strawberry 27 48 F1 individuals from 200 bp surrounding 149 Mb MYbaits LB Illumina Tennessen et al.
MRD30xMDR60 and each of 6575 MYcroarray [46]
their parental lines previously identified
polymorphisms
Potato and 28 Rpi-ber2 and Rpi-rzc1 580 NB-LRR coding N.D. Agilent SureSelect LB Illumina Jupe et al. [47]
tomato F1 populations and sequence from
Solanum tuberosum Solanaceae
group Phureja clone
DM1-3 516 R44
Tomato 29 Solanum 743 NB-LRR-like N.D. Agilent SureSelect LB Illumina Andolfo et al.
pimpinellifolium sequences [48]
LA1589 and Solanum
lycopersicum Heinz
1706
Potato 30 83 tetraploid cultivars 807 genes 1.44 Mb Agilent SureSelect LB Illumina Uitdewilligen et
and 1 monoploid clone al. [49]
(DM 1-3 511)
LB liquid-based, SB solid-based, N.D. not determined
Hybridization-Based Enrichment and Next Generation Sequencing to Explore. . .
125
126 I. Terracciano et al.
In the course of recent years, different authors proved the efficacy of exome
capture for the investigation of nucleotide diversity in polyploid species with a large,
repetitive, and heterozygous genomes [10, 29, 34, 35] and of intra-cultivar genomic
heterogeneity in diploid species [39, 40].
Sequence capture assays have also been designed to target genomic regions
associated with agronomically important traits and capture DNA sequence diversity
in maize [24], rapeseed (Brassica napus) [25, 26], cotton (Gossypium hirsutum)
[27], and cassava (Manihot esculenta) [30] to generate novel data for both research
and breeding activities.
In addition, sequence capture and re-sequencing have been applied to several
tree species, namely loblolly pine (Pinus taeda), black cottonwood (Populus
trichocarpa), and eucalyptus (Eucalyptus globulus), in order to identify sequence
polymorphisms to be used for the generation of a dense reference gene-based
genetic map [31, 32], perform genotyping [23], and develop xylogenesis associated-
trait markers [33].
In order to reconstruct phylogenetic relationships across the Trifolie tribe [43]
and the Compositae family [44] hybridization-based enrichment has been used to
capture sequence variability within low-copy nuclear (LCN) and conserved ortholog
set (COS) markers, respectively.
Mapping-by-sequencing, which combines genetic mapping with targeted-re-
sequencing, has been exploited (1) to identify useful polymorphism to map candi-
date genes in barley (Hordeum vulgare) [41], wild strawberry (Fragaria vesca ssp.
bracteata) [46], and einkorn wheat (Triticum monococcum) [36] and (2) to detect the
precise allocation of Hordeum bulbosum introgression regions in the cultivated H.
vulgare genetic background [42].
WES has been used to re-sequence ethyl methanesulfonate (EMS)- and fast
neutron (FN)-mutagenized plant populations to discover induced mutations in
rice (Oryza sativa), bread wheat (Triticum aestivum) [37], and soybean (Glycine
max) [38].
The use of a closely related reference genome (i.e., Sorghum bicolor) for probe
design has been applied to capture genomic regions of two sugarcane (Saccharum
officinarum) genotypes [28] proving useful for polymorphism discovery in poorly
described species.
Recently, even chloroplast genomes were subjected to target enrichment and
massively parallel sequencing [45]. The strategy the authors adopted is based on
the design of a custom RNA probe set based on the complete sequences of 22
previously sequenced eudicot chloroplast DNAs. Using this probe set an enrichment
experiment was performed on 24 angiosperms (22 eudicots, 2 monocots), which
were subsequently sequenced leading to the generation of complete plastid genomes
with exceptionally high coverage (717 on average).
At present, very few studies are referred to solanaceous crops. In 2013, [49]
described liquid-phase capture method to identify sequence variants within and
across 84 potato (Solanum tuberosum) cultivars they afterwards used to genotype
the same plant material by genotyping-by-sequencing.
Hybridization-Based Enrichment and Next Generation Sequencing to Explore. . . 127
Fig. 1 Typical variant calling workflow. Different analysis steps (each object in the figure)
are concatenated to identify reliable sequence polymorphisms and derive meaningful biological
interpretation of the results. The “RR” step is facultative in a variant calling NGS analysis
A BAM file can include reads with the same start and end coordinates. These
might represent PCR duplicates, which should be removed/flagged in the BAM file
since they are not informative and should not be counted as evidence of a putative
variant. The Picard MarkDuplicates (http://picard.sourceforge.net) is the preferred
tool for this task, although it only considers the starting position of the read as a
way to indicate a putative duplicated read. As an alternative, the SAMtools “rmdup”
command can be used [52].
Reads mapping to the edges of InDels often lead to mis-alignments and
produce artifactual mis-matches. Therefore, the local re-alignment of the reads
around InDels is necessary because it helps improve the accuracy of downstream
processing steps. The strategy developed to accomplish this task combines short-
read mapping with an assembly inspired approach to identify a local consensus
Hybridization-Based Enrichment and Next Generation Sequencing to Explore. . . 129
sequence. Programs that implement this approach include SRMA [59] and Indel-
Realigner from the Genome Analysis Toolkit (GATK) [60].
A further improvement may be achieved running the base quality score recalibra-
tion (BQSR) on re-aligned BAM files. One of the most commonly used programs
for BQSR is BaseRecalibrator from the GATK suite [60].
Variant calling, at first glance, may be pretty simple, as it involves the identifi-
cation of sites where one or more samples display possible genomic variations. All
the available tools allow the minimum coverage and minimum variant frequency
threshold to be fixed in order to extract significant variants. Of course additional
parameters can be configured to compute more stringent analyses. The GATK
HaplotypeCaller or UnifiedGenotyper [61], SAMtools mpileup [52], and Freebayes
(https://github.com/ekg/freebayes) are the most widely used programs for sequence
variant calling. The Variant Call File (VCF) format allows the most prevalent types
of sequence variations to be stored. In order to provide easily accessible methods for
working with VCF files the VCFtools program package has been developed [62].
A binary representation of the variant call format (BCF), which is more compact
and much faster to be processed than VCF, has also been implemented [63]. A
limitation of VCF tools is not supporting filtering of polyploidy data, but this can be
accomplished by VCFlib (https://github.com/vcflib/vcflib).
Identifying functionally relevant polymorphisms in a mare magnum of genetic
variations is the major challenge. Annotation of sequence variants includes the
classification of the effects of single nucleotide polymorphisms and insertion–
deletions (e.g., synonymous or non-synonymous SNPs, start-codon gain/loss, stop-
codon gain/loss, frame-shift, etc.) on annotated genes. Annotations can also be
based on the coordinate system used to describe the genomic position of each
polymorphism (e.g., intronic, 50 or 30 un-translated region, upstream, downstream,
inter-genic regions, etc.). In this regard, it is crucial to have an accurate, preferably
gold standard, structural annotation of the reference genome. ANNOVAR [64] and
SnpEff [65] are two of the most used tools in the variant annotation process.
Sequence polymorphisms in the coding regions are frequently associated with
aberrant protein modifications. The interpretation of novel missense mutations
(a type of non-synonymous substitutions) is challenging. Nevertheless, several
computational tools have been developed in order to predict possible impact of an
amino acid substitution on the structure and function of proteins [66, 67]. More
recently, the six best performing tools were combined into a consensus classifier,
called PredictSNP [68], which predictions on protein-related mutations represent a
robust alternative to the predictions delivered by individual tools. More complicated
is the study of splice-site polymorphisms as well as of sequence variations within
intronic regions. It is known that nucleotide variants very close to splice junctions
might alter the splicing pattern of a gene and/or affect splicing efficiency as well as
that introns can harbor functional polymorphisms that can influence the expression
of the genes that host them [69]. However, to the best of our knowledge, no tools
are available for the automatic classification of the effects of such polymorphisms
in plants. By contrast, strategies and tools to support investigations on promoters
are well-defined. Indeed, regulatory regions are generally scanned to identify
transcription factor binding sites (TFBSs). SNPs and InDels within these regions
might modify the TFBS pattern and alter gene expression. A variety of databases
130 I. Terracciano et al.
have been established during the time to collect cis-acting regulatory DNA elements
found in plant promoters. Most of them are now integrated into the most recent
PlantPAN resource [70]. Of course, in silico predictions must be always interpreted
with caution and additional experimental evidences are needed to confirm sequence
variations within the identified alleles.
The last step of the workflow consists in the visual representation of NGS data.
This can be amazingly useful when interpreting the obtained results. The integrative
genomic viewer (IGV) supports users by displaying, along a reference genome,
aligned reads (BAM files) and predicted genetic variants (VCF files) combined with
annotations from the reference [71]. Aggregation of data on a single platform has
significant consequences in the meaningful interpretation of sequencing data and it
is essential to facilitate knowledge discovery.
7 Conclusions
Liquid- or solid-phase sequence capture and target enrichment coupled with NGS
have been proven reliable in the identification of sequence polymorphisms in whole
exomes, target genes, or genomic regions of many plant species. The demand of
targeted re-sequencing is constantly growing and requires significant effort in data
analysis and management. Bioinformatic strategies are essential to extract meaning-
ful results from raw sequence data. Although all the steps of the complex workflow
are well defined, the tools developed to accomplish basic tasks are constantly being
improved and updated. Nevertheless, challenges associated with data analysis can be
taken on with confidence. Indeed, several applications intended to accelerate plant-
breeding activities for crop improvement can benefit from using this technology:
these include genotyping, SNP marker development and biodiversity exploration,
mapping-by-sequencing, etc. An additional application of sequence capture is the
identification and characterization of novel alleles from non-reference genomes. By
describing our research activity on the identification of sequence variation across a
panel of tomato genotypes at candidate genes controlling carotenoid biosynthesis,
we demonstrated that in solution-based hybridization method could be successfully
applied to detect and study the effect of novel alleles in economically important
crops.
Acknowledgements This work was carried out in the frame of the “GenoPom-pro -
Integrating post-genomic platforms to enhance the tomato production chain” project
(PON02_00395_3082360) and is supported by the PON R&C 2007–2013 grant funded by the
Italian Ministry of Education, University and Research in cooperation with the European Funds
for the Regional Development (FESR).
132 I. Terracciano et al.
References
1. Hashmi, U., Shafqat, S., Khan, F., Majid, M., Hussain, H., Kazi, A.G., John, R., Ahmad, P.:
Plant exomics: concepts, applications and methodologies in crop improvement. Plant Signal.
Behav. 10(1), e976152 (2015). doi:10.4161/15592324.2014.976152
2. Warr, A., Robert, C., Hume, D., Archibald, A., Deeb, N., Watson, M.: Exome sequencing:
current and future perspectives. G3 5(8), 1543–1550 (2015). doi:10.1534/g3.115.018564
3. Cronn, R., Knaus, B.J., Liston, A., Maughan, P.J., Parks, M., Syring, J.V., Udall, J.: Targeted
enrichment strategies for next-generation plant biology. Am. J. Bot. 99(2), 291–311 (2012).
doi:10.3732/ajb.1100356
4. Hodges, E., Xuan, Z., Balija, V., Kramer, M., Molla, M.N., Smith, S.W., Middle, C.M.,
Rodesch, M.J., Albert, T.J., Hannon, G.J., McCombie, W.R.: Genome-wide in situ exon capture
for selective resequencing. Nat. Genet. 39(12), 1522–1527 (2007). doi:10.1038/ng.2007.42
5. Mamanova, L., Coffey, A.J., Scott, C.E., Kozarewa, I., Turner, E.H., Kumar, A., Howard, E.,
Shendure, J., Turner, D.J.: Target-enrichment strategies for next-generation sequencing. Nat.
Methods 7(2), 111–118 (2010). doi:10.1038/nmeth.1419
6. Mertes, F., Elsharawy, A., Sauer, S., van Helvoort, J.M., van der Zaag, P.J., Franke,
A., Nilsson, M., Lehrach, H., Brookes, A.J.: Targeted enrichment of genomic DNA
regions for next-generation sequencing. Brief. Funct. Genomics 10(6), 374–386 (2011).
doi:10.1093/bfgp/elr033
7. Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., LeProust, E.M., Brockman, W., Fennell,
T., Giannoukos, G., Fisher, S., Russ, C., Gabriel, S., Jaffe, D.B., Lander, E.S., Nusbaum,
C.: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted
sequencing. Nat. Biotechnol. 27(2), 182–189 (2009). doi:10.1038/nbt.1523
8. Okou, D.T., Steinberg, K.M., Middle, C., Cutler, D.J., Albert, T.J., Zwick, M.E.: Microarray-
based genomic selection for high-throughput resequencing. Nat. Methods 4(11), 907–909
(2007). doi:10.1038/nmeth1109
9. Fu, Y., Springer, N.M., Gerhardt, D.J., Ying, K., Yeh, C.T., Wu, W., Swanson-Wagner,
R., D’Ascenzo, M., Millard, T., Freeberg, L., Aoyama, N., Kitzman, J., Burgess, D.,
Richmond, T., Albert, T.J., Barbazuk, W.B., Jeddeloh, J.A., Schnable, P.S.: Repeat subtraction-
mediated sequence capture from a complex genome. Plant J. 62(5), 898–909 (2010).
doi:10.1111/j.1365-313X.2010.04196.x
10. Saintenac, C., Jiang, D., Akhunov, E.D.: Targeted analysis of nucleotide and copy number
variation by exon capture in allotetraploid wheat genome. Genome Biol. 12(9), R88 (2011).
doi:10.1186/gb-2011-12-9-r88
11. Martin, L.B., Fei, Z., Giovannoni, J.J., Rose, J.K.: Catalyzing plant science research with RNA-
seq. Front. Plant Sci. 4, 66 (2013). doi:10.3389/fpls.2013.00066
12. Egan, A.N., Schlueter, J., Spooner, D.M.: Applications of next-generation sequencing in plant
biology. Am. J. Bot. 99(2), 175–185 (2012). doi:10.3732/ajb.1200020
13. Baird, N.A., Etter, P.D., Atwood, T.S., Currey, M.C., Shiver, A.L., Lewis, Z.A., Selker, E.U.,
Cresko, W.A., Johnson, E.A.: Rapid SNP discovery and genetic mapping using sequenced
RAD markers. PLoS One 3(10), e3376 (2008). doi:10.1371/journal.pone.0003376
14. Rowe, H.C., Renaut, S., Guggisberg, A.: RAD in the realm of next-generation sequencing
technologies. Mol. Ecol. 20(17), 3499–3502 (2011)
15. Maughan, P.J., Yourstone, S.M., Jellen, E.N., Udall, J.A.: SNP discovery via genomic
reduction, barcoding, and 454-pyrosequencing in Amaranth. Plant Genome J. 2(3), 260 (2009).
doi:10.3835/plantgenome2009.08.0022
16. He, J., Zhao, X., Laroche, A., Lu, Z.X., Liu, H., Li, Z.: Genotyping-by-sequencing (GBS), an
ultimate marker-assisted selection (MAS) tool to accelerate plant breeding. Front. Plant Sci. 5,
484 (2014). doi:10.3389/fpls.2014.00484
Hybridization-Based Enrichment and Next Generation Sequencing to Explore. . . 133
17. Uribe-Convers, S., Settles, M.L., Tank, D.C.: A phylogenomic approach based on PCR target
enrichment and high throughput sequencing: resolving the diversity within the South American
species of Bartsia L. (Orobanchaceae). PLoS one 11(2), e0148203 (2016). doi:10.1371/jour-
nal.pone.0148203
18. Durstewitz, G., Polley, A., Plieske, J., Luerssen, H., Graner, E.M., Wieseke, R., Ganal, M.W.:
SNP discovery by amplicon sequencing and multiplex SNP genotyping in the allopolyploid
species Brassica napus. Genome 3(11), 948–956 (2010). doi:10.1139/G10-079
19. Chilamakuri, C.S., Lorenz, S., Madoui, M.A., Vodak, D., Sun, J., Hovig, E., Myklebost,
O., Meza-Zepeda, L.A.: Performance comparison of four exome capture systems for deep
sequencing. BMC Genomics 15, 449 (2014). doi:10.1186/1471-2164-15-449
20. Parla, J.S., Iossifov, I., Grabill, I., Spector, M.S., Kramer, M., McCombie, W.R.: A comparative
analysis of exome capture. Genome Biol. 12(9), R97 (2011). doi:10.1186/gb-2011-12-9-r97
21. Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: Substantial biases in ultra-short read
data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36(16), e105 (2008).
doi:10.1093/nar/gkn425
22. Aird, D., Ross, M.G., Chen, W.S., Danielsson, M., Fennell, T., Russ, C., Jaffe, D.B., Nusbaum,
C., Gnirke, A.: Analyzing and minimizing PCR amplification bias in Illumina sequencing
libraries. Genome Biol. 12(2), R18 (2011). doi:10.1186/gb-2011-12-2-r18
23. Zhou, L., Holliday, J.A.: Targeted enrichment of the black cottonwood (Populus
trichocarpa) gene space using sequence capture. BMC Genomics 13, 703 (2012).
doi:10.1186/1471-2164-13-703
24. Muraya, M.M., Schmutzer, T., Ulpinnis, C., Scholz, U., Altmann, T.: Targeted sequencing
reveals large-scale sequence polymorphism in Maize candidate genes for biomass production
and composition. PLoS One 10(7), e0132120 (2015). doi:10.1371/journal.pone.0132120
25. Clarke, W.E., Parkin, I.A., Gajardo, H.A., Gerhardt, D.J., Higgins, E., Sidebottom, C.,
Sharpe, A.G., Snowdon, R.J., Federico, M.L., Iniguez-Luy, F.L.: Genomic DNA enrichment
using sequence capture microarrays: a novel approach to discover sequence nucleotide
polymorphisms (SNP) in Brassica napus L. PLoS One 8(12), e81992 (2013). doi:10.1371/jour-
nal.pone.0081992
26. Schiessl, S., Samans, B., Huttel, B., Reinhard, R., Snowdon, R.J.: Capturing sequence variation
among flowering-time regulatory gene homologs in the allopolyploid crop species Brassica
napus. Front. Plant Sci. 5, 404 (2014). doi:10.3389/fpls.2014.00404
27. Salmon, A., Udall, J.A., Jeddeloh, J.A., Wendel, J.: Targeted capture of homoeolo-
gous coding and noncoding sequence in polyploid cotton. G3 2(8), 921–930 (2012).
doi:10.1534/g3.112.003392
28. Bundock, P.C., Casu, R.E., Henry, R.J.: Enrichment of genomic DNA for polymorphism
detection in a non-model highly polyploid crop plant. Plant Biotechnol. J. 10(6), 657–667
(2012). doi:10.1111/j.1467-7652.2012.00707.x
29. Evans, J., Kim, J., Childs, K.L., Vaillancourt, B., Crisovan, E., Nandety, A., Gerhardt,
D.J., Richmond, T.A., Jeddeloh, J.A., Kaeppler, S.M., Casler, M.D., Buell, C.R.: Nucleotide
polymorphism and copy number variant detection using exome capture and next-generation
sequencing in the polyploid grass Panicum virgatum. Plant J. 79(6), 993–1008 (2014).
doi:10.1111/tpj.12601
30. Pootakham, W., Shearman, J.R., Ruang-Areerate, P., Sonthirod, C., Sangsrakru, D., Jomchai,
N., Yoocha, T., Triwitayakorn, K., Tragoonrung, S., Tangphatsornruang, S.: Large-scale SNP
discovery through RNA sequencing and SNP genotyping by targeted enrichment sequencing
in cassava (Manihot esculenta Crantz). PLoS One 9(12), e116028 (2014). doi:10.1371/jour-
nal.pone.0116028
31. Neves, L.G., Davis, J.M., Barbazuk, W.B., Kirst, M.: Whole-exome targeted sequencing of the
uncharacterized pine genome. Plant J. 75(1), 146–156 (2013). doi:10.1111/tpj.12193
32. Neves, L.G., Davis, J.M., Barbazuk, W.B., Kirst, M.: A high-density gene map of loblolly
pine (Pinus taeda L.) based on exome sequence capture genotyping. G3 4(1), 29–37 (2014).
doi:10.1534/g3.113.008714
134 I. Terracciano et al.
33. Dasgupta, M.G., Dharanishanthi, V., Agarwal, I., Krutovsky, K.V.: Development of genetic
markers in Eucalyptus species by target enrichment and exome sequencing. PLoS One 10(1),
e0116528 (2015). doi:10.1371/journal.pone.0116528
34. Allen, A.M., Barker, G.L., Wilkinson, P., Burridge, A., Winfield, M., Coghill, J., Uauy,
C., Griffiths, S., Jack, P., Berry, S., Werner, P., Melichar, J.P., McDougall, J., Gwilliam,
R., Robinson, P., Edwards, K.J.: Discovery and development of exome-based, co-dominant
single nucleotide polymorphism markers in hexaploid wheat (Triticum aestivum L.). Plant
Biotechnol. J. 11(3), 279–295 (2013). doi:10.1111/pbi.12009
35. Winfield, M.O., Wilkinson, P.A., Allen, A.M., Barker, G.L., Coghill, J.A., Burridge, A., Hall,
A., Brenchley, R.C., D’Amore, R., Hall, N., Bevan, M.W., Richmond, T., Gerhardt, D.J.,
Jeddeloh, J.A., Edwards, K.J.: Targeted re-sequencing of the allohexaploid wheat exome. Plant
Biotechnol. J. 10(6), 733–742 (2012). doi:10.1111/j.1467-7652.2012.00713.x
36. Gardiner, L.J., Gawronski, P., Olohan, L., Schnurbusch, T., Hall, N., Hall, A.: Using genic
sequence capture in combination with a syntenic pseudo genome to map a deletion mutant in
a wheat species. Plant J. 80(5), 895–904 (2014). doi:10.1111/tpj.12660
37. Henry, I.M., Nagalakshmi, U., Lieberman, M.C., Ngo, K.J., Krasileva, K.V., Vasquez-Gross,
H., Akhunova, A., Akhunov, E., Dubcovsky, J., Tai, T.H., Comai, L.: Efficient genome-wide
detection and cataloging of EMS-induced mutations using exome capture and next-generation
sequencing. Plant Cell 26(4), 1382–1397 (2014). doi:10.1105/tpc.113.121590
38. Bolon, Y.T., Haun, W.J., Xu, W.W., Grant, D., Stacey, M.G., Nelson, R.T., Gerhardt, D.J.,
Jeddeloh, J.A., Stacey, G., Muehlbauer, G.J., Orf, J.H., Naeve, S.L., Stupar, R.M., Vance, C.P.:
Phenotypic and genomic analyses of a fast neutron mutant population resource in soybean.
Plant Physiol. 156(1), 240–253 (2011). doi:10.1104/pp.110.170811
39. Haun, W.J., Hyten, D.L., Xu, W.W., Gerhardt, D.J., Albert, T.J., Richmond, T., Jeddeloh, J.A.,
Jia, G., Springer, N.M., Vance, C.P., Stupar, R.M.: The composition and origins of genomic
variation among individuals of the soybean reference cultivar Williams 82. Plant Physiol.
155(2), 645–655 (2011). doi:10.1104/pp.110.166736
40. Mascher, M., Richmond, T.A., Gerhardt, D.J., Himmelbach, A., Clissold, L., Sampath,
D., Ayling, S., Steuernagel, B., Pfeifer, M., D’Ascenzo, M., Akhunov, E.D., Hedley, P.E.,
Gonzales, A.M., Morrell, P.L., Kilian, B., Blattner, F.R., Scholz, U., Mayer, K.F., Flavell,
A.J., Muehlbauer, G.J., Waugh, R., Jeddeloh, J.A., Stein, N.: Barley whole exome capture:
a tool for genomic research in the genus Hordeum and beyond. Plant J. 76(3), 494–505 (2013).
doi:10.1111/tpj.12294
41. Pankin, A., Campoli, C., Dong, X., Kilian, B., Sharma, R., Himmelbach, A., Saini, R.,
Davis, S.J., Stein, N., Schneeberger, K., von Korff, M.: Mapping-by-sequencing identifies
HvPHYTOCHROME C as a candidate gene for the early maturity 5 locus modulating the
circadian clock and photoperiodic flowering in barley. Genetics 198(1), 383–396 (2014).
doi:10.1534/genetics.114.165613
42. Wendler, N., Mascher, M., Noh, C., Himmelbach, A., Scholz, U., Ruge-Wehling, B., Stein,
N.: Unlocking the secondary gene-pool of barley with next-generation sequencing. Plant
Biotechnol. J. 12(8), 1122–1131 (2014). doi:10.1111/pbi.12219
43. de Sousa, F., Bertrand, Y.J., Nylinder, S., Oxelman, B., Eriksson, J.S., Pfeil, B.E.: Phylo-
genetic properties of 50 nuclear loci in Medicago (Leguminosae) generated using multi-
plexed sequence capture and next-generation sequencing. PLoS One 9(10), e109704 (2014).
doi:10.1371/journal.pone.0109704
44. Mandel, J.R., Dikow, R.B., Funk, V.A., Masalia, R.R., Staton, S.E., Kozik, A., Michelmore,
R.W., Rieseberg, L.H., Burke, J.M.: A target enrichment method for gathering phylogenetic
information from hundreds of loci: an example from the Compositae. Appl. Plant Sci. 2(2)
(2014). doi:10.3732/apps.1300085
45. Stull, G.W., Moore, M.J., Mandala, V.S., Douglas, N.A., Kates, H.R., Qi, X., Brockington,
S.F., Soltis, P.S., Soltis, D.E., Gitzendanner, M.A.: A targeted enrichment strategy for
massively parallel sequencing of angiosperm plastid genomes. Appl. Plant Sci. 1(2) (2013).
doi:10.3732/apps.1200497
Hybridization-Based Enrichment and Next Generation Sequencing to Explore. . . 135
46. Tennessen, J.A., Govindarajulu, R., Liston, A., Ashman, T.L.: Targeted sequence capture
provides insight into genome structure and genetics of male sterility in a gynodioecious
diploid strawberry, Fragaria vesca ssp. bracteata (Rosaceae). G3 3(8), 1341–1351 (2013).
doi:10.1534/g3.113.006288
47. Jupe, F., Witek, K., Verweij, W., Sliwka, J., Pritchard, L., Etherington, G.J., Maclean, D., Cock,
P.J., Leggett, R.M., Bryan, G.J., Cardle, L., Hein, I., Jones, J.D.: Resistance gene enrichment
sequencing (RenSeq) enables reannotation of the NB-LRR gene family from sequenced plant
genomes and rapid mapping of resistance loci in segregating populations. Plant J. 76(3), 530–
544 (2013). doi:10.1111/tpj.12307
48. Andolfo, G., Jupe, F., Witek, K., Etherington, G.J., Ercolano, M.R., Jones, J.D.: Defining the
full tomato NB-LRR resistance gene repertoire using genomic and cDNA RenSeq. BMC Plant
Biol. 14, 120 (2014). doi:10.1186/1471-2229-14-120
49. Uitdewilligen, J.G., Wolters, A.M., D’Hoop B.B., Borm, T.J., Visser, R.G., van Eck, H.J.:
A next-generation sequencing method for genotyping-by-sequencing of highly heterozygous
autotetraploid potato. PLoS One 8(5), e62355 (2013). doi:10.1371/journal.pone.0062355
50. Li, J.W., Robison, K., Martin, M., Sjodin, A., Usadel, B., Young, M., Olivares, E.C., Bolser,
D.M.: The SEQanswers wiki: a wiki database of tools for high-throughput sequencing analysis.
Nucleic Acids Res. 40(Database issue), D1313–D1317 (2012). doi:10.1093/nar/gkr1058
51. Bolger, A.M., Lohse, M., Usadel, B.: Trimmomatic: a flexible trimmer for Illumina sequence
data. Bioinformatics 30(15), 2114–2120 (2014). doi:10.1093/bioinformatics/btu170
52. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis,
G., Durbin, R.: Genome project data processing, S.: the sequence alignment/map format and
SAMtools. Bioinformatics 25(16), 2078–2079 (2009). doi:10.1093/bioinformatics/btp352
53. Sims, D., Sudbery, I., Ilott, N.E., Heger, A., Ponting, C.P.: Sequencing depth and cov-
erage: key considerations in genomic analyses. Nat. Rev. Genet. 15(2), 121–132 (2014).
doi:10.1038/nrg3642
54. Hatem, A., Bozdag, D., Toland, A.E., Catalyurek, U.V.: Benchmarking short sequence
mapping tools. BMC Bioinf. 14, 184 (2013). doi:10.1186/1471-2105-14-184
55. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics 25(14), 1754–1760 (2009). doi:10.1093/bioinformatics/btp324
56. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4),
357–359 (2012). doi:10.1038/nmeth.1923
57. Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: SOAP2: an
improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009).
doi:10.1093/bioinformatics/btp336
58. Quinlan, A.R., Hall, I.M.: BEDTools: a flexible suite of utilities for comparing genomic
features. Bioinformatics 26(6), 841–842 (2010). doi:10.1093/bioinformatics/btq033
59. Homer, N., Nelson, S.F.: Improved variant discovery through local re-alignment of short-
read next-generation sequencing data using SRMA. Genome Biol. 11(10), R99 (2010).
doi:10.1186/gb-2010-11-10-r99
60. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella,
K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: a
MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res.
20(9), 1297–1303 (2010). doi:10.1101/gr.107524.110
61. Van der Auwera, G.A., Carneiro, M.O., Hartl, C., Poplin, R., Del Angel, G., Levy-Moonshine,
A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K.V., Altshuler, D.,
Gabriel, S., DePristo, M.A.: From FastQ data to high confidence variant calls: the Genome
Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 11(1110), 11.10.11–
11.10.33 (2013). doi:10.1002/0471250953.bi1110s43
62. Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker,
R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R., Genomes Project
Analysis, G.: The variant call format and VCFtools. Bioinformatics 27(15), 2156–2158 (2011).
doi:10.1093/bioinformatics/btr330
136 I. Terracciano et al.
63. Li, H.: A statistical framework for SNP calling, mutation discovery, association mapping and
population genetical parameter estimation from sequencing data. Bioinformatics 27(21), 2987–
2993 (2011). doi:10.1093/bioinformatics/btr509
64. Yang, H., Wang, K.: Genomic variant annotation and prioritization with ANNOVAR and
wANNOVAR. Nat. Protoc. 10(10), 1556–1566 (2015). doi:10.1038/nprot.2015.105
65. Cingolani, P., Platts, A., le Wang, L., Coon, M., Nguyen, T., Wang, L., Land, S.J., Lu,
X., Ruden, D.M.: A program for annotating and predicting the effects of single nucleotide
polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2;
iso-3. Fly 6(2), 80–92 (2012). doi:10.4161/fly.19695
66. Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P.,
Kondrashov, A.S., Sunyaev, S.R.: A method and server for predicting damaging missense
mutations. Nat. Methods 7(4), 248–249 (2010). doi:10.1038/nmeth0410-248
67. Ng, P.C., Henikoff, S.: Predicting the effects of amino acid substitutions
on protein function. Annu. Rev. Genomics Hum. Genet. 7, 61–80 (2006).
doi:10.1146/annurev.genom.7.080505.115630
68. Bendl, J., Stourac, J., Salanda, O., Pavelka, A., Wieben, E.D., Zendulka, J., Brezovsky,
J., Damborsky, J.: PredictSNP: robust and accurate consensus classifier for prediction of
disease-related mutations. PLoS Comput. Biol. 10(1), e1003440 (2014). doi:10.1371/jour-
nal.pcbi.1003440
69. Cooper, D.N.: Functional intronic polymorphisms: Buried treasure awaiting discovery within
our genes. Hum. Genomics 4(5), 284–288 (2010)
70. Chow, C.N., Zheng, H.Q., Wu, N.Y., Chien, C.H., Huang, H.D., Lee, T.Y., Chiang-Hsieh,
Y.F., Hou, P.F., Yang, T.Y., Chang, W.C.: PlantPAN 2.0: an update of plant promoter analysis
navigator for reconstructing transcriptional regulatory networks in plants. Nucleic Acids Res.
44(D1), D1154–D1160 (2016). doi:10.1093/nar/gkv1035
71. Thorvaldsdottir, H., Robinson, J.T., Mesirov, J.P.: Integrative genomics viewer (IGV): high-
performance genomics data visualization and exploration. Brief. Bioinform. 14(2), 178–192
(2013). doi:10.1093/bib/bbs017
72. Kumar, G.R., Sakthivel, K., Sundaram, R.M., Neeraja, C.N., Balachandran, S.M., Rani, N.S.,
Viraktamath, B.C., Madhav, M.S.: Allele mining in crops: prospects and potentials. Biotechnol.
Adv. 28(4), 451–461 (2010). doi:10.1016/j.biotechadv.2010.02.007
73. Rose, L.E., Langley, C.H., Bernal, A.J., Michelmore, R.W.: Natural variation in the Pto
pathogen resistance gene within species of wild tomato (Lycopersicon)I. Functional analysis
of Pto alleles. Genetics 171(1), 345–357 (2005). doi:10.1534/genetics.104.039339
74. Raiola, A., Rigano, M.M., Calafiore, R., Frusciante, L., Barone, A.: Enhancing the health-
promoting effects of tomato fruit for biofortified food. Mediators Inflamm. 2014, 139873
(2014). doi:10.1155/2014/139873
75. Giuliano, G.: Plant carotenoids: genomics meets multi-gene engineering. Curr. Opin. Plant
Biol. 19, 111–117 (2014). doi:10.1016/j.pbi.2014.05.006
76. Kavitha, P., Shivashankara, K.S., Rao, V.K., Sadashiva, A.T., Ravishankar, K.V., Sathish,
G.J.: Genotypic variability for antioxidant and quality parameters among tomato cultivars,
hybrids, cherry tomatoes and wild species. J. Sci. Food Agric. 94(5), 993–999 (2014).
doi:10.1002/jsfa.6359
77. Ruggieri, V., Francese, G., Sacco, A., D’Alessandro, A., Rigano, M.M., Parisi, M., Milone,
M., Cardi, T., Mennella, G., Barone, A.: An association mapping approach to identify
favourable alleles for tomato fruit quality breeding. BMC Plant Biol. 14, 337 (2014).
doi:10.1186/s12870-014-0337-9
78. Liu, L., Shao, Z., Zhang, M., Wang, Q.: Regulation of carotenoid metabolism in tomato. Mol.
Plant 8(1), 28–39 (2015). doi:10.1016/j.molp.2014.11.006
79. Tomato Genome Consortium: The tomato genome sequence provides insights into fleshy fruit
evolution. Nature 485(7400), 635–641 (2012). doi:10.1038/nature11119
DecontaMiner: A Pipeline for the Detection
and Analysis of Contaminating Sequences
in Human NGS Sequencing Data
1 Introduction
The study of the human genome and its relationship with the environment is a
crucial task in the context of modern biology.
The application of next generation sequencing technologies allows to charac-
terize the genome-wide map of organisms. Genome investigation has been made
possible by the construction of the reference genomes. Sequencing experiments
produce a large amount of small sequences that have to be mapped to the reference.
The alignment is probably the most challenging step of next generation sequencing
(NGS) data analyses. It allows to obtain several information—such as read density,
gene lists, and variant lists—crucial to the definition of the biological meaning
underlying the data.
Typically the amount of reads that correctly map onto the human reference
genome ranges between 70 and 90 % [1] leaving in some cases a consistent fraction
of unmapped reads. Underestimating this portion may determine loss of precious
information. Unmapped reads can be explained by errors during sequencing pro-
tocols, by the presence of repeat elements difficult to map, by novel transcripts
that can be investigated by de novo assembly, and lastly, they can derive from non-
human sequences. Indeed, microorganisms contamination can occur during samples
processing or can be part of the normal or pathological tissues microbiome [2].
The interest in detecting microorganisms-derived sequences has grown up
together with the spread of high-throughput approaches, allowing the extraction of
information both about the quality of the experimental procedures and about the
link between diseases and infections. The main appeal of these investigations is
represented by the possibility to find new pathogen-disease associations. In literature
there are many evidences which underline the importance of detecting contaminat-
ing organisms. Worth to note are the detection of polyomavirus in human Merkel
cell carcinoma [3] and a novel Old World arenavirus in a cluster of patients with
fatal transplant-associated disease [4]. Assembly of a novel bacterial draft genome
starting from tissue specimens sequencing of cord colitis patients suggested an
opportunistic pathogenic role for Bradyrhizobium enterica in humans [5].
Besides, environmental contaminations are routinely found in NGS datasets.
Downstream contaminations or cross-contaminations can compromise the reliabil-
ity of the whole experimental procedure. Strong et al. detected bacterial sequences,
belonging to different taxa, in cell line data coming from different sequencing
experiments and suggested the idea that a good portion of these bacterial reads
did not derive from the specimens themselves but from downstream contamination.
This suggestion has been supported by the detection of bacterial sequences in
polyA RNA-seq [6]. Indeed, the polyA selection step should remove upstream
contamination since bacteria are poorly polyadenylated. Moreover, to strengthen
the hypothesis of downstream contamination occurrence, the authors analyzed
Authors are contributed equally.
DecontaMiner: A Pipeline for the Detection of Contaminating Sequences 139
alignment to host genome, while CaPSID, in order to reduce the required time and
computational efforts, works on BAM files provided by the user, containing the
resulted alignments to the human and to all the pathogen reference sequences.
Here we propose DecontaMiner, a pipeline designed and developed to detect
contaminating sequences in NGS data. Our main purpose is to understand the nature
of those reads that fail to map to the reference genome, as well as to provide
an automatic pipeline that allows the quality filtering and the processing of these
sequences.
From the detected output it is straightforward to extract information about the
eventual samples contamination and/or tissue infection. As in the above-mentioned
papers [6–8] the experimental setup and the study of the detected microorganism
species might suggest the possible contamination sources. In general, it is not possi-
ble to automatically discriminate between upstream and downstream contamination.
Concluding, it can be said that DecontaMiner lies in the middle between
the complex, intensive pipelines of PathSeq and SURPI, and the post-alignment
approach of CaPSID.
2 DecontaMiner Pipeline
The BLAST outputs, in table format, are then submitted to the third and last
phase, that involves the collection and extraction of information from the local
alignments.
This module, mainly composed of Perl scripts, is executed accordingly to
some user-specified parameters specifying the filtering and collecting options. In
particular, the filtering is based on the threshold number of total reads successfully
mapped and on the minimum threshold of reads mapped to a single organism.
Instead, the collecting options involve the choice of organizing the results according
either to genus or to species names.
DecontaMiner stores the output reads into three main files: unaligned, ambigu-
ous, and aligned. The “unaligned” file contains the reads that do not satisfy
the filtering parameters (i.e., length of alignment, number of allowed gaps, and
mismatches). The ambiguous reads are those that map to different Genera or, in case
of paired-end reads, those having mates mapping to different genera. Ambiguous
reads mapping to more than one Genus might derive from ortholog sequences. Since
Reads matching all the filtering criteria are stored into the “aligned” file.
The results are available in a tabular format, one for each sample, containing
the names of the detected organisms and the relative reads count. Furthermore,
DecontaMiner generates a matrix that can be easily used to create a barplot or other
types of diagrams in which all the data are collected together.
Lastly, the summary statistics about the number of matched/filtered/discarded
reads and organisms are generated and stored into tabular textual files.
3 Case Studies
In order to assess the usefulness of the DecontaMiner pipeline and its efficiency
in detecting non-human sequences in NGS data, we used two publicly available
datasets downloaded from the GEO portal (GSE68086 and GSE69240).
The first study, from which the dataset GSE68086 was generated, concerns
the total RNA-sequencing experiments of blood platelet samples from patients
with six different malignant tumors (non-small cell lung cancer, colorectal cancer,
pancreatic cancer, glioblastoma, breast cancer, and hepato-biliary carcinomas) and
from healthy donors [18]. The experiment was performed with single-end 100 bp
reads.
The second one, GSE69240, derives from the expression profiling by high-
throughput sequencing of High-Grade Ductal Carcinoma In Situ (DCIS) [19]. The
dataset contains 25 pure HG-DCIS and 10 normal breast organoids samples. The
reads are paired-end 76 nucleotides long. This second dataset was used for testing
our pipeline on polyA RNA-seq data.
DecontaMiner: A Pipeline for the Detection of Contaminating Sequences 143
3.2 Pre-processing
The Sequence Read Archive (SRA) file of each sample was downloaded and
converted to fastq format using the SRAToolkit [20]. The sequencing reads were
cleaned by eventual poor quality ends by Trimmomatic [21]. The quality assessment
of the trimmed reads was performed with FastQC [22]. The fast splice junction
mapper TopHat [23] was chosen to align the fastq files to the reference genome
(assembly hg19) guided by UCSC gene annotation. The sequence features in
mapped data were checked by SamStat [24]. The unmapped bam files provided
by TopHat were the input to our pipeline.
The parameter setting used for analyzing the two datasets is listed in Table 1.
3.3 Results
The analysis of the overall read mapping rate showed a high variability among the
samples of the GSE68086 dataset, with a range of 5–40 % of unmapped reads.
In the case of the GSE69240 dataset, instead, we observed a good mapping rate
in all the samples, with a percentage of unmapped reads below 10 %. The mapping
statistics of the two datasets immediately suggested a different probability to detect
non-human sequences.
In order to test the reliability of our pipeline we submitted to the analysis also the
samples with a small amount of unmapped sequences.
As we expected, we did not find any significant match to contaminating genomes
for the samples of the GSE69240 dataset. We also re-analyzed the data, lowering the
stringency of the parameters in terms of allowed mismatches and gaps (2 for each),
with the same negative outcome.
This result completely agrees with the type of experimental procedure used. As
mentioned before, an efficient polyA RNA-seq process and a set of samples not
contaminated by the environment should guarantee reads free of contamination.
Hence, this result supports the reliability of the pipeline in terms of false positives
detection.
144 I. Granata et al.
Fig. 2 Healthy controls barplot. For each healthy sample a bar reports the detected contaminating
organism (colors) and percentage of unmapped reads assigned to each of them
The tumor samples barplot shows the presence of some bacterial species that
are absent in control samples, or present with a very low reads number. Among
them is worth to note the bacterium Acinetobacter baumannii. The percentage of
reads aligned to A. baumannii is particularly evident in hepato-biliary carcinoma,
although its presence seems to be independent of cancer type.
The genus Acinetobacter, as currently defined, comprises gram-negative, strictly
aerobic, nonfermenting, nonfastidious, nonmotile, catalase-positive, and oxidase-
negative bacteria [28]. A. baumannii normally inhabits human skin, mucous
membranes, and soil [29]. Acinetobacter baumannii, in particular, has become
one of the major causes of nosocomial infections during the past two decades
[28, 30–32] and its correlation with outcomes of cancer patients is a clinical issue
under study [33, 34].
4 Conclusions
The DecontaMiner pipeline was designed and developed to investigate the presence
of contaminating sequences in NGS data. It has a dual utility, both as a filtering tool
to remove foreign reads from the raw sequencing file, usually in fastq format, and
as a detection tool to identify contaminating sequences among the unmapped reads,
provided as a bam file. In order to test our pipeline we used two different RNA-
seq datasets. The lack of matches to microorganisms in case of the polyA-RNA
(GSE69240) demonstrates that the risk of incurring into false positive results is very
low. The reliability of our pipeline is further proved on the total RNA (GSE68086)
dataset analysis. Indeed, we found some kind of background contamination in
almost all the samples. The most present organisms are P. acnes and E. coli and,
in addition, some tumor samples significatively matched to A. baumannii, that it
is a well-known nosocomial pathogen, even probably associated with outcomes of
cancer diseases. It is important to underline that DecontaMiner can suggest the
presence of contaminating sequences, but this results must be confirmed by an
experimental validation. As an added value, the output fasta files and BLAST tables
can be easily uploaded to MEGAN5 [35], a metagenome analyzer, which allows
to obtain more detailed information about the taxonomy profile of the samples in
several graphical modes. We are currently working to provide DecontaMiner as a
Bash shell command-line tool, usable on a common laptop as well as in a distributed
computing environment. We are also planning to put together the pipeline here
developed and the Transcriptator tool [36] developed in our lab to provide an
integrated environment for the analysis of omics data.
References
1. Conesa, A., et al.: A survey of best practices for RNA-seq data analysis. Genome Biol. 17(1),
1–19 (2016)
2. Laurence, M., Hatzis, C., Brash, D.E.: Common contaminants in next-generation sequencing
that hinder discovery of low-abundance microbes. PLoS One 9(5), e97876 (2014)
3. Feng, H., et al.: Clonal integration of a polyomavirus in human Merkel cell carcinoma. Science
319(5866), 1096–1100 (2008)
4. Palacios, G., et al.: A new arenavirus in a cluster of fatal transplant-associated diseases. N.
Engl. J. Med. 358(10), 991–998 (2008)
5. Bhatt, A.S., et al.: Sequence-based discovery of Bradyrhizobium enterica in cord colitis
syndrome. N. Engl. J. Med. 369(6), 517–528 (2013)
6. Strong, M.J., et al.: Microbial contamination in next generation sequencing: implications for
sequence-based analysis of clinical samples. PLoS Pathog. 10(11), e1004437 (2014)
7. Tae, H., et al.: Large scale comparison of non-human sequences in human sequencing data.
Genomics 104(6), 453–458 (2014)
8. Ouma, W.Z., et al.: Important biological information uncovered in previously unaligned reads
from chromatin immunoprecipitation experiments (ChIP-Seq). Sci. Rep. 5, 8635–8635 (2015)
9. Kostic, A.D., et al.: PathSeq: software to identify or discover microbes by deep sequencing of
human tissue. Nat. Biotechnol. 29(5), 393–396 (2011)
10. Borozan, I., et al.: CaPSID: a bioinformatics platform for computational pathogen sequence
identification in human genomes and transcriptomes. BMC Bioinf. 13(1), 206 (2012)
11. Altschul, S.F., et al.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
12. Naccache, S.N., et al.: A cloud-compatible bioinformatics pipeline for ultrarapid pathogen
identification from next-generation sequencing of clinical samples. Genome Res. 24(7),
1180–1192 (2014)
13. Li, H., et al.: The sequence alignment/map format and SAMtools. Bioinformatics 25(16),
2078–2079 (2009)
14. Quinlan, A.R., Hall, I.M.: BEDTools: a flexible suite of utilities for comparing genomic
features. Bioinformatics 26(6), 841–842 (2010)
15. Gordon, A., Hannon, G.J.: FastX Toolkit (2010) http://hannonlab.cshl.edu/fastx_toolkit/index
16. Kopylova, E., Noé, L., Touzet, H.: SortMeRNA: fast and accurate filtering of ribosomal RNAs
in metatranscriptomic data. Bioinformatics 28(24), 3211–3217 (2012)
17. Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA
sequences. J. Comput. Biol. 7, 203–214 (2000)
18. Best, M.G., et al.: RNA-Seq of tumor-educated platelets enables blood-based pan-cancer,
multiclass, and molecular pathway cancer diagnostics. Cancer Cell 28(5), 666–676 (2015)
19. Abba, M.C., et al.: A molecular portrait of high-grade ductal carcinoma in situ. Cancer Res.
75(18), 3980–3990 (2015)
20. Leinonen, R., Sugawara, H., Shumway, M.: The sequence read archive. Nucleic Acids Res. 39,
D19–D21 (2010).
21. Bolger, A.M., Lohse, M., Usadel, B.: Trimmomatic: a flexible trimmer for Illumina sequence
data. Bioinformatics 30, 2114–2120 (2014).
22. Andrews, S.: FastQC: a quality control tool for high throughput sequence data. Reference
Source (2010)
23. Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-Seq.
Bioinformatics 25(9), 1105–1111 (2009)
24. Lassmann, T., Hayashizaki, Y., Daub, C.O.: SAMStat: monitoring biases in next generation
sequencing data. Bioinformatics 27(1), 130–131 (2011)
25. Perry, A., Lambert, P.: Propionibacterium acnes: infection beyond the skin. Expert Rev. Anti-
Infect. Ther. 9(12), 1149–1156 (2011)
26. Park, H.J., et al.: Clinical significance of Propionibacterium acnes recovered from blood
cultures: analysis of 524 episodes. J. Clin. Microbiol. 49(4), 1598–1601 (2011)
148 I. Granata et al.
27. Pitout, J.D.D.: Extraintestinal pathogenic Escherichia coli: an update on antimicrobial resis-
tance, laboratory diagnosis and treatment. Expert Rev. Anti-Infect. Ther. 10(10), 1165–1176
(2012)
28. Peleg, A.Y., Seifert, H., Paterson, D.L.: Acinetobacter baumannii: emergence of a successful
pathogen. Clin. Microbiol. Rev. 21(3), 538–582 (2008)
29. Manchanda, V., Sanchaita, S., Singh, N.P.: Multidrug resistant acinetobacter. J. Global Infect.
Dis. 2(3), 291 (2010)
30. Fukuta, Y., et al.: Risk factors for acquisition of multidrug-resistant Acinetobacter baumannii
among cancer patients. Am. J. Infect. Control 41(12), 1249–1252 (2013)
31. Al-Hassan, L., El Mehallawy, H., Amyes, S.G.B.: Diversity in Acinetobacter baumannii
isolates from paediatric cancer patients in Egypt. Clin. Microbiol. Infect. 19(11), 1082–1088
(2013)
32. Dijkshoorn, L., Nemec, A., Seifert, H.: An increasing threat in hospitals: multidrug-resistant
Acinetobacter baumannii. Nat. Rev. Microbiol. 5(12), 939–951 (2007)
33. Ñamendys-Silva, S.A., et al.: Outcomes of critically ill cancer patients with Acinetobacter
baumannii infection. World J. Crit. Care Med. 4(3), 258 (2015)
34. Nazer, L.H., et al.: Characteristics and Outcomes of Acinetobacter baumannii Infections in
Critically Ill Patients with cancer: a matched case-control study. Microb. Drug Resist. 21(5),
556–561 (2015)
35. Huson, D.H., Weber, N.: Microbial community analysis using MEGAN. Methods Enzymol.
531, 465–485 (2012)
36. Tripathi, K.P., Evangelista, D., Zuccaro, A., Guarracino, M.R.: Transcriptator: an automated
computational pipeline to annotate assembled reads and identify non coding RNA. PLoS One
10(11), e0140268 (2015)