DNA Microarrays and Gene Expression (BookFi)
DNA Microarrays and Gene Expression (BookFi)
G ENE E XP R E SSI O N
From experiments to data analysis and modeling
PI E RRE BAL DI
University of California, Irvine
and
G. WE SL E Y HAT F I E L D
University of California, Irvine
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo
-
isbn-13 978-0-511-06321-3 eBook (NetLibrary)
-
isbn-10 0-511-06321-0 eBook (NetLibrary)
-
isbn-13 978-0-521-80022-8 hardback
-
isbn-10 0-521-80022-6 hardback
v
vi Contents
5 Statistical analysis of array data: Inferring changes 53
Problems and common approaches 53
Probabilistic modeling of array data 55
Simulations 65
Extensions 69
Index 207
Preface
viii
Preface ix
tens-of-thousands of rows associated with gene probes, and as many
columns or experimental conditions as the experimenter is willing to
collect. As a side note, this may change in the future and one could envision
simple diagnostic arrays that can be read directly by a physician.
Clearly, the scale and tools of biological research are changing. The
storage, retrieval, interpretation, and integration of large volumes of data
generated by DNA arrays and other high-throughput technologies, such as
genome sequencing and mass spectrometry, demand increasing reliance on
computers and evolving computational methods. In turn, these demands
are effecting fundamental changes in how research is done in the life sci-
ences and the culture of the biological research community. It is becoming
increasingly important for individuals from both life and computational
sciences to work together as integrated research teams and to train future
scientists with interdisciplinary skills. It is inevitable that as we enter farther
into the genomics era, single-investigator research projects typical of
research funding programs in the biological sciences will become less preva-
lent, giving way to more interdisciplinary approaches to complex biological
questions conducted by multiple investigators in complementary fields.
Statistical methods, in particular, are essential for the interpretation of
high-throughput genomic data. Statistics is no longer a poor province
of mathematics. It is rapidly becoming recognized as the central language
of sciences that deal with large amounts of data, and rely on inferences in
an uncertain environment.
As genomic technologies and sequencing projects continue to advance,
more and more emphasis is being placed on data analysis. For example, the
identification of the function of a gene or protein depends on many things
including structure, expression levels, cellular localization, and functional
neighbors in a biochemical pathway that are often co-regulated and/or
found in neighboring regions along the chromosome. Clearly then, estab-
lishing the function of new genes can no longer depend on sequence analy-
sis alone but requires taking into account additional sources of
information including phylogeny, environment, molecular and genomic
structure, and metabolic and regulatory networks. By contributing to the
understanding of these networks, DNA arrays already are playing a
significant role in the annotation of gene function, a fundamental task of
the genomics era. At the same time, array data must be integrated with
sequence data, with structure and function data, with pathway data, with
phenotypic and clinical data, and so forth. New biological discoveries will
depend strongly on our ability to combine and correlate these diverse data
sets along multiple dimensions and scales. Basic research in bioinformatics
x Preface
must deal with these issues of systems and integrative biology in a situation
where the amount of data is growing exponentially.
As these challenges are met, and as DNA array technologies progress,
incredible new insights will surely follow. One of the most striking results of
the Human Genome Project is that humans probably have only on the
order of twice the number of genes of other metazoan organisms such as
the fly. While these numbers are still being revised, it is clear that biological
complexity does not come from sheer gene number but from other sources.
For instance, the number of gene products and their interactions can be
greatly amplified by mechanisms such as alternative mRNA splicing, RNA
editing, and post-translational protein modifications. On top of this, addi-
tional levels of complexity are generated by genetic and biochemical net-
works responsible for the integration of multiple biological processes as
well as the effects of the environment on living cells. Surely, DNA and
protein array technologies will contribute to the unraveling of these
complex interactions. At a time when human cloning and organ regenera-
tion from stem cells are on the horizon, arrays should help us to further
understand the old but still largely unanswered questions of nature versus
nurture and perhaps strike a new balance between the reductionist determi-
nism of molecular biology and the role of chance, epigenetic regulation,
and environment on living systems. However, while arrays and other high-
throughput technologies will provide the data, new bioinformatics innova-
tions must provide the methods for the elucidation of these complex
interactions.
As we progress into the genomics era, it is anticipated that DNA array
technologies will assume an increasing role in the investigation of evolution.
For example, DNA array studies could shed light on mechanisms of evolu-
tion directly by the study of mRNA levels in organisms that have fast gener-
ation times and indirectly by giving us a better understanding of regulatory
circuits and their structure, especially developmental regulatory circuits.
These studies are particularly important for understanding evolution for
two obvious reasons: first, genetic adaptation is very constrained since most
non-neutral mutations are disadvantageous; and second, simple genetic
changes can serve as “amplifiers” in the sense that they can produce large
developmental changes, for instance doubling the number of wings in a fly.
On the medical side, DNA arrays ought to help us better understand
complex issues concerning human health and disease. Among other
things, they should help us tease out the effects of environment and life
style, including drugs and nutrition, and help usher in the individualized
molecular medicine of the future. For example, daily doses of vitamin C
Preface xi
recommended in the literature vary over three orders of magnitude. In
fact, the optimal dose for all nutritional supplements is unknown.
Information from DNA array studies should help define and quantify the
impact of these supplements on human health. Furthermore, although we
are accustomed to the expression “the human body”, large response vari-
abilities among individuals due to genetic and environmental differences
are observed. In time, the information obtained from DNA array studies
should help us tailor nutritional intake and therapeutic drug doses to the
makeup of each individual.
Throughout the second half of the twentieth century, molecular biolo-
gists have predominantly concentrated on single-gene/single-protein
studies. Indeed, this obsessive focus on working with only one variable at a
time while suppressing all others in in vitro systems has been a hallmark of
molecular biology and the foundation for much of its success. As we enter
into the genomics era this basic paradigm is shifting from the study of
single-variable systems to the study of complex interactions. During this
same period, cell biologists have been following mRNA and/or protein
levels during development, and much of what we know about development
has been gathered with techniques like in situ hybridization that have
allowed us to define gene regulatory mechanisms and to follow the expres-
sion of individual genes in multiple tissues. DNA arrays give us the addi-
tional ability to follow the expression levels of all of the genes in the cells of
a given tissue at a given time.
As old and new technologies join forces, and as computational scientists
and biologists embrace the high-throughput technologies of the genomics
era, the trend will be increasingly towards a systems biology approach that
simultaneously studies tens of thousands of genes in multiple tissues
under a myriad of experimental conditions. The goal of this systems
biology approach is to understand systems of ever-increasing complexity
ranging from intracellular gene and protein networks, to tissue and organ
systems, to the dynamics of interactions between individuals, populations,
and their environments. This large-scale, high-throughput, interdiscipli-
nary approach enabled by genomic technologies is rapidly becoming a
driving force of biomedical research particularly apparent in the biotech-
nology and pharmaceutical industries. However, while the DNA array will
be an important workhorse for the attainment of these goals, it should be
emphasized that DNA array technology is still at an early stage of devel-
opment. It is cluttered with heterogeneous technologies and data formats
as well as basic issues of noise, fidelity, calibration, and statistical
significance that are still being sorted out. Until these issues are resolved
xii Preface
and standardized, it will not be possible to define the complete genetic reg-
ulatory network of even a well-studied prokaryotic cell. In the meantime,
most progress will continue to come from focused incremental studies that
look at specific networks and specific interacting sets of genes and proteins
in simple model organisms, such as the bacterium Escherichia coli or the
yeast Saccharomyces cerevisiae.
In short, the promise of DNA arrays is to help us untangle the extremely
complex web of relationships among genotypes, phenotypes development,
environment, and evolution. On the medical side, DNA arrays ought to
help us understand disease, create new diagnostic tools, and help usher in
the individualized molecular medicine of the future. DNA array technol-
ogy is here and progressing at a rapid pace. The bioinformatics methods to
process, analyze, interpret, and integrate the enormous volumes of data to
be generated by this technology are coming.
1
Brown, P. http://cmgm.stanford.edu/pbrown; Schena, M. (ed.) Microarray Biochip
Technology. 2000. Eaton Publishing Co., Natick, MA.
xiv Preface
the milestones over the past 50 years or so that have ushered us into the
genomics era. This history emphasizes the technological breakthroughs –
and the authors’ bias towards the importance of the model organism
Escherichia coli in the development of the paradigms of modern molecular
biology – that have led us from the enzyme period to the genomics era.
In Chapter 2 we describe the various DNA array technologies that are
available today. These technologies range from in situ synthesized arrays
such as the Affymetrix GeneChip™, to pre-synthesized nylon membrane
and glass slide arrays, to newer technologies such as electronic and bead-
based arrays.
In Chapter 3 we describe the methods, technology, and instrumentation
required for the acquisition of data from DNA arrays hybridized with
radioactive-labeled or fluorescent-labeled targets.
In Chapter 4 we consider issues important for the design and execution
of a DNA array experiment with special emphasis on problems and pitfalls
encountered in gene expression profiling experiments. Special considera-
tion is given to experimental strategies to deal with these problems and
methods to reduce experimental and biological sources of variance.
In Chapter 5 we deal with the first level of statistical analysis of DNA
array data for the identification of differentially expressed genes. Due to the
large number of measurements from a single experiment, high levels of
noise, and experimental and biological variabilities, array data is best
modeled and analyzed using a probabilistic framework. Here we review
several approaches and develop a practical Bayesian statistical framework
to effectively address these problems to infer gene changes. This framework
is applied to experimental examples in Chapter 7.
In Chapter 6 we move to the next level of statistical analysis involving the
application of visualization, dimensionality reduction, and clustering
methods to DNA array data. The most popular dimensionality and cluster-
ing methods and their advantages and disadvantages are surveyed. We also
examine methods to leverage array data to identify DNA genomic
sequences important for gene regulation and function. Mathematical
details for Chapters 5 and 6 are presented in Appendix B.
In Chapter 7 we present a brief survey of current DNA array applica-
tions and lead the reader through a gene expression profiling experiment
taken from our own work using pre-synthesized (nylon membrane) and in
situ synthesized (Affymetrix GeneChip™) DNA arrays. Here we describe
the use of software tools that apply the statistical methods described in
Chapters 5 and 6 to analyze and interpret DNA array data. Special empha-
sis is given to methods to determine the magnitude and sources of experi-
Preface xv
mental errors and how to use this information to determine global false
positive rates and confidence levels.
Chapter 8 covers several aspects of what is coming to be known as
systems biology. It provides an overview of regulatory, metabolic, and sig-
naling networks, and the mathematical and software tools that can be used
for their investigation with an emphasis on the inference and modeling of
gene regulatory networks.
The appendices include explicit technical information regarding: (A)
protocols, such as RNA preparation and target labeling methods, for DNA
array experiments; (B) additional mathematical details about, for instance,
support vector machines; (C) a section with a brief overview of current
database resources and other information that are publicly available over
the Internet, together with a list of useful web sites; and (D) an introduction
to CyberT, an online program for the statistical analysis of DNA array
data.
Finally, a word on terminology. Throughout the book we have used for
the most part the word “array” instead of “microarray” for two basic
reasons: first, in our minds DNA arrays encompass DNA microarrays;
second, at what feature density or physical size an array becomes a microar-
ray is not clear. Also, the terms “probe” and “target” have appeared inter-
changeably in the literature. Here we keep to the nomenclature for probes
and targets of northern blots familiar to molecular biologists; we refer to
the nucleic acid attached to the array substrate as the “probe” and the free
nucleic acid as the “target”.
Acknowledgements
Many colleagues have provided us with input, help, and support. At the risk
of omitting many of them, we would like to thank in particular Suzanne B.
Sandmeyer who has been instrumental in developing a comprehensive
program in genomics and the DNA Array Core Facility at UCI, and who
has contributed in many ways to this work. We would like to acknowledge
our many colleagues from the UCI Functional Genomics Group for the
tutelage they have given us at their Tuesday morning meetings, in particu-
lar: Stuart Arfin, Lee Bardwell, J. David Fruman, Steven Hampson, Denis
Heck, Dennis Kibler, Richard Lathrop, Anthony Long, Harry Mangalam,
Calvin McLaughlin, Ming Tan, Leslie Thompson, Mark Vawter, and Sara
Winokur. Outside of UCI, we would like to acknowledge our collaborators
Craig J. Benham, Rob Gunsalus, and David Low. We would like also to
thank Wolfgang Banzhaf, Hamid Bolouri, Hidde de Jong, Eric Mjolsness,
xvi Preface
James Nowick, and Padhraic Smyth for helpful discussions. Hamid and
Hidde also provided graphical materials, James and Padhraic provided
feedback on an early version of Chapter 8, and Wolfgang helped proofread
the final version. Additional material was kindly provided by John
Weinstein and colleagues, and by Affymetrix including the image used for
the cover of the book. We would like to thank our excellent graduate stu-
dents Lorenzo Tolleri, Pierre-François Baisnée, and Gianluca Pollastri, and
especially She-pin Hung, for their help and important contributions. We
gratefully acknowledge support from Sun Microsystems, the Howard
Hughes Medical Institute, the UCI Chao Comprehensive Cancer Center,
the UCI Institute for Genomics and Bioinformatics (IGB), the National
Science Foundation, the National Institutes of Health, a Laurel Wilkening
Faculty Innovation Award, and the UCI campus administration, as well as
a GAANN (Graduate Assistantships in Areas of National Need Program)
and a UCI BREP (Biotechnology Research and Education Program) train-
ing grant in Functional and Computational Genomics. Ann Marie Walker
at the IGB helped us with the last stages of this project. We would like also
to thank our editors, Katrina Halliday and David Tranah at Cambridge
University Press, especially for their patience and encouragement, and all
the staff at CUP who have provided outstanding editorial help. And last but
not least, we wish to acknowledge the support of our friends and families.
1
A brief history of genomics
From time to time new scientific breakthroughs and technologies arise that
forever change scientific practice. During the last 50 years, several advances
stand out in our minds that – coupled with advances in the computational
and computer sciences – have made genomic studies possible. In the brief
history of genomics presented here we review the circumstances and conse-
quences of these relatively recent technological revolutions.
Our brief history begins during the years immediately following World
War II. It can be argued that the enzyme period that preceded the modern
era of molecular biology was ushered in at this time by a small group of
physicists and chemists, R. B. Roberts, P. H. Abelson, D. B. Cowie, E. T.
Bolton, and J. R. Britton in the Department of Terrestrial Magnetism of
the Carnegie Institution of Washington. These scientists pioneered the use
of radioisotopes for the elucidation of metabolic pathways. This work
resulted in a monograph titled Studies of Biosynthesis in Escherichia coli
that guided research in biochemistry for the next 20 years and, together
with early genetic and physiological studies, helped establish the bacterium
E. coli as a model organism for biological research [1]. During this time,
most of the metabolic pathways required for the biosynthesis of intermedi-
ary metabolites were deciphered and biochemical and genetic methods
were developed to identify and characterize the enzymes involved in these
pathways.
Much in the way that genomic DNA sequences are paving the way for the
elucidation of global mechanisms for genetic regulation today, the bio-
chemical studies initiated in the 1950s that were based on our technical abil-
ities to create isotopes and radiolabel biological molecules paved the way
for the discovery of the basic mechanisms involved in the regulation of
metabolic pathways. Indeed, these studies defined the biosynthetic path-
ways for the building blocks of macromolecules such as proteins and
1
2 A brief history of genomics
nucleic acids and led to the discovery of mechanisms important for meta-
bolic regulation such as end product inhibition, allostery, and modulation
of enzyme activity by protein modifications. However, major advances con-
cerning the biosynthesis of macromolecules awaited another break-
through, the description of the structure of the DNA helix by James D.
Watson and Francis H. C. Crick in 1953 [2]. With this information, the
basic mechanisms of DNA replication, protein synthesis, gene expression,
and the exchange and recombination of genetic material were rapidly
unraveled.
During the enzyme period, geneticists around the world were using the
information provided by biochemists to develop model systems such as
bacteria, fruit flies, yeast, and mice for genetic studies. In addition to estab-
lishment of the basic mechanisms for protein-mediated regulation of gene
expression by F. Jacob and J. Monod in 1961 [3], these genetic studies led to
fundamental discoveries that were to spawn yet another major change in
the history of molecular biology. This advance was based on studies
designed to determine why E. coli cells once infected by a bacteriophage
were immune to subsequent infection. These seemingly esoteric investiga-
tions led by Daniel Nathans and Hamilton Smith [4] resulted in the discov-
ery of new types of enzymes, restriction endonucleases and DNA ligases,
capable of cutting and rejoining DNA at sequence-specific sites. It was
quickly recognized that these enzymes could be used to construct recombi-
nant DNA molecules composed of DNA sequences from different organ-
isms. As early as 1972 Paul Berg and his colleagues at Stanford University
developed an animal virus, SV40, vector containing bacteriophage lambda
genes for the insertion of foreign DNA into E. coli cells [5]. Methods of
cloning and expressing foreign genes in E. coli have continued to progress
until today they are fundamental techniques upon which genomic studies
and the entire biotechnology industry are based.
The recent history of genomics also has been driven by technological
advances. Foremost among these advances were the methodologies of the
polymerase chain reaction (PCR) and automated DNA sequencing. PCR
methods allowed the amplification of usable amounts of DNA from very
small amounts of starting material. Automated DNA sequencing methods
have progressed to the point that today the entire DNA sequence of micro-
bial genomes containing several million base pairs can be obtained in less
than one week. These accomplishments set the stage for the human genome
project.
As early as 1984 the small genomes of several microbes and bacterio-
phages had been mapped and partially sequenced; however, the modern era
A brief history of genomics 3
of genomics was not formally initiated until 1986 at an international con-
ference in Santa Fe, New Mexico sponsored by the Office of Health and
Environmental Research1 of the US Department of Energy. At this
meeting, the desirability and feasibility of implementing a human genome
program was unanimously endorsed by leading scientists from around the
world. This meeting led to a 1988 study by the National Research Council
titled Mapping and Sequencing the Human Genome that recommended the
United States support a human genome program and presented an outline
for a multiphase plan. In that same year, three genome research centers
were established at the Lawrence Berkeley, Lawrence Livermore, and Los
Alamos national laboratories. At the same time, under the leadership of
Director James Wyngaarden, the National Institutes of Health established
the Office of Genome Research which in 1989 became the National Center
for Human Genome Research, directed by James D. Watson. The next ten
years witnessed rapid progress and technology developments in automated
sequencing methods. These technologies led to the establishment of large-
scale DNA sequencing projects at many public research institutions around
the world such as the Whitehead Institute in Boston, MA and the Sanger
Centre in Cambridge, UK. These activities were accompanied by the rapid
development of computational and informational methods to meet chal-
lenges created by an increasing flow of data from large-scale genome
sequencing projects.
In 1991 Craig Venter at the National Institutes of Health developed a
way of finding human genes that did not require sequencing of the entire
human genome. He relied on the estimate that only about 3 percent of the
genome is composed of genes that express messenger RNA. Venter sug-
gested that the most efficient way to find genes would be to use the process-
ing machinery of the cell. At any given time, only part of a cell’s DNA is
transcriptionally active. These “expressed” segments of DNA are con-
verted and edited by enzymes into mRNA molecules. Using an enzyme,
reverse transcriptase, cellular mRNA fragments can be transcribed into
complementary DNA (cDNA). These stable cDNA fragments are called
expressed sequence tags, or ESTs. Computer programs that match overlap-
ping ends of ESTs were used to assemble these cDNA sequences into longer
sequences representing large parts, or all, of many human genes. In 1992,
Venter left NIH to establish The Institute for Genomic Research, TIGR. By
1995 researchers in public and private institutions had isolated over 170000
1
Changed in 1998 to the Office of Biological and Environmental Research of the
Department of Energy.
4 A brief history of genomics
ESTs, which were used to identify more than half of the then estimated
60000 to 80000 genes in the human genome.2 In 1998, Venter joined with
Perkin-Elmer Instruments (Boston, MA) to form Celera Genomics
(Rockville, MD).
With the end in sight, in 1998 the Human Genome Program announced a
plan to complete the human genome sequence by 2003, the 50th anniver-
sary of Watson and Crick’s description of the structure of DNA. The goals
of this plan were to:
• Achieve coverage of at least 90% of the genome in a working draft based
on mapped clones by the end of 2001.
• Finish one-third of the human DNA sequence by the end of 2001.
• Finish the complete human genome sequence by the end of 2003.
• Make the sequence totally and freely accessible.
On June 26, 2000, President Clinton met with Francis Collins, the
Director of the Human Genome Program, and Craig Venter of Celera
Genomics to announce that they had both completed “working drafts” of
the human genome, nearly two years ahead of schedule. These drafts were
published in special issues of the journals Science and Nature early in 2001
[6, 7] and the sequence is online at the National Center for Biotechnology
Information (NCBI) of the Library of Medicine at the National Institutes
of Health
As of this writing, the NCBI databases also contain complete or in
progress genomic sequences for ten Archaea and 151 bacteria as well as
the genomic sequences of eight eukaryotes including: the parasites
Leishmania major and Plasmodium falciparum; the worm Caenorhabditis
elegans; the yeast Saccharomyces cerevisiae; the fruit fly Drosophila
melanogaster; the mouse Mus musculus; and the plant Arabidopsis thali-
ana. Many more genome sequencing projects are under way in private and
public research laboratories that are not yet available on public databases.
It is anticipated that the acquisition of new genome sequence data will
continue to accelerate. This exponential increase in DNA sequence data
has fuelled a drive to develop technologies and computational methods to
use this information to study biological problems at levels of complexity
never before possible.
2
At the present time (September 2001) the estimate of the number of human genes has
decreased nearly twofold.
A brief history of genomics 5
1. Roberts, R. B., Abelson, P. H., Cowie, D. B., Bolton, E. B., and Britten, J. R.
Studies of Biosynthesis in Escherichia coli. 1955. Carnegie Institution of
Washington, Washington, DC.
2. Watson, J. D., and Crick, F. H. C. A structure for deoxyribose nucleic acid.
1953. Nature 171:173.
3. Jacob, F., and Monod, J. Genetic regulatory mechanisms in the synthesis of
proteins. 1961. Journal of Molecular Biology 3:318–356.
4. Nathans, D., and Smith, H. O. A suggested nomenclature for bacterial host
modification and restriction systems and their enzymes. 1973. Journal of
Molecular Biology 81:419–423.
5. Jackson, D. A., Symons, R. H., and Berg, P. Biochemical method for inserting
new genetic information into DNA of simian virus 40: circular SV40 DNA
molecules containing lambda phage genes and the galactose operon of
Escherichia coli. 1972. Procedings of the National Academy of Sciences of the
USA 69:2904–2909.
6. Science Human Genome Issue. 2001. 16 February, vol. 291.
7. Nature Human Genome Issue. 2001. 15 February, vol. 409.
2
DNA array formats
7
8 DNA microarray formats
selected addresses, and photo-chemical coupling occurs at these sites. For
example, the addresses on the glass surface for all probes beginning with
guanosine are photo-activated and chemically coupled to guanine bases.
This step is repeated three more times with masks for all addresses with
probes beginning with adenosine, thymine, or cytosine. The cycle is
repeated with masks designed for adding the appropriate second nucleotide
of each probe. During the second cycle, modified phosphoramidite moie-
ties on each of the nucleosides attached to the glass surface in the first step
are light-activated through appropriate masks for the addition of the
second base to each growing oligonucleotide probe. This process is contin-
ued until unique probe oligonucleotides of a defined length and sequence
have been synthesized at each of thousands of addresses on the glass
surface (Figure 2.1).
Several companies such as Protogene (Menlo Park, CA) and Agilent
Technologies (Palo Alto, CA) in collaboration with Rosetta Inpharmatics
(Kirkland, WA) of Merck & Co. Inc. (Whitehouse Station, NJ) have devel-
oped in situ DNA array platforms through proprietary modifications of a
standard piezoelectric (ink-jet) printing process that unlike the manufac-
turing process for Affymetrix GeneChips™, does not require photolithog-
raphy. These in situ synthesized oligonucleotide arrays are fabricated
directly on a glass support on which oligonucleotides up to 60 nucleotides
are synthesized using standard phosphoramidite chemistry. The ink-jet
printing technology is capable of depositing very small volumes – picoliters
per spot – of DNA solutions very rapidly and very accurately. It also deliv-
ers spot shape uniformity that is superior to other deposition methods.
Researchers in the Nano-fabrication Center at the University of
Wisconsin have developed yet another method for the manufacture of in
situ synthesized DNA arrays that also does not require photolithographic
masks [2]. This technology known as MAS for maskless array synthesizer
capitalizes on existing electronic chips used in overhead projection known
as digital light processors (DLPs). A DLP is an array of up to 500000 tiny
aluminum mirrors arranged on a computer chip. By electronic manipula-
tion of the mirrors, light can be directed to specific addresses on the surface
of a DNA array substrate, thus eliminating the need for expensive photo-
lithographic masks. This technology is being implemented by NimbleGen
Systems, LLC (Madison, WI). DNA arrays containing over 307000 dis-
crete features are currently being synthesized and plans are under way to
synthesize a second-generation MAS array containing over 2 million dis-
crete features. The Wisconsin researchers claim that this method will
greatly reduce the time and cost for the manufacture of high-density in situ
Figure 2.1. The Affymetrix method for the manufacture of in situ synthesized DNA microarrays (cour-
tesy of Affymetrix). (1) A photo-protected glass substrate is selectively illuminated by light passing
through a photolithographic mask. (2) Deprotected areas are activated. (3) The surface is flooded with
a nucleoside solution and chemical coupling occurs at photo-activated positions. (4) A new photolitho-
graphic mask pattern is applied. (5) The coupling step is repeated. (6) This process is repeated until the
desired set of probes is obtained.
10 DNA microarray formats
Affymetrix1,2,3,4,5,6,7,8,18 X www.affymetrix.com
Agilent Technologies18 X www.chem.agilent.com
AlphaGene1,18 X www.alphagene.com
Clontech1,2,3,18 X X X www.clontech.com
Corning6 X www.corning.com/cmt
Eurogentec5,6,9,11,12,14,15,16,18 X X www.eurogentec.be
Genomic Solutions1,2,3 X www.genomicsolutions.com
Genotech1,2 X www.genotech.com
Incyte
Pharmaceuticals1,2,3,4,9,10,18 X X www.incyte.com
Invitrogen1,2,3,6 X X www.invitrogen.com
Iris BioTechnologies1 www.irisbiotech.com
Mergen Ltd1,2,3 X www.mergen-ltd.com
Motorola Life Science1,3,18 X www.motorola.com/lifesciences
MWG Biotech3,6,8,18 X www.mwg-biotech.com
Nanogen X www.nanogen.com
NEN Life Science Products1 X X www.nenlifesci.com
Operon Technologies Inc.1,6,18 X www.operon.com
Protogene Laboratories18 www.protogene.com
Radius Biosciences18 X www.ultranet.com/~radius
Research Genetics1,2,3,6 X www.resgen.com
Rosetta Inpharmatics18 X X www.rii.com
Sigma-Genosys1,2,8,11,12,13,18 X www.genosys.com
Super Array Inc.1,2,18 X www.superarray.com
Takara1,2,4, 8,17,18 X www.takara.co.jp/english/bio_e
Notes:
1
Human, 2Mouse, 3Rat, 4Arabidopsis, 5Drosophila, 6Saccharomyces cerevisiae, 7HIV,
8
Escherichia coli, 9Candida albicans, 10Staphylococcus aureus, 11Bacillus subtilis,
12
Helicobacter pylori, 13Campylobacter jejuni, 14Streptomyces lividans, 15Streptococcus
pneumoniae, 16Neisseria meningitidis, 17Cyanobacteria, 18Custom.
1. Fodor, S. P., Rava, R. P., Huang, X. C., Pease, A. C., Holmes, C. P., and
Adams, C. L. Multiplexed biochemical assays with biological chips. 1993.
Nature 364:555–556.
2. Singh-Gasson, S., Green, R. D., Wue, Y., Nelson, C., Blattner, F., Sussman,
M. R. and Cerrine, F. Maskless fabrication of light-directed oligonucleotide
microarrays using a digital micromirror array. 1999. Nature Biotechnology
17:974–978.
3. Schena, M. (ect.) Microarray Biochip Technology. 2000. Eaton Publishing Co.,
Natick, MA.
4. Arfin, S. M., Long, A. D., Ito, E., Riehle, M. M., Paegle, E. S., and Hatfield,
G. W. Global gene expression profiling in Escherichia coli K12: the effects of
integration host factor. 2000. Journal of Biological Chemistry 275:29672–29684.
3
DNA array readout methods
Once a DNA array experiment has been designed and executed the data
must be extracted and analyzed. That is, the signal from each address on the
array must be measured and some method for determining and subtracting
the background signal must be employed. However, because there are many
different DNA array formats and platforms, and because hybridization
signals can be generated with fluorescent- or radioactive-labeled targets, no
single DNA array readout device is suitable for all purposes. Furthermore,
many instruments with different advantages and disadvantages for different
types of array formats are available. Therefore, since accurate data acquisi-
tion is a critical step of any array experiment, careful attention must be paid
to the selection of data acquisition equipment.
17
18 DNA array readout methods
and efficient detection of fluorescent emissions from two or more
fluorophores in a single experiment. This permits the common practice of
combining cDNA targets prepared from a reference and an experimental
condition, one containing Cy3-labeled and the other Cy5-labeled nucleo-
tides, for analysis on a single array. Of course, instruments for this applica-
tion must be equipped with multiple laser sources, one for each fluorophore
used.
In addition to providing an appropriate light source for fluorophore exci-
tation, attention must be paid to the efficiency and accuracy of fluorescence
emission measurements. Efficiency is an issue because the amount of light
emitted from a fluorophore is generally orders of magnitudes (as much as
1000000-fold) weaker than the intensity of the light required to excite the
fluorophore. This problem is further exacerbated by the fact that
fluorophores emit light in all directions, not just toward the light detector
system. Thus, some optical method to collect as much of the emitted light
and eliminate as much of the excitation light as possible is required. This is
accomplished with an objective lens that captures the emitted light and
focuses it toward the detector. However, it is not possible to collect this light
simultaneously from all directions. The best that can be accomplished is to
collect that light that is emitted from the hemisphere of the fluorescence
directed toward the detector. The efficiency of an objective lens in accom-
plishing this task is expressed as its numerical aperture. If it were possible to
collect all of the light in a hemisphere, the objective would have a numerical
aperture of 1.0. In reality, objective lenses in array readers have numerical
apertures ranging from 0.5 to 0.9. Obviously, the numerical aperture of the
objective lens directly affects an instrument’s sensitivity. Therefore, since
instrument manufacturers often do not list this information in their
product description, it is something that a purchaser should ask for.
All organic molecules fluoresce. Any contamination on an array surface
(commonly glass slides), including the coated surface of the slide itself, will
produce background fluorescence that can seriously compromise the
signal-to-noise ratio. The answer to this problem has been addressed by
limiting the focus of the laser to a small field of view in three-dimensional
space centered in the DNA sample on the array. This is accomplished by a
confocal method comprising two lenses in series producing two reciprocal
focal points, one focused on the sample and the other focused at a pinhole
for transmission of the light signal to the detector. The beauty of this
system is that the pinhole permits only the light from a very narrow depth of
focus of the objective lens to be passed through to the detector (Figure 3.1).
This allows very fine discrimination of noise and signal.
Reading data from a fluorescent signal 19
Glass Beam
slide Laser Pinhole
splitter
Filter
Objective Collector Photomultiplier
lens lens tube
Figure 3.1. The principle of confocal scanning laser microscopy. Confocal systems
image only a small point (pixel) in three-dimensional space. This is achieved by
employing a laser as an illumination source and a small aperture in front of the
detector, which usually is a photomultiplier tube. A schematic of a simplified confo-
cal optical path is shown above. The laser beam enters from above and is reflected by
the beam splitter (a dichroic or multichroic filter that reflects light of short wave-
lengths and transmits light of longer wavelengths). The laser beam is focused to a
spot on the surface of the glass slide by an objective lens and excites fluorophores in
the focal plane of the objective. The fluorescent emissions (and the reflected laser
beam rays) are collected by the objective and collimated into parallel rays (solid
lines). Most of the reflected laser rays (95%) are reflected back toward the laser
source by the beam splitter. The remaining laser rays are excluded by a downstream
filter specific for the emission wavelength of the fluorophore. The parallel rays of the
fluorescent light received from the focal plane of the objective lens are focused on a
pinhole in a detector plate and passed through to the photomultiplier tube. The ray
paths indicated with dashed lines shows how light from out-of-focus objects such as
reflections and emissions from the second surface of the glass slide are eliminated by
the design of the optical system. This out-of-focus fluorescent light takes a different
path through the objective lens, the beam splitter, and the emission filter. As a result,
it is not focused into the detector pinhole and, therefore only a small portion of this
polluted light is passed on to the photomultiplier tube. Since confocal microscopy
images only a small point in three-dimensional space proportional to the aperture
of the pinhole in the detector plate, complete images must be obtained by scanning
and digital reconstruction.
Number
Web of Scan
Company address Model Lasers area Sensitivity
Notes:
a
Not confocal.
b
PMT,photomultiplier tube; CCD,charge-coupled device
Reading data from a radioactive signal 23
Mechanism
Pixel of Scan Supported Source
resolution Filters scanning speed dyes type
Figure 3.2. Sigma-Genosys Panorama™ E.coli Gene Array. Each E. coli array con-
tains three fields divided into 2416 (384) primary grids (1–24A-P). Each
primary grid contains 16 secondary grids separated by 1 mm. Four different full-
length ORF probes are spotted in duplicate in every other secondary grid as shown
in the blowup of Field 1, A2. The remaining eight secondary grids of each primary
grid are blank. In the third field, some primary grids do not contain any ORFs.
A. Radiation
Transparent layer
Protective layer
Photomultiplier
Phot
tube
B. Scanning laser
Light
Transparent layer
Protective layer
visible light color center. This structure is stable until the crystal is excited a
second time with laser light absorbed by the color center. This releases the
trapped electron that emits fluorescent light captured by a photomultiplier
tube during a laser scan of the screen. Since all of the crystals are not
returned to the BaFBr:Eu 2 state during the scan, the screen is “erased” for
subsequent use by flashing with visible light. The basic principle of PLS is
schematically illustrated in Figure 3.3.
Thus, when a DNA array hybridized with radioactively labeled targets is
placed on the phosphorimaging screen in the cassette, trapped electrons are
stored in the BaFBr:Eu 3 crystals at a rate proportional to the intensity of
the radioactive signal. After an appropriate exposure time (often 12–48
hours) the DNA array is removed and the phosphorimaging screen is
scanned with a laser and a photo-optical detection system containing a
photomultiplier tube that measures and records the emitted light. This
information is digitized and reconstructed into an image of the type shown
in Figure 3.2.
Since the scanning surfaces accommodated by phosphorimagers are
much larger than those of glass slide arrays scanned with confocal laser
scanners, time constraints dictate that they cannot be scanned at the same
26 DNA array readout methods
resolution. Therefore, to accommodate this time constraint, the highest
scanning resolution for phosphorimaging instruments is 20 m; however,
as scanning mechanics improve, scanning resolutions approaching the
5 m size of the phosphor crystal are anticipated.
Some newer storage phosphorimaging instruments come equipped with
multiple scanning laser sources and filters for the excitation of many com-
monly used fluorophores. For example, fluorescent dye combinations such
as Cy3 and Cy5 can be imaged with these instruments. Nevertheless, while
these detection methods are suitable for many applications, they do not
offer the signal-to-noise and scanning resolution advantages necessary for
high-density DNA arrays that are provided by confocal laser scanning
instruments. The features of several currently available phosphorimagers
are described in Table 3.2.
Table 3.2. Commercial sources for phosphorimagers
Linear
dynamic
Name Web address Model Resolution Scan area Detects range
29
30 Gene expression profiling experiments
strains were isogenic; that is, that they contained identical genetic back-
grounds except for the single structural gene for the Lrp protein. This was
accomplished by using a single wild-type E. coli K12 strain for the con-
struction of the two isogenic strains. First, the natural promoter-regulatory
region of the lacZYA operon in the chromosome was replaced with the pro-
moter of the ilvGMEDA operon (known to be regulated by Lrp). This pro-
duced a strain (IH-G2490; ilvPG::lacZYA, lrp ) in which the known effects
of Lrp on ilvGMEDA operon expression could be easily monitored by
enzymatic assay of the gene product of the lacZ gene,
-galactosidase.
Next, the Lrp structural gene was deleted from this strain to produce the
otherwise isogenic strain (IH-G2491; ilvPG::lacZYA, lrp) [5].
The ability to control this source of biological variation in a model
organism such as E. coli with an easily manipulated genetic system is an
obvious advantage for gene expression profiling experiments. However,
most systems are not as easily controlled. For example, human samples
obtained from biopsy materials will not only differ in genotype but also in
cell types. Nevertheless, the experimenter should strive to reduce this source
of biological variability as much as possible. For example, laser-capture
techniques for the isolation of single cells from animal and human tissues
for isolation and amplification of RNA samples that address this problem
are being developed [6].
An additional source of biological variation in experiments comparing
the gene profiles of two cell types comes from the conditions under which
the cells are cultured. In this regard we have recommended that standard
cell-specific media should be adopted for the growth of cells queried by
DNA array experiments [2]. While this is not possible in every case, many
experimental conditions such as the comparison of two different genotypes
of the same cell line can be standardized. The adoption of such medium
standards would greatly reduce experimental variations and facilitate the
cross-comparison of experimental data obtained from different experi-
ments and/or different experimenters. For E. coli, Neidhardt et al. [7] have
performed extensive studies concerning the conditions required for main-
taining cells in a steady state of balanced growth. Their studies have defined
a synthetic glucose-minimal salts medium (glucose-minimal MOPS) that
minimizes medium changes during the logarithmic growth phase. The
experiments described in Chapter 7 were performed with cells grown in this
medium. Similar studies have described defined media for the growth of
many eukaryotic cell lines that should be agreed upon by researchers and
used when experimental conditions allow.
Primary sources of variation 31
10–3
A
10–4
Relative signal intensity
10–5
B
10–4
10–3
10–2
10–5
10–1 10–2 10–3 10–4
ORF length
Figure 4.2. Relationships between the logarithm of hybridization signals and the
logarithm of ORF lengths with targets prepared from genomic DNA. Scatter plots
showing the relationships between hydridization signal intensities with 33P-labeled
cDNA probes generated from genomic DNA with random-hexamer oligonucleo-
tides (A) or 3-ORF-specific DNA primers (B).
1
3'
5'
b 1
c
3'
f
mRNA
23S rRNA
mRNA
cDNA
mRNA
Figure 4.4. A poly(A) independent method for mRNA enrichment. Oligo-
nucleotide primers (short blue lines) specific for ribosomal RNA sequences are used
to generate rRNA/cDNA (represented by the red-dashed arrows) hybrids. The
RNA moiety (black line) of these double-stranded hybrids is digested away with
RNase H. Finally, the cDNA strand is removed with DNase I.
Normalization methods
Before we can determine the differential gene expression profiles between
two conditions obtained from the data of two DNA array experiments, we
must first ascertain that the data sets are comparable. That is, we must
develop methods to normalize data sets in a way that accounts for sources
of experimental and biological variations, such as those discussed above,
that might obscure the underlying variation in gene expression levels attrib-
utable to biological effects. However, with few exceptions, the sources of
these variations have not been measured and characterized. As a conse-
quence, many array studies are reported without statistical definitions of
their significance. This problem is even further exacerbated by the presence
of many different array formats and experimental designs and methods.
While some theoretical studies that address this important issue have
appeared in the literature, the normalization methods currently in common
use are based on more pragmatic biological considerations. Here we con-
sider the pros and cons of these data normalization methods and their
applicability to different experimental designs and DNA array formats.
Basically, these methods attempt to correct for the following variables:
• number of cells in the sample
• total RNA isolation efficiency
• mRNA isolation and labeling efficiency
• hybridization efficiency
• signal measurement sensitivity.
1. DeRisi, J. L., Iyer, V. R., and Brown, P. O. The metabolic and genetic control
of gene expression on a genomic scale. 1997. Science 278(5338):680–686.
2. Arfin, S. M., Long, A. D., Ito, E., Riehle, M. M., Paegle, E. S., and Hatfield,
G. W. Global gene expression profiling in Escherichia coli K12: the effects of
integration host factor. 2000. Journal of Biological Chemistry
275:29672–29684.
3. Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. Quantitative
monitoring of gene expression patterns with a complementary DNA
microarray. 1995. Science 270:467–470.
4. Schena, M. Genome analysis with gene expression microarrays. Review.
1996. Bioessays 18:427–431.
5. Rhee, K., Parekh, B. S., and Hatfield, G. W. Leucine-responsive regulatory
protein–DNA interactions in the leader region of the ilvGMEDA operon of
Escherichia coli. 1996. Journal of Biological Chemistry 271:26499–26507.
6. Dolter, K. E., and Braman, J. C., Small-sample total RNA purification: laser
capture microdissection and cultured cell applications. 2001. Biotechniques
6:1358–1361.
7. Neidhardt, F. C., Bloch, P. L., and Smith, D. F. Culture medium for
enterobacteria. 1974. Journal of Bacteriology 119:736–747.
8. Ito, E. T., and Hatfield, G. W., unpublished data.
9. Decker, C. J., and Parker, R. Mechanisms of mRNA degradation in
eukaryotes. Review. 1994. Trends in Biochemical Sciences 19:336–340.
10. Olivas, W., and Parker, R. The Puf3 protein is a transcript-specific regulator
of mRNA degradation in yeast. 2000. EMBO Journal 19:6602–6611.
11. Yu, H., Tan, M., and Hatfield, G.W., unpublished data.
12. Vawter, M., and Hatfield, G. W., unpublished data.
52 Gene expression profiling experiments
13. Suzuki, T., Higgins, P. J., and Crawford, D. R. Control selection for RNA
quantitation. 2000. Biotechniques 29:332–337.
14. Li, C., and Wong, W. H. Model-based analysis of oligonucleotide arrays:
expression index computation and outlier detection. 2001. Proceedings of the
National Academy of Sciences of the USA 98:31–36.
5
Statistical analysis of array data:
Inferring changes
53
54 Statistical analysis: inferring changes
from the angle of the conditions and ask how similar are two conditions
from a gene expression standpoint [4], or cluster data according to experi-
ments. If for nothing else than convenience, we shall still follow the simple
roadmap. Thus, in this chapter we deal with the first problem of inferring
significant gene changes. The treatment here largely follows our published
work [5, 6].
To begin with, we assume for simplicity that for each gene X the data D
consists of a set of measurements x 1c,...,x ncc and x1t ,...,xnt t representing expres-
sion levels, or rather their logarithms, in both a control and treatment situa-
tion. For each gene, the fundamental question we wish to address is
whether the level of expression is significantly different in the two situa-
tions.
One approach commonly used in the literature at least in the first wave of
publications (see for instance, [7, 8, 9]), has been a simple-minded fold
approach, in which a gene is declared to have significantly changed if its
average expression level varies by more than a constant factor, typically
two, between the treatment and control conditions. Inspection of gene
expression data suggests, however, that such a simple “twofold rule” is
unlikely to yield optimal results, since a factor of two can have quite
different significance and meaning in different regions of the spectrum of
expression levels, in particular at the very high and very low ends.
Another approach to the same question is the use of a t-test, for instance
on the logarithm of the expression levels. This is similar to the fold
approach because the difference between two logarithms is the logarithm of
their ratio. This approach is not necessarily identical to the first because the
logarithm of the mean is not equal to the mean of the logarithms; in fact it
is always strictly greater by convexity of the logarithm function. The two
approaches are equivalent only if one uses the geometric mean of the ratios
rather than the arithmetic mean. In any case, with a reasonable degree of
approximation, a test of the significance of the difference between the log
expression levels of two genes is equivalent to a test of whether or not their
fold change is significantly different from 1.
In a t-test, the empirical means mc and mt and variances s2c and s2t are used
to compute a normalized distance between the two populations in the form:
兹
s2c s2t
t (mc mt )/ (5.1)
nc nt
where, for each population, mi xi /n and s2 i (xi m)2/(n1) are the
well-known estimates for the mean and standard deviation. It is well known
Probabilistic modeling of array data 55
in the statistics literature that t follows approximately a Student distribu-
tion (Appendix A), with
[(s2c /nc ) (s2t /nt )]2
f (5.2)
(s2c /nc ) 2 (s2t /nt ) 2
nc 1 nt 1
degrees of freedom. When t exceeds a certain threshold depending on the
confidence level selected, the two populations are considered to be
different. Because in the t-test the distance between the population means is
normalized by the empirical standard deviations, this has the potential for
addressing some of the shortcomings of the simple fixed fold-threshold
approach. The fundamental problem with the t-test for array data,
however, is that the repetition numbers nc and/or nt are often small because
experiments remain costly or tedious to repeat, even with current technol-
ogy. Small populations of size n1, 2 or 3 are still very common and lead,
for instance, to poor estimates of the variance. Thus a better framework is
needed to address these shortcomings.
Here we describe a Bayesian probabilistic framework for array data,
which bears some analogies with the framework used for sequence data [10]
and can effectively address the problem of detecting gene differences.
T
G
CA
Figure 5.1. DNA dice.
冤
C
1 (
2 ) ( 0 /2 1) exp
0 2 0
2
2
0 2 ( 0 ) 2
2
冥 (5.6)
The expectation of the prior is finite if and only if 0 2. Notice that it
makes perfect sense with array data to assume a priori that and
2 are
dependent, as suggested immediately by visual inspection of typical array
data sets (see Figure 5.3, below). The hyperparameters 0 and
2 /0 can be
interpreted as the location and scale of , and the hyperparameters 0 and
20 as the degrees of freedom and scale of
2. After some algebra, the poste-
rior has the same functional form as the prior
P( ,
2 | D, ) N( ; n,
2/n )I(
2; n,
2n ) (5.7)
with
0 n
n m (5.8)
0 n 0 0 n
n 0 n (5.9)
n 0 n (5.10)
0n
n
2n 0
20 (n 1)s 2 (m 0 ) 2 (5.11)
0 n
The parameters of the posterior combine information from the prior and the
data in a sensible way. The mean n is a convex weighted average of the prior
mean and the sample mean. The posterior degree of freedom n is the prior
degree of freedom 0 plus the sample size n, and similarly for the scaling
factor n. The posterior sum of squares n
2n is the sum of of the prior sum of
squares 0
20, the sample sum of squares (n1)s 2, and the residual uncer-
tainty provided by the discrepancy between the prior mean and the sample
mean.
While is is possible to use a prior mean 0 for gene expression data, in
many situations it is sufficient to use 0 m. The posterior sum of squares is
then obtained precisely as if one had 0 additional observations all asso-
ciated with deviation
20. While superficially this may seem like setting the
60 Statistical analysis: inferring changes
prior after having observed the data, it can easily be justified [18].
Furthermore, a similar effect is obtained using a preset value 0 with 0 →0,
i.e, with a very broad standard deviation so that the prior belief about the
location of the mean is essentially uniform and vanishingly small. The
selection of the hyperparameters for the prior is discussed in more detail
below.
It is not difficult to check that the conditional posterior distribution of
the mean P( |
, D, ) is normal N(n,
2/ n). The marginal posterior
P( |D, ) of the mean is Student t(n, n,
2n /n) and the marginal posterior
P(
2 |D, ) for the variance is scaled inverse gamma I(n,
2n).
Finally, it is worth remarking that, if needed, more complex priors could
be constructed using mixtures of conjugate priors, leading to mixtures of
conjugate posteriors.
45
40
35
30
25
20
15
10
Figure 5.2. Contour plots for the posterior distribution of two hypothetical genes, or the
same gene in two different conditions, using the posterior defined above seen in the xy
plane, with x and y
2. In both cases, the mean is equal to 100 therefore the two situa-
tions are indistinguishable with a t-test or a fold approach.Yet the behaviors are very different
in terms of standard deviations.
62 Statistical analysis: inferring changes
assigning a non-zero prior to the null hypothesis, which seems quite con-
trived (see section “Hypothesis testing” in Appendix B).
To address this decision issue, here we use a compromise between
hypothesis testing and the more general Bayesian framework by leveraging
the simplicity of the t-test, but using parameter and hyperparameter point
estimates derived from the Bayesian framework.
2
3
1.5
std
2
1
1
0.5
0 0
14 12 10 8 6 4 14 12 10 8 6 4
experimental experimental
4 2.5
2
3
1.5
std
2
1
1
0.5
0 0
14 12 10 8 6 4 14 12 10 8 6 4
mean mean
Figure 5.3. DNA array experiment on Escherichia coli. Data obtained from reverse
transcribed 33P-labeled RNA hydridized to commercially available nylon arrays
(Sigma Genosys) containing each of the 4290 predicted E. coli genes. The sample
included a wild-type strain (control) and an otherwise isogenic strain lacking the
gene for the global regulatory gene, integration host factor (IHF) (experimental).
n4 for both control and experimental situations. The horizontal axis represents
the mean of the logarithm of the expression levels, and the vertical axis shows the
corresponding standard deviations (std
). The left column corresponds to raw
data; the right column to regularized standard deviations using Equation 5.13.
Window size is ws101 and K10. Data are from [20].
Simulations
We have used the Bayesian approach and CyberT to analyze a number of
published and unpublished data sets. We have found that the Bayesian
approach very often compares favorably to a simple fold approach or a
straight t-test and partially overcomes deficiencies related to low replication
in a statistically consistent way [6]. In every high-density array experiment
Simulations 65
100%
Confidence level: 95%
Position of first Gaussian: 10
90%
Percent prediction samples are from different populations
80% n=5
70%
n=4
60%
50%
n=3
40%
30%
20%
10% n=2
0%
10 9.5 9 8.5 8 7.5
Position of second Gaussian (log activity)
Figure 5.4. Plain t-test run on simulated data in the same range as data in previous
figure. n points are randomly drawn from two pairs of Gaussians 1000 times and a
plain t-test is applied to decide whether the two populations are different or not with
a confidence of 95%. n is varied from 2 to 5. Vertical axis represents the percentage
of “correct” answers. One of the Gaussians is kept fixed with a mean of 10 and
standard deviation of 0.492. The other Gaussians are taken at regular intervals
away from the first Gaussian. Both Gaussians have the corresponding average stan-
dard deviation derived by linear regression on the data of the previous figure in the
interval [11, 7] of log-activities. The parameters of the regression line
a log
(activity) b for log-activities in the range of [11, 7] are: a0.123 and
b0.736. For low values of n, the t-test performs very poorly.
n=5
90% Confidence level: 99%
Percent prediction samples are from different populations
70%
60%
50%
n=3
40%
30%
20%
10%
n=2
0%
10 9.5 9 8.5 8 7.5
Position of second Gaussian (log activity)
Figure 5.5. Same are previous figure but with 99% confidence interval.
Log
expression Ratio Plain t-test Bayes
n from to 2-fold 5-fold p0.05 p0.01 p0.05 p0.01
2 8 8 1 0 38 7 73 9
2 10 10 13 0 39 11 60 11
2 12 12 509 108 65 10 74 16
2 6 6.1 0 0 91 20 185 45
2 8 8.5 167 0 276 71 730 419
2 10 11 680 129 202 47 441 195
3 8 8 0 0 42 9 39 4
3 10 10 36 0 51 11 39 6
3 12 12 406 88 44 5 45 4
3 6 6.1 0 0 172 36 224 60
3 8 8.5 127 0 640 248 831 587
3 10 11 674 62 296 139 550 261
5 8 8 0 0 53 13 39 8
5 10 10 9 0 35 6 31 3
5 12 12 354 36 65 11 54 4
5 6 6.1 0 0 300 102 321 109
5 8 8.5 70 0 936 708 966 866
5 10 11 695 24 688 357 752 441
2v4 8 8 0 0 35 4 39 6
2v4 10 10 38 0 36 9 40 3
2v4 12 12 446 85 46 17 43 5
2v4 6 6.1 0 0 126 32 213 56
2v4 8 8.5 123 0 475 184 788 509
2v4 10 11 635 53 233 60 339 74
Notes:
Data was generated using normal distribution on a log scale in the range of [3],
with 1000 replicates for each parameter combination. Means of the log data and
associated standard deviations (in brackets) are as follows: 6 (0.1), 8 (0.2),
10 (0.4), 11 (0.7), 12 (1.0). For each value of n, the first three experiments
correspond to the case of no change and therefore yield false positive rates.
Analysis was carried out using CyberT with default settings and the hyperparam-
eter set to 10. Regularized t-tests were carried out using degrees of freedom equal
to: reps 1 hyperparameter 1.
Extensions
In summary, the probabilistic framework for array data analysis addresses a
number of current approach shortcomings related to small sample bias and
the fact that fold differences have different significance at different expres-
sion levels. The framework is a form of hierarchical Bayesian modeling
with Gaussian gene-independent models. While there can be no perfect
substitute for experimental replication (see also [21]), in simulations and
controlled replicated experiments we have shown that the approach has a
regularizing effect on the data, that it compares favorably to a conventional
t-test, or simple fold approach, and that it can partially compensate for the
absence of replication. New methods discussed in Chapter 7 are being
developed to statistically estimate the rates of false positives and false nega-
tives by modeling the distribution of p-values using a mixture of Dirichlet
distributions and leveraging the fact that, under the null assumption, this
distribution is uniform [22].
Depending on goals and implementation constraints, the present frame-
work can be extended in a number of directions. For instance, regression
functions could be computed offline to establish the relationship between
standard deviation and expression level and used to produce background
standard deviations. Another possibility is to use adaptive window sizes to
compute the local background variance, where the size of the window could
depend, for instance, on the derivative of the regression function. In an
expression range in which the standard deviation is relatively flat (e.g.,
between 8 and 4 in Figure 5.3), the size of the window is less relevant than
in a region where the standard deviation varies rapidly (e.g., between 12
and 10). A more complete Bayesian approach could also be implemented.
The general approach described in this chapter also can be extended to
more complex designs and/or designs involving gradients of an experimen-
tal variable and/or time series designs. Examples would include a design in
which cells are grown in the presence of different stressors (urea, ammonia,
oxygen peroxide), or when the molarity of a single stressor is varied (0, 5, 10
mmol). Generalized linear and non-linear models can be used in this
context. The most challenging problem, however, is to extend the probabil-
istic framework towards the second level of analysis, taking into account
70 Statistical analysis: inferring changes
possible interactions and dependencies amongst genes. Multivariate
normal models, and their mixtures, could provide the starting probabilistic
models for this level of analysis (see also section “Gaussian processes” in
Appendix B).
With a multivariate normal model, for instance, is a vector of means
and is a symmetric positive definite covariance matrix with determinant
||. The likelihood has the form
冤 冥
n
1
C | |n /2 exp
2 兺 (X )
i1
i
T
1 (Xi ) (5.17)
n 0 n
n 0 n
n
n 0 兺 (X m)(X m)
1
i i
t
0n
(m 0 )(m 0 ) t (5.18)
0 n
Thus estimates similar to Equation 5.13 can be derived for this multidimen-
sional case to regularize both variances and covariances. While multivariate
normal and other related models may provide a good starting-point, good
probabilistic models for higher-order effects in array data are still at an
early stage of development. Many approaches so far have concentrated on
more or less ad hoc applications of clustering methods. This is one of the
main topics of Chapter 6 and 7. In this chapter we hope to have provided
convincing argument to the reader that it is effective in general to model
Extensions 71
array data in probabilistic fashion. Besides DNA arrays, there are several
other kinds of high-density arrays, at different stages of development,
which could benefit from a similar treatment. Going directly to a systematic
probabilistic framework may contribute to the accleration of the discovery
process by avoiding some of the pitfalls observed in the history of sequence
analysis, where it took several decades for probabilistic models to emerge as
the proper framework for many tasks.
1. Zhang, M. Q. Large-scale gene expression data analysis: a new challenge to
computational biologists. 1999. Genome Research 9:681–688.
2. Beissbarth, T., Fellenberg, K., Brors, B., Arribas-Prat, R., Boer, J. M.,
Hauser, N. C., Scheideler, M, Hoheisel, J. D., Schutz, G., Poustka, A., and
Vingron, M. Processing and quality control of DNA array hybridization
data. 2000. Bioinformatics 16:1014–1022.
3. Zimmer, R., Zien, A., Aigner, T., and Lengauer, T. Centralization: a new
method for the normalization of gene expression data. 2001. Bioinformatics
17 (Supplement 1): S323–S331.
4. Leach, S. M., Hunter, L., Taylor, R. C., and Simon, R. GEST: a gene
expression search tool based on a novel bayesian similarity metric. 2001.
Bioinformatics 17 (Supplement 1): S115–S122.
5. Baldi, P., and Long, A.D. A Bayesian framework for the analysis of
microarray expression data: regularized t-test and statistical inferences of
gene changes. 2001. Bioinformatics 17:509–519.
6. Long, A. D., Mangalam, H. J., Chan, B. Y., Tolleri, L., Hatfield, G. W., and
Baldi, P. Global gene expression profiling in Escherichia coli K12: improved
statistical inference from DNA microarray data using analysis of variance
and a Bayesian statistical framework. 2001. Journal of Biological Chemistry
276:19937–19944.
7. Schena, A. M., Shalon, D., Davis, R. W., and Brown, P. O. Quantitative
monitoring of gene expression patterns with a complementary DNA
microarray. 1995. Science 270:467–470.
8. Schena, B. M., Shalon, D., Heller, R., Chai, A., Brown, P. O. and Davis, R.
W. Parallel human genome analysis: microarray-based expression monitoring
of 1000 genes. 1995. Proceedings of the National Academy of Sciences of the
USA 93:10614–10619.
9. Heller, C. R. A., Schena, M., Chai, A., Shalon, D., Bedillon, T., Gilmore, J.,
Wolley, D. E., and Davis, R. W. Discovery and analysis of inflammatory
disease-related genes using cDNA microarrays. 1997. Proceedings of the
National Academy of Sciences of the USA 94:2150–2155.
10. Baldi, P., and Brunak, S. Bioinformatics: The Machine Learning Approach,
2nd edn. 2001. MIT Press, Cambridge, MA.
11. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998.
Cambridge University Press, Cambridge.
12. Box, G. E. P., and Tiao, G. C. Bayesian Inference in Statistical Analysis. 1973.
Addison Wesley, New York.
72 Statistical analysis: inferring changes
13. Berger, J. O. Statistical Decision Theory and Bayesian Analysis. 1985.
Springer-Verlag, New York.
14. Pratt, J. W., Raiffa, H., and Schlaifer, R. Introduction to Statistical Decision
Theory. 1995. MIT Press, Cambridge, MA.
15. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian Data
Analysis. 1995. Chapman & Hall, London.
16. Wiens, B. L. When log-normal and gamma models give different results: a
case study. 1999. American Statistician 53:89–93.
17. Newton, M. A., Kendziorski, C. M., Richmond, C. S., Blattner, F. R., and
Tsui, K. W. On differential variability of expression ratios: improving
statistical inference about gene expression changes from microarray data. In
press. Journal of Computational Biology.
18. MacKay, D. J. C. Bayesian interpolation. 1992. Neural Computation
4:415–447.
19. MacKay, D. J. C. Comparison of approximate methods for handling
hyperparameters. 1999. Neural Computation 11:1035–1068.
20. Arfin, S. M., Long, A.D., Ito, E.T., Tolleri, L., Riehle, M. M., Paegle, E. S.,
and Hatfield, G. W. Global gene expression profiling in Escherichia coli K12:
the effects of integration host factor. 2000. Journal of Biological Chemistry
275:29672–29684.
21. Lee, M. T., Kuo, F. C., Whitmore, G. A., and Sklar, J. Importance of
replication in microarray gene expression studies: statistical methods and
evidence from repetitive cDNA hydribizations. 2000. Proceedings of the
National Academy of Sciences of the USA 97:9834–9839.
22. Allison, D. B., Gadbury, G. L., Moonseong, H., Fernandez, J. R., Cheol-
Koo, L., Prolle, T. A., and Weindruch, R. A mixture model approach for the
analysis of microarray gene expression data. 2002. Computational Statistics
and Data Analysis 39:1–20.
6
Statistical analysis of array data:
Dimensionality reduction, clustering, and
regulatory regions
73
74 Dimensionality reduction, clustering, and regulatory regions
The methods and examples of Chapter 5 and 7 show how noisy the
boundaries between clusters can be, especially in isolated experiments with
low repetition. The key observation, however, is that even when individual
experiments are not replicated many times complex expression patterns can
still be detected robustly across multiple experiments and conditions.
Consider, for instance, a cluster of genes directly involved in the cell divi-
sion cycle and whose expression pattern oscillates during the cycle. For each
individual measurement at a given time t, noise alone can introduce distor-
tions so that a gene which belongs to the cluster may fall out of the cluster.
However, when the measurements at other times are also considered, the
cluster becomes robust and it becomes unlikely for a gene to fall out of the
cluster it belongs to at most time steps. The same can be said of course for
genes involved in a particular form of cancer across multiple patients, and
so forth. In fact, it may be argued (Chapter 8) that robustness is a funda-
mental characteristic of regulatory circuits that must somehow transpire
even through noisy microarray data.
In many cases, cells tend to produce the proteins they need simultane-
ously, and only when they need them. The genes for the enzymes that cata-
lyze a set of reactions along a pathway are likely to be co-regulated (and
often somewhat co-located along the chromosome). Thus, depending on
the data and clustering methods, gene clusters can often be associated with
particular pathways and with co-regulation. Even partial understanding
of the available information can provide valuable clues. Co-expression of
novel genes may provide a simple means of gaining leads to the functions of
many genes for which information is not yet available. Likewise, multi-gene
expression patterns could characterize diseases and lead to new precise
diagnostic tools capable of discriminating, for instance, different kinds of
cancers.
Many data analysis techniques have already been applied to problems in
this class, including various clustering methods from k-means to hierarchi-
cal clustering, principal component analysis, factor analysis, independent
component analysis, self-organizing maps, decision trees, neural networks,
support vector machines, and Bayesian networks to name just a few. It is
impossible to review all the methods of analysis in detail in the available
space and counter-productive to try to single out a “best method” because:
(1) each method may have different advantages depending on the specific
task and specific properties of the data set being analyzed; (2) the under-
lying technology is still rapidly evolving; and (3) noise levels do not always
allow for a fine discrimination between methods. Rather, we focus on the
main methods of analysis and the underlying mathematical background.
Visualization 75
Array data is inherently high-dimensional, hence methods that try to
reduce the dimensionality of the data and/or lend themselves to some form
of visualization remain particularly useful. These range from simple plots
of one condition versus another, to projection on to lower dimensional
spaces, to hierarchical and other forms of clustering. In the next sections,
we focus on dimensionality reduction (principal component analysis) and
clustering since these are two of the most important and widely used
methods of analysis for array data. For clustering, we partially follow the
treatment in [1]. Additional information about other methods of array data
analysis can be found in the last chapter (neural networks and Bayesian net-
works), in the Appendices (support vector machines), and throughout the
references. Clustering methods of course can be applied not only to genes,
but also to conditions, DNA sequences, and other relevant data. From
array-derived gene clusters it is also possible to go back to the correspond-
ing gene sequences and look, for instance, for shared motifs in the regula-
tory regions of co-regulated genes, as described in the last section of this
chapter.
Figure 6.1. A two-dimensional set of points with its principal component axes and
its unit vector u.
the probes/genes. We assume that the xs have already been centered by sub-
tracting their mean value, or expectation, E[x]. The basic idea in PCA is to
reduce the dimension of the data by projecting the xs on to an interesting
linear subspace of dimension K, where K is typically significantly smaller
than M. Interesting is defined in terms of variance maximization.
PCA can easily be understood by recursively computing the orthonor-
mal axis of the projection space. For the first axis, we are looking for a unit
vector u1 such that, on average, the squared length of the projection of the
xs along u1 is maximal. Assuming that all vectors are column vectors, this
can be written as
u1 arg max E [(uTx) 2] (6.1)
|| u || 1
冤冢 冣冥
2
k1
uk arg max E
|| u || 1
x 兺
i1
(uTix)ui (6.2)
Visualization 77
The principal components of the vector x are given by ci uTix. By con-
struction, the vectors ui are orthonormal. In practice, it can be shown the uis
are the eigenvectors of the (sample) covariance matrix E[xxT ] asso-
ciated with the K largest eigenvalues and satisfy
uk kuk (6.3)
In array experiments, these give rise to “eigengenes” and “eigenarrays” [2].
Each eigenvalue k provides a measure of the proportion of the variance
explained by the corresponding eigenvector.
By projecting the vectors onto the subspace spanned by the first eigenvec-
tors, PCA retains the maximal variance in the projection space and mini-
mizes the mean-square reconstruction error. The choice of the number K of
components is in general not a serious problem – basically it is a matter of
inspecting how much variance is explained by increasing values of K. For
visualization purposes, only projections on to two- or three-dimensional
spaces are useful. The first dominant eigenvectors can be associated with
the discovery of important features or patterns in the data. In DNA micro-
array data where the points correspond to genes and the axes to different
experiments, such as different points in time, the dominant eigenvectors can
represent expression patterns. For example, if the first eigenvector has a
large component along the first experimental axis and a small component
along the second and third axis, it can be associated with the experimental
expression pattern “high–low–low”. In the case of replicated experiments,
we can expect the first eigenvector to be associated with the principal diago-
nal (1/ 兹N , . . ., 1/ 兹N ).
There are also a number of techniques for performing approximate PCA,
as well as probabilistic and generalized (non-linear and project pursuit) ver-
sions of PCA [3, 4, 5, 6, 7], and references therein. An extensive application
of PCA techniques to array data is described in [2] and in Chapter 7.
Although PCA is not a clustering technique per se, projection on to lower
dimensional spaces associated with the top components can help reveal and
visualize the presence of clusters in the data (see Chapter 7 for an example).
These projections however must be considered carefully since clusters
present in the data can become hidden during the projection operation
(Figure 6.2). Thus, while PCA is a useful technique, it is only one way of
analyzing the data which should be complemented by other methods, and
in particular by methods whose primary focus is data clustering.
78 Dimensionality reduction, clustering, and regulatory regions
Figure 6.2. Schematic representation of three data sets in two dimensions with very
different clustering properties and projections onto principal components. A has
one cluster. B and C have four clusters. A, B, and C have the same principal compo-
nents. A and B have a similar covariance matrix with a large first eigenvalue. The
first and second eigenvalues of C are identical.
Clustering overview
Another direction for visualizing or compressing high-dimensional array
data is the application of clustering methods. Clustering refers to an impor-
tant family of techniques in exploratory data analysis and pattern discov-
ery, aimed at extracting underlying cluster structures. Clustering, however,
is a “fuzzy” notion without a single precise definition. Dozens of clustering
algorithms exist in the literature and a number of ad hoc clustering proce-
dures, ranging from hierarchical clustering to k-means have been applied to
Clustering overview 79
DNA array data [8, 9, 10, 11, 12, 13, 14, 15]. Because of the variety and
“open” nature of clustering problems, it is unlikely that a systematic
exhaustive treatment of clustering can be given. However there are a
number of important general issues to consider in clustering and clustering
algorithms, especially in the context of gene expression.
Data types
At the highest level, clustering algorithms can be distinguished depending
on the nature of the data being clustered. The standard case is when the
data points are vectors in Euclidean space. But this is by no means the only
possibility. In addition to vectorial data, or numerical data expressed in
absolute coordinates, there is the case of relational data, where data is rep-
resented in relative coordinates, by giving the pairwise distance between any
two points. In many cases the data is expressed in terms of a pairwise simi-
larity (or dissimilarity) measure that often does not satisfy the three axioms
of a distance (positivity, symmetry, and triangle inequality). There exist sit-
uations where data configurations are expressed in terms of ternary or
higher order relationships or where only a subset of all the possible pairwise
similarities is given. More importantly, there are cases where the data is not
vectorial or relational in nature, but essentially qualitative, as in the case of
answers to a multiple-choice questionnaire. This is sometimes also called
nominal data. While at the present time gene expression array data is pre-
dominantly numerical, this is bound to change in the future. Indeed, the
dimension “orthogonal to the genes” covering different experiments,
different patients, different tissues, different times, and so forth is at least in
part non-numerical. As databases of array data grow, in many cases the
data will be mixed with both vectorial and nominal components.
Supervised/unsupervised
One important distinction amongst clustering algorithms is supervised
versus unsupervised. In supervised clustering, clustering is based on a set of
given reference vectors or classes. In unsupervised clustering, no predefined
set of vectors or classes is used. Hybrid methods are also possible where an
unsupervised approach is followed by a supervised one. At the current early
stage of gene expression array experiments, unsupervised methods such as
k-means and self-organizing maps [11] are most commonly used. However,
supervised methods have also been tried [12, 16], where clusters are prede-
termined using functional information or unsupervised clustering
methods, and then new genes are classified in the various clusters using a
classifier, such as linear and quadratic discriminant analysis, decision trees,
80 Dimensionality reduction, clustering, and regulatory regions
neural networks, or support vector machines, that can learn the decision
boundaries between data classes. The feasibility of class discrimination
with array expression data has been demonstrated, for instance for tumor
classes such as leukemias arising from several different precursors [17], and
B-cell lymphomas [18] (see also [19, 20]).
Similarity/distance
The starting-point of several clustering algorithms, including several forms
of hierarchical clustering, is a matrix of pairwise similarities or distances
between the objects to be clustered. In some instances, this pairwise dis-
tance is replaced by a distortion measure between a data point and a class
centroid as in vector quantization methods. The precise definition of simi-
larity, distance, or distortion is crucial and, of course, can greatly impact
the output of the clustering algorithm. In any case, it allows converting the
clustering problem into an optimization problem in various ways, where the
goal is essentially to find a relatively small number of classes with high
intraclass similarity or low intraclass distortion, and good interclass separ-
ation. In sequence analysis, for instance, similarity can be defined using a
score matrix for gaps and substitutions and an alignment algorithm. In
gene expression analysis, different measures of similarity can be used. Two
obvious examples are Euclidean distance (or more generally L p distances)
and correlation between the vectors of expression levels. The Pearson cor-
relation coefficient is just the dot product of two normalized vectors, or the
cosine of their angle. It can be measured on each pair of genes across, for
instance, different experiments or different time steps. Each measure of
similarity comes with its own advantages and drawbacks depending on the
situation, and may be more or less suitable to a given analysis. The correla-
tion, for instance, captures similarity in shape but places no emphasis on
the magnitude of the two series of measurements and is quite sensitive to
outliers. Consider, for instance, measuring the activity of two unrelated
genes that are fluctuating close to the background level. Such genes are very
similar in Euclidean distance (distance close to 0), but dissimilar in terms of
correlation (correlation close to 0). Likewise, consider the two vectors
1000000000 and 0000000001. In a sense they are similar since they are
almost always identical and equal to 0. On the other hand, their correlation
is close to 0 because of the two “outliers” in the first and last position.
Number of clusters
The choice of the number K of clusters is a delicate issue, which depends,
among other things, on the scale at which one looks at the data. It is safe to
Clustering overview 81
say that an educated partly manual trial-and-error approach still remains
an efficient and widely used technique, and this is true for array data at the
present stage. Because in general the number of clusters is relatively small,
all possible values of K within a reasonable range can often be tried.
Intuitively, however, it is clear that one ought to be able to assess the quality
of K from the compactness of each cluster and how well each cluster is sep-
arated from the others. Indeed there have been several recent developments
aimed at the automatic determination of the number of clusters [13, 21, 22,
23] with reports of good results.
Hierarchical clustering
Clusters can result from a hierarchical branching process. Thus there exist
methods for automatically building a tree from data given in the form of
pairwise similarities. In the case of gene expression data, this is the
approach used in [8].
Tree visualization
In the case of gene expression data, the resulting tree organizes genes or
experiments so that underlying biological structure can often be detected
and visualized [8, 9, 18, 27]. As already pointed out, after the construction
of such a dendrogram there is still a problem of how to display the result
Hierarchical clustering 83
and which clusters to choose. Leaves are often displayed in linear order and
biological interpretations are often made in relation to this order, e.g., adja-
cent genes are assumed to be related in some fashion. Thus the order of the
leaves matters.
At each node of the tree, either of the two elements joined by the node
can be ordered to the left or the right of the other. Since there are N1
joining steps, the number of linear orderings consistent with the structure
of the tree is 2N1. Computing the optimal linear ordering maximising the
combined similarity of all neighboring pairs seems difficult, and therefore
heuristic approximations have been proposed [8]. These approximations
weigh genes using average expression level, chromosome position, and time
of maximal induction.
More recently, it was noticed in [28] that the optimal linear ordering can
be computed in O(N4) steps simply by using dynamic programming, in a
form which is essentially the well-known inside portion of the
inside–outside algorithm for stochastic context-free grammars [1]. If G1,...,
GN are the leaves of the tree and φ denotes one of the 2N1 possible order-
ings of the leaves, we would like to maximize
N1
兺 C [G
i1
(i), G(i 1)] (6.4)
where Gφ (i ) is the ith leaf when the tree is ordered according to φ. Let V
denote both an internal node of the tree as well as the corresponding
subtree. V has two children: Vl on the left and Vr on the right, and four
grandchildren Vll, Vlr, Vrl , and Vrr. The algorithm works bottom up, from the
leaves towards the roots by recursively computing the cost of the optimal
ordering M(V,U,W ) associated with the subtree V when U is the leftmost
leaf of Vl and W is the rightmost leaf of Vr (Figure 6.3). The dynamic pro-
gramming recurrence is given by:
The optimal cost M(V ) for V is obtained by maximizing over all pairs, U,
W. The global optimal cost is obtained recursively when V is the root of the
tree, and the optimal tree can be found by standard backtracking. The algo-
rithm requires computing M(V,U,W ) only once for each O(N 2) pair of
leaves. Each computation of M(V,U,W ) requires maximization over all
possible O(N 2) (R,S ) pair of leaves. Hence the algorithm requires O(N 4)
steps with O(N 2) space complexity, since only one M(V,U,W ) must be com-
puted for each pair (U,W ) and this is also the size of the pairwise similarity
84 Dimensionality reduction, clustering, and regulatory regions
Vl Vr
U R S W
Figure 6.3. Tree underlying the dynamic programming recurrence of the inside
algorithm.
K-means algorithm
Of all clustering algorithms, k-means [29] is among the simplest and most
widely used, and has probably the cleanest probabilistic interpretation as a
form of EM on the underlying mixture model. In a typical implementation
of the k-means algorithm, the number of clusters is fixed to some value of
K based, for instance, on the expected number of regulatory patterns. K
representative points or centers are initially chosen for each cluster more or
less at random. In array data, these could reflect, for instance, the expected
number of regulatory patterns. These points are also called centroids or
prototypes. Then at each step:
• Each point in the data is assigned to the cluster associated with the closest
representative.
• After the assignment, new representative points are computed for instance
by averaging or taking the center of gravity of each computed cluster.
• The two procedures above are repeated until the system converges or
fluctuations remain small.
Hence notice that using k-means requires choosing the number of clusters
and also being able to compute a distance or similarity between points and
compute a representative for each cluster given its members.
The general idea behind k-means can lead to different software imple-
mentations depending on how the initial centroids are chosen, how symme-
tries are broken, whether points are assigned to clusters in a hard or soft
way, and so forth. A good implementation ought to run the algorithm
multiple times with different initial conditions and possibly also try
different values of K automatically.
When the cost function corresponds to an underlying probabilistic
mixture model [30, 31], k-means is an online approximation to the classical
EM algorithm [1, 32], and as such in general is bound to converge towards a
solution that is at least a local maximum likelihood or maximum posterior
solution. A classical case is when Euclidean distances are used in conjunc-
tion with a mixture of Gaussians model. A related application to a
sequence clustering algorithm is described in [33].
冢兺 冣 冢 兺 冣
N K K
L 兺
i1
log
k1
kP(di |Mk )
k1
k1 (6.7)
Thus the maximum likelihood estimate of the mixing coefficients for class k
is the sample mean of the conditional probabilities that di comes from
model k. Consider now that each model Mk has its own vector of parame-
ters (wkj ). Differentiating the Lagrangian with respect to wkj gives
L N
k P(di |Mk )
兺
wkj i1 P(di ) wkj
(6.11)
for each k and j. The maximum likelihood equations for estimating the
parameters are weighted averages of the maximum likelihood equations
log P(di |Mk )/wkj 0 (6.13)
K-means, mixture models, and EM algorithm 87
arising from each point separately. As in Equation 6.10, the weights are the
probabilities of membership of the di in each class.
The maximum likelihood Equations 6.10 and 6.12 can be used iteratively
to search for maximum likelihood estimates; this is in essence the EM algo-
rithm. In the E step, the membership probabilities (hidden variables) of
each data point are estimated for each mixture component. The M step is
equivalent to K separate estimation problems with each data point contrib-
uting to the log-likelihood associated with each of the K components with a
weight given by the estimated membership probabilities. Different flavors of
the same algorithm are possible depending on whether the membership
probabilities P(M|d ) are estimated in hard or soft fashion during the E
step. The description of k-means given above corresponds to the hard
version where these membership probabilities are either 0 or 1, each point
being assigned to only one cluster. This is analogous to the use of the
Viterbi version of the EM algorithm for hidden Markov models, where
only the optimal path associated with a sequence is used, rather than the
family of all possible paths. Different variations are also possible during the
M step of the algorithms depending, for instance, on whether the parame-
ters wkj are estimated by gradient descent or by solving Equation 6.12
exactly. It is well known that the center of gravity of a set of points mini-
mizes its average quadratic distance to any fixed point. Therefore in the case
of a mixture of spherical Gaussians, the M step of the k-means algorithm
described above maximizes the corresponding quadratic log-likelihood and
provides a maximum likelihood estimate for the center of each Gaussian
component. It is also possible to introduce prior distributions on the
parameters of each cluster and/or the mixture coefficients and create more
complex hierarchical mixture models.
PCA, hierarchical clustering, k-means, as well as other clustering and
data analysis algorithms are currently implemented in several publicly or
commercially (Table 6.1) available software packages for DNA array data
analysis. It is important to recognize that many software packages will
output some kind of answer, for instance a set of clusters, on any kind of
data set. These answers should not be trusted always blindly. Rather it is
wise practice, whenever possible, to track down the assumptions underlying
each algorithm/implementation/package and to run the same algorithms
with different parameters, as well as different algorithms, on the same data
set, as well as on other data sets.
Table 6.1. Commercially available software packages for DNA array data analysis
Commercial sources
GeneSpring 4.1 Win Silicon Genetics www.sigenetics.com t-test, hierarchical clustering,
Mac k-means, SOM, PCA, class
Unix predictor, experiment tree
Spotfire Win Spotfire www.spotfire.com t-test, PCA, k-means,
Unix hierarchical clustering
GeneSight 2.1 Win Biodiscovery www.biodiscovery.com t-test, k-means, hierarchical
clustering, SOM, PCA, pattern
similarity search, non-linear
normalization
Data Mining Tool Win Affymetrix www.affymetrix.com t-test, Mann–Whitney test,
SOM, modified Pearson’s
correlation coefficient
Web-based sources
GeneMaths Win Applied Maths www.applied-maths.com Hierarchical clustering,
k-means, SOM, PCA,
discrimination analysis with or
without variance
CyberT Web-based UCI www.genomics.uci.edu t-test or
t-test with Bayesian framework
R Cluster Web-based UCI www.genomics.uci.edu Hierarchical clustering,
k-means
EPCLUST/ Web-based European www.ebi.ac.uk/microarray Hierarchical clustering, k-
means
Expression Bioinformatic and finding the nearest
Profiler Institute neighbors
Freeware sources
J-Express Mac Molmine www.molmine.com Hierarchical clustering,
Win k-means, PCA, SOM, profile
Unix similarity search
Significance Win Stanford University www-stat-
analysis of class.stanford.edu/SAM/
microarray (SAM) SAMServlet
dChip Win Harvard University www.biostat.harvard.edu/ t-test,
(Wong) complab/dchip/dchip.exe model-based analysis for
GeneChips™
Cluster and Win Stanford University/ rana.lbl.gov/Eisen Hierarchical clustering,
TreeView Berkeley Software.htm k-means, PCA, SOM
(Eisen)
SVM 1.2 Linux Columbia University www.cs.columbia.edu/ Class prediction
Unix (Noble) ~noble/svm/doc/
Xcluster Mac Stanford University genome-www.stanford. SOM, k-means
Win (Sherlock) edu/ ~sherlock/cluster.Html
Unix
GeneCluster Win Whitehead http://www.genome.wi.mit. SOM
Institute/MIT edu/MPR
90 Dimensionality reduction, clustering, and regulatory regions
GCGATGAGC
6
Number of occurrences
6
500 450 400 350 300 250 200 150 100 50 0
Upstream position
1. Baldi, P., and Brunak, S. Bioinformatics: The Machine Learning Approach,
2nd edn. 2001. MIT Press, Cambridge, MA.
2. Alter, O., Brown, P. O., and Botstein, D. Singular value decomposition for
genone-wide expression data processing and modeling. 2000. Proceedings of
the National Academy of Sciences of the USA 97:10101–10106.
3. Baldi, P., and Hornik, K. Neural networks and principal component
analysis: learning from examples without local minima. 1988. Neural
Networks 2:53–58.
4. Roweis, S. EM algorithm for PCS and SPCA. In M. I. Jordan, M. J. Kearns,
and S. A. Solla, editors, Advances in Neural Information Processing Systems,
vol. 10, pp. 626–632. 1998. MIT Press, Cambridge, MA.
5. Scholkopf, B., Smola, A., and Mueller, K. R. Nonlinear component analysis
as a kernal eigenvalue problem. 1998. Neural Computation 10:1299–1319.
6. Bishop, C. M. Bayesian PCA. In M. S. Kearns, S. A. Solla, and D. A. Cohn,
editors, Advances in Neural Information Processing Systems, vol. 11, pp.
382–388. 1999. MIT Press, Cambridge, MA.
7. Hand, D., Mannila, H., and Smyth, P. 2001. MIT Press, Cambridge, MA.
8. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. Cluster analysis
and display of genome-wide expression patterns. 1998. Proceedings of the
National Academy of Sciences of the USA 95:14863–14868.
9. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and
Levine, A. J. Broad patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
94 Dimensionality reduction, clustering, and regulatory regions
1999. Proceedings of the National Academy of Sciences of the USA
96:6745–6750.
10. Heyer, L. J., Kruglyak, S., and Yooseph, S. Exploring expression data:
identification and analysis of co-expressed genes. 1999. Genome Research
9:1106–1115.
11. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E.,
Lander, E. S., and Golub, T. R. Interpreting patterns of gene expression with
self-organizing maps: methods and application to hematopoietic
differentiation. 1999. Proceedings of the National Academy of Sciences of the
USA 96:2907–2912.
12. Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Walsh Sugnet, C.,
Ares, M. Jr., Furey, T. S., and Haussler, D. Knowledge-based analysis of
microarray gene expression data by using support vector machines. 2000.
Proceedings of the National Academy of Sciences of the USA 97:262–267.
13. Sharan, R., and Shamir, R. CLICK: a clustering algorithm with applications
to gene expression analysis. In Proceedings of the 2000 Conference on
Intelligent Systems for Molecular Biology (ISMB00), La Jolla, CA, pp.
307–316. 2000. AAAI Press, Menlo Park, CA.
14. Cheng, Y., and Church, G. M. Biclustering of expression data. In
Proceedings of the 2000 Conference on Intelligent Systems for Molecular
Biology (ISMB00), La Jolla, CA, pp. 93–103. 2000. AAAI Press, Menlo
Park, CA.
15. Xing, E. P., Jordan, M. I., and Karp, R. M. Feature selection for high-
dimensional genomic microarray data.
16. Furey, T. S., Christianini, N., Duffy, N., Bednarski, D. W., Schummer, M.,
and Haussler, D. Support vector machine classification and validation of
cancer tissue samples using microarray expression data. 2000. Bioinformatics
16:906–914.
17. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M.,
Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A.,
Bloomfield, C. D., and Lander, E. S. Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring. 1999. Science
286:531–537.
18. Alizadeh, A., Eisen M., et al. Distinct types of diffuse large B-cell lymphoma
identified by gene expression profiling. 2000. Nature 403:503–510.
19. Poustka, A., von Heydebreck, A., Huber, W., and Vingron, M. Identifying
splits with clear separation: a new class discovery method for gene expression
data. 2001. Bioinformatics 17 (Supplement 1):S107–S114.
20. Tamayo, P., Mukherjee, S., Rifkin, R. M., Angelo, M., Reich, M., Lander, E.,
Mesirov, J., Yeang, C. H., Ramaswamy, S., and Golub, T. Molecular
classification of multiple tumor types. 2001. Bioinformatics 17 (Supplement
1):S316–S322.
21. Tishby, N., Pereira, F., and Bialek, W. The information bottleneck method.
In B. Hajek and R. S. Sreenivas, editors, Proceedings of the 37th Annual
Allerton Conference on Communication, Control, and Computing, pp.
368–377. 1999. University of Illinois.
22. Tishby, N., and Slonim, N. Data clustering by Markovian relaxation and the
information bottleneck method. In T. Leen, T. Dietterich, and V. Tresp,
editors, Neural Information Processing Systems (NIPS 2000), vol. 13. 2001.
MIT Press, Cambridge, MA.
23. Slonim, N., and Tishby, N. The power of word clustering for text
classification. In Proceedings of the European Colloquium on IR Research,
ECIR 2001. 2001.
DNA arrays and regulatory regions 95
24. Blatt, M., Wiseman, S., and Domany, E. Super-paramagnetic clustering of
data. 1996. Physical Review Letters 76:3251–3254.
25. Buhmann, J., and Kuhnel, H. Vector quantization with complexity costs.
1993. IEEE Transactions on Information Theory 39:1133–1145.
26. Eppstein, D. Fast hierarchical clustering and other applications of dynamic
closest pairs. In Proceedings of the 9th ACM-SIAM Symposium on Discrete
Algorithms, pp. 619–628. 1998.
27. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen,
M. B., Brown, P. O., Botstein, D., and Futcher, B. Comprehensive
identification of cell cycle-regulated genes of the yeast Saccharomyces
cerevisiae by microarray hybridization. 1998. Molecular Biology of the Cell
9:3273–3297.
28. Bar-Joseph, Z., Gifford, D. K., and Jaakkola, T. S. Fast optimal leaf
ordering for hierarchical clustering. 2001. Bioinformatics 17 (Supplement
1):S22–S29.
29. Duda, R. O., and Hart, P. E. Pattern Classification and Scene Analysis. 1973.
Wiley, New York.
30. Everitt, B. S. An Introduction to Latent Variable Models. 1984. Chapman &
Hall, London.
31. Titterington, D. M., Smith, A. F. M., and Makov, U. E. Statistical Analysis
of Finite Mixture Distributions. 1985. Wiley, New York.
32. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from
incomplete data via the EM algorithm. 1977. Journal of the Royal Statistical
Society B39:1–22.
33. Baldi, P. On the convergence of a clustering algorithm for protein-coding
regions in microbial genomes. 2000. Bioinformatics 16:367–371.
34. Hampson, S., Baldi, P., Kibler, D., and Sandmeyer, S. Analysis of yeast’s
ORFs upstream regions by parallel processing, microarrays, and
computational methods. In Proceedings of the 2000 Conference on Intelligent
Systems for Molecular Biology (ISMB00), La Jolla, CA, pp. 190–201. 2000.
AAAI Press, Menlo Park, CA.
35. Pevzner, P. A., and Sze, S. Combinatorial approaches to finding subtle signals
in DNA sequences. In Proceedings of the 2000 Conference on Intelligent
Systems for Molecular Biology (ISMB00), La Jolla, CA, pp. 269–278. 2000.
AAAI Press, Menlo Park, CA.
36. Pevzner, P. A. Computational Molecular Biology: An Algorithmic Approach.
2000. MIT Press, Cambridge, MA.
37. Mauri, G., Pavesi, G., and Pesole, G. An algorithm for finding signals of
unknown length in DNA. 2001. Bioinformatics 17 (Supplement
1):S207–S214.
38. van Helden, J., del Olmo, M., and Perez-Ortin, J. E. Statistical analysis of
yeast genomic downstream sequences reveals putative polyadenylation
signals. 2000. Nucleic Acids Research 28:1000–1010.
39. Brazma, A., Jonassen, I. J., Vilo, J., and Ukkonen, E. Predicting gene
regulatory elements in silico on a genomic scale. 1998. Genome Research
8:1202–1215.
40. van Helden, J., Andre, B., and Collado-Vides, J. Extracting regulatory sites
from the upstream region of yeast genes by computational analysis of
oligonucleotide frequencies. 1998. Journal of Molecular Biology 281:827–842.
41. Wingender, E., Chen, X., Fricke, E., Geffers, R., Hehl, R., Liebich, I., Krull,
M., Matys, V., Michael, H., Ohnhauser, R., Pruss, M., Schacherer, F., Thiele,
S., and Urbach, S. The TRANSFAC system on gene expression regulation.
2001. Nucleic Acids Research 29:281–284.
96 Dimensionality reduction, clustering, and regulatory regions
42. Vilo, J., and Brazma, A. Mining for putative regulatory elements in the yeast
genome using gene expression data. In Proceedings of the 2000 Conference on
Intelligent Systems for Molecular Biology (ISMB00), La Jolla, CA, pp.
384–394. 2000. AAAI Press, Menlo Park, CA.
43. Bussemaker, H. J., Li, H., and Siggia, E. D. Building a dictionary for
genomes: identification of presumptive regulatory sites by statistical analysis.
2000. Proceedings of the National Academy of Sciences of the USA
97:10096–10100.
44. Hughes, J. D., Estep, P. W., Tavazole, S., and Church, G. M. Computational
identification of cis-regulatory elements associated with groups of
functionally related genes in Saccharomyces cerevisiae. 2000. Journal of
Molecular Biology 296:1205–1214.
45. Hampson, S., Kibler, D., and Baldi, P. Distribution patterns of locally over-
represented k-mers in non-coding yeast DNA. 2002. Bioinformatics 18:513–528.
46. Blanchette, M., and Sinha, S. Separating real motifs from their artifacts.
2001. Bioinformatics 17 (Supplement 1):S30–S38.
47. Martinez-Pastor, M. T., Marchler, G., Schuller, C., Marchler-Bauer, A., Ruis,
H., and Estruch, F. The Saccharomyces cerevisiae zinc finger proteins
MSN2p and Msn4p are required for transcriptional induction through the
stress-response element (STRE). 1996. EMBO Journal 15:2227–2235.
48. Wu, A. L., and Moye-Rowley, W. S. GSH1 which encodes gamma-
glutamylcysteine synthetase is a target gene for YAP-1 transcriptional
regulation. 1994. Molecular and Cellular Biology 14:5832–5839.
49. Fernandes, L., Rodrigues-Pousada, C., and Struhl, K. Yap, a novel family of
eight bzip proteins in Saccharomyces cerevisiae with distinct biological
functions. 1997. Molecular and Cellular Biology 17:6982–6993.
50. Coleman, S. T., Epping, E. A., Steggerda, S. M., and Moye-Rowley, W. S.
Yap1p activates gene transcription in an oxidant-specific fashion. 1999.
Molecular and Cellular Biology 19:8302–8313.
51. Baldi, P., and Baisnée, P.-F. Sequence analysis by additive scales: DNA
structure for sequences and repeats of all lengths. 2000. Bioinformatics
16:865–889.
52. Gorm Pedersen, A., Jensen, L. J., Brunak, S., Staerfeldt, H. H., and Ussery,
D. W. A DNA structural atlas for Escherichia coli. 2000. Journal of
Molecular Biology 299:907–930.
53. Steffen, N. R., Murphy, S. D., Tolleri, L., Wesley Hatfield, G., and Lathrop,
R. H. DNA sequence and structure: Direct and indirect recognition in
protein–DNA binding. In Proceedings of the 2002 Conference on Intelligent
Systems for Molecular Biology (ISMB02). (2002). (in press)
54. Brown, P. O., Chiang, D. Y., and Eisen, M. B. Visualizing associations
between genome sequences and gene expression data using genome-mean
expression profiles. 2001. Bioinformatics 17 (Supplement 1):S49–S55.
55. van Someren, E. P., Wessels, L. F. A., and Reinders, M. J. T. Linear modeling
of genetic networks from experimental data. In Proceedings of the 2000
Conference on Intelligent Systems for Molecular Biology (ISMB00), La Jolla,
CA, pp. 355–366. 2000. AAAI Press, Menlo Park, CA.
56. Friedman, N., Linial, M., Nachman, I., and Pe’er, D. Using Bayesian
networks to analyze expression data. 2000. Journal of Computational Biology
7:601–620.
57. Zien, A., Kuffner, R., Zimmer, R., and Lengauer, T. Analysis of gene
expression data with pathway scores. In Proceedings of the 2000 Conference
on Intelligent Systems for Molecular Biology (ISMB00), La Jolla, CA, pp.
407–417. 2000. AAAI Press, Menlo Park, CA.
7
The design, analysis, and interpretation of
gene expression profiling experiments
97
98 Design, analysis, and interpretation of experiments
be made from data obtained from different DNA microarray formats such
as pre-synthesized nylon filter arrays and in situ synthesized Affymetrix
GeneChips™. In sum, we employ a model system to show that appropriate
applications of statistical methods allow the discovery of genes of complex
regulatory circuits with a high level of confidence.
We have chosen to use experiments performed with E. coli for the discus-
sions of this chapter both because of our own scientific interests and
because this model organism offers several advantages for the evaluation of
DNA array technologies and data analysis methods that are applicable to
gene expression profiling experiments performed with all organisms.
Foremost among these advantages is the fact that 50 years of work with E.
coli have produced a wealth of information about its operon-specific and
global gene regulation patterns. This information provides us with a “gold
standard” which makes it possible to evaluate the accuracy of data
obtained from DNA array experiments, and to identify data analysis
methods that optimize the identification of genes differentially expressed
because of biological reasons from false positives (genes that appear to be
differentially expressed due to chance occurrences exacerbated by experi-
mental error and biological variance). The knowledge we have gained
through these analyses with this well-defined model organism gives us con-
fidence that these methods are equally applicable to other less well-defined
systems. Indeed, Jacques Monod is credited with saying that “What is true
for E. coli is true for elephants, only more so.”
In the second half of this chapter, we turn our attention to the identifica-
tion of more complex gene expression patterns involving groups of genes
that behave similarly in time and/or across different treatment conditions.
In these cases, we emphasize that robustness can be achieved by averaging
out noise either by experiment replication with a limited number of treat-
ment conditions or by measuring many samples across time, genotypes, etc.
To illustrate this point we describe two different types of experiments. In
the first example, we describe experiments that measure the gene expression
patterns of E. coli cells grown under three different treatment conditions, in
the presence and absence of oxygen, and in the presence and absence of a
global regulatory protein for genes of anaerobic metabolism in the absence
of oxygen. With these experiments we show that, when only three samples
(two treatment conditions) are measured, individual experiment replication
is required for accurate clustering of genes with similar regulatory patterns.
In the second example, we describe experiments that measure the effects of
188 drugs on the expression patterns of 1376 genes in 60 cancer cell line
samples. In this case, where 60 samples are measured under 188 conditions,
Experimental design 99
we demonstrate that direct applications of clustering methods to entire
data sets reveal robust gene regulatory patterns in the absence of individual
experimental replications.
Experimental design
Many experimental designs and applications of gene expression profiling
experiments are possible. However, no matter what the purpose of the
experiment, a sufficient number of measurements must be obtained for sta-
tistical analysis of the data, either through multiple measurements of
homogeneous samples (replication) or multiple sample measurements (e.g.,
across time or subjects). This is basically because each gene expression pro-
filing experiment results in the measurement of the expression levels of
thousands of genes. In such a high-dimensional experiment, many genes
will show large changes in expression levels between two experimental con-
ditions without being significant. These false positives arise from chance
occurrences caused by uncontrolled biological variance as well as experi-
mental and measurement errors.
Experimental errors include variations in procedures involved in growing
and harvesting cultures or obtaining biological samples discussed in
Chapter 4. These errors can be minimized by appropriate experimental
design and technical procedures. On the other hand, biological variance is
more difficult to control. Even when two cultures of organisms with identi-
cal genotypes are grown under the same conditions differences in gene
expression profiles are detected. In addition to these false positives, diffe-
rentially expressed genes that may not be related to the experimental condi-
tions will be detected when targets prepared from cells of different
genotypes or even different tissues of the same genotype are queried. On
top of these sources of errors, the quality of DNA microarray data can vary
even further depending upon the type of the array and the array manufac-
turing methods and quality. With these considerations in mind, we
employed the general design described below for the E. coli experiments of
this chapter.
The purpose of the first experiment is to identify the network of genes
that are regulated by the global E. coli regulatory protein, leucine-responsive
regulatory protein (Lrp). Lrp is a global regulatory protein that affects the
expression of multiple genes and operons. In most cases, Lrp activates
operons that encode genes for biosynthetic enzymes and represses operons
that encode genes for catabolic enzymes. Interestingly, the intermediary
metabolite, -leucine, is required for the binding of Lrp at some of its DNA
100 Design, analysis, and interpretation of experiments
target sites; however, at other sites -leucine inhibits DNA binding, and at
yet other sites it exerts no effect at all. While the expression level of about 75
genes have been reported to be affected by Lrp under different environmen-
tal and nutritional growth conditions, its specific role in the regulation of
cellular metabolism remains unclear. It has been suggested that it might
function to coordinate cellular metabolism with the nutritional state of the
environment by monitoring the levels of free -leucine in the cell.
In spite of the fact that much remains to be learned about the Lrp regula-
tory network, its many characterized target genes make it an ideal system
for the development and assessment of statistical methods for the identifi-
cation of differentially expressed genes. As these methods are developed
and as a better understanding of the gene network regulated by this impor-
tant protein emerges, a clearer view of its physiological purpose will surely
follow.
The experimental design for our Lrp experiment consists of four inde-
pendent experiments, each performed in duplicate, diagrammed in Figure
4.1. In Experiment 1, Filters 1 and 2 were hybridized with 33P-labeled,
random-hexamer-generated, cDNA fragments complementary to each of
three RNA preparations (Lrp RNA1–3) obtained from the cells of three
individual cultures of strain IH-G2490 (Lrp ) These three 33P-labeled,
cDNA preparations were pooled prior to hybridizations. Following phos-
phorimager analysis, these filters were stripped and hybridized with pooled,
33
P-labeled cDNA fragments complementary to each of three RNA prepar-
ations (Lrp RNA1–3) obtained from strain IH-G2491 (Lrp). In
Experiment 2, these same filters were again stripped and this protocol was
repeated with 33P-labeled cDNA fragments complementary to another set
of three pooled RNA preparations obtained from strains IH-G2490 (Lrp
RNA 4–6) and IH-G2491 (Lrp RNA 4–6) as described above. Another set
of filters (Filter 3 and Filter 4) was used for Experiments 3 and 4 as
described for Experiments 1 and 2. This protocol results in duplicate filter
data for four experiments performed with four independently prepared
cDNA target sets. Thus, since each filter contains duplicate spots for each
ORF and duplicate filters are hybridized for each experiment, four meas-
urements for each ORF are obtained from each of four experiments [1].
0.0001 12
0.0005 30
0.001 44
0.005 134
0.01 208
Experi- Experi
Gene Control mental Control -mental
name a (mean) (mean) (SD) (SD) p-value PPDE Fold
Note:
a
Known Lrp regulated genes are identified by an asterisk.
Global false positive level 103
To estimate the reproducibility between different filters, we employed a
statistical t-test to compare data from each pair of filters hybridized with
33
P-labeled cDNA targets prepared from the same pooled RNA samples.
For example, each normalized and background subtracted ORF measure-
ment on Filter 1 (IH-G2490 (Lrp )) of Experiment 1 was compared to the
data on Filter 2 (IH-G2490 (Lrp )) of Experiment 1 (Figure 4.1). Similar
comparisons were made between filters hybridized with the same RNA
preparations for Experiments 2, 3, and 4. This procedure was repeated for
filter pairs hybridized with RNA preparations from the IH-G2491 (Lrp)
strain. The results of these comparisons for the control (Lrp ) strain are
shown in Table 7.3. Based on the t-test distribution, and in the absence of
experimental error, 12 of the 2547 genes expressed at a level above back-
ground in all experiments should exhibit a p-value less than 0.005 based on
chance alone (0.005254712). That is, at this level of measurement accu-
racy 12 false positives (genes that appear to be differentially expressed) out
of the 2547 genes measured are expected, and experimental errors will
increase this number of false positives to even higher levels. The data in
Table 7.3 show that, in fact, an average of 43 genes with a p-value less than
0.005 is observed. This demonstrates that experimental errors introduced
by differences between filters increase the global false positive level of this
experiment about threefold beyond that expected by chance.
The data in Table 7.3 demonstrate that even greater errors are introduced
by differences among RNA preparations. In this case, the global false posi-
tive level expected from chance is elevated nearly tenfold. Thus, while some
error is introduced by differences among filters, the major source of error is
derived from differences among RNA preparations.
These results illustrate the need to reduce experimental and biological
differences among RNA preparations as much as possible. They also illus-
trate the advantage of pooling RNA preparations from independent
samples for each experiment. For example, the RNA preparations used
here were pooled from three independent cultures. However, when single
RNA preparations were used, the false positive levels were more than twice
the levels reported in Table 7.3.
Estimation of the global false positive level for a DNA array experiment
The above results demonstrate the necessity of determining the global false
positive level of a gene expression profiling experiment. The global false
positive level reflects all sources of experimental and biological variation
inherent in a DNA array experiment. With this information, a global level
Table 7.3. False positive level elevations contributed by differences among filters and target preparations
0.0001 0.0
0.0005 0.25
0.001 1.00
0.005 3.75
0.01 7.25
other differentially expressed genes we must lower the stringency of our sta-
tistical criterion. The data in Table 7.1 show that as the p-value is raised to
0.005 we observe an additional 122 genes that are differentially expressed at
this threshold level. At the same time, raising the statistical threshold to
0.005 reveals an average of 3.75 genes that appear differentially expressed
with a p-value equal to or less than 0.005 when the control data sets are
compared to one another (Table 7.4). This means that, given this complete
data set from four replicate experiments, we expect at least 3.75 false posi-
tives among the 134 genes differentially expressed with a p-value equal to or
less than 0.005. Therefore, our global confidence in the identification of any
one of these 134 genes as differentially expressed genes is 97%.
It should be emphasized that relaxing the p-value threshold rapidly
increases the average number of false positives in the control (Lrp vs.
Lrp ) data sets relative to the number of genes differentially expressed at
the same p-value in the experimental (Lrp vs. Lrp) data set and, there-
fore, decreases the confidence with which differentially expressed genes can
be identified.
A computational method
David Allison and colleagues have described a computational version of
this ad hoc method for estimating the false positive level [2]. The basic idea
is to consider the p-values as a new data set and to build a probabilistic
model for this new data. When there is no change (i.e., no differential gene
expression) it is easy to see that the p-values ought to have a uniform distri-
bution between 0 and 1. In contrast, when there is change, the distribution
Global false positive level 107
Figure 7.1. Distribution of the p-values from the Lrp+ vs. Lrp data. The fitted
model (dashed curve) is a mixture of one beta and a uniform distribution (dotted
line).
of p-values will tend to cluster more closely to 0 than 1, i.e., there will be a
subset of differentially expressed genes with “significant” p-values (Figure
7.1). One can use a mixture of beta distributions (Chapter 6 and Appendix
B) to model the distribution of p-values in the form
K
P(p) 兺 B(p; r , s )
i0
i i i (7.1)
0/8
PPDE
0.6
0.4
0.2
0
10–11 10–10 10–9 10–8 10–7 10–6 10–5 10–4 10–3 10–2 10–1 1
p-value
兺
i1
iB(p; ri, si ) 兺 B(p; r , s )
i1
i i i
PPDE P (change冟p) K
K
(7.3)
兺 B(p; r , s )
i0
i i i 0 兺 B(p; r , s )
i1
i i i
The distribution of p-values from our Lrp vs. Lrp data is shown in Figure
7.1, and a plot of PPDE values vs. p-values is shown in Figure 7.2.
A comparison of the ad hoc method for determining the global signifi-
cance for the differential expression of a given gene and the computational
method is presented in Table 7.5. It is satisfying to see that these data
compare quite well.
Improved inference using a Bayesian statistical framework 109
Posterior
Number of genes Number of genes probability of
(control vs. (control vs. % Confidence differential
p-value control) experimental) (ad hoc) expression
(b)
(b)
Figure 7.3. Scatter plots showing the mean of the fractional mRNA levels obtained
from eight filters hybridized with 33P-labeled cDNA targets prepared from three
pooled RNA preparations extracted from E. coli K12 strains IH-G2490 (Lrp+) and
IH-G2491 (Lrp). The larger black dots identify 100 genes differentially expressed
between strains IH-G2490 and IH-G2491 with (A) p-values less than 0.004 and (B)
p-values less than 0.002 based on a simple t-test distribution. The circled black dots
identify genes known to be regulated by Lrp. The gray spots represent the relative
expression levels of each of the 2885 genes expressed at a level above background in
all experiments. The dashed lines demarcate the limits of twofold differences in
expression levels.
Improved inference using a Bayesian statistical framework 113
expressed between Lrp and Lrp strains with a p-value less than 0.0001
identified by the Bayesian approach, 17 are known to be Lrp-regulated
(Table 7.7). Why does the regularized t-test identify more Lrp-regulated
genes? The answer lies in the fact that all the genes identified to be differen-
tially expressed with a p-value less than 0.005 with the regularized t-test
exhibit fold changes greater than 1.7-fold (Figure 7.3B). However, many
genes identified to be differentially expressed with a p-value less than 0.005
with the simple t-test exhibit fold changes as small as 1.2-fold (Figure
7.3A). Furthermore, the 100 genes with the lowest p-values identified as
differentially expressed by both methods contain only 43 genes in common.
Thus, many of the genes identified by the simple t-test that are excluded by
the Bayesian approach are genes that show small fold changes. In general,
these genes with small fold changes identified by the simple t-test are asso-
ciated with “too good to be true” within-treatment variance estimates,
reflecting underestimates of the within-treatment variance when the
number of observations is small. The elimination of this source of false
positives by the Bayesian approach improves the identification of true posi-
tives. However, although this is desired, genes that are truly differentially
expressed with small fold changes in the range of 1.2- to 1.7-fold will also be
eliminated by the Bayesian approach. For example of the 16 genes of the
top 100 with the lowest p-values identified by the simple t-test that are
known to be regulated by Lrp, one was not identified by the Bayesian
method. This Lrp-regulated gene that did not pass the regularized t-test
was the sdaC gene, previously reported to be regulated by Lrp and meas-
ured to be regulated 1.9-fold in the experiment performed with the DNA
arrays [5]. Nevertheless, although this gene is lost, the overall performance
of the regularized t-test surpasses that of the simple t-test, and most
researchers are interested in discovering genes that are differentially
expressed with large fold changes.
At first glance it might appear that the Bayesian approach validates the
often-used 2-fold rule for the identification of differentially expressed
genes. That is, the identification of genes differentially expressed between
two experimental treatments with a fold change greater than 2 in, for
example, three out of four experiments. This type of reasoning is based on
the intuition that larger observed fold changes can be more confidently
interpreted as a stronger response to the experimental treatment than
smaller observed fold changes, which of course is not necessarily the case.
An implicit assumption of this reasoning is that the variance among repli-
cates within treatments is the same for every gene. In reality, the variance
varies among genes (for example, see Figure D.3, Appendix D) and it is
Table 7.7. Genes differentially expressed between Lrp and Lrp (control vs. experimental) E. coli strains with a p-value less
than 0.0001 identified with a regularized t-test
Posterior
probability of
Control Experimental Control Experimental differential
Gene namea (mean) (mean) (SD) (SD) p-value Fold expression
Notes:
a
Known Lrp-regulated genes are identified with an asterisk.
b
lac genes under the control of the Lrp-regulated ilvPG promotor–regulatory region.
116 Design, analysis, and interpretation of experiments
critical to incorporate this information into a statistical test. This is made
clear by simply examining the scatter plots in Figure 7.3. Here many genes
that appear differentially expressed greater than 2-fold do not exhibit p-
values less than 0.005 and a global confidence level of at least 95%. This
does not mean that these might not be regulated genes, it simply means that
they are false negatives that cannot be identified at this level of confidence.
The power and usefulness of the statistical methods described here is that
all of the genes examined in a DNA array experiment can be sorted by their
p-values and global confidence (PPDE) levels based on the accuracy and
reproducibility of each gene measurement. At this point, the researcher can
set his/her own threshold level for genes worthy of further experimentation.
After all, it is best to know the odds when placing a bet.
The Bayesian approach allows the identification of more true positives with
fewer replicates.
As additional replications of DNA array experiments are performed, a
simple t-test analysis results in the identification of a more consistent set of
up- or down-regulated genes. This is of course because better estimates of
the standard deviation for each gene are obtained as the number of experi-
mental replications becomes larger. We have shown above that the use of a
Bayesian prior estimate of the standard deviation for each gene used in the
t-test further improves our ability to identify differentially expressed genes
with a higher level of confidence than the simple t-test. This suggests that a
more consistent set of differentially expressed genes identified with a higher
level of confidence might be identified with fewer replications when the reg-
ularized t-test is employed. Long et al. [3] have demonstrated that this is
indeed the case. In this work, they used CyberT to compare and analyze the
gene expression profiles obtained from a wild-type strain of E. coli and an
otherwise isogenic strain lacking the gene for the global regulatory protein,
integration host factor (IHF), reported by Arfin et al. [6]. IHF is a DNA
architectural protein that is important for the compaction of the bacterial
nucleoid. However, unlike other architectural DNA proteins that bind to
DNA with low specificity, IHF also binds to high-affinity sites to modulate
expression levels of many genes during transitions from one growth condi-
tion to another by its effects on local DNA topology and structure.
Long et al. [3] defined genes whose differential expression is likely to be
due to a direct effect of IHF, and therefore true positives, as those genes that
possess a documented or predicted high-affinity IHF binding site within
500 base pairs upstream of the ORF for each gene, or operon containing
the gene. Of the 120 genes differentially expressed between IHF and IHF
Improved inference using a Bayesian statistical framework 117
Figure 7.4. Analysis of IHF data with and without Bayesian “treatment”. The
numbers in parentheses represent the number of genes of the 120 genes with the
lowest p-values that contain a documented or predicted IHF binding site less than
500 base pairs upstream of each ORF. In each case the raw or log-transformed data
from two (22) or four (44) independent experiments were analyzed either with a
simple t-test (t-test) or regularized t-test (Bayes).
strains with the lowest p-values identified by a simple t-test based on using
the data from two independent experiments (22 t-test), 51 genes contain-
ing an upstream IHF site were observed; whereas, a regularized t-test (22
Bayesian) using these same two data sets identified 59 genes, or 15% more
genes with upstream IHF sites (Figure 7.4). Furthermore, comparison of
the differentially expressed genes identified by the simple t-test or the regu-
larized t-test to the differentially expressed genes identified by a simple t-
test performed on four experimental data sets (44 t-test) showed that the
regularized t-test again identifies 15% more genes (38 vs. 33) in common
with the genes identified with the 44 simple t-test. These data demon-
strate that replicating an experiment twice and performing a Bayesian anal-
ysis is comparable in inference to replicating an experiment four times and
using a traditional t-test. As data from more replicate experiments are
included, the differentially expressed genes identified by the regularized and
simple t-test analyses converge on a common set of differentially expressed
genes.
Figure 7.5. Distribution of functions for genes expressed between Lrp and Lrp
E. coli strains.
tion. As more experiments of this type are performed and as more func-
tions are assigned to the gene products of these hypothetical ORFs and
bioinformatics methods to identify degenerate protein binding sites
typical of proteins that bind to many DNA sites are developed, a clearer
picture of genetic regulatory networks and their interactions in E. coli will
emerge. These will be the first steps towards the systems biology goals of
Chapter 8.
Figure 7.6. Design of the Affymetrix GeneChip™ experiments. See text for details.
hybridized with biotin-labeled targets obtained from the same total RNA
preparation.
For the GeneChip™ experiments, the exact same four control and
experimental pairs of pooled RNA preparations used in the Lrp vs. Lrp
filter experiments described above were used for hybridization to four pairs
of E. coli Affymetrix GeneChips™. However, in this case each experiment
was not performed in duplicate, and only one measurement for each gene
was obtained on each chip. Thus, instead of having four measurements for
each gene expression level for each experiment (Figure 4.1), only one meas-
urement was obtained from each GeneChip™ (Figure 7.6). On the other
hand, this single measurement is the average of the difference between
hybridization signals from 15 perfect match (PM) and mismatch (MM)
probe pairs.1 While these are not equivalent to duplicate measurements
because different probes are used, these data do increase the reliability of
each gene expression level measurement.
Because each filter experiment was performed in duplicate and each filter
contained duplicate probes for each target, it was possible to assess the
1
While the number of perfect match and mismatch probe pairs for the vast majority of E.
coli ORFs is 15, this number can range from 2 to 20 depending on the length of the ORF.
Nylon filter vs. Affymetrix GeneChip™ data 121
Notes:
a
Calculated with 2128 control and experimental gene expression measurements
(AD values from *.CEL file with negative values converted to 0) containing four
non-zero values for four experiments.
b
Calculated by averaging the control or experimental measurements and compar-
ing experiments 1 and 3 vs. 2 and 4 or 1 and 4 vs. 2 and 3 that average data across
chips and RNA preparations.
Table 7.9. Differential gene expression data for nylon filter experiments
using CyberT with a regularized t-testa
Notes:
a
Calculated with 2775 control and experimental gene measurements containing
four non-zero values for four experiments.
b
Calculated by averaging the control measurements of experiments 1 and 3 vs. 2
and 4 or 1 and 4 vs. 2 and 3 that average data across filters and RNA preparations.
Number of differentially
Number of replicates expressed genes
1 416–682
2 118–184
3 68–95
4 55
and the mismatch values of the control and experimental chip divided by
the perfect match and the mismatch values of the control chip must be
greater than the percent change threshold divided by 100, where the percent
change threshold is defined by the user (default 80). Based on these criteria
the software identifies (calls) differentially expressed genes as marginally
increased or decreased, increased or decreased, or not changed.
Since the Affymetrix software allows the comparison of only one
GeneChip™ pair at a time, it was run on each of the four independent
experiments comparing Lrp genotypes. Each comparison identified
between 500–700 genes as marginally increased or decreased, or increased
or decreased (Table 7.10). However, filtering identified only 55 genes that
were called differentially expressed in all four experiments. Comparison of
these 55 genes to the 55 genes exhibiting the lowest p-values identified by
the CyberT software employing a Bayesian statistical framework revealed
35 genes in common with both lists. Among these, were 17 known Lrp-
regulated genes.
These results illustrate several important points. First, they stress the
importance of replication when only two conditions are compared. Little
can be learned about those genes regulated by Lrp from the analysis of only
one experiment with one GeneChip™ pair since an average of 600 genes
were identified as differentially expressed, only 55 of which can be repro-
duced in four independent experiments. Furthermore, in the absence of sta-
tistical analysis it is not possible to determine the confidence level and rank
the reliability of any differentially expressed gene measurement identified
with the Affymetrix software. This is, of course, important for prioritizing
genes to be examined by additional experimental approaches. In spite of
these limitations, the comparison of the 55 genes identified by examining
the results of four GeneChip™ experiments with the results of four nylon
filter experiments reveals that differentially genes can be identified with the
124 Design, analysis, and interpretation of experiments
Affymetrix Microarray Suite software when multiple replicate experiments
are analyzed.
Experiment
Figure 7.7. Gene expression regulatory patterns expected from the comparison of
DNA microarray experiments with one control and two treatment conditions.
Control condition (Experiment 1), gene expression levels during growth under
aerobic conditions in a Fnr+ E. coli strain; Treatment condition 1 (Experiment 2),
gene expression levels during growth under anaerobic conditions in a Fnr E. coli
strain; Treatment condition 2 (Experiment 3), gene expression levels during growth
under anaerobic conditions in a Fnr E. coli strain. Each regulatory pattern is
designated by a number, 1–8.
lower confidence levels that can be included at the discretion of the investi-
gator.
To identify genes differentially expressed at a high confidence level that
correspond to each of the patterns diagrammed in Figure 7.7, the genes
differentially expressed due to the treatment condition of Experiments 1
and 2 were sorted in ascending order according to their p-values based on
Two-variable perturbation experiments 127
the regularized t-test as described for the Lrp experiments. Next, the genes
differentially expressed due to the treatment condition of Experiments 2
and 3 were sorted in ascending order according to their p-values. The 100
genes with the lowest p-values present in both lists were selected. These
genes exhibited either an increased or decreased expression level between
both treatment conditions (i.e., between Experiment 1 and 2, and
Experiment 2 and 3). All of these genes possessed p-values less than 0.001
representing a 97% global confidence level (PPDE0.97) for these gene
measurements.
To identify those genes differentially expressed with a high level of confi-
dence under the treatment conditions of Experiments 1 and 2 but expressed
at the same or similar levels under the treatment conditions of Experiments
2 and 3, the 500 genes of Experiments 1 and 2 with the lowest p-values were
compared to the 500 genes with the highest (closest to 1) p-values from
Experiments 2 and 3. This comparison identified 57 genes that were present
in both lists, that is, genes whose regulatory pattern fulfill this criterion.
Likewise, to identify those genes differentially expressed under the treat-
ment conditions of Experiments 2 and 3 but expressed at the same or
similar levels under the treatment conditions of Experiments 1 and 2, the
500 genes of Experiments 2 and 3 with the lowest p-values were compared
to 500 genes with the highest p-values from Experiments 1 and 2. This com-
parison identified 48 genes that were present in both lists. These gene lists
were combined into a single list of 205 genes differentially expressed under
at least one treatment condition.
Figure 7.10. Cluster image map (CIM) relating activity patterns of 118 drug
compounds to the expression patterns of 1376 genes in 60 cell lines. See text for
explanation. (Reproduced with permission from Scherf et al., 2000 [10].)
These results of Scherf and his colleagues [10] clearly demonstrate the
usefulness of unsupervised clustering algorithms for the analysis of large
data sets. Nevertheless, it should be kept in mind that in most cases where
small sets of data are examined unsupervised clustering methods do not
perform as well. In these cases, attention must be paid to the experimental
error and biological variance inherent in DNA microarray experiments,
and statistical methods and supervised clustering procedures of the type
described for the Lrp vs. Lrp experiment described earlier in this chapter
should be employed.
1. Hung, S., Baldi, P. and Hatfield, G. W. Global gene expression profiling in
Escherichia coli K12: The effects of leucine-responsive regulatory protein.
2002. Journal of Biological Chemistry 277(43):40309–40323.
2. Allison, D. B., Gadbury, G. L., Moonseong, H., Fernandez, J. R., Cheol-
Koo, L., Prolla, T. A., and Weindruch, R. A mixture model approach for the
Two-variable perturbation experiments 133
analysis of microarray gene expression data. 2002. Computational Statistics
and Data Analysis 39:1–20.
3. Long, A. D., Mangalam, H. J., Chan, B. Y. P., Tolleri, L., Hatfield, G. W.,
and Baldi, P. Improved statistical inference from DNA microarray data using
analysis of variance and a Bayesian statistical framework. 2001. Journal of
Biological Chemistry 276:19937–19944.
4. Baldi, P., and Long, A. D. A Bayesian framework for the analysis of
microarray expression data: regularized t-test and statistical inferences of
gene changes. 2001. Bioinformatics 17(6): 509–519.
5. Calvo, J. M., and Matthews, R. F. The leucine-responsive regulatory protein,
a global regulator of metabolism in Escherichia coli. 1994. Microbiology
Reviews 58(3):466–490.
6. Arfin, S. M., Long, A. D., Ito, E. T., Tolleri, L., Riehle, M. M., Paegle, E. S.,
and Hatfield, G. W. Global gene expression profiling in Escherichia coli K12:
The effects of integration host factor. 2000. Journal of Biological Chemistry
275(38):29672–29684.
7. Li, C., and Wong, W. H. Model-based analysis of oligonucleotide arrays:
Expression index computation and outlier detection. 2001. Proceedings of
the National Academy of Sciences of the USA 98(1): 31–36.
8. Salmon, K., Hung, S., Mekjian, K., Baldi, P., Hatfield, G. W., and Gunsalus,
R. P. Global gene expression profiling in Escherichia coli K12: The effects of
oxygen availability and FNR. 2003. Journal of Biological Chemistry
278(32):29837–29855.
9. Guest, J. R., Green, J., Irvine, A. S., and Spiro, S. The FNR modulon and
FNR-regulated gene expression. In E. C. C. Lin and A. S. Lynch, editors,
Regulation of Gene Expression in Escherichia coli, pp. 317–342. 1996.
Chapman & Hall, New York,
10. Scherf, U., Ross, D. T., Waltham, M., Smith, L. H., Lee, J. K., Tanabe, L.,
Kohn, K. W., Reinhold, W. C., Myers, T. G., Andrews, D. T., Scudiero, D. A.,
Eisen, M. B., Sausville, E. A., Pommier, Y., Botstein, D., Brown, P. O., and
Weinstein, J. N. A gene expression database for the molecular pharmacology
of cancer. 2000. Nature Genetics 24(3):236–244.
8
Systems biology
Introduction
In Chapter 7, we studied three global regulatory proteins in E. coli (Lrp,
IHF, and Fnr). These proteins are responsible for the direct regulation of
scores of genes, and through the use of DNA microarrays we were able to
establish a fairly comprehensive list of the genes each protein regulates with
good confidence. These results have an intuitive graphical representation
where nodes represent proteins and directed edges represent direct regula-
tion. Intuitively these simple graphs should capture a portion of the com-
plete “regulatory network” of E. coli. Within this network, Lrp, IHF, and
Frn are like “hubs”, to use an analogy with the well-connected airports of
airline flight charts, three of the two dozen or so hubs in the E. coli regula-
tory chart. In spite of their simplicity, these diagrams immediately suggest a
battery of questions. How can one represent more complex indirect interac-
tions or interactions involving multiple genes at the same time? Is there any
large-scale “structure” in the network associated with, for instance, control
hierarchies, or duplicated circuits, or plain feedback and robustness? What
is the relationship between the global regulatory proteins (i.e., the hubs)
and the less well connected nodes? How are the edges of the hubs distrib-
uted with respect to a functional pie chart classification (biosynthesis,
catabolism, etc.) of all the genes?
These questions point towards an ever broader set of problems and ulti-
mately whether we can model and understand regulatory and other
complex biological processes from the molecular level to the systems level.
Such a a systems approach is necessary if we are to integrate the large
amounts of data produced by high-throughput technologies into a compre-
hensive view of the organization and function of the complex mechanisms
that sustain life. The dynamic character of these mechanisms and the
prevalence of interactions and feedback regulation strategies suggest that
135
136 Systems biology
they ought to be amenable to systematic mathematical analysis applying
some of the methods used in biophysics, biochemistry, developmental
biology coupled with more synthetic sciences from chemical engineering, to
control theory, and to artificial intelligence and computer science.
Although there are many kinds and levels of biological systems (e.g.,
immune system, nervous system, ecosystem [1]), the expression “systems
biology” [2] is used today mostly to describe attempts at unraveling molecu-
lar systems, above the traditional level of single genes and single proteins,
focusing on the level of pathways and groups of pathways. A basic method-
ology for the elucidation of these types of systems has been outlined by
Leroy Hood and his group [3].
1. Identify all the players of the system, that is all the components that are
involved (e.g., genes, proteins, compartments).
2. Perturb each component through a series of genetic or environmental
manipulations and record the global response using high-throughput
technologies (e.g. microarrays).
3. Build a global model and generate new testable hypothesis. Return to 2,
and in some cases to 1 when missing components are discovered.
Currently, DNA microarrays are one of the key tools in this process, to
be used in complement with other tools ranging from proteomic tools such
as mass spectrometry and two-hybrid systems for global quantitation of
protein expression and interactions, to bioinformatics tools and databases
for large-scale storage and simulation of molecular information and inter-
actions. These tools are progressively putting us in the position of someone
in charge of reverse-engineering a computer circuit with little knowledge of
the overall design but with the capability of measuring multiple voltages
simultaneously across the circuit.
Indeed, the long-term goal of DNA microarray technology is to allow us
to understand, model, and infer regulatory networks on a global scale,
starting from specific networks all the way up to the complete regulatory
circuitry of a cell [4, 5, 6, 7, 8, 9, 10, 11, 12].
In this chapter, we first briefly review the molecular players and interac-
tions of biological systems involved in metabolic, signaling, and most of all
regulatory networks. The bulk of the chapter focuses on gene regulation
and the computational methods that are available to model regulatory net-
works followed by a brief survey of available software tools and databases.
We end the chapter with the large-scale properties and general principles of
regulatory networks.
Representation and simulation 137
Molecular reactions
While all bonds are based on electrostatic forces and can in principle be
derived from first quantum mechanical principles, the detailed simulation
of large molecules is already at the limit of current computing resources
and molecular dynamics simulations [14]: we still cannot compute the ter-
tiary structure of proteins reliably in the computer, let alone produce
detailed simulations of multiple biomolecules and their interactions. Thus,
today computational models in systems biology cannot incorporate all the
details of molecular interactions but must often operate several levels of
abstraction above the detailed molecular level, by simplifying interactions
and/or considering pools of interacting species and studying how local con-
centrations evolve in time.
Consider, for instance, the basic problem of modeling the reaction
between an enzyme and a substrate. Substrates are bound to enzymes at
active-site clefts from which water is largely excluded when the substrate is
bound. The specificity of enzyme–substrate interaction arises mainly from
hydrogen bonding and the shape of the active site. Furthermore, the recog-
nition of substrates by enzymes is a dynamic process often accompanied by
conformational changes (allosteric interactions).
In a basic model that ignores the details of the interactions we can con-
sider that an enzyme E combines with a substrate S with a rate k1 to form an
138 Systems biology
enzyme–substrate complex ES. The complex in turn can form a product P
with a rate k3, or dissociate back into E and S with a rate k2
k1 k3
ES →E P
E S (8.1)
k2
In the model, the rate of catalysis V (number of moles per second) of the
product P is given by
dP [S]
V Vmax (8.2)
dt [S] KM
where Vmax is the maximal rate and KM (k2 k1)/k3. For a fixed enzyme
concentration [E ]; V is almost linear in [S ] when [S ] is small, and V艐Vmax
when [S ] is large.
This is the Michaelis–Menten model, which dates back to 1913 and
accounts for the kinetic properties of some enzymes in vitro, but is limited
both in the range of enzymes and the range of reactions it can model in vivo.
In the end the catalytic activity of many enzymes is regulated in vivo in mul-
tiple ways including: (1) allosteric interactions; (2) extensive feedback inhi-
bition where, in many cases, the accumulation of a pathway end-product
inhibits the enzyme catalyzing the first step of its biosynthesis [15]; (3)
reversible covalent modifications such as phosphorylation of serine, threo-
nine, and tyrosine side chains; and (4) peptide-bond cleavage (proteolytic
activation). Many enzymes do not obey the Michaelis–Menten formalism
and kinetics. An important example is provided by allosteric enzymes
where the binding of a substrate to one active site can affect the properties
of other active sites in the same protein. The co-operative binding of sub-
strates requires more complex models beyond simple pairwise interactions
[16, 17].
The Michaelis–Menten equations also have a somewhat ad hoc non-linear
form that is resistant to analytical solutions and can only be simulated. To
quantitatively implement even such a highly simplified model, one has to
know the values of the initial concentrations, as well as the stoichiometric
constants that govern the reaction k1, k2, and k3 or at least KM. The value of
KM varies widely over at least six orders of magnitude and depends also on
substrate, and environmental conditions including temperature and pH.
Similar problems arise when we try to model biochemical reactions
involved in gene regulation reviewed below. In particular, what is the appro-
priate level of molecular detail? Clearly such a level is dictated by the
system and phenomena one is trying to model, but also by the available
data and associated subtle tradeoffs. While molecular data about in vivo
Representation and simulation 139
concentrations and stoichiometric constants remain sparse, emergent high-
throughput technologies such as DNA microarrays provide hope that these
data will become available. Furthermore, a fundamental property of bio-
logical systems is their robustness so that it may not be necessary to know
the values of these parameters with high precision. Indeed, models that are
a few levels of granularity above the single molecule level can be very suc-
cessful, as exemplified by the Hodgkin–Huxley model of action potentials
in nerve cells, which was derived before any detailed knowledge of molecu-
lar ion channels became available. Such higher-level models together with
large data sets open the door for machine learning or model fitting
approaches, where interactions and other processes are modeled by para-
meterized classes of generic functions that are capable of approximating
any behavior and can be fit to the data. These models are described in detail
for gene regulation networks below.
Metabolic networks
Metabolic networks represent the enzymatic processes within the cell to
transform food molecules into energy as well as into a number of other
molecules, from simple building blocks such as nucleotides or amino acids,
to complex polymeric assemblies, from DNA to proteins, from which
organelles are assembled. Individual metabolic pathways involve biosyn-
thesis but also biodegradation catalyzed by enzymes.
The major metabolic features of E. coli, which are common to all life, are
well studied and well understood (see, for instance, [20]). All the biosyn-
thetic pathways begin with a small group of molecules called the key pre-
cursor metabolites, of which there are 12 in E. coli, and from which roughly
100 different building-blocks are derived including amino acids,
nucleotides, and fatty acids. Most biosynthetic reactions require energy and
often involve the breakdown of ATP whereas degradative reactions eventu-
ally generate ATP. Escherichia coli has about 4400 genes and on the order of
1000 small metabolites which have fleeting existences between synthetic and
degradative steps. The typical operation along a metabolic pathway is the
Representation and simulation 141
combination of a substrate with an enzyme in a biosynthesis or degrada-
tion reaction. Typically a metabolite can combine with two different
enzymes to be broken down into X or to be used in the biosynthesis of Y.
There are however many exceptions to the typical case such as, for example,
reactions that are catalyzed by RNA molecules. It is also not uncommon
for an enzyme to catalyze multiple reactions and for several enzymes to be
needed to catalyze a single reaction.
As we have seen in the discussion of the Michaelis–Menten model, the
catalytic activity of enzymes is regulated in vivo by multiple processes
including allosteric interactions (conformational changes), extensive feed-
back loops, reversible covalent modifications, and reversible peptide-bond
cleavage.
Protein networks
The term “protein networks” is usually meant to describe communication
and signaling networks where the basic reaction is not between a protein
and a metabolite but rather between two proteins or more. These
protein–protein interactions are involved in signal transduction cascades,
for instance to transfer information about the environment of the cell cap-
tured by a G-protein coupled receptor down to the DNA in the nucleus and
the regulation of particular genes. Protein networks can thus be viewed as
information processing networks where information is processed and
transferred dynamically mainly through protein interactions [21]. Proteins
functionally connected by post-translational modifications, allosteric inter-
actions, or other mechanisms into biochemical circuits can perform a
variety of simple computations including integration of multiple inputs,
coincidence detection, amplification, and information storage with fast
switching times, typically in the microsecond range. Symbolically, many of
these operations, as well as those involved in regulation (see below), can be
represented using the formalisms of Boolean circuits and artificial neural
networks. In principle these operations can be combined in circuits to carry
virtually any computation.
The resulting circuits have staggering complexity: a typical mammalian
call may have on the order of 30000 genes and one billion proteins. Not only
do most proteins exist in multiple copies, but also in multiple forms due, for
instance, to post-translational modifications. Because the number of genes
in a mammal is only one order of magnitude the number of genes in a bac-
terium, and about two to three times the number of genes in the worm or the
fly, biological complexity must arise from the variety of molecular species
142 Systems biology
and the corresponding interactions [22]. A single mammalian cell has on the
order of a thousand different channel and receptor proteins on its surface, a
hundred G-proteins and second messengers, and a thousand kinases. It has
been estimated that up to one-third of all proteins in an eukaryotic cell may
be phosphorylated.
Indeed, the concentration and activity of thousands of proteins and
other molecules in the cell provide a memory trace of the environment and
the entire short-term behavior of the cell, or for that matter even an entire
organism, relies on circuits of proteins which receive signals, process them,
and produce outputs either in terms of a modification of the internal state
of the cell, or a direct impact on the environment, such as mechanical
impact in the case of movement (e.g., bacterial chemotaxis) or chemical
impact in the case of secretion.
These interactions also can be conceptualized in terms of graphs with
linear structures, but also with extensive branching and cycles. The rela-
tions between these graphs and the underlying biochemical pathways exist
but are not obvious since, for instance, other substrates or metabolites can
be present which are not proteins. Issues of cross-talk between protein
pathways are important although in general, because of their size and inter-
actions, proteins do not diffuse through the cell as rapidly as small mole-
cules (a molecule of cyclic AMP can move anywhere in a cell in less than
one-tenth of a second).
Regulatory networks
Gene expression is regulated at many molecular levels starting from the
DNA level, to the mRNA level, to the protein level [23]. How genes are
encoded in DNA and their relative spatial location along the chromosomes
affect regulation and expression. For instance, in bacteria the genes coding
for the chains of heterodimeric proteins are often co-located in the same
operon, so that they are expressed at the same time, in the same amounts,
and in the same neighborhood. Likewise genes belonging to the same
pathway are often co-located and the spatial sequential arrangement of
related genes often reflects spatial or temporal patterns of activation. In
bacterial chemotaxis, for instance, the spatial arrangement of genes reflects
their temporal pattern of activation.
Cascades of metabolic and signaling reactions determine the concentra-
tions of the proteins directly involved in the transcription of a given gene.
The proteins interact with themselves and with DNA in complex ways to
build transcriptional molecular “processors” that regulate the rate of tran-
scription initiation and transcription termination (attenuation in bacteria).
Representation and simulation 143
Transcription is an exceedingly complex molecular mechanism which is
regulated at multiple levels that are not completely understood. Further
regulation occurs at the level of RNA processing and transport, RNA
translation, and post-translational modification of proteins. Increasingly
RNA molecules are being found to participate in regulation [24]. On top of
these complexities, the degradation of proteins and RNA molecules is also
regulated, and new regulatory mechanisms are still being uncovered, such
as purge mechanisms of Salmonella and Yersinia bacteria whereby gene
expression in an infected host cell is regulated by secreting regulatory pro-
teins from the pathogen to the host. Since most of these regulatory steps are
carried out by other proteins, it is obvious that the regulatory networks of a
cell are replete with feedback loops.
Although bacteria are often viewed as simple organisms, their gene regu-
lation is already very complex and exhibits many of the features encoun-
tered in higher organisms. Bacterial response to any significant stress, for
instance, involves adjustments in the rates of hundreds of proteins besides
the specific responders. In bacteria, DNA regulatory regions are relatively
short, in the range of a few hundred base pairs, often but not always
upstream of the operon being regulated. On average a typical bacterial gene
is regulated by less than a handful of transcription factors.
DNA regulatory regions in higher organisms, however, extend over much
longer stretches of DNA involving thousands of nucleotides and in many
cases regulatory elements are found downstream of the gene being regu-
lated, or even inside introns. For example, very distant 3 regulatory ele-
ments in the fly are described in [25, 26]. Often, dozens of transcription
factors are involved in the regulation [27, 28] and even a fairly specific func-
tion, such as the regulation of blood pressure in humans, is already known
to involve hundreds of genes.
In bacteria, the regulation of several genes is fairly well understood and
a small number of regulatory networks have been studied extensively, to
the point where qualitative and even quantitative modeling is becoming
possible. Classical examples are the sporulation choice in Bacillus subtilis
and the lytic–lysogenic choice in phage lambda [29]. In higher organisms
the situation is considerably more complex and detailed modeling studies
still focus mostly on the regulation of a single gene and its complexity. The
endo 16 gene in the sea urchin Strongylocentrus purpuratus is probably
one of the best studied examples still in progress [28, 30, 31] (Figure 8.1).
The cis-regulatory region of endo 16 covers over 2000 bp, contains no
fewer than seven different interacting modules, with a protein binding map
that contains dozens of binding sites and regulatory proteins with a highly
Figure 8.1. Regulatory region of the endo 16 gene in sea urchin. The region is 2300 bp long and contains dis-
tinct functional modules (A to G) in addition to the basal promoter (Bp) region. Factors that bind uniquely
in a single region of the sequence are marked above the line representing the DNA. Factors below the line
interact in multiple regions. Module G is a general booster for the whole system; F, E, and DC are repressor
modules that permit ectopic expression. Module B drives expression later in development, and its major
activator is responsible for a sharp, gut specific increase in transcription after gastrulation. Module B inter-
acts strongly with module A which controls the expression of endo 16 during the early stages of develop-
ment. (B) DNA sequence of modules B, A, and the basal promoter region. Beneath are target site
mutations used in perturbation experiments. (Reprinted with permission from Yuh et al., 2001 and
Company of Biologists Ltd.)
Representation and simulation 145
non-linear regulatory logic that is complex but amenable to computational
modeling.
In gene regulation, there is also a tension between stochastic and deter-
ministic behavior that is not completely understood quantitatively. On a
large scale, regulatory mechanisms appear to be deterministic and the
short-term behavior of cells and organisms ought to be somewhat pre-
dictable. However, at the molecular level, there is plenty of stochasticity in
nanoscale regulatory chemistry due to thermal fluctuations and their con-
sequences, such as the random distribution of a small number of transcrip-
tion factor molecules between daughter cells during the mechanics of cell
divisions. In some cases, expression of a gene in a cell can literally depend
on a few dozen molecules: a single gene copy and a few transcription factor
molecules present in very few copies. Fluctuations and small number of
molecules of course influence the exact start and duration of transcription
and current evidence suggests that proteins are produced in short random
bursts [32, 33]. There are examples, for instance during development, where
cells or organisms seem to make use of molecular noise. In most cases,
however, regulatory circuits ought to exhibit robustness against perturba-
tions and parameter fluctuations [34, 35, 36]. Quantitative understanding
of the role of noise and, for instance, the tradeoffs between robustness and
evolvability, are important open questions.
Finally, at its most fundamental level, gene regulation results from a set
of molecular interactions that can only be properly understood within a
broader context of fundamental molecular processes in the cell, including
metabolism, transport, and signal transduction. It should be obvious that
signaling, metabolic, and regulatory networks overlap and interact with
each other both conceptually and at the molecular level in the cell.
Metabolic and signaling networks can be viewed as providing the boundary
conditions for the regulatory networks. There are additional important cel-
lular processes, such as transport, that are somewhat beyond these cate-
gories and can also play a role in regulation. Furthermore, in the cell, all the
processes and their molecular implementation coexist, interact, and also
depend on geometrical, structural, and even mechanical constraints and
events, such as cellular membranes, compartmentalization, and cell divi-
sion. In the next sections, we describe the computational tools that are cur-
rently available to model regulatory networks.
146 Systems biology
The term Xi /i represents a decay or degradation term with time constant
i and is essential to prevent the system from running away into highly satu-
rated states. Accordingly, the concentration of Xi has a decay component to
model degradation, diffusion, and growth dilution at a rate proportional to
the concentration itself.
The coefficients Tij are production constants that represent the strengths
of the pairwise interactions between the concentrations and can be parti-
tioned into positive excitatory terms and negative inhibitory terms.
Alternatively, all the interactions terms could be positive by incorporating
the negative signs into interaction-specific activation functions fij and
allowing decreasing activation functions.
The weighted linear average of the activations represents the global influ-
ence of the network. The term Ii (t) can be used to incorporate additional
external inputs or a noise term. An essentially equivalent formalism is
obtained when the non-linearity is applied to the global weighted activation
instead:
dXi
dt
X
i fi
i 冤兺 j
Tij (Xj ) Ii (t) 冥 (8.9)
for second-order interactions. Interestingly, equations like 8.8, 8.9, and 8.10
have been used to model neuronal interactions in the artificial neural
network literature. Today we know that these artificial neurons are too simple
and do not provide good models of the complex dynamics of biological
neurons, i.e., the time behavior of the voltage of a neuron is several orders of
152 Systems biology
complexity above the time behavior of the concentration of a typical protein.
This is not to say that these artificial neurons do not oversimplify also the
behavior of regulatory networks, but the degree of oversimplification may be
more acceptable. Thus, artificial neural networks may provide better models
of regulatory networks or protein networks [21] than networks of neuronal
cells. It should be remarked, however, that both kinds of networks are essen-
tially carrying computations through their dynamical behavior.
Artificial neural network systems, both discrete and continuous, have
been extensively studied in the literature. It is well known, for these and
other systems of differential equations, that limit cycles require the pres-
ence of feedback loops with odd numbers of inhibitory terms [53, 54] (see
also [55, 56] for a study of delays and stability), and that feedback loops
with even numbers of inhibitory terms tend to drive multiple stationary
points. The stability of limit cycles connected with negative feedback loops
has been associated with homeostasis, and bifurcation behavior with differ-
entiation processes seen during development as well as rapid switching
behavior (e.g., bacterial sporulation).
One important observation is that not all the variables in the systems
considered in this section need to be gene product concentrations. In partic-
ular, in addition to other molecular species, it is possible to enrich the
systems of equations by allowing hidden variables, i.e., considering that the
variables Xi can be partitioned into two sets: visible or measurable variables
and non-visible variables, as the hidden units of standard artificial net-
works or latent variables in statistical modeling. Naturally the presence of
hidden variables increases the number of parameters and the amount of
data required to learn the parameters of the model.
A related and important property of artificial neural network systems is
that they also have universal approximation properties [57, 58]. In simple
terms, this means that any well-behaved (for instance continuously differen-
tiable) function or system can be approximated to any degree of precision by
a neural network, provided one can add hidden state variables (see [59] for a
simple proof). These universal approximation results are typically proved
for feedforward networks but can be extended to recurrent networks by
viewing them as iterations of a forward function. In short, artificial neural
networks provide a general formalism that can be used to approximate
almost any dynamical system. While universal approximation properties
can be reassuring, the central issue becomes learning, i.e., how efficiently we
can find a reasonable approximation from a limited set of training data.
Before we discuss the learning issue, it is worth mentioning another very
general related formalism developed by Savageau and others [41, 42, 43,
Computational models of regulatory networks 153
60]. This is the power-law formalism or S-systems introduced in the late
1960s, which can be described by
dXi
dt
兺T X
k
ik
j
j
gijk
兺U X
k
ik
j
j Ii (t)
hijk
(8.11)
where gijk (resp. hijk) are kinetic exponents and Tik, Uik are positive rate con-
stants. Thus in Equation 8.11 the terms are separated into an excitatory and
an inhibitory (degradation) component. Non-linearity emerges from the
exponentiations and the multiplications, which also capture higher-order
interactions as in the case of the artificial neural network equations. In fact,
in many ways this formalism is similar to the neural network formalism
with which it shares the same universal approximation properties as a con-
sequence of Taylor’s formula and polynomial approximation. In fact, an S-
system can be viewed as a higher-order neural network with exponents in
the interactions and where the activation functions fi s are the identity func-
tion. By transforming to a logarithmic scale, Equation 8.11 can be con-
verted to a more tractable linear system. As discussed in the references, this
formalism has also been applied to model other biochemical pathways,
besides regulatory networks, for instance in pharmacodynamics and
immunology.
In addition to simulating the networks and understanding the structure
of steady states and limit cycles, it is also important to conduct a bifurca-
tion analysis to understand the sensitivity of steady states, limit cycles, and
other dynamical properties with respect to parameter values and noise.
These issues are also related to the learning problem and how much the
available data constrains the model parameters. Thus before proceeding
with other classes of models that can be applied to regulatory networks, it is
worthwhile considering the learning problem briefly.
Qualitative modeling
Lack of data coupled with issues of robustness suggests that it may be
useful to pursue a kind of modeling that is more qualitative in nature. One
approach to qualitative modeling is to partition the state space into cells
delimited by behavioural thresholds and model the behavior of each con-
centration in each cell by a very simple first-order differential equation with
a single rate. Thus, in addition to its robustness, the most attractive feature
of qualitative modeling is that it preserves the Boolean logic aspect of regu-
lation while allowing for continuous concentration levels.
More precisely, consider the piecewise linear differential equations
dXi X
i Fi (X1,..., XN ) (8.12)
dt i
with
Fi (X1,..., XN ) 兺 k b (X ,..., X
jJ(i)
ij ij 1 N) (8.13)
where kij are rate parameters and J(i) is a set of indices, possibly empty,
which depends on i. The functions bij take their values in the set {0, 1}.
These functions are the numerical equivalent of Boolean functions and
specify the conditions under which a gene is expressed at the rate kij .More
precisely, they are defined in terms of sums and products of step functions
[63, 64] s and s of the form
1 : X
s (X,
) {
s (X,
) 0 : X
(8.14)
with s(X,
)1s (X,
) (Figure 8.2). For instance, the equation
dX3 /dtk3s (X1,a)s(X1,b)s (X2,c)s(X2,d ) expresses the fact that X3 is
expressed or produced at a rate k3 provided the concentration of X1 is
between a and b and the concentration of X2 is between c and d (Figure 8.2).
More generally, as in the discretized Boolean case, we can associate with
each element Xi a bounded sequence of threshold values 0
1i ,
2i
...
mi max Xi . These thresholds partition the phase space of all possi-
ble states into cells or orthants. The functions bij take a constant value, 0 or
1, in each orthant. Any combination of orthants can be represented by a
156 Systems biology
max X 2
a b max X 1
Figure 8.2. Two-dimensional phase space with concentration variables 0X1 max
X1 and 0X2 max X2. The black area corresponds to the region where the func-
tion s (X1, a) s(X1,b)s (X2,c)s(X2,d ) is equal to 1, and where the concentration X1
is between a and b and the concentration of X2 is between c and d. In any other cell
of the plane, one of the conditions is violated and therefore the function is equal to
0. Any combination of cells in the plane can be described in terms of sums and
products of step functions.
generalized Boolean function bij that is its indicator function, i.e., has a
value of 1 on the orthants that belong to the combination. This is easy to
see by analogy with the canonical decomposition of a Boolean function
into a conjunction of disjunctions with negation. Any number of cells can
be described by the sum of the indicator functions associated with each cell.
The indicator function of a cell is a product of s and s step functions,
associated with the conjunction of the corresponding cell threshold condi-
tions.
In each orthant, the total contribution from Equation 8.13 is constant so
that the behavior of the system is described by N independent linear equa-
tions of the form
dXi /dt ki Xi / i (8.15)
Computational models of regulatory networks 157
which have an easy solution iet /i kii. Thus within an orthant, all trajec-
tories evolve towards the steady state Xi kii. The steady state of each
orthant may lie inside or outside the orthant. If it lies outside, the trajecto-
ries will tend towards one or several of the threshold hyperplanes which
delimit the orthant. Depending on the exact details of Equation 8.12, these
trajectories may be continued in the adjacent orthants or not. Additional
important singular steady states may also be located on the threshold
planes delimiting the orthants.
Analysis of piecewise linear equations, their behavior and their applica-
tions to qualitative modeling of gene regulation and other biological phe-
nomena, can be found in [51, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74].
The global behavior of piecewise linear differential systems can be
complex and is not fully understood. However they provide an attractive tool
for qualitatively modeling biochemical systems where detailed quantitative
information is often very incomplete. In particular the qualitative behavior
of the system can be analyzed and represented using a directed graph with
one node per qualitative state associated with each orthant. Nodes in the
graph are connected whenever the trajectories can pass between the corre-
sponding orthants. This approach has been used in [75] to build a qualitative
robust model of the sporulation switch in B. subtilis (Figure 8.3). An impor-
tant observation that is relevant for building or learning such models is that
only a very small fraction of the possible cells in phase space is visited by the
system during operation. In this particular model, a few dozen states were
visited out of several thousand possible states.
Figure 8.3. Model of genetic regulatory network underlying the initiation of sporulation in Bacillus subtilis. Coding region and promoter
shown for every gene. Promoters are distinguished by the specific
factor directing DNA transcription. Signs indicate the type of regula-
tory interaction (activation/inhibition). The network is centered around a phosphorylation pathway (phosphorelay), slightly simplified in
this model, which integrates a variety of environmental, cell cycle, and metabolic signals and phosphorylates Spo0A. The essential function
of the phosphorelay is to modulate the phosphate flux as a function of the competing action of kinases and phosphatases (KinA and
Spo0E). When a sufficient number of inputs in favor of sporulation accumulate, the concentration of SPo0A~P reaches a threshold value
above which it activates several genes that commit the bacterium to sporulation. In order to produce a critical level of Spo0A~P, signals
arriving at the phosphorelay need to be amplified and stabilized. This is achieved through several positive and negative feedback loops.
Thus the decision to enter sporulation emerges from a complex network. (Reprinted with permission from de Jong et al., 2001.)
Computational models of regulatory networks 159
neighboring cells in position k1 and k 1. The diffusion of products is
supposed to be proportional to the concentration differences X (k 1)
i
X (k)
i
and X (k)
i
X (k 1)
i
. This yields the reaction–diffusion model
dX i(k)
F (k)
i
(X1,..., XN ) D (k)
i
(X (k 1)
i
2X (k)
i
X (k 1)
i
) (8.16)
dt
for 1iN and 1kK. If the system is invariant under translation, for
instance in the case of a homogeneous population of cells, then F (k) i
Fi
and D (k)
i
D i
. Equation 8.16 is for a one-dimensional system of compart-
ments or cells but it can obviously be generalized to two and three dimen-
sions.
If the number of cells or compartments is large enough over the spatial
interval [0,S], we can use a continuous formalism, where the concentrations
Xi(t,s) depend both on time t and space s. Assuming translational invari-
ance, the resulting system of partial differential equations can be written as
Xi 2X
Fi (X1,..., XN ) Di 2 i (8.17)
t s
For 1iN and 0sS. Assuming that there is no diffusion across the
spatial boundaries, the boundary conditions are given by the initial concen-
trations and 2Xi(t,0)/s2 2Xi(t,S)/s2 0. Again two- and three-dimen-
sional versions of these equations can easily be derived.
Alan Turing was the first to clearly suggest the use of reaction–diffusion
equations to study developmental phenomena and pattern formation. In
his original work Turing considered two concentration variables X1 and X2,
also called morphogens, satisfying Equation 8.16 or 8.17. Even in the case
of N2 direct analytical solutions are in general not possible. However if
there exists a unique spatially homogeneous steady state X *1 and X *2 with
X1 0 and X2 0, the behavior of the system in response to a small pertur-
bation of the steady state can be predicted by simple linearization. In par-
ticular, with the boundary conditions above, the deviation Xi from the
homogeneous steady state is given by
冢 冣
k
Xi(s,t)Xi(s,t)X *i 兺c
k0
ik
(t) cos
S
s (8.18)
k
for i1, 2. The functions cos( S s) are the modes or eigenfunctions of the
Laplacian operator 2/s2 on [0,S] [76, 81, 82]. The time-dependent coeffi-
cients cik(t) are the mode amplitudes. The steady state is stable with respect
to perturbations if all the mode amplitudes decay exponentially in response
to a perturbation decomposable into a large number of weighted modes.
160 Systems biology
When the steady state is unstable, diffusion does not have a homogenizing
effect resulting rather into spatially heterogeneous gene expression/concen-
tration patterns. It has been shown that for N2 this requires one positive
feedback loop F1 /X1 0, one negative feedback loop F1 /X2 0, and
diffusion constants that favor the rapid diffusion of the inhibitor D2 /D1 1
together with a sufficiently large domain size SS0 [83]. These activator–
inhibitor systems have been used to model the emergence of segmentation
patterns during early embryo development in Drosophila [77, 78, 84, 85, 86].
The solutions of continuous reaction–diffusion equations N2 are quite
sensitive to initial and boundary conditions, the shape of the domain, and
the values of the parameters. Furthermore, there is little experimental evi-
dence that the coupled reaction and diffusion of two morphogens underlies
biological pattern formation. The regulatory system that controls embry-
onic development in Drosophila is likely to contain a layered network with
hundreds of genes that mutually interact and whose products diffuse
through the embryo. More realistic models must use Equation 8.16, for
instance, with discrete diffusion and a larger number of interacting species.
In [40, 87, 88, 89, 90], a reaction–diffusion network model such as Equation
8.16, with the neural network reaction term of Equation 8.8, is used to
model Drosophila development.
Although the diffusion component of reaction–diffusion models captures
some spatial aspects of regulatory networks, it does not capture all of them.
In particular a single set of reaction–diffusion equations can only be used as
long as the number of compartments or cells remains fixed. Other events,
such as nuclear or cell division, must be modeled at another level. Likewise,
different models or additional terms are needed to model spatial organization
resulting from processes such as active transport that go beyond diffusion.
Stochastic equations
Ordinary or partial differential equations in general lead to models that are
both continuous and deterministic. Both assumptions can be challenged at
the molecular level due, as we have seen, to thermal fluctuations and the
fact that some reactions may involve only a small number of molecules.
Thus, it may be useful to introduce models that are both discrete and sto-
chastic. One possibility is to let Xi represent the number of molecules of
species i and to describe the state of the system by a joint probability distri-
bution P(X1,..., XN, t) (this can be reformulated in terms of concentration
by dividing by a volume factor). If there are M possible different reactions
in the system, the time evolution of the probability distribution can be
described by a discrete equation of the form
Computational models of regulatory networks 161
冢 兺 t冣 兺
t (8.19)
M M
P(X1,..., XN,t t)P(X1,..., XN, t) 1 j j
j1 j1
P M
兺
(
j P)
t j1 j
(8.20)
The master equation describes how the probability distribution, and not
the system itself as in the case of ordinary differential equations, evolves in
time. In general, analytical solutions are not possible and time-consuming
simulations are required. In some cases, a master equation can be approxi-
mated by an ordinary stochastic equation, consisting of an ordinary differ-
ential equation plus a noise term (Langevin equations [92]).
In the stochastic simulation approach [93, 94], simulations of the sto-
chastic evolution (trajectories) of the system are carried directly rather than
simulating the global overarching distribution encapsulated in the master
equation. To simulate individual trajectories, the stochastic simulation
algorithm needs only to specify the time of occurrence of the next reaction,
and its type. This choice can be made at random from a joint probability
distribution consistent with Equation 8.19. This is akin to simulating an
actual sequence of coin flips rather than studying the averages and vari-
ances of a population of runs. This approach is more amenable to com-
puter simulations and has been used in [95, 96, 97] to analyze the
interactions controlling the expression of a single prokaryotic gene and to
suggest that, at least in some cases, protein production occurs by bursts.
Fluctuations in gene expression rates may provide an explanation for
observed phenotypic variations in isogenic populations and influence
switching behavior, such as lytic and lysogenic growth in phage lambda.
While stochastic simulations may seem closer to molecular reality, the
models discussed so far require knowledge of the reactions and the joint
probability distribution of time intervals and reactions. Furthermore, the
corresponding simulations are very time-consuming. Thus the models may
be advantageous for fine-grained analysis but for longer time-scale and more
global effects, the continuous models may be sufficient [92]. In the future,
more complete models and software tools may in fact combine both for-
malisms into hybrid hierarchical models where fine-grained models could be
used to help guide the selection of higher-lever models, and vice versa.
162 Systems biology
P(Xi |Xj : j N(i)) where N(i) denotes all the parents of vertex i. Here
the parents of node i could be interpreted as the “direct” regulators of i.
The Markovian independence assumption of a Bayesian network states
that, conditioned on the state of its parent (immediate past), the state of a
node (present) is independent of the state of all its non-descendants
(distant past). This independence assumption is equivalent to the global
factorization
N
P(X1,..., XN) P(X |X : j N
i1
i j
(i)) (8.22)
1. Noble, D. Modeling the heart: From genes to cells to the whole organ. 2002.
Science 295:1669–1682.
2. Kitano, H. Systems biology: A brief overview. 2002. Science 295:1662–1664.
3. Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K.,
Bumgarner, R., Goodlett, D. R., Aebersold, R., and Hood, L. Integrated
genomic and proteomic analysis of a systematically perturbed metabolic
network. 2001. Science 292:929–934.
4. Loomis, W. F., and Sternberg, P. W. Genetic networks. 1995. Science 269:649.
5. Thieffry, D. From global expression data to gene networks. 1999. BioEssays
21(11):895–899.
6. Salgado, H., Santos-Zavaleta, A., Gama-Castro, S., Millán-Zárate, D.,
The search for general principles 169
Blattner, F. R., and Collado-Vides, J. RegulonDB (version 3.0):
transcriptional regulation and operon organization in Escherichia coli K-12.
2000. Nucleic Acids Research 28(1):65–67.
7. Edwards, J. S., and Palsson, B. O. The Escherichia coli MG1655 in silico
metabolic genotype: its definition, characteristics, and capabilities. 2000.
Proceedings of the National Academy of Sciences of the USA
97(10):5528–5533.
8. Karp, P. D., Riley, M., Paley, S. M., Pellegrini-Toole, A., and
Krummenacker, M. EcoCyc: encyclopedia of Escherichia coli genes and
metabolism. 1999. Nucleic Acids Research 27(1):55–58.
9. Thieffrey, D., Huerta, A. M., Pérez-Rueda, E., and Colladio-Vides, J. From
specific gene regulation to genomic networks: a global analysis of
transcriptional regulation in Escherichia coli. 1998. BioEssays 20:433–440.
10. Endy, D., and Brent, R. Modelling cellular behavior. 2001. Nature
409:391–395.
11. Hasty, J., McMillen, D., Isaacs, F., and Collins, J. J. Computational studies of
gene regulatory networks: in numero molecular biology. 2001. Nature Reviews
of Genetics 2:268–279.
12. Smolen, P., Baxter, D. A., and Byrne, J. H. Frequency selectivity,
multistability, and oscillations emerge from models of genetic regulatory
systems. 1998. American Journal of Physiology 274:C531–C542.
13. Bower, J. M., and Bolouri, H. (eds.) Computational Modeling of Genetic and
Biochemical Networks. 2001. MIT Press, Cambridge, MA.
14. Schlick, T., Skeel, R., Brünger, A., Kalé, L., Board, J. A. Jr., Hermans, J.,
and Schulten, K. Algorithmic challenges in computational molecular
biophysics. 1999. Journal of Computational Physics 151:9–48.
15. Umbarger, H. E. Evidence for a negative feedback mechanism in the
biosynthesis of isoleucine. 1956. Science 123:848.
16. Monod, J., Wyman, J., and Changeux, J. P. On nature of allosteric
transitions: a plausible model. 1965. Journal of Molecular Biology 12:88.
17. Koshland, D. E., Nemethy, G., and Filmer, D. Comparison of experimental
binding data and theoretical models in proteins containing subunits. 1966.
Biochemistry 5:365.
18. Goodsell, D. S. The Machinery of Life, reprinted edn. 1998. Copernicus
Books, New York.
19. Levchenko, A., Bruck, J., and Sternberg, P. W. Scaffold proteins may
biphasically affect the levels of mitogen-activated protein kinase signaling
and reduce its threshold properties. 2000. Proceedings of the National
Academy of Sciences of the USA 97:5818–5823.
20. Edwards, J. S., and Palsson, B. O. The Escherichia coli MG1655 in silico
metabolic genotype: Its definition, characteristics, and capabilities. 2000.
Proceedings of the National Academy of Sciences of the USA 97:5528–5533.
21. Bray. D. Protein molecules as computational elements in living cells. 1995.
Nature 376:307–312.
22. Claverie, J. M. Gene number: what if there are only 30000 human genes?
2001. Science 291:1255–1257.
23. Maniatis, T., and Reed, R. An extensive network of coupling among gene
expression machines. 2002. Nature 416:499–506.
24. Shaechter, M., and The View From Here Group. Escherichia coli and
salmonella 2000: the view from here. 2001. Microbiology and Molecular
Biology Reviews 65:119–130.
25. Blackman, R. K., Sanicola, M., Raftery, L. A., Gillevet, T., and Gelbart,
170 Systems biology
W. M. An extensive 3 cis-regulatory region directs the imaginal disk
expression of decapentaplegic, a member of the TGF-beta family in
Drosophila. 1991. Development 111;657–666.
26. van den Heuvel, M., Harryman-Samos, C., Klingensmith, J., Perrimon, N.,
and Nusse, R. Mutations in the segment polarity genes wingless and
porcupine impair secretion of the wingless protein. 1993. EMBO Journal
12:5293–5302.
27. Latchman, D. S. Eukaryotic Transcription Factors, 3rd edn. 1998. Academic
Press, San Diego, CA.
28. Yuh, C. H., Bolouri, H., and Davidson, E. H. Cis-regulatory logic in the
endo 16 gene: switching from a specification to a differentiation mode of
control. 2001. Development 128:617–629.
29. McAdams, H. H., and Shapiro, L. Circuit simulation of genetic networks.
1995. Science 269:650–656.
30. Yuh, C. H., Bolouri, H., and Davidson, E. H. Genomic cis-regulatory logic:
experimental and computational analysis of a sea urchin gene. 1998. Science
279:1896–1902.
31. Davidson, E. H. et al. A genomic regulatory network for development. 2002.
Science 295:1669–1678.
32. McAdams, H. H., and Arkin, A. It’s a noisy business! genetic regulation at
the nanomolar scale. 1999. Trends in Genetics 15:65–69.
33. Hasty, J., Pradines, J., Dolnik, M., and Collins, J. J. Noise-based switches and
amplifiers for gene expression. 2000. Proceedings of the National Academy of
Sciences of the USA 97:2075–2080.
34. Becksel, A., and Serrano, L. Engineering stability in gene networks by
autoregulation. 2000. Nature 405:590–593.
35. Barkal, N., and Liebler, S. Robustness in simple biochemical networks.
Nature 387:913–917.
36. Barkai, N., and Liebler, S. Biological rhythms: circadian clocks limited by
noise. 2000. Nature 403:267–268.
37. Kauffman, S. A. Metabolic stability and epigenesis in randomly constructed
genetic nets. 1969. Journal of Theoretical Biology 22:437–467.
38. Kauffman, S. A. The large-scale structure and dynamics of gene control
circuits: an ensemble approach. 1974. Journal of Theoretical Biology
44:167–190.
39. Kauffman, S. A. Requirements for evolvability in complex systems: orderly
dynamics and frozen components. 1990. Physica D 42:135–152.
40. Mjolsness, E., Sharp, D. H., and Reinitz, J. A connectionist model of
development. 1991. Journal of Theoretical Biology 152:429–453.
41. Voit, E. O. Canonical Nonlinear Modeling. 1991. Van Nostrand and
Reinhold, New York.
42. Savageau, M. A. Power-law formalism: a canonical nonlinear approach to
modeling and analysis. In V. Lakshmikantham, editor, World Congress of
Nonlinear Analysts 92, vol. 4, pp. 3323–3334. 1996. Walter de Gruyter
Publishers, Berlin.
43. Hlavacek, W. S., and Savageau, M. S. Completely uncoupled and perfectly
coupled gene expression in repressible systems. 1997. Journal of Molecular
Biology 266:538–558.
44. Friedman, N., Linial, M., Nachman, I., and Pe’er, D. Using Bayesian networks
to analyze expression data. 2000. Journal of Computational Biology 7:601–620.
45. McAdams, H. H., and Arkin, A. Simulation of prokaryotic genetic circuits.
1998. Annual Reviews of Biophysical and Biomolecular Structures 27:199–224.
The search for general principles 171
46. Smolen, P., Baxter, D. A., and Byrne, J. H. Modeling transcriptional control
in gene networks: methods, recent results, and future directions. 2000. B.
Math. Biol., 62:247–292.
47. Kauffman, S. A. Homeostasis and differentiation in random genetic control
networks. 1969. Nature 224:177–178.
48. Kauffman, S. A. Gene regulation networks: a theory for their global
structure and behaviors. In A. N. Other, editor, Current Topics in
Developmental Biology, vol. 6, pp. 145–182. 1977. Academic Press, New
York.
49. Thomas, R. Logical analysis of systems comprising feedback loops. 1978.
Journal of Theoretical Biology 73:631–656.
50. Thomas, R. Boolean formalization of genetic control circuits. 1973. Journal
of Theoretical Biology 42:563–585.
51. Thomas, R., d’Ari, R. Biological Feedback. 1990. CRC Press, Boca Raton,
FL.
52. Thomas, R. Regulatory networks seen as asynchronous automata: a logical
description. 1991. Journal of Theoretical Biology 153:1–23.
53. Baldi, P., and Atiya, A. F. Oscillations and synchronizations in neural
networks: an exploration of the labeling hypothesis. 1989. International
Journal of Neural Systems 1(2):103–124.
54. Baldi, P., and Atiya, A. F. How delays affect neural dynamics and learning.
1994. IEEE Transactions on Neural Networks 5(4):626–635.
55. Arik, S. Stability analysis of delayed neural networks. 2000. IEEE
Transactions on Circuits and Systems I, 47:1089–1092.
56. Joy, M. P. Results concerning the absolute stability of delayed neural
networks. 2000. Neural Networks 13:613–616.
57. Hornik, K., Stinchcombe, M., and White, H. Universal approximation of an
unknown function and its derivatives using multilayer feedforward networks.
1990. Neural Networks 3:551–560.
58. Hornik, K., Stinchcombe, M., White, H., and Auer, P. Degree of
approximation results for feedforward networks approximating unknown
mappings and their derivatives. 1994. Neural Computation 6:1262–1275.
59. Baldi, P., and Brunak, S. Bioinformatics: The Machine Learning Approach.
2001. MIT Press, Cambridge, MA.
60. Savageau, M. A. Enzyme kinetics in vitro and in vivo: Michaelis–Menten
revisited. In E. E. Bittar, editor, Principles of Medical Biology, vol. 4, pp.
93–146. 1995. JAI Press Inc., Greenwich, CT.
61. Baldi, P. Gradient descent learning algorithms overview: a general dynamical
systems perspective. 1995. IEEE Transactions on Neural Networks
6(1):182–195.
62. Fraser, C. M., Gocayne, J. D., White, O., Adams, M. D., Clayton, R. A.,
Fleischmann, R. D., Bult, C. J., Kerlavage, A. R., Sutton, G., Kelley, J. M., et
al. The minimal gene complement of Mycoplasma genitalium. 1995. Science
270:397–403.
63. Snoussi. E. H. Qualitative dynamics of piecewise-linear differential
equations: a discrete mapping approach. 1989. Dynamics and Stability of
Systems 4(3–4):189–207.
64. Plahte, E., Mestl, T., and Omholt, S. W. A methodological basis for
description and analysis of systems with complex switch-like interactions.
1998. Journal of Mathematical Biology 36:321–348.
65. Edwards, R., and Glass, L. Combinatorial explosion in model gene networks.
2000. Chaos 10(3):691–704.
172 Systems biology
66. Edwards, R., Seigelmann, H. T., Aziza, K., and Glass, L. Symbolic dynamics
and computation in model gene networks. 2001. Chaos 11(1):160–169.
67. Lewis, J. E., and Glass, L. Steady states, limit cycles, and chaos in models of
complex biological networks. 1991. International Journal of Bifurcation and
Chaos 1(2):477–483.
68. Mestl, T., Plahte, E., and Omholt, S. W. Periodic solutions in systems of
piecewise-linear differential equations. 1995. Dynamics and Stability of
Systems 10(2):179–193.
69. Plahte, E., Mestl, T., and Omholt, S. W. Stationary states in food web models
with threshold relationships. 1995. Journal of Biological Systems
3(2):569–577.
70. Mestl, T., Plahte, E., and Omholt, S. W. A mathematical framework for
describing and analysing gene regulatory networks. 1995. Journal of
Theoretical Biology 176:291–300.
71. Snoussi, E. H., and Thomas, R. Logical identification of all steady states: the
concept of feedback loop characteristic states. 1993. B. Math. Biol.,
55(5):973–991.
72. Prokudina, E. I., Valeev, R. Y., and Tchuraev, R. N. A new method for the
analysis of the dynamics of the molecular genetic control systems. II.
Application of the method of generalized threshold models in the
investigation of concrete genetic systems. 1991. Journal of Theoretical
Biology 151:89–110.
73. Omholt, S. W., Kefang, X., Anderson, Ø., and Plahte, E. Description and
analysis of switchlike regulatory networks exemplified by a model of cellular
iron homeostasis. 1998. Journal of Theoretical Biology 195:339–350.
74. Sanchez, L., van Helden, J., and Thieffrey, D. Establishment of the Dorso-
ventral pattern during embryonic development of Drosophila melanogaster: a
logical analysis. 1997. Journal of Theoretical Biology 189:377–389.
75. de Jong, H., Page, M., Hernandez, C., and Geiselmann, J. Qualitative
simulation of genetic regulatory networks: method and application. In B.
Nebel, editor, Proceedings of the 17th International Joint Conference on
Artificial Intelligence, IJCAI-01, pp. 67–73. 2001. Morgan Kauffman, San
Mateo, CA.
76. Britton, N. F. Reaction–Diffusion Equations and Their Applications to
Biology. 1986. Academic Press, London.
77. Lacalli, T. C. Modeling the Drosophila pair-rule pattern by reaction
diffusion: gap input and pattern control in a 4-morphogen system. 1990.
Journal of Theoretical Biology 144:171–194.
78. Kauffman, S. A. The Origins of Order: Self-Organization and Selection in
Evolution. 1993. Oxford University Press, New York.
79. Maini, P. K., Painter, K. J., and Nguyen Phong Chau, H. Spatial pattern
formation in chemical and biological systems. 1997. Journal of the Chemical
Society, Faraday Transactions 93(20):3601–3610.
80. Turing, A. M. The chemical basis of morphogenesis. 1951. Philosophical
Transactions of the Royal Society B 237:37–72.
81. Nicolis, G., and Prigogine, I. Self-Organization in Nonequilibrium Systems:
From Dissipative Structures to Order Through Fluctuations. 1977. Wiley-
Interscience, New York.
82. Segel, L. A. Modeling Dynamic Phenomena in Molecular and Cellular
Biology. 1984. Cambridge University Press, Cambridge, UK.
83. Meinhardt, H. Pattern formation by local self-activation and lateral
inhibition. 2000. BioEssays 22(8):753–760.
The search for general principles 173
84. Lacalli, T. C., Wilkinson, D. A., and Harrison, L. G. Theoretical aspects of
stripe formation in relation to Drosophila segmentation. 1988. Development
103:105–113.
85. Hunding, A., Kauffman, S. A., and Goodwin, B. C., Drosophila
segmentation: supercomputer simulation of prepattern hierarchy. 1990.
Journal of Theoretical Biology 145:369–384.
86. Marnellos, G., Deblandre, G. A., Mjolsness, E., and Kintner, G. Delta-Notch
lateral inhibitory patterning in the emergence of ciliated cells in Xenopus:
experimental observations and a gene network model. In R. B. Altman, K.
Lauderdale, A. K. Dunker, L. Hunter, and T. E. Klein, editors, Proceedings
of the Pacific Symposium on Biocomputing (PSB 2000), vol. 5, pp. 326–337.
2000. World Scientific Publishing, Singapore.
87. Reinitz, J., Mjolsness, E., and Sharp, D. H. Model for cooperative control of
positional information in Drosophila by bicoid and maternal hunchback.
1995. Journal of Experimental Zoology 271:47–56.
88. Sharp, D. H., and Reinitz, J. Prediction of mutant expression patterns using
gene circuits. 1998. BioSystems 47:79–90.
89. Reinitz, J. and Sharp, D. H. Gene circuits and their uses. In J. Collado-Vides,
B. Magasanik, and T. F. Smith, editors, Integrative Approaches to Molecular
Biology, pp. 253–272. 1996. MIT Press, Cambridge, MA.
90. Myasnikova, E., Samsonova, A., Kozlov, K., Samsonova, and Reinitz, J.
Registration of the expression patterns of Drosophila segmentation genes by
two independent methods. 2001. Bioinformatics 17(1):3–12.
91. van Kampen, N. G. Stochastic Processes in Physics and Chemistry, revised
edn. 1997. Elsevier, Amsterdam.
92. Gillespie, D. T. The chemical Langevin equation. 2000. Journal of Chemical
Physics 113(1):297–306.
93. Gillespie, D. T. Exact stochastic simulation of coupled chemical reactions.
1977. Journal of Physical Chemistry 81(25):2340–2361.
94. Gibson, M. A., and Bruck, J. Efficient exact stochastic simulation of
chemical systems with many species and many channels. 2000. Journal of
Physical Chemistry A 104:1876–1889.
95. McAdams, H. H., and Arkin, A. Stochastic mechanism in gene expression.
1997. Proceedings of National Academy of Sciences of the USA 94:814–819.
96. Arkin, A., Ross, J., and McAdams, H. A. Stochastic kinetic analysis of
developmental pathway bifurcation in phage -infected Escherichia coli cells.
1998. Genetics 149:1633–1648.
97. Barkai, N., and Liebler, S. Circadian clocks limited by noise. 1999. Nature
403:267–268.
98. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. 1988. Morgan Kaufmann, San Mateo, CA.
99. Lauritzen, S. L. Graphical Models. 1996. Oxford University Press, Oxford, UK.
100. Jordan, M. I. (ed.). Learning in Graphical Models. 1999. MIT Press,
Cambridge, MA.
101. Heckerman, D. A tutorial on learning with Bayesian networks. In M. I.
Jordan, editor, Learning in Graphical Models, pp. 00–00. 1998. Kluwer,
Dordrecht.
102. van Someren, E. P., Wessels, L. F. A., and Reinders, M. J. T. Linear modeling
of genetic networks from experimental data. In Proceedings of the 2000
Conference on Intelligent Systems for Molecular Biology (ISMB00), La Jolla,
CA, pp. 355–366. 2000. AAAI Press, Menlo Park, CA.
103. Zien, A., Kuffner, R., Zimmer, R., and Lengauer, T. Analysis of gene
174 Systems biology
expression data with pathway scores. In Proceedings of the 2000 Conference
on Intelligent Systems for Molecular Biology (ISMB00), La Jolla, CA, pp.
407–417. 2000. AAAI Press, Menlo Park, CA.
104. Elidan, G., Peer, D., Regev, A., and Friedman, N. Inferring subnetworks
from perturbed expression profiles. 2001. Bioinformatics 17:S215–S224.
Supplement 1.
105. Gasch, A., Friedman, N., Segal, E., Taskar, B., and Koller, D. Rich
probabilistic models for gene expression. 2001. Bioinformatics 17:S243–S252,
Supplement 1.
106. Tanay, A., and Shamir, R. Computational expansion of genetic networks.
2001. Bioinformatics 17:S270–S278, Supplement 1.
107. Weiss, Y. Correctness of local probability propagation in graphical models
with loops. 2000. Neural Computation 12:1–41.
108. Yedidia, J. S., Freeman, W. T., and Weiss, Y. Generalized belief propagation.
In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural
Information Processing Systems, vol. 13 (Proceedings of the NIPS 2000
Conference). 2001. MIT Press, Boston, MA.
109. Goss, P. J. E., and Peccoud, J. Quantitative modeling of stochastic systems in
molecular biology by using stochastic Petri nets. 1998. Proceedings of the
National Academy of Sciences of the USA 95:6750–6755.
110. Regev, A., Silverman, W., and Shapiro, E. Representation and simulation of
biochemical processes using the -calculus process algebra. In R. B. Altman,
A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, editors,
Proceedings of the Pacific Symposium on Biocomputing (PSB 2001), vol. 6,
pp. 459–470.
111. Meyers, S., and Friedland, P. Knowledge-based simulation of genetic
regulation in bacteriophage lambda. 1984. Nucleic Acids Research 12(1):1–9.
112. Shimada, T., Hagiya, M., Arita, M., Nishizaki, S., and Tan, C. L.
Knowledge-based simulation of regulatory action in lambda phage. 1995.
International Journal of Artificial Intelligence Tools 4(4):511–524.
113. Lindenmayer, A. Mathematical models for cellular interaction in
development. I. Filaments with one-sided inputs. 1968. Journal of Theoretical
Biology 18:280–289.
114. Collado-Vides, J., Guitièrrez-Ríos, R. M., and Bel-Enguix, G. Networks of
transcriptional regulation encoded in a grammatical model. 1998.
BioSystems 47:103–118.
115. Hofestädt, R., and Meineke, F. Interactive and modelling and simulation of
biochemical networks. 1995. Computers in Biology and Medicine
25(3):321–334.
116. Ren, B., Robert, F., Wyrick, J. J., Aparicio, O., Jennings, E. G., Simon, I.,
Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T. L., Wilson,
C. J., Bell, S. P., and Young, R. A. Genome-wide location and function of
DNA binding proteins. 2000. Science 290:2306–2309.
117. Iyer, V. R., Horak, C. E., Scafe, C. S., Botstein, D., Snyder, M., and Brown,
P. O. Genomic binding sites of the yeast cell-cycle transcription factors SBF
and MBF. 2001. Nature 409:533–538.
118. Pandey, A., and Mann, M. Proteomics to study genes and genomes. 2000.
Nature 405:837–846.
119. Zhu, H., and Snyder, M. Protein arrays and microarrays. 2001. Current
Opinion in Chemical Biology 5:40–45.
120. Tucker, C. L., Gera, J. F., and Uetz, P. Towards an understanding of complex
protein networks. 2001. Trends in Cell Biology 11(3):102–106.
The search for general principles 175
121. Gardner, T. S., Cantor, C. R., and Collins, J. J. Construction of a genetic
toggle switch in Escherichia coli. 2000. Nature 403:339–342.
122. Elowitz, M., and Liebler, S. A synthetic oscillatory network of
transcriptional regulators. 2000. Nature 403:335–338.
123. Systems Biology Workbench Project. http://www.cds.caltech.edu/erato/
124. Karp, P. D., Krummenacker, M., Paley, S., and Wagg, J. Integrated
pathway–genome databases and their role in drug discovery. 1999. Trends in
Biotechnology 17:275–281.
125. Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O., and
Eisenberg, D. Detecting protein function and protein–protein interactions
from genome sequences. 1999. Science 285:751–753.
126. Xenarios, I., Fernandez, E., Salwinsky, L., Duan, X. J., Thompson, M. J.,
Marcotte, E. M., and Eisenberg, D. DIP: the database of interacting
proteins: 2001 update. 2001. Nucleic Acids Research 29:239–241.
127. Wojcik, J., and Schachter, V. Protein–protein interaction map inference using
interacting domain profile pairs. 2001. Bioinformatics 17:S296–S305,
Supplement 1.
128. Pazos, F., and Valencia, A. Similarity of phylogenetic trees as indicator of
protein–protein interaction. 2001. Protein Engineering 14:609–614.
129. Niggemann, O., Lappe, M., Park, J., and Holm, L. Generating protein
interaction maps from incomplete data: application to fold assignment. 2001.
Bioinformatics 17:S149–S156, Supplement 1.
130. Xenarios, I., and Eisenberg, D. Protein interaction databases. 2001. Current
Opinion in Biotechnology 12:334–339.
131. Blaschke, C., Andrade, M. A., Ouzounis, C., and Valencia, A. Automatic
extraction of biological information from scientific text: protein–protein
interactions. In Proceedings of the 1999 Conference on Intelligent Systems for
Molecular Biology (ISMB99), Heidelberg, Germany, pp. 60–67. 1999. AAAI
Press, Menlo Park, CA.
132. Marcotte, E. M., Xenarios, I., and Eisenberg, D. Mining literature for
protein–protein interactions. 2001. Bioinformatics 17:359–363.
133. Yu, H., Krauthammer, M., Friedman, C., Kra, P., and Rzhetsky, A.
GENIES: a natural-language processing system for the extraction of
molecular pathways from journal articles. 2001. Bioinformatics 17:S74–S82,
Supplement 1.
134. van Helden, J., Naim, A., Mancuso, R., Eldridge, M., Wernisch, L., Gilbert,
D., and Wodak, S. J. Representing and analysing molecular and cellular
function using the computer. 2000. Biological Chemistry 381:921–935.
135. Kolpakov, F. A., Ananko, E. A., Kolesov, G. B., and Kolchanov, N. A.
GeneNet: a gene network database and its automated visualization. 1999.
Bioinformatics 14(6):529–537.
136. Kanehisa, M., and Goto, S. KEGG: Kyoto Encyclopedia of Genes and
Genomes. 2000. Nucleic Acids Research 28(1):27–30.
137. Wingender, E., Chen, X., Fricke, E., Geffers, R., Hehl, R., Liebich, I., Krull,
M., Matys, V., Michael, H., Ohnhauser, R., Pruss, M., Schacherer, F., Thiele,
S., and Urbach, S. The TRANSFAC system on gene expression regulation.
2001. Nucleic Acids Research 29:281–284.
138. Salgado, H., Santos, A., Garza-Ramos, U., van Helden, J., Díaz, E., and
Collado-Vides, J. RegulonDB (version 2.0): a database on transcriptional
regulation in Escherichia coli. 2000. Nucleic Acids Research 27(1):59–60.
139. Hartwell, L. H., Hopfield, J. J., Leibler, S., and Murray, A. W. From
molecular to modular cell biology. 1999. Nature 402 (Supp.):C47–C52.
176 Systems biology
140. Morgan, D. O. Cyclin-dependent kinases: engines, clocks, and
microprocessors, 1997. Annual Review of Cellular and Developmental Biology
13:261–291.
141. Barkai, N., and Leibler, S. Robustness in simple biochemical networks. 1997.
Nature 387:913–917.
142. Yi, T., Huang, Y., Simon, M. I., and Doyle, J. Robust perfect adaptation in
bacterial chemotaxis through integral feedback control. 2000. Proceedings of
the National Academy of Sciences of the USA 97:4649–4653.
143. Morohashi, M., Winn, A. E., Borisuk, M. T., Bolouri, H., Doyle, J., and
Kitano, H. Robustness as a measure of plausibility. 2002. Journal of
Theoretical Biology. (in press)
144. Csete, M. E., and Doyle, J. C. Reverse engineering of biological complexity.
2002. Science 295:1664–1669.
145. Houchmandzadeh, B., Wieschaus, E., and Leibler, S. Establishment of
development precision and proportions in the early Drosophila embryo. 2002.
Nature 415:798–802.
146. Sinden, R. R. DNA Structure and Function. 1994. Academic Press, San
Diego, CA.
147. Baldi, P., and Baisnée, P. F. Sequence analysis by additive scales: DNA
structure for sequences and repeats of all lengths. 2000. Bioinformatics
16(10):865–889.
148. Parvin, J. D., McCormick, R. J., Sharp, P. A., and Fisher, D. E. Pre-bending
of a promoter sequence enhances affinity for the TATA-binding factor. 1995.
Nature 373:724–727.
149. Starr, D. B., Hoopes, B. C., and Hawley, D. K. DNA bending is an important
component of site-specific recognition by the TATA binding protein. 1995.
Journal of Molecular Biology 250:434–446.
150. Grove, A., Galeone, A., Mayol, L., and Geiduschek, E. P. Localized DNA
flexibility contributes to target site selection by DNA-bending proteins. 1996.
Journal of Molecular Biology 260:120–125.
151. Pazin, M. J., and Kadonaga, J. T. SWI2/SNF2 and related proteins: ATP-
driven motors that disrupt protein–DNA interactions? 1997. Cell
88:737–740.
152. Tsukiyama, T., and Wu, C. Chromatin remodeling and transcription. 1997.
Current Opinion in Genetics and Development 7:182–191.
153. Werner, M. H., and Burley, S. K. Architectural transcription factors: proteins
that remodel DNA. 1997. Cell 88:733–736.
154. Gorm Pedersen, A., Baldi, P., Brunak, S., and Chauvin, Y. DNA structure in
human RNA polymerase II promoters. 1998. Journal of Molecular Biology
281:663–673.
155. Sheridan, S. D., Benham, C. J., and Hatfield, G. W. Activation of gene
expression by a novel DNA structural transmission mechanism that requires
supercoiling-induced DNA duplex destabilization in an upstream activating
sequence. 1998. Journal of Biological Chemistry 273:21298–21308.
156. Banavar, J. R., Maritan, A., and Rinaldo, A. Size and form in efficient
transportation networks. 1999. Nature 399:130–132.
157. Barabasi, A., and Albert, R. Emergence of scaling in random networks.
1999. Science 286:509–512.
158. Jeong, H., Tomber, B., Albert, R., Oltvai, Z. N., Barabasi, A.-L. The large-
scale organization of metabolic networks. 2000. Nature 407:651–654.
159. Bollobas, B. Random Graphs. 1985. Academic Press, New York.
Appendix A
Experimental protocols
cDNA synthesis for the preparation of 33P-labeled bacterial or eukaryotic targets for
hybridization to pre-synthesized nylon filter arrays
cDNA synthesis for the preparation of 33P-labeled targets for hybridization to the
nylon filters is performed with 20 g of total RNA and 37.5 ng of random hexamer
primers. To anneal the primers to the RNA, this mixture is heated at 70 °C for 3 min
and quick cooled on ice in the presence of annealing buffer (1RT buffer: 50 m
Tris-HCl, 8 m MgCl2, 30 m KCl, 1 m DTT, pH 8.5). cDNA synthesis is per-
formed at 42 °C for 3 hrs in a 60-l reaction mixture containing the RNA and
primer mixture, reverse transcriptase buffer (50 m Tris-HCl, 8 m MgCl2, 30 m
KCl, 1 m DTT, pH 8.5) containing: 1 m each dATP, dGTP, and dTTP, 50 Ci
[33P]-dCTP (3000 Ci/mmol), 20 units of ribonuclease inhibitor III, and 4 l (88
units) of AMV reverse transcriptase. Labeled cDNA is separated from unincorpo-
rated nucleotides on Sephadex G25 spin columns (Roche Biochemical) in a tabletop
clinical centrifuge. Three of these preparations are combined for hybridization to
the DNA microarray filters.
177
178 Appendix A
Hybridization of 33P-labeled targets to pre-synthesized nylon filter arrays
The nylon filters are soaked in 2SSPE (20SSPE contains 3 M NaCl, 0.2
NaH2PO4, 25 m EDTA) for 10 min and prehybridized in 10 ml of prehybridization
solution (5SSPE, 2% SDS, 1Denhardt’s solution (50Denhardt’s solution con-
tains 5 g of Ficoll, 5 g of polyvinylpyrrolidone, 5 g of bovine serum albumin and
H2O to 500 ml) and 0.1 mg/ml of sheared herring sperm DNA) for 1 hr at 65 °C.
3–5107 cpm of cDNA targets in 500 l of prehybridization solution is heated at 95
°C for 10 min, rapidly cooled on ice, and added to an additional 5.5 ml of prehy-
bridization solution. The prehybridization solution is removed and replaced with
the prehybridization solution containing the 33P-labeled cDNA targets.
Hybridization is carried out for 15–18 hrs at 65 °C. Following hybridization, each
filter is rinsed with 50 ml of 0.5SSPE containing 0.2% SDS at room temperature
for 3 min, followed by three more washes in 0.5SSPE containing 0.2% SDS solu-
tion at 65 °C for 20 min each. The filters are partially air-dried, wrapped in Saran
Wrap,® and exposed to a phosphorimager screen for 15–30 hrs. Following phospho-
rimaging, the targets are stripped from the filters by microwaving at 30% of
maximal power (1400 W) in 500 ml of 10 m Tris solution (pH 8.0) containing 1 m
EDTA and 1% SDS for 20 min. Stripped filters are wrapped in Saran Wrap® and
stored in the presence of damp paper towels in sealed plastic bags at 4 °C.
Cy3, Cy5 target labeling of poly(A) mRNA for hybridization to glass slide arrays
Although we prefer the method described above (see Chapter 4), this method can be
modified for labeling poly(A) mRNA targets for hybridization to pre-synthesized
full-length ORF glass slide arrays by using oligo(dT) primers instead of random
hexamer primers or a mixture of oligo(dT) and random hexamer primers (5g/l
each).
mRNA enrichment and biotin labeling methods for hybridization of bacterial targets
to in situ synthesized Affymetrix GeneChips™ or glass slide arrays containing
oligonucleotide probes
To enrich the proportion of mRNA in a total RNA preparation, 300 g of total
RNA is split into 12 aliquots to increase the efficiency of the enrichment procedure.
All reactions are performed in PCR tubes in a thermocycler. For each reaction, 25
g of total RNA is mixed with 70 pmol of a rRNA specific primer mix in a final
volume of 30 l. Each specific primer mix includes three specific primers for 16S
rRNA (5-CCTACGGTTACCTTGTT-3, 5-TTAACCTTGCGGCC GTACTC-
3, and 5-TCCGATTAACGCTTGCACCC-3) and five specific primers for 23S
rRNA (5-CCTCACGGTTCATTAGT-3, 5-CTATAGTAAAGGTTCACGGG-
3, 5-TCGTCATCACGCCTCAGCCT-3, 5-TCCCACATCGTTTCCCAC-3
and 5-CATGGAAAACATATTACC-3). The procedure described here for
Escherichia coli can be used for other organisms with appropriate ribosomal specific
oligonucleotide primers. This mixture is heated to 70 °C for 5 min and cooled to
4°C. Now 10 l of 10 MMLV RT buffer (0.5 Tris-HCl (pH 8.3), 0.1 MgCl2,
and 0.75 KCl), 5 l of 10 m DTT, 2 l of 25 m dNTPs mix, 3.5 l of 20 U/l
of SUPERase•In™ (Ambion Inc. catalog no. 2694), 6 l of 50 U/l of MMLV
reverse transcriptase and water are added to each tube to a final volume of 100 l.
The reactions are incubated at 42 °C for 25 min and incubation is continued at 45 °C
for 20 min for cDNA synthesis. To remove the rRNA moiety from the rRNA/cDNA
hybrid, 5 l of 10 U/l of RNase H is added and the mixture is incubated at 37 °C
for 45 min. RNase H is inactivated by heating at 65 °C for 5 min. Newly synthesized
cDNA is removed by incubation with 4 l of 2 U/l of DNase I and 1.2 l of 20
U/l of SUPERase•In™ at 37 °C for 2 hrs. Four reactions are combined for RNA
clean-up with a single Qiagen RNeasy mini column (Qiagen catalog no. 74104). The
quantity of enriched mRNA is measured by absorbance at 260 nm. A typical yield is
10–20 g of RNA from 300 g of total RNA constituting a five- to tenfold enrich-
ment of mRNA to rRNA.
For the RNA fragmentation step, a maximum of 20 g of RNA is added to a
PCR tube containing 10 l of 10 NEB buffer (0.7 Tris-HCl (pH 7.6), 0.1
MgCl2, 50 m DTT) for T4 polynucleotide kinase in a final volume of 88 l. The
tube is incubated at 95 °C for 30 min and cooled to 4 °C.
For the RNA 5-thiolation and biotin-labeling reaction, 2 l of 5 m -S-ATP
and 10 l of 10 U/l of T4 polynucleotide kinase are incubated with the fragmented
RNA at 37 °C for 50 min. The reaction is inactivated by heating to 65 °C for 10 min
and cooled to 4 °C. Excess -S-ATP is removed by ethanol precipitation.
Fragmented and thiolated RNA is collected by centrifugation in the presence of
glycogen (0.25 g/l) and resuspended in 90 l of dH2O. Six l of 500 m MOPS
(3-(N-morpholino)propanesulfonic acid, pH 7.5) and 4.0 l of 50 m PEO-
idoacetyl-biotin (Pierce Chemical catalog no. P/N 21334ZZ) is added to the frag-
mented thiolated RNA and incubated at 37 °C for 1 hr. The biotin-labeled RNA is
isolated by ethanol precipitation, washed twice with 70% ethanol, dried and dis-
solved in 20–30 l of Milli-Q grade water. The quantity of the biotin-labeled RNA
is measured by absorbance at 260 nm. The total yield for the entire procedure is typ-
ically 2–4 g of biotin-labeled RNA from 300 g of total RNA. The efficiency of
RNA fragmentation and biotin labeling is monitored with a gel shift assay where
the biotin-labeled RNA is pre-incubated with avidin prior to electrophoresis.
Biotin-labeled RNAs are retarded during electrophoresis due to the avidin–biotin
interaction. The position of the RNA in the gel addresses the fragmentation
Experimental protocols 181
efficiency. The amount of shifted RNA indicates the efficiency of the biotin labeling.
Inefficiencies in either of these parameters should be addressed before proceeding to
the hybridization step.
Distributions
In Chapter 5, we have modeled the expression level of a gene under a given condi-
tion using a Gaussian distribution parameterized by its mean and variance. We have
used a conjugate prior for the mean and variance and provided arguments in
support of this choice. Other prior distributions for the parameters of a Gaussian
distribution are possible and, for completeness, here we describe two such alterna-
tives which are studied in the literature [1, 2]. We leave as an exercise for the reader to
derive the corresponding estimates and compare them to those derived in Chapter
5.
冤
P( ,
2 | D) C
(n 2) exp
n1 2
2
2
n
s 2 ( m) 2
2
冥 (B.1)
with and
0. The conditional and marginal posteriors are easily
derived. The conditional posterior P( |
, D) is normal N (m,
2/n). The marginal
posterior of is a Student t(n 1, m, s2/n) distribution. The marginal posterior dis-
tribution of
2 is a scaled inverse gamma density P(
2/D)I(
2; n 1, s2).
A better prior in the case of gene expression data should assume that log
is
uniform over a finite interval [a, b] or that
2 has a bell-shaped distribution concen-
trated on positive values, such as a gamma distribution. The conjugate prior
addresses these issues.
185
186 Appendix B
The semi-conjugate prior
In the semi-conjugate prior distribution, the functional form of the prior distribu-
tions on and
2 are the same as in the conjugate case (normal and scaled inverse
gamma, respectively) but independent of each other, i.e., P(,
2)P()P(
2).
However, as previously discussed, this assumption is not optimal for current DNA
array data.
Other more complex priors could be constructed using mixtures. A mixture of
conjugate priors would lead to a mixture of conjugate posteriors.
for x0. represents the gamma function ( y)兰0 ett y1dt. The expectation is
( / 2)s2 when 2; otherwise it is infinite. The mode is always ( / 2)s2.
冤 冢 1i
冣冥
k
2 k/2 k(k1)/4 i1
2
| S | /2 |W | ( k 1) /2
冢 1
exp tr(SW 1 )
2 冣 (B.4)
(r s) r1
B(x; r, s) x (1 x) s1 (B.5)
(r)(s)
Mathematical complements 187
with 0x1. The mean is r/(r s) and the variance rs/(r s)2)(r s 1). It is a useful
distribution for quantities that are constrained to the unit interval, such as probabil-
ities. When rs1 it yields the uniform distribution. For most other values, it
yields a “bell-shaped” distribution over the [0, 1] interval.
Hypothesis testing
We have modeled the log-expression level of each gene in each situation using a
Gaussian model. If all we care about is whether a given gene has changed or not, we
could model directly the difference between the log-expression levels in the control
and treatment cases. These differences can be considered pairwise or in paired
fashion, as is more likely the case with current microarray technology where the log-
arithm of the ratio between the expression levels in the treatment and control situa-
tions is measured along two different channels (red and green).
We can model again the differences x t xc with a Gaussian N ( ,
2). Then the
null hypothesis H, given the data, is that 0 (no change). To avoid assigning a
probability of 0 to the null hypothesis, a Bayesian approach here must begin by
giving a non-zero prior probability for 0, which is somewhat contrived. In any
case, as in Chapter 5, we can set P(
2)I(
2;v0,
20). For the mean , we use the
mixture
冦
0 : with probability p
N (0,
2/) : with probability 1p (B.6)
Missing values
It is not uncommon for array data to have missing values resulting, for instance,
from experimental errors. Ideally, missing values should be dealt with probabilisti-
cally by estimating their distribution and integrating them out. When they are not
removed entirely from the analysis, a more straightforward approximation is to
replace missing values by single point estimates. At the crudest level, a missing level
of expression could be replaced by the average of the expression levels of all the
genes in the same experiment. More adequate estimates can be derived using the
techniques of Chapter 5, by looking at “neighboring” genes with similar properties
(see also [3]).
for any sequence {xi}, where is the mean vector and Cij C(x1, xj ) is the covari-
ance of xi and xj. For simplicity, we shall assume in what follows that 0. Priors
on the noise and the modeling function are combined into the covariance matrix C.
Different sensible parameterizations for C are described below. From Equation B.7,
the predictive distribution for the variable y associated with a test case x, is obtained
by conditioning on the observed training examples. In other words, a simple calcu-
lation shows that y has a Gaussian distribution
with
where k(x)(C(x1,x), ..., C(xK, x)), and CK denotes the covariance matrix based on
the K training samples.
Covariance parameterization
A Gaussian process model is defined by its covariance function. The only constraint
on the covariance function C(xi , xj ) is that it should yield positive semi-definite
matrices for any input sample. In the stationary case, Bochner theorem in harmonic
analysis [8] provides a complete characterization of such functions in terms of
Fourier transforms. It is well known that the sum of two positive matrices (resp. pos-
itive definite) is positive (resp. positive definite). Therefore the covariance can be
conveniently parameterized as a sum of different positive components. Examples of
useful components have the following forms:
• Noise variance: ij 21 or, more generally, ij ƒ(xi ) for an input dependent-noise
model where ij 0 if ij and ij 1 otherwise
n
• Smooth covariance: C(xi, xj ) 22 exp( 兺 (x
u1
2
u iu
xju)2)
n
• And more generally: C(xi, xj ) 22 exp( 兺
u1
2
|x xju | r )
u iu
n
• Periodic covariance: C(xi, xj ) 23 exp( 兺
u1
2
u
sin2[ (xiu xju)/ u ]
Mathematical complements 189
Notice that a small value of u characterizes components u that are largely irrele-
vant for the output in a way closely related to the automatic relevance determination
framework [7]. For simplicity, we write to denote the vector of hyperparameters of
the model. Short of conducting lengthy Monte Carlo integrations over the space of
hyperparameters, a single value can be estimated by minimizing the negative log-
likelihood
1 1 K
!() log det CK Y KT C K1YK log 2 (B.10)
2 2 2
Without any specific shortcuts, this requires inverting the covariance matrix and is
likely to require O(N 3) computations. Prediction or classification can then be
carried based on Equation B.9. A binary classification model, for instance is
readily obtained by defining a Gaussian process on a latent variable Z as above
and letting
1
P(yi 1) (B.11)
1 ezi
More generally, when there are more than two classes, one can use normalized expo-
nentials instead of sigmoidal functions.
so that, up to trivial constants, log P(H |x)H i K(xi , x) and similarly for the
negative examples. K is called the kernel function. The intuitive idea is to base the
classification of the new examples on all the previous examples weighted by two
factors: a coefficient i 0 measuring the importance of example i, and the kernal
K(xi , x) measuring how similar x is to example xi . Thus in a sense the expression for
the discrimination depends directly on the training examples. This is different from
the case of neural networks, for instance, where the decision depends indirectly on
the training examples via the trained neural network parameters. Thus in an appli-
cation of kernel methods two fundamental choices must be made regarding the
kernel K and the weights i. Variations in these choices lead to a spectrum of
different methods, including generalized linear models and SVMs.
Kernel selection
To a first approximation, from the mathematical theory of kernels, a kernel must be
positive definite. By Mercer’s theorem of functional analysis, K can be represented
as an inner product of the form
Thus another way of looking at kernel methods, is to consider that the original x
vectors are mapped to a “feature” space via the function (x). Note that the feature
space can have very high (even infinite) dimension and that the vectors (x) have the
same length even when the input vectors x do not. The similarity of two vectors is
assessed by taking their inner production in feature space. In fact we can compute
the Euclidean distance || (xi ) (xj )|| 2 Kii 2Kij Kjj which also defines a
pseudo-distance on the original vectors.
The fundamental idea in kernel methods is to define a linear or non-linear deci-
sion surface in feature space rather than the original space. The feature space does
not need to be constructed explicitly since all decisions can be made through the
kernel and the training examples. In addition, as we are about to see, the decision
surface depends directly on a subset of the training examples, the support vectors.
Notice than a dot product kernel provides a way of comparing vectors in feature
space. When used directly in the discrimination function, it corresponds to looking
for linear separating hyperplanes in feature space. However more complex decision
boundaries in feature spaces (quadratic or higher order) can easily be implemented
using more complex kernels K derived from the inner product kernel K, such as:
• Polynomial kernels: K(xi , xj )(1 K(xi , xj ))m.
1
• Radial basis kernels: K(xi, xj )exp 2
2 ((xi ) (xj ))t((xi ) (xj )).
• Neural network kernels: K(xi, xj )tanh(xitxj #).
Another important class of kernels that can be derived from probabilistic generative
models are Fischer kernels [11, 13].
Mathematical complements 191
Figure B.1. Hyperplane separating two classes of points with the corresponding
margins and support vectors (black circles and triangles).
Weight selection
The weights i are typically obtained by iterative optimization of an objective func-
tion (classification loss), corresponding in general to a quadratic optimization
problem. With large training sets, at the optimum many of the weights are equal to
0. The only training vectors that matter in a given decision are those with non-zero
weights and these are called the support vectors (Figure B.1).
To see this, consider an example xi with target classification yi. Since the decision
is based on the sign of D(xi ), ideally we would like to have yi D(xi ), the margin for
example i, to be as large possible. Because the margin can be rescaled by rescaling
the s, it is natural to introduce additional constraints such as 0 i 1 for every i .
In the case where an exact separating manifold exists in feature space, then a reason-
able criterion is to maximize the margin in the worst case. This is also called risk
maximization and corresponds to: max mini yi D(xi ). SVMs can be defined as a
class of kernel methods based on structural risk minimization [9, 10, 14].
Substituting the expression for D in terms of the kernel yields: max mini j
j yi yj Kij . This can be rewritten as: max mini j Aij j , with Aij yi yj Kij and
0 i 1. It is clear that in each minimization procedure all weights j associated
with a non-zero coefficient Aij will either be 0 or 1. With a large training set many of
them will be zero for each i and this will remain true at the optimum. When the
margins are violated, as in most real-life examples, we can maximize the average
margin, the average being taken with respect to the weights i themselves which are
intended to reflect the relevance of each example. Thus in general we want to maxi-
mize a quadratic expression of the form ii yi D(xi ) under a set of linear con-
straints on the i. Standard techniques exist to carry such optimization. For
example a typical function used for minimization in the literature is:
192 Appendix B
!(i ) 兺 [ y D(x ) 2 ]
i
i i i i
(B.15)
The solution to this constrained optimization problem is unique provided that for
any finite set of examples the corresponding kernel matrix Kij is positive definite. It
can be found with standard iterative methods, although the convergence can some-
times be slow. To accommodate training errors or biases in the training set, the
kernel matrix K can be replaced by K D, where D is a diagonal matrix whose
entries are either d or d in location corresponding to positive and negative exam-
ples [9, 10, 13, 14]. An example of application of SVMs to gene expression data can
be found in [12].
In summary, kernel methods and SVMs have several attractive features. As pre-
sented, these are supervised learning methods that can leverage labeled data. These
methods can build flexible decision surfaces in high dimensional feature spaces. The
flexibility is related to the flexibility in the choice of the kernel function. Overfitting
can be controlled through some form of margin maximization. These methods can
handle inputs of variable lengths such as biological sequences, as well as large
feature spaces. Feature spaces need not be constructed explicitly since the decision
surface is entirely defined in terms of the kernel function and typically a sparse
subset of relevant training examples, the support vectors. Learning is typically
achieved through iterative solution of a linearly constrained quadratic optimization
problem.
1. Box, G. E. P., and Tiao, G. C. Bayesian Inference in Statistical Analysis. 1973. Addison
Wesley.
2. Pratt, J. W., Raiffa, H., and Schlaifer, R. Introduction to Statistical Decision Theory.
1995. MIT Press, Cambridge, MA.
3. Troyanskaya, Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein,
D., and Altman, R. Missing value estimation methods for DNA microarrays. 2001.
Bioinformatics 17:520–525.
4. Williams, C. K. I., and Rasmussen, C. E. Gaussian processes for regression. In D. S.
Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information
Processing Systems, vol. 8. 1996. MIT Press, Cambridge, MA.
5. Gibbs, M. N., and MacKay, D. J. C. Efficient implementation of Gaussian processes.
1997. Technical Report, Cavendish Laboratory, Cambridge.
6. Neal, R. M. Monte Carlo implementation of Gaussian process models for Bayesian
regression and classification. 1997. Technical Report No. 9702, Department of
Computer Science, University of Toronto.
7. Neal, R. M. Bayesian Learning for Neural Networks. 1996. Springer-Verlag, New York.
8. Feller, W. An Introduction to Probability Theory and its Applications, vol. 2, 2nd edn.
1971. Wiley, New York.
9. Vapnik, V. The Nature of Statistical Learning Theory. 1995. Springer-Verlag, New
York.
10. Cristianini, N., and Shawe-Taylor, J. An Introduction to Support Vector Machines. 2000.
Cambridge University Press, Cambridge.
11. Jaakkola, T. S., Diekhans, M., and Haussler, D. Using the Fisher kernel method to
detect remote protein homologies. In T. Lengauer, R. Schneider, P. Bork, D. Brutlag, J.
Glasgow, H. W. Mewes, and R. Zimmer, editors, Proceedings of the 7th International
Conference on Intelligent Systems for Molecular Biology (ISMB99), pp. 149–155. 1999.
AAAI Press, Menlo Park, CA.
12. Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Walsh Sugnet, C., Ares, M.
Jr., Furey, T. S., and Haussler, D. Knowledge-based analysis of microarray gene
Mathematical complements 193
expression data by using support vector machines. 2000. Proceedings of the National
Academy of Sciences of the USA 97:262–267.
13. Baldi, P., and Brunak, S. Bioinformatics: The Machine Learning Approach, 2nd edn.
2001. MIT Press, Cambridge, MA.
14. Burges, C. J. C. A tutorial on support vector machines for pattern recognition. 1998.
Data Mining and Knowledge Discovery 2:121–167.
Appendix C
Internet resources
This Appendix provides a short list of Internet resources and pointers. By its very
nature, it is bound to be incomplete and should serve only as a starting-point.
BioCatalog
http://www.ebi.ac.uk/biocat/e-mail_Server_ANALYSIS.html
The publisher has used its best endeavors to ensure that the URLs for external
websites referred to in this book are correct and active at the time of going to
press. However, the publisher has no responsibility for the websites and can make
no guarantee that a site will remain live or that the content is or will remain
appropriate.
195
196 Appendix C
DNA microarray technology to identify genes controlling spermatogenesis
http://www.mcb./arizona.edu/wardlab/microarray.html
NHGRI ArrayDB
http://genome.nhgri.nih.gov/arraydb
yMGV: an interactive on-line tool for visualization and data mining of published
genome-wide yeast expression data
http://www.biologie.ens.fr/fr/genetiqu/puces/publications/ymgv/index.html
Internet resources 197
DNA array databases
ArrayExpress: The ArrayExpress Database at EBI
http://www.ebi.ac.uk/arrayexpress
199
Genex: CyberT - Statistical Analysis for Large Scale Gene Expression Data
Figure D.1. The CyberT interface at the UCI genomics web site.
202 Appendix D
Gene C1 C2 C3 C4 E1 E2 E3 E4
1. Long, A. D., Mangalam, H. J., Chan, B. Y. P., Tolleri, L., Hatfield, G. W., and
Baldi, P. Improved statistical inference from DNA microarray data using
analysis of variance and a Bayesian statistical framework: analysis of global
gene expression in Escherichia coli K12. 2001. Journal of Biological Chemistry
276(23):19937–19944.
2. Baldi, P., and Long, A. D. A Bayesian framework for the analysis of
microarray expression data: regularized t-test and statistical inferences of gene
changes. 2001. Bioinformatics 17(6):509–519.
3. Arfin, S. M., Long, A. D., Ito, E. T., Tolleri, L., Riehle, M. M., Paegle, E. S.,
and Hatfield, G. W. Global gene expression profiling in Escherichia coli K12:
the effects of integration host factor. 2000. Journal of Biological Chemistry
275(38):29672–29684.
Index
Affymetrix Gene Chips™ nylon filter data vs. GeneChipTM data 122
bacterial target preparation 39–40 replicate reduction 116–117, 117
non-polyadenylated mRNA 40–41, 41 beta distribution 107, 107, 186–187
data acquisition 48 (Appendix)
hybridization of biotinylated mRNA biotin labeling methods
targets 181 bacterial targets 180–181 (Appendix)
modeling probe pair set data 124 eukaryotic targets 44, 181–183
mRNA enrichment/biotin labeling (Appendix)
methods 40, 41, 180–181 Boolean networks 146, 147–149
nylon filter data vs. GeneChip™ data attractors 148, 149
119–124 connectivity 148
photolithography 7, 8, 9 transient states 148
Archaea 4 Brown method of DNA array manufacture
Arabidopsis thaliana 4, 10 11–12
attractors 148, 149, 150
automated DNA sequencing 2 Caenorhabditis elegans 4
Campylobacter jejuni 10
Bacillus subtilis Candida albicans 10
DNA 10 cDNA synthesis/labeling
regulatory networks 143 33
P-labeled target preparation 177
sporulation 157, 158 pre-synthesized DNA arrays 45
bacteria charge-coupled device (CCD) 20
gene expression profiling 32–41 Chlamydia trachomatis 43
target preparation 32 clustering 78–87
message-specific vs. random hexamer cluster image map 131, 132
primers 33–37, 34, 36 cost function 81
non-poly(A) mRNA 40–41, 41 data types 79
for in situ synthesized DNA arrays Euclidean distance 80
39–40 hierarchical 82–84
see also Escherichia coli; other specific see also hierarchical clustering
bacteria method application 124
bacteriophage T7 44–45 multiple-variable perturbation
Bayesian networks 162–163 experiments 131–132, 132
Bayesian probabilistic framework 55–56 number of clusters (K) 80–81
improved statistical inference 109–119 overview 78–81
207
208 Index
clustering (cont.) CyberT software 63, 65, 67, 199–205
parametric 81 improved statistical inference 110
Pearson correlation coefficient 80 nylon filter data vs. Affymetrix
probabilistic interpretation 81 GeneChip™ data 122–123
results interpretation 128–131
similarity/distance 80 data acquisition
supervised/unsupervised 79–80 Affymetrix GeneChip™ experiments 48
two-variable perturbation experiments commercial software 46–47
127–128, 129 fluorescent signals 17–21
CombiMatrix, electrical addressing system nylon filter experiments 45, 48
10–11, 12 radioactive signals 21, 24–27
commercial sources data analysis see statistical analysis
confocal laser scanning microscopy 22–23 databases
data acquisition software 46–47 DNA arrays 197
DNA arrays 10 National Centre for Biotechnology
glass slide arrays 10 Information (NCBI) 4,197
nylon-filter-based DNA arrays 10 systems biology 163–165
phosphorimaging instruments 27 design, experimental see experimental
computational models of regulatory design/analysis
networks 146–163 die (Markov) model 56, 57
Bayesian networks 162–163 differential equations, computational
Boolean networks 146, 147–149 models 149–153
see also Boolean networks neural networks 151–153
continuous models 149–153 partial differential equations 157–160
differential equations 149–153 repellors 150
see also differential equations saddle points 150
discrete models 147–149 stable attractors 150
learning or model fitting 153–155 digital light processors (DLPs) 8
neural networks 151–153 Dirichlet prior 62–63
partial differential equations 157–160 distributions
power-law formalism (S-systems) 153 beta 186–187
probabilistic models 162–163 hypothesis testing 187
qualitative modelling 155–157, 158 inverse Wishart 186
reaction-diffusion models 157–160 non-informative improper prior 185
stochastic equations 160–161 scaled inverse gamma 186
confocal laser scanning microscopy 18–20, semi-conjugate prior 186
19 Student t 186
commercial sources 22–23 DLPs (digital light processors) 8
principle 19 DNA array readout methods
conjugate prior 58–60, 70 Affymetrix GeneChip™ experiments 48
connectivity 167 commercial software 46–47
Boolean networks 148 fluorescent signals 17–21
Corning, array production method 12 nylon filter experiments 45, 48
covariance parameterization 188–189 radioactive signals 21, 24–27
culture conditions 30 DNA arrays viii–ix
Cy3/Cy5 (cyanine) dyes 17–18 applications viii, x–xii
Cy3, Cy5 poly(A) mRNA target labeling cDNA synthesis, 33P-labeled target
179 preparation 177
hybridization of Cy3-, Cy5-labeled commercial sources 10
targets to glass slide arrays 179 data acquisition see data acquisition
RNA target labeling 178–179 databases 197
Cyanobacteria 10 definition 7
Index 209
fibre optic based 14–15 Sigma-Genosys PanoramaTM E. coli Gene
filter-based see filter-based DNA arrays Array 21, 24, 33
fluid arrays 13–14 Euclidean distance, clustering 80
formats 7–15 eukaryotic cells, target preparation
glass slide arrays poly(A) mRNA, problems 42–44
Cy3, Cy5 poly(A) mRNA target total RNA solution 44–45
labeling 179 evolution, DNA array technologies role x
Cy3, Cy5 RNA target labeling 178–179 experimental design/analysis 99–100
micro-bead based 13–14 ad hoc empirical method 105–106
non-conventional technologies 13–15 Bayesian statistical framework 109–119
pre-synthesized 11–12 clustering/visualization method
see also pre-synthesized DNA arrays application 124
quantum dots 15 computational method 106–109
regulatory regions 90–93 cutoff levels 117–119
in situ synthesized see in situ synthesized design 99–100
oligonucleotide DNA arrays differentially expressed gene identification
DNA chips see DNA arrays 100–101, 101, 102
DNA dice 56, 57 E. coli regulatory genes 97–133
DNA ligases 2 error source determination 101–103
DNA microarrays see DNA arrays global false positive rate estimation 103,
DNA probes 7 104, 105–109
in situ synthesis 7–11 leucine-responsive regulatory protein
DNA regulatory regions, systems biology (Lrp) 99–100
143–145 modeling probe pair set data 124
DNA sequencing, automated 2 nylon filter data vs. Affymetrix
DNA structure 2, 167 GeneChip™ data 119–124
Drosophila melanogaster 4 replicate reduction 116–117, 117
dynamic programming recurrence 83, 84 two-variable perturbation experiments
125–132
eigenvectors 77 experimental error 29–52, 99
electrical addressing systems 10–11 E. coli regulatory genes studies 99,
EM algorithms 85–87 101–103
enzyme–substrate interaction 137–138 error source determination 101–103
errors, experimental see experimental error normalization methods 49–51
Escherichia coli primary sources of variation 29–32
Affymetrix GeneChipTM 39, 40–41, 124 culture conditions 30
experimental design/analysis, regulatory RNA isolation procedures 31–32, 103
genes 97–133 sample differences 29–30
Fnr, regulatory protein 125,130–131 special considerations for bacteria 32–41
gene regulation, hierarchical organization mRNA enrichment 40–41, 41
165–166 mRNA turnover 37–39, 38
history of genomics 1, 2 target preparation with non-
integration host factor (IHF) 116–117, polyadenylated mRNA 40–41, 41
117 target preparation with poly(A) mRNA
leucine-responsive regulatory protein from eukaryotic cells 42–44
(Lrp) 99–100 see also false positive rates
as a model organism 1, 2, 10, 97–98 expressed sequence tags (ESTs) 3–4
open reading frame (ORFs) arrays 13
regulatory genes false positive rates
differentially expressed genes 100–101 Bayesian regularization 110, 113
errors 99, 101–103 error source determination 103
experimental design/analysis 97–133 filter/target preparation differences 104
210 Index
false positive rates (cont.) National Centre for Biotechnology
global false positive rate estimation Information (NCBI) databases 4,
ad hoc empirical method 105–106 197
computational method 106–109 genomics, history see history, genomics
global rate estimation 103, 104, 105–109 glass slide arrays
statistical analysis 68–69 commercial sources 10
fiber-optic-based DNA arrays 14–15 Cy3, Cy5 poly(A) mRNA target labeling
filter-based DNA arrays 12–13 179
33
P-labeled target hybridization 178 Cy3, Cy5 RNA target labeling 178–179
cDNA synthesis, 33P-labeled target G-protein coupled receptors 141, 142
preparation 177
data acquisition 45, 48 Helicobacter pylori 10
nylon filter data vs. Affymetrix hierarchical clustering 82–84
GeneChip™ data 119–124 algorithms 82
fluorescent signal detection 17–21 dynamic programming recurrence 83, 84
background fluorescence 18 tree visualization 82–84
charge-coupled device (CCD) 20 two-variable perturbation experiments
confocal laser scanning 18–20, 19, 22–23 127–128, 129
see also confocal laser scanning history, genomics viii-ix, 1–5
microscopy automated DNA sequencing 2
emission measurements 18 DNA ligases 2
fluorophore excitation 17–18 DNA structure 2
photomultiplier tubes 20 E. coli as a model organism 1, 2
see also fluorophores HIV DNA 10
fluorophores expressed sequence tags (ESTs) 3–4
Cy3/Cy5 (cyanine) dyes 17–18 gene expression regulation 2
Cy3, Cy5 poly(A) mRNA target Human Genome Project 3–4
labeling 179 polymerase chain reaction (PCR) 2
hybridization of Cy3-/Cy5-labeled targets restriction endonucleases 2
to glass slide arrays 179 Hotelling transform 75–77, 78
RNA target labeling 178–179 housekeeping genes, normalization methods
excitation 17–18 50
limitations 15 human DNA 10
quantum dots 15 Human Genome Project x, 3–4
Fnr, regulatory protein 125, 130–131 sequencing 3–4
hybridization
Gaussian models 56–58, 187–188 bacterial targets, biotin labeling methods
covariance parameterization 188–189 180–181
GeneChips™ see Affymetrix GeneChipsTM biotinylated mRNA targets to Affymetrix
gene expression profiling 7 GeneChips™ 181
experimental design/analysis 97–133 Cy3-, Cy5-labeled targets to glass slide
see also experimental design/analysis arrays 179
experimental problems/pitfalls 29–52 eukaryotic targets, biotin labeling
mammalian systems 43 methods 181–183
non-conventional technologies 13–15 hyperparameter point estimates 63–65, 64
normalization methods see normalization hypothesis testing 187
methods
special considerations for bacteria 32–41 inkjet (piezoelectric) printing 8
gene expression regulation 2, 142–145 in situ synthesized oligonucleotide DNA
hierarchical organization in E.coli 165–166 arrays 7–11
genome sequencing Affymetrix Inc. 7, 9
Human Genome Project 3–4 digital light processors (DLPs) 8
Index 211
electrical addressing systems 10–11 Cy3, Cy5 labeling 179
maskless array synthesizer (MAS) 8,10 experimental error sources 42–44
mRNA enrichment/biotin labeling multiple gene expression analysis 53
methods 180–181 multiple-variable perturbation experiments
photolithographic method 7–8, 9 131–132, 132
piezoelectric (inkjet) printing 8 Mus musculus 4, 10
preparation of bacterial targets 39–40
integration host factor (IHF), E.coli Nanocen, electrical addressing system
116–117, 117 10–11, 12
Internet resources 195–197 National Centre for Biotechnology
inverse Wishart distribution 70, 186 Information (NCBI) databases 4,
197
Karhunen–Loéve transform 75–77, 78 negative feedback loops 166
kernel methods 189–193 Neisseria meningitidis 10
kernel selection 190 neural networks 151–153
weight selection 191–192 normalization methods 49–51
k-means 85–87 by global scaling 50–51
to housekeeping genes 50
lasers, fluorescent signal readout 17–18 to reference RNA 50
learning or model fitting computational to total or ribosomal RNA 49
models 153–155 nylon-filter-based DNA arrays 12–13
33
Leishmania major 4 P-labeled target hybridization 178
leucine-responsive regulatory protein (Lrp) cDNA synthesis, 33P-labeled target
99–100 preparation 177
luminescence, photo-stimulated 21, 24, 25 commercial sources 10
data acquisition 45, 48
Markov models 56, 57, 162, 163 nylon filter data vs. Affymetrix
maskless array synthesizer (MAS) 8, 10 GeneChip™ data 119–124
mean of the posterior (MP) 62–63
metabolic networks 1, 140–141 parameter point estimates 62–63
Michaelis–Menten model 138 Dirichlet prior 62–63
Microarray Suite software 122, 123 mean of the posterior (MP) 62–63
micro-bead-based DNA arrays 13–14 mode of the posterior (MAP) 62
mixture models 85–87 parametric clustering 81
model fitting or learning computational partial differential equations 157–160
models 153–155 Pearson correlation coefficient 80
mode of the posterior (MAP) 62 perturbation experiments
motif recognition 90 multiple-variable 131–132, 132
mRNA two-variable 125–132
bacterial 32 phosphorimaging instruments 21, 24–26
degradation 38 commercial sources 27
enrichment 40, 41, 180–181 photo-stimulated luminescence 25
experimental error sources 31–32, 103 resolution 45
message-specific vs. random hexamer Sigma-Genosys PanoramaTM E. coli
primers, target preparation 33–37, Gene Array 21, 24
34, 36 photolithography 7–8, 9
non-polyadenylated, target preparation photomultiplier tubes 20
with 40–41, 41 photo-stimulated luminescence 21, 24,
RNA isolation 32, 177 25
turnover rapidity 37–39, 38 piezoelectric (ink-jet) printing 8, 12
eukaryotic, poly(A) mRNA target Plasmodium falciparum 4
preparation polymerase chain reaction (PCR) 2
212 Index
posterior probability for differential see also computational models of
expression (PPDE) 108, 108, regulatory networks
117–118 regulatory region identification 90–93
power-law formalism (S-systems) 153 repellors 150
pre-synthesized DNA arrays 11–12 replicate number
33
P-labeled target hybridization 178 reduction, Bayesian probabilistic
cDNA synthesis, 33P-labeled target framework 116–117, 117
preparation 177 statistical analysis 65, 67–69
Cy3, Cy5 poly(A) mRNA target labeling restriction endonucleases 2
179 RNA
Cy3, Cy5 RNA target labeling 178–179 messenger see mRNA
target cDNA synthesis/labeling 45 reference, normalization methods 50
primers, message-specific vs. random ribosomal, normalization method 49
hexamer 33–37, 34, 36 total
principal component analysis 75–77, 76 bacterial, isolation procedure 177
eigenvectors 77 normalization method 49
two-variable perturbation experiments target preparation using 43–44
127–128, 130 RNA isolation procedures
printing, piezoelectric (inkjet) 8, 12 experimental error 31–32, 103
priors 58–60 total RNA, bacterial 177
conjugate 58–60, 70
Dirichlet 62–63 Saccharomyces cerevisiae 4, 10
non-informative improper 185 target preparation with poly(A) mRNA
semi-conjugate 186 42–43
probabilistic modeling 55–65 saddle points 150
Bayesian probabilistic framework 55–56 scaled inverse gamma distribution 186
computational models 162–163 scaling laws 167–168
conjugate prior 58–60, 70 sea urchin, DNA regulatory regions 143,
full Bayesian treatment vs. hypothesis 144, 145
testing 60, 61, 62 Sigma-Genosys PanoramaTM E. coli Gene
Gaussian models 56–58 Array, phosphorimaging 21, 24, 33
hyperparameter point estimates 55–65 signal detection
Markov models 56, 57 fluorescent signals 17–21
parameter point estimates 62–63 radioactive signals 21, 24–27
prior 58–60 see also data acquisition; fluorescent
protein networks 140–141 signal detection
signal transduction cascades 141
qualitative modeling 155–157, 158 simulations 65–69, 137–145
quantum dots 15 single-gene expression analysis 53
singular value decomposition (SVD) 75–77,
radioactive labeling, for pre-synthesized 78
DNA arrays 45 S-systems (power-law formalism) 153
radioactive signal detection 21, 24–27 stable attractors 150
phosphorimaging 21, 24, 24–26, 27 Staphylococcus aureus 10
photo-stimulated luminescence 21, 24, statistical analysis ix, 53–96
25 Bayesian probabilistic framework 55–56
rat DNA 10 improved statistical inference 109–119
reaction–diffusion models of development nylon filter data vs. Affymetrix
157–160 GeneChip™ data 122
regulation of gene expression 2, 142–145 replicate reduction 116–117, 117
regulatory networks 142–145 clustering 78–87
computational models 146–163 commercial software 88–89
Index 213
conjugate prior 58–60, 70 Streptococcus pneumoniae 10
covariance parameterization 188–189 Strongylocentrus purpuratus 143
dimensionality reduction 75–77 Student t distribution 186
distributions 185–187 support vector machines 189–193
see also distributions systems biology 135–175
EM algorithms 85–87 basic methodology 136
extensions 69–71 computational models 146–163
false positive rates 68–69 see also computational models of
see also false positive rates regulatory networks
full Bayesian treatment vs. hypothesis DNA regulatory regions 143–145
testing 60, 61, 62 enzyme-substrate interaction 137–138
Gaussian models 56–58, 187–188 gene expression regulation 142–145
hyperparameter point estimates 63–65, 64 G-protein-coupled receptors 141,142
inferring changes 53–72 graphs/pathways 139–140
problems/common approaches 53–55 Hodgkin–Huxley model 139
inverse Wishart distribution 70, 186 metabolic networks 1, 140–141
kernel methods 189–193 Michaelis–Menten model 138
kernel selection 190 molecular reactions 137–139
weight selection 191–192 protein networks 140–141
k-means 85–87 regulatory networks 142–145
Markov models 56, 57 representation/simulation 137–145
missing values 187 search for general principles 165–168
mixture models 85–87 DNA structure 167
motif finding 90 hierarchical organization of gene
multiple gene expression analysis 53, 73 regulation 165–166
overrepresentation 90–93, 92 negative feedback loops 166
parameter point estimates 62–63 scaling laws 167–168
Dirichlet prior 62–63 signal transduction cascades 141
mean of the posterior (MP) 62–63 software/databases 163–165
mode of the posterior (MAP) 62
principal component analysis 75–77, 76 target preparation see bacteria; eukaryotic
probabilistic modeling 55–65 cells, mRNA
see also probabilistic modeling total RNA isolation, bacterial 177
regulatory region identification 90–93 transcription, regulation 143
replicate number 65, 67–69 t-test 54–55
simulations 65–69 full Bayesian treatment vs. hypothesis
single-gene expression analysis 53 testing 60, 61, 62
support vector machines 189–193 hyperparameter point estimates 65
t-test 54–55 parameter point estimates 62–63
full Bayesian treatment vs. hypothesis simple vs. regularized 110, 111
testing 60, 61, 62 twofold rule 54, 113
hyperparameter point estimates 65 two-variable perturbation experiments
parameter point estimates 62–63 125–132
simple vs. regularized 110, 111
twofold rule 54, 113 variation, experimental/biological 29–32
visualization 75–77, 78
stochastic equations 160–161 Wishart distribution, inverse 70