0% found this document useful (0 votes)
14 views20 pages

2000-Genetic Network Inference-From Coexpression Clustering To Reverse Engineering

Uploaded by

Haimabati Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views20 pages

2000-Genetic Network Inference-From Coexpression Clustering To Reverse Engineering

Uploaded by

Haimabati Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Vol. 16 no.

8 2000
BIOINFORMATICS Pages 707–726

Genetic network inference: from co-expression


clustering to reverse engineering
Patrik D’haeseleer 1, Shoudan Liang 2, 3 and Roland Somogyi 3
1 Universityof New Mexico, Department of Computer Science, Albuquerque, NM
87131, USA, 2 NASA Ames Research Center, Moffett Field, CA 94035, USA and
3 Incyte Pharmaceuticals, Inc., 3174 Porter Dr, Palo Alto, CA 94304, USA

Received on December 20, 1999; revised on February 29, 2000; accepted on March 3, 2000

Abstract • Which genes are responsible for this disease?


motivation: Advances in molecular biological, analytical
and computational technologies are enabling us to sys- • Which drugs will treat this disease?
tematically investigate the complex molecular processes
underlying biological systems. In particular, using high- Beginning with gene sequencing, we are identifying the
throughput gene expression assays, we are able to measure structure of thousands of genes, and a variety of structural
the output of the gene regulatory network. We aim here to and regulatory features that provide functional clues.
review datamining and modeling approaches for concep- However, only the molecular machinery of the cell is
tualizing and unraveling the functional relationships im- able to consistently interpret the sequence information
plicit in these datasets. Clustering of co-expression pro- into the functions which determine the complex genetic
files allows us to infer shared regulatory inputs and func- and biochemical networks that define the behavior of an
tional pathways. We discuss various aspects of clustering, organism. Since we ultimately seek understanding of the
ranging from distance measures to clustering algorithms regulatory skeleton of these networks, we are also taking
and multiple-cluster memberships. More advanced anal- steps to monitor the molecular activities on a global level
ysis aims to infer causal connections between genes di- to reflect the effective functional state of a biological
rectly, i.e. who is regulating whom and how. We discuss system. Several technologies, ranging from hybridization
several approaches to the problem of reverse engineer- microarrays, automated RTPCR, to two-dimensional gel
ing of genetic networks, from discrete Boolean networks, electrophoresis and antibody arrays, allow us to assay
to continuous linear and non-linear models. We conclude RNA and protein expression profiles with differing levels
that the combination of predictive modeling with system- of precision and depth.
atic experimental verification will be required to gain a But how should we organize this process of activity-
deeper insight into living organisms, therapeutic targeting data acquisition? How should we interpret these results
and bioengineering. to enhance our understanding of living organisms, and
Contact: [email protected]; [email protected]; identify therapeutic targets and critical processes for
[email protected] bioengineering? Here it is crucial to find the proper
abstractions around which to build modeling frameworks
Introduction and data analysis tools. These abstractions must be
centered on two important principles:
Novel, high-throughput technologies are opening global
perspectives of living organisms on the molecular level. 1. Genetic information flow: Defining the mapping
Together with a vast experimental literature on biomolec- from sequence space to functional space.
ular processes, these data are now providing us with the
The genome contains the information for constructing the
challenge of multifunctionality, implying regulatory net-
complex molecular features of an organism, as reflected
works as opposed to isolated, linear pathways of causality
in the process of development. In information terms, the
(Szallasi, 1999). Questions which have traditionally been
complexity of the fully developed organism is contained
posed in the singular are now being addressed in the plu-
in that of the genome. But which processes link these two
ral.
types of information, in other words, what are the codes
• What are the functions of this gene? that translate sequence into structure and function? We
need to represent these codes in a form understandable
• Which genes regulate this gene? to us, so that we may apply them to model building and


c Oxford University Press 2000 707
P.D’haeseleer et al.

prediction. We therefore seek methods that allow us to One way to make progress in understanding the princi-
extract these codes from gene sequence and activity data. ples of network behavior is to radically simplify the indi-
vidual molecular interactions, and focus on the collective
2. Complex dynamic systems: From states to outcome. Boolean networks (Kauffman, 1969) represent
trajectories to attractors. such a simplification: each gene is considered as a binary
When addressing biological function, we usually refer variable—either ON or OFF—regulated by other genes
to functions in time, i.e. causality and dynamics. On a through logical or Boolean functions (as can be found in
molecular level, function is manifested in the behavior the biochemical literature: see e.g. Yuh et al., 1998). Even
of complex networks. The dynamics of these networks with this simplification, the network behavior is already
resemble trajectories of state transitions—this is what we extremely rich (Kauffman, 1993). Many useful concepts
are monitoring when conducting temporal gene expression naturally emerge from such a simple mathematical model.
surveys. The concept of attractors is what really lends For example, cell differentiation corresponds to transitions
meaning to these trajectories, i.e. the attractors are the from one global gene expression pattern to another. Stabil-
high-dimensional dynamic molecular representations of ity of global gene expression patterns can be understood as
stable phenotypic structures such as differentiated cells a consequence of the dynamic properties of the network,
and tissues, either healthy or diseased (Kauffman, 1993; namely that all networks fall into one or more attractors,
Somogyi and Sniegoski, 1996). Our goal is to understand representing stable states of cell differentiation, adaptation
the dynamics to the point where we can predict the or disease.
attractor of a molecular network, and know enough about For a Boolean network with N genes, the total number
the network architecture to direct these networks to of global gene expression patterns can be very large
attractors of choice, e.g. from a cancerous cell type to a even for moderate N (2 N : each gene can be either on
benign cell type, from degeneration to regeneration etc. or off independently). We assume that each gene is
The goal of this review is to discuss principles of genetic controlled by up to K other genes in the network. The
network organization, and computational methods for ex- connectivity K is a very important parameter for network
tracting network architectures from experimental data. In dynamics (with large K , the dynamics tends to be more
Section A conceptual approach to complex network dy- chaotic). In Random Boolean Network models, these K
namics we will present a brief introduction to Boolean inputs, and a K -input Boolean function, are chosen at
Networks, as a useful conceptual framework to think about random for each gene (for a justification of this choice,
the dynamic behavior of large, interacting regulatory net- see Kauffman, 1993). The Boolean variables (ON/OFF
works. Section Inference of regulation through cluster- states of the genes) at time t + 1 are determined by the
ing of gene expression data describes methods and ap- state of the network at time t through the K inputs as
plications for the clustering of genes based on their ex- well as the logical function assigned to each gene. An
pression patterns. In Section Modeling methodologies, excellent tool for calculating and visualizing the dynamics
we give an overview of the different modeling method- of these networks is the DDLAB software (Wuensche,
ologies that may be used to model regulatory networks. 1993, 1998).
Finally, Section Gene network inference: reverse engi- Because the total number of expression patterns is finite,
neering shows how genetic regulatory networks may be the system will eventually return to an expression pattern
inferred directly from large-scale gene expression data, in- it has visited earlier. Since the system is deterministic, it
cluding estimates of how much data is needed to do so. will from then on keep following the exact same cycle
of expression patterns. This periodic state cycle is called
the attractor of the network. For example, in Figure 1 we
A conceptual approach to complex network show the repeating 6-state cycle attractor pattern within
dynamics the trajectory of a 12-element network (right panel). This
In higher metazoa, each gene or protein is estimated on trajectory is only one of many alternative trajectories also
average to interact with four to eight other genes (Arnone leading to the same attractor, as shown in the ‘basin of
and Davidson, 1997), and to be involved in 10 biological attraction’ graph (lower panel). If a state within the basin
functions (Miklos and Rubin, 1996). The global gene of attraction is perturbed to any other state within it, the
expression pattern is, therefore, the result of the collective dynamics inexorably converge to the same attractor. This
behavior of individual regulatory pathways. In such highly feature helps explain why cell types or cellular adaptations
interconnected cellular signaling networks, gene function to particular environments are stable with respect to small
depends on its cellular context; thus understanding the perturbations from a variety of internal or external noise
network as a whole is essential. However, dynamic sources.
systems with large numbers of variables present a difficult The long-term dynamics are determined by the attrac-
mathematical problem. tors. In the non-chaotic case, which is the only case of

708
Genetic network inference

Fig. 1. Dynamics of the Boolean model of a genetic network illustrated using the DDLAB software (Wuensche, 1993, 1999). Wiring,
trajectory and basin of attraction. Top panel, wiring diagram of binary gene interactions. The network consists of hypothetical regulatory
gene modules (1, 3, 6) and dependent modules of structural genes (S, U, T—share identical wiring and rules). Right inset, trajectory.
As determined by the wiring and rules, the network passes through a series of states from a given starting state, finally arriving at a repeating
attractor pattern, a six-state cycle in this case (dark grey = ON, light grey = OFF; time points numbered at left; modules S, U and T have
been collapsed into a single element for simplicity). Bottom panel, basin of attraction. The states of the trajectory shown in the inset are
displayed as a series of connected points (labeled by time points) arriving in a cyclic graph resembling the attractor. The additional nodes in
the graph resemble other states which also lead to the same attractor, therefore the term ‘basin of attraction’—all states in this graph merge
into a single attractor (Somogyi and Sniegoski, 1996).

practical interest, the number of states in the attractor is Bhattacharjya and Liang, 1996), rather than exponentially
typically only a very small fraction of the total number of with N . The notion that gene expression patterns are
states, growing as a square root of N (Kauffman, 1993; constrained is in general agreement with the experimental

709
P.D’haeseleer et al.

findings of large-scale gene expression data. For example, been a significant aggregate change over all conditions,
genes from the same sequence or functional family do (c) or whether the fluctuation pattern shows high diversity
not act independently, but tend to fluctuate in parallel, according to Shannon entropy (Fuhrman et al., 2000).
reducing effective N (Wen et al., 1998). In terms of A comparison of these criteria in the analysis for toxic
attractor cycles, it is rare for a gene to oscillate more response genes has shown that the Shannon entropy allows
than once during the cell cycle in the yeast (Chu et al., the clearest distinction of drug-specific expression patterns
1998). Also, the change in gene expression pattern during (Cunningham et al., 2000).
diauxic shift—when yeast metabolism switches from Beyond straightforward scoring methods, we would like
glucose to ethanol (DeRisi et al., 1997)—has been found to classify gene expression patterns to explore shared
to be very similar to the change in expression pattern functions and regulation. This can be accomplished using
during adaptive evolution to growth in a low glucose clustering methods. The simplest approach to clustering,
medium (Ferea et al., 1999). sometimes referred to as GBA (Guilt By Association), is
The number of distinct attractors of a Random Boolean to select a gene and determine its nearest neighbors in
Network also tends to grow as a square root of the number expression space within a certain user-defined distance
of genes (Kauffman, 1993). If we equate the attractors cut-off (Walker et al., 1999). Genes sharing the same
of the network with individual cell types, as Kauffman expression pattern are likely to be involved in the same
suggests, this explains why a large genome of a few billion regulatory process. Clustering allows us to extract groups
base pairs is capable of a few hundred stable cell types. of genes that are tightly co-expressed over a range of
This convergent behavior implies immense complexity
different experiments. If the promoter regions of the genes
reduction, convergence and stabilization in networks of
are known—as is the case for yeast—it is possible to
constrained architecture.
identify the cis-regulatory control elements shared by
Boolean networks provide a useful conceptual tool for
the co-expressed genes. Several algorithms to extract
investigating the principles of network organization and
dynamics (Wuensche, 1999). We can study the role of common regulatory motifs from gene clusters have been
various constraints on global behavior in terms of network developed (Brazma et al., 1998; Roth et al., 1998;
complexity, stability and evolvability. From experimental van Helden et al., 1998; Wolfsberg et al., 1999). For
studies we are learning about constraints in the number example, Tavazoie et al. (1999) identified 18 biologically
of inputs and outputs per gene, input and output sharing significant DNA-motifs in the promoter region of gene
among genes evolved within a gene family or pathway, clusters based on cell-cycle expression patterns. Most
and restrictions on rule types (thresholding, no ‘exclusive motifs were highly selective for the cluster in which
or’ rules etc.). Investigations into abstract models will they were found, and seven were known regulatory
help us understand the cybernetic significance of network motifs for the genes in their respective clusters. In an
features, and provide meaningful questions for targeted example from brain development (Figure 2), correlations
experimental exploration. between cis elements and expression profiles can be
established, but are sensitive to the clustering method
used.
Inference of regulation through clustering of gene We will briefly review important issues involved in
expression data clustering and some of the main clustering methods
Introduction used, as well as a few classical clustering methods which
Large-scale gene screening technologies such as mRNA have not yet been adopted in gene expression analysis.
hybridization micro-arrays have dramatically increased We should caution the reader that different clustering
our ability to explore the living organism at the genomic methods can have very different results (see for example
level (Zweiger, 1999). Large amounts of data can be Figure 2), and—at this point—it is not yet clear which
routinely generated. In order to identify genes of interest, clustering methods are most useful for gene expression
we need software tools capable of selecting and screening analysis. Claverie (1999) provides a preliminary review
candidate genes for further investigation (Somogyi, 1999). of gene expression analysis techniques with a focus on
At the simplest level, we can determine which genes coexpression clustering. Niehrs and Pollet (1999) provide
show significant expression changes compared with a an overview of very tightly coexpressed groups of genes
control group in a pair-wise comparison. As data sets (which they call ‘synexpression groups’) that have been
become more complex, covering a variety of biological identified based on large-scale gene expression data.
conditions or time series, one may consider several scoring For further reading, some useful textbooks on cluster-
methods for selecting the most interesting genes; e.g. ing include Massart and Kaufman (1983), Aldenderfer
according to (a) whether there has been a significant and Blashfield (1984) and Kaufman and Rousseeuw
change at any one condition, (b) whether there has (1990).

710
Genetic network inference

Hippocampal Development (B) Hippocampal Development


(A) Min/Max Normalization - Agglomerative Clustering Max Normalization - K-means Clustering
Cis- Cis- Cluster

E15
E18
E15
E18

P10
P13
P25
P10
P13
P25
Cluster bifurcation Gene Cluster identifier Gene

P0
P3
P7
P0
P3
P7

A
A
element element no.
cyclinA # #
# # # # # # 000 00000000000000000 MOG # # # # # # # # # 1
cyclinB # ## # # # 000 00000000000001000 5HT3 # # # # # # # # # 2
G67I8086 # #
# # # # # # 000 00000000000001010 aFGF # # # # # # # # # 2
G67I86 # #
# # # # # # 000 00000000000001011 GRb3 # # # # # # # # # 2
Brm # # #
# # # # # # 000 00000000000001100 IP3R3 # # # # # # # # # 2
nestin # # # # # # 000 00000000000001101 PDGFb # # # # # # # # # 2
IGFR2 # # # # # # 000 00000000000002000 PDGFR # # # # # # # # # 2
IGFR1 # # # # # # # # # 000 00000000000010000 mAChR1 # # # # # # # # # 3
H4 # # # 000 00000100000000000 nAChRa3 # # # # # # # # # 3
ODC # # # # # # # # # 000 00000100000000010 trkC # # # # # # # # # 3
IGF2 # # # # 000 00000100000000100 BDNF # # # # # # # # # 4
GAD65 # # # # # # # # # 000 00000100000001000 Brm # # # # # # # # # 5
GRg3 # # # # # # # # # 000 00000100000100000 cyclinB # # # # # # # # 5
GRa3 # # # # # # 000 00000100000100010 G67I8086 # ## # # # # # # 5
actin # # # # # # 000 00000100010000000 G67I86 # # # # # # # # # 5
cellubrevin # # # # # # 000 00000100010000100 IGFR1 # # # # # # # # # 5
GRg2 # # # # # 000 00000100010000110 nestin # # # # # # # 5
IP3R3 # # # # # # # # # 000 00000100100000000 TH # # # # # # # # # 5
5HT3 # # # # # # # # # 000 00000200000000000 ACHE # # # # # # # # # Krox-24 6
TCP # # # # # # # # # 000 00001000000000000 actin # # # # # # # # # 6
EGF # # # # # # # # # 000 00001000000000010 CCO1 # # # # # # # # # 6
GRa5 # # # # # # # # # 001 00000000000000000 CCO2 # # # # # # # # # 6
GRa1 # # # # 0 01 00000000000000010 cellubrevin # # # # # # # # # 6
IP3R2 # # # # # # 001 00000000001000000 cjun # # # # # # # # # 6
S100beta # # # # # # # # # 001 00000000001000001 CNTF # # # # # # # # # 6
NFL # # # # # Krox-24 001 00000000001000100 CNTFR # # # # # # # # # 6
nAChRa5 # # # # # 001 00000000001000101 CX43 # # # # # # # # # 6
mGluR3 # # # # # 001 00000000001100000 cyclinA # # 6
GRa4 # # # # 001 00000000001100001 EGF # # # # # # # # # 6
NT3 # # # # # 001 00000000001100100 FABP # # # # # # # # # 6
GRb1 # # 001 00000000001100101 GAD65 # # # # # # # # # 6
GAD67 # # # # # # # # # Krox-24 001 00000000001101000 GAP43 # # # # # # # # # 6
GRa2 # # 001 00000000001101010 GMFb # # # # # # # # # 6
GRg1 # # 001 00000000001101011 GRa3 # # # # # # # # # 6
5HT1b # # # # # # # # # 001 00000000001102000 GRg2 # # # # # # # # # 6
NFM # # # # # # # # # 001 00000000001110000 GRg3 # # # # # # # # # 6
bFGF # # # Krox-24 001 00000000002000000 H4 # # # # # # # # # 6
GFAP # # # # # # 001 00000000003000000 IGF1 # # # # # # # # # 6
mAChR2 # # # # # # # # # 001 00000000010000000 IGF2 # # # # # # # # # 6
GRb2 # # # # # # # # # 001 00000000020000000 IGFR2 # # # # # # # # # 6
nAChRa7 # # # # # # # # # Krox-24 001 00000000020000010 InsR # # # # # # # # # 6
preGAD67 # # # # # # # # # 001 00000000020000100 MK2 # # # # # # # # # 6
NFH # # # # # # 001 00000000100000000 neno # # # # # # # # # Krox-24 6
synaptophysin # # # # # # # # # 001 00000000100000010 ODC # # # # # # # # # 6
FABP # # # # # # # # # 001 00000010000000000 synaptophysin # # # # # # # # # 6
PDGFb # # # # # # # # # 001 00000010000000010 TCP # # # # # # # # # 6
PDGFR # # # # # # # # # 001 00000010000001000 GFAP # # # # # # # # # 7
GRb3 # # # # # # # # # 001 00000010000010000 mGluR8 # # # # # # # # # 7
mAChR1 # # # # # # # # # 001 00000020000000000 NFH # # # # # # # # # 7
ACHE # # # # # # Krox-24 001 00000100000000000 NFL # # # # # # # # # Krox-24 7
mGluR8 # # # # # # # # # 001 00000200000000000 S100beta # # # # # # # # # 7
5HT2 # # # # # # # # # 001 00000300000000000 5HT1b # # # # # # # # # 8
CX43 # # # # # # # # # 001 00001000000000000 5HT2 # # # # # # # # # 8
neno # # # # # # # # # Krox-24 001 00001000000001000 bFGF # # # # # # # # # Krox-24 8
IGF1 # # # # # # # # # 001 00001000000001010 GAD67 # # # # # # # # # Krox-24 8
aFGF # # # # # # # # # 002 00000000000000000 GRa1 # # # # # # # # # 8
GAP43 # # # # # # # # # 003 00000000000000000 GRa2 # # # # # # # # # 8
CCO1 # # # # # # 010 00000000000000000 GRa4 # # # # # # # # # 8
MOG # # # 020 00000000000000000 GRa5 # # # # # # # # # 8
cjun # # # # 020 00000000000010000 GRb1 # # # # # # # # # 8
BDNF # # # # # # # # # 020 00000000000010010 GRb2 # # # # # # # # # 8
MK2 # # # # # # # # # 100 00000000000000000 GRg1 # # # # # # # # # 8
CCO2 # # # # # # # # # 100 00000000000000100 IP3R2 # # # # # # # # # 8
TH # # # # # # # # # 100 00000000000010000 mAChR2 # # # # # # # # # 8
CNTF # # # # # 100 00000000001000000 mGluR3 # # # # # # # # # 8
CNTFR # # ## # # # # # 100 00000010000000000 nAChRa5 # # # # # # # # # 8
GMFb # # # # 100 00000010000001000 nAChRa7 # # # # # # # # # Krox-24 8
nAChRa3 # # # # # # 200 00000000000000000 NFM # # # # # # # # # 8
InsR # # # # # # # # 200 00000000000000100 NT3 # # # # # # # # # 8
trkC # # # # # # # # # 200 00000000001000000 preGAD67 # # # # # # # # # 8

Fig. 2. Comparison of clustering methods on hippocampal development gene expression data. Genes were grouped into eight clusters with
both methods. Note the relative positions of the genes sharing a Krox-24 transcriptional regulatory element among the clusters. (A) Gene
expression patterns were normalized to their respective minima and maxima, and clustered using an agglomerative algorithm. The cluster
bifurcation pattern is shown on the left; cluster boundaries were drawn at the depth shaded in grey (left), based on the 20-level cluster
identifier that captures the branching pattern for each gene within the dendrogram (right). (B) Gene expression patterns were normalized to
their respective maxima, and clustered using a numerical K -means algorithm (Fuhrman et al., 2000).

711
P.D’haeseleer et al.

Distance measures and preprocessing Figure 2 for an example of a single data set clustered using
Most clustering algorithms take a matrix of pairwise different normalizations and clustering methods. When
distances between genes as input. The choice of distance using relative expression levels (for example, microarray
measure—used to quantify the difference in expression data), the data will tend to be log-normally distributed,
profiles between two genes—may be as important as so the logarithm of the relative expression values should
the choice of clustering algorithm. Distance measures be used. Califano et al. (2000) suggest using a nonlinear
can be divided into at least three classes, emphasizing transformation into a uniform distribution for each gene
different regularities present within the data: (a) similarity instead, which will tend to spread out the clusters more
according to positive correlations, which may identify effectively.
similar or identical regulation; (b) similarity according to
Clustering algorithms
positive and negative correlations, which may also help
identify control processes that antagonistically regulate All clustering algorithms assume the pre-existence of
downstream pathways; (c) similarity according to mutual groupings of the objects to be clustered. Random noise and
information, which may detect even more complex other uncertainties have obscured these groupings. The
relationships. objectives of the clustering algorithm are to recover the
So far, most clustering studies in the gene expression original grouping among the data.
literature use either Euclidean distance or Pearson cor- Clustering algorithms can be divided into hierarchical
relation between expression profiles as a distance mea- and non-hierarchical methods. Non-hierarchical methods
sure. Other measures used include Euclidean distance be- typically cluster N objects into K groups in an iterative
tween expression profiles and slopes (for time series Wen process until certain goodness criteria are optimized.
et al., 1998), squared Pearson correlation (D’haeseleer et Examples of non-hierarchical methods include K -means,
al., 1997), Euclidean distance between pairwise correla- expectation-maximization (EM) and Autoclass. Hierar-
tions to all other genes (Ewing et al., 1999), Spearman chical methods return an hierarchy of nested clusters,
rank correlation (D’haeseleer et al., 1997), and mutual in- where each cluster typically consists of the union of two
formation (D’haeseleer et al., 1997; Michaels et al., 1998; or more smaller clusters. The hierarchical methods can
Butte and Kohane, 2000). be further distinguished into agglomerative and divisive
Conspicuously absent so far are distance measures methods, depending on whether they start with single-
that can deal with the large numbers of highly related object clusters and recursively merge them into larger
measurements in the data sets. For example, clustering clusters, or start with the cluster containing all objects and
yeast genes based on all publicly available data will recursively divide it into smaller clusters. In this section,
be highly biased towards the large cell-cycle data sets: we review several clustering methods for gene expression
73 data points in four time series, containing almost eight (see also Figure 2 for comparison of agglomerative and
complete cell cycles (Spellman et al., 1998), whereas K -means clustering).
only a single data point may be present for various The K -means algorithm (MacQueen, 1967) can be
stress conditions, mutations, etc. Correlation between the used to partition N genes into K clusters, where K is
experiments will also lead to highly elliptical clusters, pre-determined by the user (see Tavazoie et al., 1999)
which form a problem for clustering methods that are for an application to yeast gene expression). K initial
biased towards compact, round clusters (such as K - cluster ‘centroids’ are chosen—either by the user, to
means). A distance measure that can deal with the reflect representative expression patterns, or at random—
covariance between experiments in a principled way (e.g. and each gene is assigned to the cluster with the nearest
Mahalanobis distance Mahalanobis, 1936) may be more centroid. Next, the centroid for each cluster is recalculated
appropriate here. For even longer time series, distance as the average expression pattern of all genes belonging
measures based on Fourier or wavelet transforms may be to the cluster, and genes are reassigned to the closest
considered. centroid. Membership in the clusters and cluster centroids
A related issue is normalization and other preprocessing are updated iteratively until no more changes occur, or
of the data. Distance measures that are sensitive to scaling the amount of change falls below a pre-defined threshold.
and/or offsets (such as Euclidean distance) may require K -means clustering minimizes the sum of the squared
normalization of the data. Normalization can be done with distance to the centroids, which tends to result in round
respect to the maximum expression level for each gene, clusters. Different random initial seeds can be tried to
with respect to both minimum and maximum expression assess the robustness of the clustering results.
levels or with respect to the mean and standard deviation The Self-Organized Map (SOM) method is closely re-
of each expression profile. From a statistical point of lated to K -means and has been applied to mRNA expres-
view, we recommend using the latter, unless there is a sion data of yeast-cell cycles as well as hematopoietic dif-
good reason to preserve the mean expression values. See ferentiation of four well-studied model cell lines (Tamayo

712
Genetic network inference

et al., 1999). The method is more structured than K -means the distance between their average expression pattern.
in that the cluster centers are located on a grid. At each After N − 1 steps, all the genes are merged together
iteration, a randomly selected gene expression pattern at- into an hierarchical tree. Other hierarchical methods may
tracts the nearest cluster center, plus some of its neighbors calculate the distance between clusters differently. In
in the grid. Over time, fewer cluster centers are updated (unweighted pair-group method using arithmetic aver-
at each iteration, until finally only the nearest cluster is ages (UPGMA) Sneath and Sokal, 1973) for example, the
drawn towards each gene, placing the cluster centers in the distance between two clusters is defined as the average
center of gravity of the surrounding expression patterns. distance between genes in the two clusters.
Drawbacks of this method are that the user has to specify Ben-Doret al. (1999) have developed a clustering algo-
a priori the number of clusters (as for K -means), as well rithm based on random graph theory. Their method shares
as the grid topology, including the dimensions of the grid features with both agglomerative hierarchical clustering
(typically one, two or three dimensional) and the number and K -means. Clusters are constructed one at a time. The
of clusters in each dimension (e.g. eight clusters could be gene with the largest ‘affinity’ (smallest average distance
mapped to a 2×4 two-dimensional grid or a 2×2×2 three- to all other genes in the cluster) is added to the cluster,
dimensional cube). The artificial grid structure makes it if the affinity is larger than a cut-off. A gene can also be
very easy to visualize the results, but may have residual removed from the cluster if its affinity drops below the cut-
effects on the final clustering. Optimization techniques for off. A finite number of clusters are constructed depending
selecting the number of clusters developed for K -means on the cut-off. The ability to remove ill-fitting genes from
can presumably be used here too. the cluster is an attractive feature of this algorithm. Zhu
The Expectation-Maximization (EM) algorithm (Demp- and Zhang (2000) used a similar algorithm to cluster yeast
ster et al., 1977) for fitting a mixture of Gaussians (also sporulation data.
known as Fuzzy K -Means Bezdek, 1981) is very similar to Alon et al. (1999) used a divisive hierarchical algorithm
K -means, and has been used by Mjolsness et al. (1999b) to cluster gene expression data of colon cancer. The
to cluster yeast data. Rather than classifying each gene into method relies on the maximum entropy principle and
one specific cluster, we can assign membership functions attempts to find the most likely partition of data into
(typically Gaussians, or any other parametric probability clusters at a given ‘cost’ (sum of squared within-cluster
distribution) to each cluster, allowing each gene to be part distances). Starting from a single cluster with large cost,
of several clusters. As in K -means, we alternately update as the allowed cost is lowered, the cluster breaks up
the membership for each expression pattern, and then the spontaneously into multiple clusters in order to maximize
parameters associated with each cluster: centroid, covari- the entropy for the configuration, within the constraint of
ance and mixture weight. Cluster boundaries are sharp and fixed total cost.
linear in K -means, smooth and rounded in EM. Califano et al. (2000) have developed a clustering al-
Autoclass is also related to EM, in that it finds a mixture gorithm to identify groups of genes which can be used
of probability distributions. In addition, it uses Bayesian for phenotype classification of cell types, by searching for
methods to derive the maximum posterior probability clas- clusters of microarray samples that are highly correlated
sification, and the optimum number of clusters (Cheese- over a subset of genes. Only the most significant clusters
man and Stutz, 1996). are returned. The same technique could be used to find
Wen et al. (1998) used the FITCH hierarchical cluster- clusters of genes that are highly coexpressed over a sub-
ing algorithm (Felsenstein, 1993) to group the expression set of their expression profiles. Han et al. (1997) used a
patterns of 112 genes in spinal-cord development, pro- similar, partial matching approach to group objects into a
ducing a graph similar to the phylogenetic trees famil- hypergraph based on correlations over subsets of the data.
iar to most biologists Sokal and Michener (1958). The In a hypergraph, each hyperedge (corresponding to a sin-
expression clusters captured the main waves of gene ex- gle cluster) connects several nodes (genes), so each node
pression in development. While the algorithm used in this (gene) can be part of several hyperedges (clusters). Mjol-
study minimizes the overall distance in the tree, the com- sness et al. (1999a) developed a hierarchical algorithm that
putational requirement grows with the fourth power of the places objects into a directed, acyclic graph, where each
number of elements, making it impractical for much larger cluster can be part of several parent clusters. The algo-
data sets. rithm optimizes the number of clusters, cluster positions
Eisen et al. (1998) applied a standard agglomerative and partial cluster memberships of objects, such as to pro-
hierarchical clustering algorithm, average-linkage anal- vide the most compact graph structure. All three clustering
ysis, to large-scale gene expression data. Starting with methods allow genes to be part of several clusters, possi-
N clusters containing a single gene each, at each step bly coinciding with multiple regulatory motifs or multi-
in the iteration the two closest clusters are merged into ple functional classifications for each gene. This makes
a larger cluster. Distance between clusters is defined as them especially appropriate for eukaryotic gene expres-

713
P.D’haeseleer et al.

sion where genes are controlled by complex inputs from


multiple transcription factors and enhancers.

Other applications of co-expression clusters


Gene expression clustering is potentially useful in at
least three areas: (i) extraction of regulatory motifs (co-
regulation from co-expression); (ii) inference of functional
annotation; (iii) as a molecular signature in distinguishing
cell or tissue types.
Genes in the same expression cluster will tend to
share biological functions. In a system as complex as
the developing rat spinal cord, expression clustering
clearly led to a segregation according to functional gene
families (Wen et al., 1998). Moreover, cluster–function
relationships exist over several methods of classification;
for example, neurotransmitter receptor ligand classes
and sequence/pathway groups each selectively map to
expression waves (Figure 3). Tavazoie et al. (1999),
used K -means to partition yeast genes during the cell
cycle into 30 distinct clusters, and found the members
of each cluster to be significantly enriched for genes
with similar functions. Functions of unknown genes may
be hypothesized from genes with known function within
the same cluster. Yeast genes with previously unknown
functions have been identified from their temporal pattern Fig. 3. Gene expression clusters reflect gene families and pathways.
of expression during spore morphogenesis and their Neurotransmitter receptors follow particular expression waveforms
functional role in sporulation has been confirmed in according to ligand and functional class. Waves 1–4 correspond
deletion experiments (Chu et al., 1998). to major expression clusters found in rat spinal-cord development
mRNA expression can be regarded as a molecular signa- (Wen et al., 1998). Note that the early expression waves 1 and 2
ture of a cell’s phenotype. Clustering of gene expression are dominated by ACh and GABA receptors, and by receptor ion-
patterns helps differentiate different cell types, which is channels in general. Each line represents a gene, and can be traced
by following its reflection at each node (Agnew, 1998).
useful, for example, in recognizing subclasses of cancers
(Alon et al., 1999; Golub et al., 1999; Perou et al., 1999;
Alizadeh et al., 2000). Two-way clustering of both the of any arbitrary shapes and sizes in a multidi-
genes and experiments allows for easy visualization (Eisen mensional pattern space. Each clustering cri-
et al., 1998; Alon et al., 1999; Alizadeh et al., 2000; Wein- terion imposes a certain structure on the data,
stein et al., 1997). Because activities of genes are often re- and if the data happens to conform to the re-
lated to each other, gene expression is highly constrained, quirements of a particular criterion, the true
and gene expression patterns under different conditions clusters are recovered.
can be very similar. Clustering is necessary for identify-
ing the coherent patterns. It is impossible to objectively evaluate how ‘good’ a spe-
cific clustering is without referring to what the clustering
Which clustering method to use? will be used for. However, once an application has been
We have discussed several different distance measures identified, it may be possible to evaluate objectively the
and clustering algorithms. Each combination of distance quality of the clustering for that particular application.
measure and clustering algorithm will tend to emphasize For example, if we want to extract regulatory motifs
different types of regularities in the data. Some may be from clusters, we can compare clustering methods based
useless for what we want to do. Others may give us on the P-values of the resulting motifs. Similarly, for
complementary pieces of information. Jain and Dubes functional classification, we can compare P-values asso-
(1988) state: ciated with enrichment of clusters in certain functional
categories. It is unlikely that there would be a single
There is no single best criterion for obtaining best-clustering method for all applications. Considering
a partition because no precise and workable the overwhelming number of combinations of distance
definition of ‘cluster’ exists. Clusters can be measures and clustering algorithms—far too many to

714
Genetic network inference

try them all each time—the field is in dire need of a (Arkin et al., 1998) included a total of 67 parameters, and
comparison study of the main combinations for some of required supercomputers for its stochastic simulation.
the standard applications, such as functional classification In-depth biochemical modeling is very important in un-
or extraction of regulatory motifs. If we want to use gene derstanding the precise interactions in common regulatory
clusters to infer regulatory interactions, synthetic data mechanisms, but clearly we cannot expect to construct
generated from small but detailed models of regulatory such a detailed molecular model of, say, an entire yeast
networks could provide a useful touchstone for comparing cell with some 6000 genes by analysing each gene
clustering methods. Preliminary results comparing SOM, and determining all the binding and reaction constants
K -means, FITCH and Autoclass—all using Euclidean one-by-one. Likewise, from the perspective of drug target
distance—showed very poor performance of all clus- identification for human disease, we cannot realistically
tering methods in identifying a metabolic pathway with hope to characterize all the relevant molecular interactions
associated regulation of the enzymes by the metabolites one-by-one as a requirement for building a predictive
(Mendes, 1999). disease model. There is a need for methods that can
The greatest challenge in cluster analysis lies in faith- handle large-scale data in a global fashion, and that can
fully capturing complex relationships in biological net- analyse these large systems at some intermediate level,
works. As stated above, a gene may participate in mul- without going all the way down to the exact biochemical
tiple functions, and is controlled by the activities of many reactions.
other genes through a variety of cis-regulatory elements.
Therefore, for complex datasets spanning a variety of bio- Boolean or continuous
logical responses, a gene should by definition be a mem- The Boolean approximation assumes highly cooperative
ber of several clusters, each reflecting a particular aspect binding (very ‘sharp’ activation response curves) and/or
of its function and control. As more data becomes avail- positive feedback loops to make the variables saturate
able to accurately delineate expression behavior under dif- in ON or OFF positions. However, examining real gene
ferent conditions, we should consider using some of the expression data, it seems clear that genes spend a lot of
clustering methods that partition genes into non-exclusive their time at intermediate values: gene expression levels
clusters. Alternatively, several clustering methods could tend to be continuous rather than binary. Furthermore,
be used simultaneously, allocating each gene to several important concepts in control theory that seem indis-
clusters based on the different regularities emphasized by pensable for gene regulation systems either cannot be
each method. implemented with Boolean variables, or lead to a radically
different dynamical behavior: amplification, subtraction
Modeling methodologies and addition of signals; smoothly varying an internal
Cluster analysis can help elucidate the regulation (or co- parameter to compensate for a continuously varying
regulation) of individual genes, but eventually we will environmental parameter; smoothly varying the period of
have to consider the integrated behavior of networks of a periodic phenomenon like the cell cycle, etc. Feedback
regulatory interactions. Various types of gene regulation control (see e.g. Franklin et al., 1994) is one of the most
network models have been proposed, and the model of important tools used in control theory to regulate system
choice is often determined by the question one is trying variables to a desired level, and reduce sensitivity to both
to answer. In this section we will briefly address some of external disturbances and variation of system parameters.
the decisions that need to be made when constructing a Negative feedback with a moderate feedback gain has a
network model, and the trade-offs associated with each. stabilizing effect on the output of the system. However,
negative feedback in Boolean circuits will always cause
Level of biochemical detail oscillations, rather than increased stability, because the
Gene-regulation models can vary from the very abstract— Boolean transfer function effectively has an infinite slope
such as Kauffman’s random Boolean networks (Kauff- (saturating at 0 and 1). Moreover, Savageau (1998) iden-
man, 1993)—to the very concrete—such as the full tified several rules for gene circuitry (bacterial operons)
biochemical interaction models with stochastic kinetics that can only be captured by continuous analysis methods.
in Arkin et al. (1998). The former approach is the most In this study, positive and negative modes of regulation
mathematically tractable, and its simplicity allows ex- were respectively linked to high and low demand for
amination of very large systems (thousands of genes). expression, and a relationship was established between
The latter fits the biochemical reality better and may the coupling of regulator and effector genes and circuit
carry more weight with the experimental biologists, capacity and demand.
but its complexity necessarily restricts it to very small Some of these problems can be alleviated by hybrid
systems. For example, the detailed biochemical model Boolean systems. In particular, Glass (1975, 1978) has
of the five-gene lysis–lysogeny switch in Lambda phage proposed sets of piecewise linear differential equations,

715
P.D’haeseleer et al.

where each gene has a continuous-valued internal state, proach to unpredictability is to construct models that can
and a Boolean external state. Researchers at the Free manipulate entire probability distributions rather than just
University of Brussels (Thomas, 1991; Thieffry and single values. Stochastic differential equations could be
Thomas, 1998) have proposed an asynchronously updated used for example. Of course, this does add a whole new
logic with intermediate threshold values. These systems level of complexity to the models. Alternatively, a deter-
allow for easy analysis of certain properties of networks, ministic model can sometimes be extended by a simplified
and have been used for qualitative models of small gene analysis of the variance on the expected behavior.
networks, but still do not seem appropriate for quantitative
modeling of real, large-scale gene expression data. Spatial or non-spatial
Spatiality can play an important role, both at the level of
Deterministic or stochastic intercellular interactions, and at the level of cell compart-
One implicit assumption in continuous-valued models is ments (e.g. nucleus versus cytoplasm versus membrane).
that fluctuations in the range of single molecules can Most processes in multicellular organisms, especially
be ignored. Differential equations are already widely during development, involve interactions between dif-
used to model biochemical systems, and a continuous ferent cells types, or even between cells of the same
approach may be sufficient for a large variety of interesting type. Some useful information may be extracted using
mechanisms. However, molecules present at only a few a nonspatial model (see for example D’haeseleer et al.
copies per cell do play an important role in some (1999); D’haeseleer and Fuhrman (2000) for a non-spatial
biological phenomena, such as the lysis–lysogeny switch model of CNS development and injury), but eventually a
in Lambda phage (Ptashne, 1992). In that case, it may be spatial model will be needed.
impossible to model the behavior of the system exactly Spatiality adds a whole extra dimension of complex-
with a purely deterministic model. ity to the models: spatial development, cell-type interac-
These stochastic effects—which have mainly been tions, reservoirs, diffusion constants, etc. In some cases,
observed in prokaryotes—may not play as much of a the abundance of data—spatial patterns—can more than
role in the larger eukaryotic cells. In yeast, most mRNA make up for the extra complexity of the model. For ex-
species seem to occur at close to one mRNA copy per cell ample, Mjolsness et al. (1991) used a time series of one-
(Velculescu et al., 1997; Holstege et al., 1998a), down to dimensional spatial patterns to fit a simple model of eve
0.1 mRNA/cell or less (i.e. the mRNA is only present 10% stripe formation in Drosophila. Models like the ones pro-
of the time or less in any one cell). Low copy numbers like posed by Marnellos and Mjolsness (1998) for the role of
these could be due to leaky transcription and not have any lateral interactions in early Drosophila neurogenesis pro-
regulatory role. Also, if the half-life of the corresponding vide experimentally testable predictions about potentially
protein (typically measured in hours or days) is much important interactions.
larger than the half-life of the mRNA (averaging around
20 min in yeast (Holstege et al., 1998b)), the protein Data availability
level may not be affected by stochastic fluctuations in In general, we must also realize that molecular activity
mRNA. Analysis of mRNA and protein decay rates and measurements are limited and are carried out over popula-
abundances may allow us to identify those few genes for tion averages of cells, not on individual cell circuits. Mod-
which stochastic modeling may prove necessary. eling methodologies must, therefore, be designed around
Particle-based models can keep track of individual the level of organization for which data is actually accessi-
molecule counts, and often include much biochemical ble. An exhaustive model must take into account RNA and
detail and/or spatial structure. Of course, keeping track protein concentrations, phosphorylation states, molecular
of all this detail is computationally expensive, so they are localization and so forth, since each molecular variable
typically only used for small systems. A related modeling carries unique information. However, due to present lim-
technique is Stochastic Petri Nets (SPNs), which can be itations in measuring technology, these data are not rou-
considered a subset of Markov processes, and can be tinely accessible. Modeling is then challenged with pro-
used to model molecular interactions (Goss and Peccoud, viding as much predictive power as possible given limited
1998). Whereas fitting the parameters of a general particle data on molecular states. The constraints and redundan-
model to real data can be quite difficult, optimization cies in biological networks suggest that much may still be
algorithms exist for SPNs. Hybrid Petri Nets (Mounts gained even though not all parameters involved in the pro-
and Liebman, 1997; Matsuno et al., 2000) include both cess may be modeled.
discrete and continuous variables, allowing them to model
both small-copy number and mass action interactions. Forward and inverse modeling
Additional sources of unpredictability can include ex- Some of the more detailed modeling methodologies listed
ternal noise, or errors on measured data. The Bayesian ap- above have been used to construct computer models of

716
Genetic network inference

small, well-described regulatory networks. Of course, covering problem is solved using the branch and bound
this requires an extensive knowledge of the system in technique. They devise a perturbation strategy that may be
question, often resulting from decades of research. In used by laboratory scientists for systematic experimental
this review, we will not focus on this forward modeling design. It should be pointed out that the problem of design-
approach, but rather on the inverse modeling, or reverse ing a Boolean circuit that corresponds to certain input–
engineering problem: given an amount of data, what output mappings is well studied in electrical engineering,
can we deduce about the unknown underlying regulatory and several efficient algorithms exist (e.g. Espresso Bray-
network? Reverse engineering typically requires the use ton et al., 1984) that could provide inspiration for reverse
of a parametric model, the parameters of which are then engineering of Boolean regulatory networks.
fit to the real-world data. If the connection structure of
the regulatory network (i.e. which genes have a regulatory Data requirements
effect on each other) is unknown, the parametric model To correctly infer the regulation of a single gene, we need
will necessarily have to be very general and simplistic. to observe the expression of that gene under many dif-
The results of this sort of model only relate to the ferent combinations of expression levels of its regulatory
overall network structure. While this will imply little inputs. This implies sampling a wide variety of different
about the actual molecular mechanisms involved, much environmental conditions and perturbations. Gene expres-
helpful information will be gained on genes critical for a sion time series yield a lot of data, but all the data points
biological process, sufficient for the identification of drug tend to be about a single dynamical process in the cell, and
targets for example. Once the network structure is well will be related to the surrounding time points. A 10-point
known, a more detailed model may be used to estimate time series generally contains less information than a data
individual mechanism-related parameters, such as binding set of 10 expression measurements under dissimilar envi-
and decay constants. ronmental conditions, or with mutations in different path-
ways. The advantage of the time series is that it can pro-
Gene network inference: reverse engineering vide insight into the dynamics of the process. On the other
Clustering is a relatively easy way to extract useful hand, data sets consisting of individual measurements pro-
information out of large-scale gene expression data sets, vide an efficient way to map the attractors of the network.
however, it typically only tells us which genes are Both types of data, and multiple data sets of each, will be
co-regulated, not what is regulating what. In network needed to unravel the regulatory interactions of the genes.
inference, the goal is to construct a coarse-scale model of Successful modeling efforts will probably have to use
the network of regulatory interactions between the genes. data from different sources, and will have to be able to deal
This requires inference of the causal relationships among with different data types such as time series and steady-
genes, i.e. reverse engineering the network architecture state data, different error levels, incomplete data, etc.
from its activity profiles. As the molecular dynamics data Whereas clustering methods can use data from different
we acquire becomes more copious and complex, we may strains, in different growth media etc., combining data sets
need to routinely consult reverse engineering methods to for reverse engineering of regulatory networks requires
provide the guiding hypotheses for experimental design. that differences between the experimental conditions be
One may wonder whether it is at all possible to reverse quantified much more precisely. Likewise, data will have
engineer a network from its activity profiles. A straight- to be calibrated properly to allow comparison with other
forward answer to this question should be obtainable from data sets. In this respect, there is a growing need for
model networks, e.g. Boolean networks, for which we un- a reliable reference in relative expression measurements.
derstand the network architectures and can easily gener- An obvious approach could be to agree on a standard
ate activity profiles. In a first attempt, a simple method strain and carefully controlled growth conditions to use
was introduced that showed that reverse engineering is in all data collection efforts on the same organism.
possible in principle (Somogyi et al., 1997). A more sys- Alternatively, a reference mRNA population could be
tematic and general algorithm was developed by Liang et derived directly from the DNA itself.
al. (1998), using Mutual Information to identify a mini-
mal set of inputs that uniquely define the output for each Estimates for various network models
gene at the next time step. Akutsu et al. (1999) proposed The ambitious goal of network reverse engineering comes
a simplified reverse engineering algorithm and rigorously at the price of requiring more data points. The space of
examined the input data requirements. For more realis- models to be searched increases exponentially with the
tic applications, a further modification was introduced by number of parameters of the model, and therefore with
Akutsu et al. (2000) that allows the inference of Boolean the number of variables. Narrowing the range of models
networks given noisy input data. Ideker et al. (2000) de- by adding extra constraints can simplify the search for the
veloped an alternative algorithm in which the minimal set- best model considerably. Including such information into

717
P.D’haeseleer et al.

Table 1. Fully connected: each gene can receive regulatory inputs from all 1.0000
other genes

solution)
Model Data needed 0.1000

Boolean, fully connected 2N


Boolean, connectivity K 2 K (K + log(N )) 0.0100

P (incorrect
Boolean, connectivity K , linearly separable K log(N /K )
Continuous, fully connected, additive N +1 k=1
Continuous, connectivity K , additive K log(N /K ) (∗) 0.0010 k=2
Pairwise correlation comparisons (clustering) log(N ) k=3

Connectivity K : at most K regulatory inputs per gene. Additive, linearly 0.0001


separable: regulation can be modeled using a weighted sum. Pairwise
0 20 40 60 80 100
correlation: significance level for pairwise comparisons based on
State transitions
correlation must decrease inversely proportional to number of variables.
(∗): conjecture.

Fig. 4. Dependence of Boolean network reverse engineering algo-


the inference process is the true art of modeling. rithm on depth of training data. The probability of finding an incor-
How many data points are needed to infer a gene rect solution is graphed versus the number of state transition pairs
network of N genes depends on the complexity of used as input for the algorithm for a N = 50 element network. More
the model used to do the inference. Constraining the training data is required for networks of k = 3 inputs per gene than
for networks of lower connectivity to minimize incorrect solutions.
connectivity of the network (number of regulatory inputs
However, only a small fraction e.g. 80 state transition pairs from
per gene) and the nature of the regulatory interactions can a total of 250 = 1.13 × 1015 is required to obtain reliable results
dramatically reduce the amount of data needed. Table 1 (Liang et al., 1998).
provides an overview of some of the models considered,
and estimates of the amount of data needed for each. These
estimates hold for independently chosen data points, with the equivalent Boolean model, we speculate it to
and only indicate asymptotic growth rates, ignoring any be of the form (K log(N /K )). A promising avenue of
constant factors. further research in this area may be the results on sample
To completely specify a fully connected Boolean net- complexity for recurrent neural networks, which have a
work model, where the output of each gene is modeled very similar structure to the models presented here (Koiran
as a Boolean function of the expression levels of all N and Sontag, 1998; Sontag, 1997).
genes, we need to measure all possible 2 N input–output Finally, to allow for comparison with gene clustering
pairs. This is clearly inconceivable for realistic numbers methods, we examined data requirements for clustering
of genes. If we reduce the connectivity of the Boolean based on pairwise correlation comparisons (see Ap-
network to an average of K regulatory inputs per gene, pendix Data requirements for pairwise correlation
the data requirements decrease significantly. For a slightly comparisons). As the number of genes being compared
simpler model, we can derive a lower bound of (2 K (K + increases, the number of data points will have to in-
log(N ))) (see Appendix Data requirements for sparse crease proportional to log(N ), in order to maintain a
Boolean networks), which agrees well with preliminary constant, low level of false positives. Claverie (1999)
experimental results by Liang et al. (1998) and Akutsu et arrived at a similar logarithmic scaling for binary data
al. (1999) (see Figure 4). Further constraining the Boolean (absent/detected).
model to use only linearly separable Boolean functions Note that, for reasonably constrained models, the num-
(those that can be implemented using a weighted sum of ber of data points needed will scale with log(N ) rather
the inputs, followed by a threshold function) reduces the than N , and that the data requirements for network in-
data requirements to (K log(N /K )) (Hertz, 1998). ference are at least a factor K larger than for clustering.
For models with continuous expression levels, the In practice, the amount of data may need to be orders of
data requirements are less clear. In the case of linear magnitude higher because of non-independence and large
(D’haeseleer et al., 1999) or quasi-linear (Weaver et al., measurement errors (see also Szallasi, 1999). Higher ac-
1999) additive models, fitting the model is equivalent curacy methods such as RT-PCR yield more bits of infor-
to performing a multiple regression, so at least N + 1 mation per data point than cDNA microarrays or oligonu-
data points are needed for a fully connected model of N cleotide chips, so fewer data points may be required to
genes. Data requirements for sparse additive regulation achieve the same accuracy in the model. Modeling real
models are as yet unknown, but based on the similarity data with Boolean networks discards a lot of information

718
Genetic network inference

in the data sets, because the expression levels need to be connectionist model (Mjolsness et al., 1991), linear model
discretized to one bit per measurement. Continuous mod- (D’haeseleer et al., 1999), linear transcription model
els will tend to use the available information in the data set (Chen et al., 1999), weight matrix model (Weaver et
better. al., 1999). The core of these seems to be the additive
interaction between the regulatory inputs to each gene,
Correlation metric construction so we suggest calling these models collectively additive
Adam Arkin and John Ross have been working on a regulation models.
method called Correlation Metric Construction (Arkin and In the simplest case, we can think of these models as
Ross, 1995; Arkin et al., 1997), to reconstruct reaction being similar to multiple regression:
networks from measured time series of the component 
chemical species. This approach is based in part on xi ≈ w ji x j + bi
electronic circuit theory, general systems theory and i
multivariate statistics. Although aimed more towards cell or
signaling or metabolic networks, the same methodology 
xi (t + t) = w ji x j (t) + bi
could be applied to regulatory networks.
j
The system (a reactor vessel with chemicals im-
plementing glycolysis) is driven using random (and where xi is the expression level of gene i at time t, bi
independent) inputs for some of the chemical species, is a bias term indicating whether gene i is expressed or
while the concentration of all the species is monitored not in the absence of regulatory inputs, and weight w ji
over time. A distance matrix is constructed based on indicates the influence of gene j on the regulation of
the maximum time-lagged correlation between any two gene i. This can be written equivalently as a difference
chemical species. This distance matrix is then fed into or differential equation. Given an equidistant time series
a simple clustering algorithm to generate a tree of con- of expression levels (or an equidistant interpolation of a
nections between the species, and the results are mapped non-equidistant time series), we can use linear algebra
into a two-dimensional graph for visualization. It is also to find the least-squares fit to the data. Weaver et al.
possible to use the information regarding the time lag (1999) showed how a non-linear transfer function can be
between species at which the highest correlation was incorporated into the model as well, and demonstrated
found, which could be useful to infer causal relationships. that some randomly generated networks can be accurately
More sophisticated methods from general systems theory, reconstructed using this modeling technique. D’haeseleer
based on mutual information, could be used to infer et al. (1999); D’haeseleer and Fuhrman (2000) showed
dependency. that even a simple linear model can be used to infer
biologically relevant regulatory relationships from real
Systems of differential equations data sets (Figure 5).
Simple systems of differential equations have already Chen et al. (1999) presented a number of linear differen-
proven their worth in modeling simple gene regulation tial equation models which included both mRNA and pro-
systems. For example, the seminal work of Mjolsness tein levels. They showed how such models can be solved
et al. (1991) used a spatial ‘gene circuit’ approach to using linear algebra and Fourier transforms. Interestingly,
model a small number of genes involved in pattern they found that mRNA concentrations alone were not suf-
formation during the blastoderm stage of development ficient to solve their model, without at least the initial pro-
in Drosophila. The change in expression levels at each tein levels. Conversely, the model can be solved given only
point in time depended on a weighted sum of inputs a time series of protein concentrations.
from other genes, and diffusion from neighboring ‘cells’. Models that are more complex may require more general
Synchronized cell divisions along a longitudinal axis methods for fitting the parameters to the expression
(under the control of a maternal clock) were alternated data. Mjolsness et al. (1991) used Simulated Annealing
with updating the gene expression levels. The model was to fit their hybrid model—incorporating both reaction-
able to successfully replicate the pattern of eve stripes in diffusion kinetics and cell divisions—to spatial data.
Drosophila, as well as some mutant patterns on which the Mjolsness et al. (1999b) used a similar approach to fit
model was not explicitly trained. a recurrent neural network with weight decay to clusters
of yeast genes. Genetic algorithms (GAs) have been
Additive regulation models used to model the interaction between four ‘waves’ of
The differential equation systems described above model coordinately regulated genes (Wahde and Hertz, 1999)
gene networks using an update rule based on a weighted previously identified in rat CNS development (Wen et
sum of inputs. Several variants of such models have al., 1998). Similarly, Tominaga et al. (1999) used a GA
been proposed, with each group coining a different name: to fit a power-law model (Savageau, 1995; Tominaga

719
P.D’haeseleer et al.

(A)
1

Nestin

0.5

0.5
GABARalpha4

aFGF
0.5

0
-10 0 10 20 30 40 50 60

GAD65
(B)

GAD67 GRg2 GRa3 GRa2

GRa1 preGAD67

GRb3 GRa4 GRb1 GRb2 GRg1

G67I8086 GRg3

G67I86 nestin GRa5

Fig. 5. Continuous-valued reverse engineering of a CNS genetic network using a linear additive method. (A) Experimental gene expression
data (circles; development and injury), and simulation using a linear model (lines). The model faithfully reproduces the time series of
the training data sets. Dotted line: spinal cord, starting 11 days before birth. Solid line: hippocampus development, starting seven days
before birth. Dashed line: hippocampus kainate injury, starting at postnatal day 25. (B) Hypothetical gene interaction diagram for the
GABA signaling family inferred from developmental gene expression data (spinal cord and hippocampus data). While individual proposed
interactions have not yet been experimentally verified, the high predicted connectivity within this gene family appears biologically plausible.
The positive feedback interaction of the GAD species has been proposed independently in another study (Somogyi et al., 1995). Solid lines
correspond to positive interactions, broken lines suggest inhibitory relationships (D’haeseleer et al., 1999).

720
Genetic network inference

and Okamoto, 1998) of a small gene network. Networks for clustering according to co-expression profiles
with larger numbers of genes will likely require stronger should select the appropriate experimental sets for
optimization algorithms. Akutsu et al. (2000) proposes analysis, and provide flexible solutions with multi-
using Linear Programming for both power-law models ple cluster memberships that more accurately reflect
and qualitative hybrid Boolean-like networks. Efficient the biological reality. Well-designed cluster analysis
gradient descent algorithms developed for continuous- promises to identify new pathway relationships
time recurrent neural networks (Pearlmutter, 1995) may and gene functions that may be critical to cellular
be useful for even larger networks (D’haeseleer et al., control in health and disease.
1999). Alternatively, the size of the problem can be
drastically reduced by combining gene clustering with 3. Reverse engineering. Since it is the ultimate goal
network inference, deriving a regulation model only for to identify the causative relationships between
the cluster centers (Wahde and Hertz, 1999; Mjolsness et gene products that determine important phenotypic
al., 1999b). parameters, top priority should be given to develop
reverse engineering methods that provide significant
Conclusions and outlook predictions. Alternative computational approaches
should be applied to given data sets, and their pre-
We are participating in the transition of biology into an
information-driven science. However, this transition can dictions tested in targeted experiments to identify
be meaningful only if we focus on generating models the most reliable methods.
that allow us to systematically derive predictions about 4. Integrated modeling. While the current focus is on
important biological processes in disease, development the analysis of large-scale gene expression data,
and metabolic control. These will find important applica- there are other established sources of information
tions in pharmaceutical development and bioengineering on gene function, ranging from sequence homology
(Zweiger, 1999). We have reviewed conceptual founda- and cis-regulatory sequences, to disease association
tions for understanding complex biological networks, and and a wide variety of functional knowledge from
several practical methods for data analysis. There are still targeted experiments. Ideally, all of these categories
major challenges ahead, which may be divided into five of information should be included in model build-
areas: ing. A major challenge here lies in the reliability
1. Measurement quantity, depth and quality. Any at- and compatibility of these data sets.
tempt at predictive data analysis and model building
5. Coupling of modeling with systematic experimental
critically depends on the scope and quality of the in-
design. Discovery of a novel gene function through
put data. Ideally, we would like to gain access to the
expression profiling and computational inference
activities of all important molecular species in a bio-
depends on the optimal coordination of experimen-
logical process (ranging from mRNA to metabolites
tal technology with data-analysis methods. While
and second messengers), with adequate quantitative,
data-analysis methods must be centered around
anatomical and temporal resolution. However, even
data that are realistically accessible, critical pre-
though our analytical measurement technologies
dictions from the models must guide experimental
are undergoing transformations in precision and
throughput, there will always be limitations to the design. The hope is that progressive iteration of
amount of data and resolution we can acquire and predictions, experimental measurements and model
process. Computational data analysis must therefore updates will result in increased fidelity and depth of
identify the most essential molecular parameters to computational models of biological networks.
guide experimental measurements, and critically
evaluate measurement precision and reproducibility Acknowledgements
with appropriate statistical measures. P.D. gratefully acknowledges the support of the National
2. Clustering and functional categorization. One Science Foundation (grant IRI-9711199), the Office of
priority in this area is to compare the large variety Naval Research (grant N00014-99-1-0417), and the Intel
of existing clustering methods (including different Corporation. S.L. gratefully acknowledges support from
normalizations and distance measures), and identify NASA Collaborative Agreement NCC 2-794. The authors
those that give the most biologically relevant results. would like to express their appreciation to Xiling Wen,
Just as a gene can play multiple functional roles in Millicent Dugich and Stefanie Fuhrman for providing data
various pathways and is subject to different regula- on hippocampal development and injury, and to Stefanie
tory inputs, co-expression patterns vary according Fuhrman for critically reading and commenting on the
to the cellular and experimental context. Methods manuscript.

721
P.D’haeseleer et al.

Appendix A: Data requirements for sparse Boolean algorithms: we say that two genes cluster together if their
networks correlation is significantly greater (with a significance
To fully specify a Boolean network with limited connec- level α) than a certain cut-off value ρc . We test whether
tivity, we need to specify the connection pattern and the we can exclude the null hypothesis ρ < ρc based on
rule table for a function of K inputs at each. A lower bound the measured correlation coefficient r over the available
of (2 K + K log(N )) can be derived using information data points. Because of the large number of comparisons
theory. For a slightly simpler model, where we assume the being made, we need to reduce the significance level for
pattern of connectivity is given, we can calculate how the the correlation test, proportional to the number of tests
number of independently chosen data points should scale each gene is involved in: α = α /N (this will keep
with K and N . Since this is a simpler model, its data re- the expected number of false positives for each gene
quirements should be a lower bound to the requirements constant). In order to be able to use the same cut-off value
for the model with unknown connections. for the measured correlation rα to decide whether two
Every data point (i.e. every input–output pair, specifying genes cluster together, the number of data points will have
the state of the entire Boolean network at time t and t + 1), to increase as the significance level for each test grows
specifies exactly one of 2 K entries in each rule table. Given smaller.
this particular combination of the K inputs to each gene at If the real correlation coefficient ρ is close to 1.0,
time t, the output of the gene is given by its state at time the distribution of the measured correlation coefficient
t + 1. We will estimate the probability P that all N rule r is very asymmetrical. The following z-transformation,
tables are fully specified by n data points, and calculate developed by Fisher et al. (1969), is approximately
how the number of data points n needs to scale with P, normally distributed with mean z(ρ) and variance 1/(n −
the number of genes N , and connectivity K . For P ≈ 1 3) (with n the number of data points):
(i.e. we have enough data to have a good chance at a fully  
1 1+r
specified model), the probability for a single rule table to z(r ) = ln .
2 1−r
be fully specified by n data points is approximately:
We can now devise a single-sided test on z(r ) to answer
1 − 2 K (1 − 2−K )n . the following question. If z(r ) > z(rα ), what is the
The probability that all N rule tables are fully specified by significance level with which we can reject the hypothesis
n data points then becomes: z(ρ) < z(ρc ) (and thus ρ < ρc )? At the tail of the normal
distribution, the area under the normal curve to the right
P ≈ (1 − 2 K (1 − 2−K )n ) N . of z(rα ) can be approximated by:
 ∞ 2
Taking base-2 logarithms, and further simplifying using 1 − (z−z(ρ2c ))
α= √ e 2σ dz
log2 (1 − z) ≈ −z log2 (e) for z 1, we find: z=z(rα ) σ 2π

C1 = − log(P) σ − (z(rα )−z(ρ c ))


2
≈√ e 2σ 2

≈ −N log(1 − 2 K (1 − 2−K )n ) 2π (z(rα ) − z(ρc ))



≈ N 2 K (1 − 2−K )n log(e) with σ = 1/ n − 3, and taking natural logs:
C2 = − log(C1 / log(e))
ln(α) ≈ ln(α ) − ln(N )
≈ − log(N ) − K − n log(1 − 2−K ) √
≈ − 12 ln(n − 3) − ln( 2π (z(rα ) − z(ρc )))
≈ − log(N ) − K + n2−K log(e).
−(n − 3)(z(rα ) − z(ρc ))2 /2
If P ≈ 1, C1 will be a small, and C2 a large positive 2
constant. We can now express n, the number of data points n≈3+ (ln(N ) + ln(1/α )
(z(rα ) − z(ρc ))2
needed, in terms of N , K and C2 : √
− ln(n − 3)/2 − ln( 2π (z(rα ) − z(ρc ))))
n ≈ 2 K (K + log(N ) + C2 )/ log(e)
= O(log(N )).
which is O(2 K (K + log(N ))). In other words, if we want to use the same cut-off value
rα to decide whether ρ > ρc , we need to scale the number
Appendix B: Data requirements for pairwise of data points logarithmically with the number of genes.
correlation comparisons Strictly speaking, this analysis only holds for correlation
Let us examine a very simple form of clustering as a tests, but we can expect similar effects to play a role in
representative example of the wide variety of clustering other clustering algorithms.

722
Genetic network inference

References Cheeseman,P. and Stutz,J. (1996) Bayesian classification (auto-


class): theory and results. In Fayyad,U.M., Piatetsky-Shapiro,G.,
Agnew,B. (1998) NIH plans bioengineering initiative. Science, 280, Smyth,P. and Uthurusamy,R. (eds), Advances in Knowledge Dis-
1516–1518. covery and Data Mining AAAI Press/MIT Press, http://ic-www.
Akutsu,T., Miyano,S. and Kuhara,S. (1999) Identification of genetic arc.nasa.gov/ic/projects/bayes-group/images/kdd-95.ps
networks from a small number of gene expression patterns under Chen,T., He,H.L. and Church,G.M. (1999) Modeling gene ex-
the Boolean network model. Pacific Symposium on Biocomput- pression with differential equations. Pacific Symposium on
ing, 4, 17–28. http://www.smi.stanford.edu/projects/helix/psb99/ Biocomputing, 4, 29–40. http://www.smi.stanford.edu/projects/
Akutsu.pdf helix/psb99/Chen.pdf
Akutsu,T., Miyano,S. and Kuhara,S. (2000) Algorithms for infer- Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O.
ring qualitative models of biological networks. Pacific Sympo- and Herskowitz,I. (1998) The transcriptional program of sporu-
sium on Biocomputing, 5, 290–301.http://www.smi.stanford.edu/ lation in budding yeast. Science, 282, 699–705.
projects/helix/psb00/akutsu.pdf Claverie,J.-M. (1999) Computational methods for the identification
Aldenderfer,M.S. and Blashfield,R.K. (1984) Cluster Analysis. Sage of differential and coordinated gene expression. Hum. Mol.
Publications, Newbury Park, CA. Genet., 8, 1821–1832.
Alizadeh,A.A. et al. (2000) Distinct types of diffuse large B- Cunningham,M.J., Liang,S., Fuhrman,S., Seilhamer,J.J. and Somo-
cell lymphoma identified by gene expression profiling. Nature, gyi,R. (2000) Gene expression microarray data analysis for toxi-
403(6769), 503–511. cology profiling. Ann. New York Acad. Sci., in press.
Alon,U., Barkai,N., Notterman,D.A., Gish,K., Ybarra,S., Mack,D. Dempster,A.P., Laird,N.M. and Rubin,D.B. (1977) Maximum like-
and Levine,A.J. (1999) Broad patterns of gene expression lihood from incomplete data via the EM algorithm. J. R. Stat.
revealed by clustering analysis of tumor and normal colon tissues Soc., B 39, 1–38.
probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, DeRisi,J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the
96(12), 6745–6750. metabolic and genetic control of gene expression on a genomic
Arkin,A. and Ross,J. (1995) Statistical construction of chemical scale. Science, 278(5338), 680–686.
reaction mechanism from measured time-series. J. Phys. Chem., D’haeseleer,P. and Fuhrman,S. (2000) Gene network inference
99, 970–979. using a linear, additive regulation model. In preparation.
Arkin,A., Ross,J. and McAdams,H.H. (1998) Stochastic kinetic D’haeseleer,P., Wen,X., Fuhrman,S. and Somogyi,R. (1997) Mining
analysis of developmental pathway bifurcation in phage lambda- the gene expression matrix: inferring gene relationships from-
infected Escherichia coli cells. Genetics, 149, 1633–1648. large scale gene expression data. In: M.Holcombe, R.Paton
Arkin,A., Shen,P. and Ross,J. (1997) A test case of correlation (eds) Information processing in cells and Tissues. Plenum,
metric construction of a reaction pathway from measurements. pp. 203–212. http://www.cs.unm.edu/∼patrik/networks/IPCAT/
Science, 277, 1275–1279. ipcat.html
Arnone,M.I. and Davidson,E.H. (1997) The hardwiring of develop- D’haeseleer,P., Wen,X., Fuhrman,S. and Somogyi,R. (1999) Linear
ment: organization and function of genomic regulatory systems. modeling of mRNA expression levels during CNS development
Development, 124, 1851–1864. and injury. Pacific Symposium on Biocomputing, 4, 41–52. http:
Ben-Dor,A., Shamir,R. and Yakhini,Z. (1999) Clustering gene //www.smi.stanford.edu/projects/helix/psb99/Dhaeseleer.pdf
expression patterns. J. Comput. Biol., 6(3–4), 281–297. Eisen,M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998)
Bezdek,J.C. (1981) Pattern Recognition with Fuzzy Objective Cluster analysis and display of genome-wide expression patterns.
Function Algorithms. Plenum Press, New York. Proc. Natl. Acad. Sci. USA, 95(25), 14863–14868.
Bhattacharjya,A. and Liang,S. (1996) Power laws in some random Erb,R.S. and Michaels,G.S. (1999) Sensitivity of biological mod-
Boolean networks. Phys. Rev. Lett., 77, 1644. els to errors in parameter estimates. Pacific Symposium on
Brayton,R.K., Hachtel,G.D., McMullen,C.T. and Sangiovanni- Biocomputing, 4, 53–64. http://www.smi.stanford.edu/projects/
Vincentelli,A.L. (1984) Algorithms for VLSI Synthesis. Kluwer helix/psb99/Erb.pdf
Academic Publishers, ftp://ic.eecs.berkeley.edu/pub/Espresso/ Ewing,R.M., Kahla,A.B., Poirot,O., Lopez,F., Audic,S. and
Brazma,A., Jonassen,I., Vilo,J. and Ukkonen,E. (1998) Predicting Claverie,J.M. (1999) Large-scale statistical analyses of rice ESTs
gene regulatory elements in silico on a genomic scale. Genome reveal correlated patterns of gene expression. Genome Res., 9,
Res., 8, 1202–1215. 950–959.
Brown,P.O. and Botstein,D. (1999) Exploring the new world of Felsenstein,J. (1993) PHYLIP (Phylogeny Inference Package). ver-
genome with DNA microarrays. Nature Genet., 21, 33–37. sion 3.5c, distributed by the author, Department of Genetics, Uni-
Butte,A.J. and Kohane,I.S. (2000) Mutual information relevance versity of Washington, Seattle.
networks: functional genomic clustering using pairwise entropy Ferea,T.L., Botstein,D., Brown,P.O. and Rosenzweig,R.F. (1999)
measurements. Pacific Symposium on Biocomputing, 5, 415– Systematic changes in gene expression patterns following adap-
426.http://www.smi.stanford.edu/projects/helix/psb00/butte.pdf tive evolution in yeast. Proc. Natl. Acad. Sci. USA, 96, 17 9721–
Califano,A., Stolovitzky,G. and Tu,Y. (2000) Analysis of gene ex- 9726.
pression microarrays for phenotype classification. 8th Interna- Fisher,R.A., In Kendall,M.G. and Stuart,A. (1969) The Advanced
tional Conference on Intelligent Systems for Molecular Biology, Theory of Statistics. Vol. 1, 3rd edn., Hafner Press, New York,
in press. pp. 391.

723
P.D’haeseleer et al.

Franklin,G.F., Powell,J.D. and Emami-Naeini,A. (1994) Feedback An Introduction to Cluster Analysis. John Wiley and Sons, New
Control of Dynamic Systems. 3rd edn, Addison-Wesley. York.
Fuhrman,S., Cunningham,M.J., Wen,X., Zweiger,G., Seilhamer,J.J. Koiran,P. and Sontag,E.D. (1998) Vapnik-Chervonenkis dimension
and Somogyi,R. (2000) The application of Shannon entropy in of recurrent neural networks. Discrete Appl. Math., 86, 63–79.
the identification of putative drug targets. Biosystems, 55(1–3), Lance,G.N. and Williams,W.T. (1966) A general theory of classifi-
5–14. catory sorting strategies: 1. hierarchical systems. Comput. J., 9,
Fuhrman,S., D’haeseleer,P. and Somogyi,R. (1999) Tracing genetic 373–380.
information flow from gene expression to pathways and molec- Liang,S., Fuhrman,S. and Somogyi,R. (1998) REVEAL, a gen-
ular networks. 1999 Society for Neuroscience Short Course Syl- eral reverse engineering algorithm for inference of genetic net-
labus, DNA Microarrays: The New Frontier in Gene Discovery work architectures. Pacific Symposium on Biocomputing, 3, 18–
and Gene Expression Analysis. 29. http://www.smi.stanford.edu/projects/helix/psb98/liang.pdf
Glass,L. (1975) Combinatorial and topological methods in nonlin- MacQueen,J. (1967) Some methods for classification and anal-
ear chemical kinetics. J. Chem. Phys., 63, 1325–1335. ysis of multivariate observation. In Le Cam,L.M. and Nye-
Glass,L. (1978) Stable oscillations in mathematical models of man,J. (eds), Proceedings of the Fifth Berkeley Symposium
biological control systems. J. Math. Biol., 6, 207–223. on Mathematical Statistics and Probability Vol. I, University of
Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasen- California Press.
beek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Mahalanobis,P.C. (1936) On the generalized distance in statistics.
Caligiuri,M.A., Bloomfield,C.D. and Lander,E.S. (1999) Molec- Proc. Nat. Inst. Sci. India 12, pp. 49–55.
ular classification of cancer: class discovery and class prediction Marnellos,G. and Mjolsness,E. (1998) A gene network approach
by gene expression monitoring. Science, 286, 531–537. to modeling early neurogenesis in Drosophila. Pacific Sympo-
Goss,P.J. and Peccoud,J. (1998) Quantitative modeling of stochastic sium on Biocomputing, 3, 30–41. http://www.smi.stanford.edu/
systems in molecular biology by using stochastic Petri nets. Proc. projects/helix/psb98/marnellos.pdf
Natl. Acad. Sci. USA, 95, 6750–6755. Massart,D.L. and Kaufman,L. (1983) The Interpretation of Analyt-
Han,E.-H., Karypis,G., Kumar,V. and Mobasher,B. (1997) ical Chemical Data by the Use of Cluster Analysis. John Wiley
Clustering in a high-dimensional space using hyper- and Sons, New York.
graph models. Technical Report # 97-019 Department Matsuno,H., Doi,A. and Nagasaki,M. (2000) Hybrid petri net
of Computer Science, University of Minnesota. http: representation of genetic regulatory network. Pacific Sympo-
//www.cs.umn.edu/tech reports/1997/TR 97-019 Clustering sium on Biocomputing, 5, 338–349.http://www.smi.stanford.edu/
Based on Association Rule Hypergraphs.html. projects/helix/psb00/matsuno.pdf
Hertz,J. (1998) Statistical issues in reverse engineering of genetic Mendes,P. (1999) Metabolic simulation as an aid in understanding
networks. Poster for Pacific Symposium on Biocomputing, http: gene expression data. In Bornberg-Bauer,E., De Beuckelaer,A.,
//www.nordita.dk/∼hertz/papers/dgshort.ps.gz. Kummer,U. and Rost,U. (eds), Workshop on Computation of
Heyer,L.J., Semyon Kruglyak, and Shibu Yooseph, (1999) Explor- Biochemical Pathways and Genetic Networks Logos Verlag,
ing expression data: identification and analysis of coexpressed Berlin, pp. 27–33.
genes. Genome Res., 9(11), 1106–1115. Michaels,G., Carr,D.B., Wen,X., Fuhrman,S., Askenazi,M. and So-
Holstege,F.C.P., Jennings,E.G., Wyrick,J.J., Lee,T.I., Hengart- mogyi,R. (1998) Cluster analysis and data visualization of large-
ner,C.J., Green,M.R., Golub,T.R., Lander,E.S. and Young,R.A. scale gene expression data. Pacific Symposium on Biocomput-
(1998a) Dissecting the regulatory circuitry of a eukaryotic ing, 3, 42–53. http://www.smi.stanford.edu/projects/helix/psb98/
genome. Cell, 95, 717–728. michaels.pdf
Holstege,F.C.P., Jennings,E.G., Wyrick,J.J., Lee,T.I., Hengart- Miklos,G.L. and Rubin,G.M. (1996) The role of the genome project
ner,C.J., Green,M.R., Golub,T.R., Lander,E.S. and Young,R.A. in determining gene function: insights from model organisms.
(1998b) Genome-Wide Expression Page, http://web.wi.mit.edu/ Cell, 86, 4 521–529.
young/expression/ Mjolsness,E., Castaño,R. and Gray,A. (1999a) Multi-parent clus-
Huang,S. (1999) Gene expression profiling, genetic networks, tering algorithms for large-scale gene expression analy-
and cellular states: an integrating concept for tumorigenesis sis. Technical Report JPL-ICTR-99-5 Jet Propulsion Lab-
and drug discovery. J. Mol. Med., 77, 6 469–480. oratory Section 367. http://www-aig.jpl.nasa.gov/public/mls/
Ideker,T.E., Thorsson,V. and Karp,R.M. (2000) Discovery of regula- papers/emj/multiparentPreprint.pdf.
tory interactions through perturbation: inference and experimen- Mjolsness,E., Mann,T., Castaño,R. and Wold,B. (1999b) From co-
tal design. Pacific Symposium on Biocomputing, 5, 302–313.http: expression to co-regulation: an approach to inferring transcrip-
//www.smi.stanford.edu/projects/helix/psb00/ideker.pdf tional regulation among gene classes from large-scale expres-
Jain,A.K. and Dubes,R.C. (1988) Algorithms for Clustering Data. sion data. Technical Report JPL-ICTR-99-4 Jet Propulsion
Prentice Hall, Englewood Cliffs, NJ. Laboratory Section 365. http://www-aig.jpl.nasa.gov/public/mls/
Kauffman,S.A. (1969) Metabolic stability and epigenesis in ran- papers/emj/GRN99prprnt.pdf.
domly connected nets. J. Theoret. Biol., 22, 437–467. Mjolsness,E., Sharp,D.H. and Reinitz,J. (1991) A connectionist
Kauffman,S.A. (1993) The Origins of Order, Self-Organization and model of development. J. Theor. Biol., 152, 429–454.
Selection in Evolution. Oxford University Press. Mounts,W.M. and Liebman,M.N. (1997) Application of petri nets
Kaufman,L. and Rousseeuw,P.J. (1990) Finding Groups in Data: and stochastic activity nets to modeling biological pathways and

724
Genetic network inference

processes. Int. J. Comput. Simul.. Church,G.M. (1999) Systematic determination of genetic


Niehrs,C. and Pollet,N. Synexpression groups in eukaryotes. Na- network architecture. Nature Genet., 22, 281–285.
ture, 402, 483–487. Thieffry,D. and Thomas,R. (1998) Qualitative analysis of gene
Pearlmutter,B.A. (1995) Gradient calculations for dynamic recur- networks. Pacific Symposium on Biocomputing, 3, 77–88. http:
rent neural networks: a survey. IEEE Trans. Neural Netw., 6, //www.smi.stanford.edu/projects/helix/psb98/thieffry.pdf
1212–1228. Thomas,R. (1991) Regulatory networks seen as asynchronous
Perou,C.M., Jeffrey,S.S., van de Rijn,M., Rees,C.A., Eisen,M.B., automata: a logical description. J. Theor. Biol., 153, 1–23.
Ross,D.T., Pergamenschikov,A., Williams,C.F., Zhu,S.X., Tominaga,D. and Okamoto,M. (1998) Design of canonical model
Lee,J.C., Lashkari,D., Shalon,D., Brown,P.O. and Botstein,D. describing complex nonlinear dynamics. In Yoshida,T. and
(1999) Distinctive gene expression patterns in human mam- Shioya,S. (eds), Computer Applications in Biotechnology 1998
mary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. Elsevier Science, pp. 85–90.
USA, 96, 16 9212–9217. Tominaga,D., Okamoto,M., Kami,Y., Watanabe,S. and Eguchi,Y.
Ptashne,M. (1992) A genetic Switch. Cell Press and Blackwell (1999) Nonlinear Numerical Optimization Technique Based
scientific publications. on a Genetic Algorithm, http://www.bioinfo.de/isb/gcb99/talks/
Roth,P., Hughes,J.D., Estep,P.W. and Church,G.M. (1998) Finding tominaga/
DNA regulatory motifs within unaligned noncoding sequences van Helden,J., Andre,B. and Collado-Vides,J. (1998) Extracting
clustered by whole-genome mRNA quantitation. Nature Biotech- regulatory sites from the upstream region of yeast genes by
nol., 16, 939–945. computational analysis of oligonucleotide frequencies. J. Mol.
Savageau,M.A. (1995) Power-law formalism: a canonical nonlinear Biol., 281, 827–842.
approach to modeling and analysis. In Proceedings of the World Velculescu,V.E., Zhang,L., Zhou,W., Vogelstein,J., Basrai,M.A.,
Congress of Nonlinear Analysts ’92 pp. 3323–3334. Bassett,D.E.,Jr, Hieter,P., Vogelstein,B. and Kinzler,K.W. (1997)
Savageau,M.A. (1998) Rules for the evolution of gene circuitry. Characterization of the yeast transcriptome. Cell, 88, 243–251.
Pacific Symposium on Biocomputing, 3, 54–65. http://www.smi. Wahde,M. and Hertz,J. (1999) Course-grained reverse engineering
stanford.edu/projects/helix/psb98/savageau.pdf of genetic regulatory networks. Proceedings of Information
Sneath,P.H.A. and Sokal,R.R. (1973) Numerical Taxonomy. Free- Processing in Cells and Tissues (IPCAT) ’99.
man, San Francisco. Walker,M.G., Volkmuth,W., Sprinzak,E., Hodgson,D. and Klin-
Sokal,R.R. and Michener,C.D. (1958) Univ. Kans. Sci. Bull., 38, gler,T. (1999) Prediction of gene function by genome-scale
1409–1438. expression analysis: prostate-cancer associated genes. Genome
Somogyi,R. and Sniegoski,C. (1996) Modeling the complexity of Res., 9, 1198–1203.
genetic networks. Complexity, 1(6), 45–63. Weaver,D.C., Workman,C.T. and Stormo,G.D. (1999) Modeling
Somogyi,R., Fuhrman,S., Askenazi,M. and Wuensche,A. (1997) regulatory networks with weight matrices. Pacific Symposium
The gene expression matrix: towards the extraction of genetic on Biocomputing, 4, 112–123. http://www.smi.stanford.edu/
network architectures. Nonlinear Analysis, Proc. of Second projects/helix/psb99/Weaver.pdf
World Cong. of Nonlinear Analysts (WCNA96), 30(3), 1815– Weinstein,J.N., Myers,T.G., O’Connor,P.M., Friend,S.H., For-
1824. http://rsb.info.nih.gov/mol-physiol/reprints/WCNA.pdf nace,A.J.,Jr, Kohn,K.W., Fojo,T., Bates,S.E., Rubinstein,L.V.,
Somogyi,R., Wen,X., Ma,W. and Barker,J.L. (1995) Developmental Anderson,N.L., Buolamwini,J.K., van Osdol,W.W., Monks,A.P.,
kinetics of GAD family mRNAs parallel neurogenesis in the rat Scudiero,D.A., Sausville,E.A., Zaharevitz,D.W., Bunow,B.,
spinal cord. J. Neurosci., 15, 2575–2591. Viswanadhan,V.N., Johnson,G.S., Wittes,R.E. and Paull,K.D.
Somogyi,R. (1999) Making sense of gene expression data. Phar- (1997) An information-intensive approach to the molecular phar-
mainformatics: A Trends Guide (Trends Supplement). pp. 17–24. macology of cancer. Science, 275, 343–349.
Sontag,P. (1997) Shattering all sets of k points in general position re- Wen,X., Fuhrman,S., Michaels,G.S., Carr,D.B., Smith,S.,
quires (k − 1)/2 parameters. Neural Comput., 9, 337–348. Barker,J.L. and Somogyi,R. (1998) Large-scale temporal gene
Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., expression mapping of central nervous system development.
Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Proc. Natl. Acad. Sci. USA, 95, 1 334–339.
Comprehensive identification of cell cycle-regulated genes of Wolfsberg,T.G., Gabrielian,A.E., Campbell,M.J., Cho,R.J.,
the yeast Saccharomyces cerevisiae by microarray hybridization. Spouge,J.L. and Landsman,D. (1999) Candidate regulatory
Mol. Biol. Cell., 9, 3273–3297. sequence elements for cell cycle-dependent transcription in
Szallasi,Z. (1999) Genetic network analysis in light of massively saccharomyces cerevisiae. Genome Res., 9, 775–792.
parallel biological data acquisition. Pacific Symposium on Bio- Wuensche,A. (1993) Discrete Dynamics Lab (DDLab) http://www.
computing, 4, 5–16. http://www.smi.stanford.edu/projects/helix/ santafe.edu/∼wuensch/ddlab.html.
psb99/Szallasi.pdf Wuensche,A. (1998) Genomic regulation modeled as a network
Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitro- with basins of attraction. Pacific Symposium on Biocom-
vsky,E., Lander,E.S. and Golub,T.R. (1999) Interpreting patterns puting, 3, 89–102. http://www.smi.stanford.edu/projects/helix/
of gene expression with self-organizing maps: methods and ap- psb98/wuensche.pdf
plication to hematopoietic differentiation. Proc. Natl. Acad. Sci. Wuensche,A. (1999) Classifying Cellular Automata Automatically:
USA, 96, 2907. Finding gliders, filtering, and relating space-time patterns, attrac-
Tavazoie,S., Hughes,J.D., Campbell,M.J., Cho,R.J. and tor basins, and the Z parameter. Complexity, 4(3), 47–66. http:

725
P.D’haeseleer et al.

//www.santafe.edu/∼wuensch/cplex ab.html puting, 5, 476–487.http://www.smi.stanford.edu/projects/helix/


Yuh,C.H., Bolouri,H. and Davidson,E.H. (1998) Genomic cis- psb00/zhu.pdf
regulatory logic: experimental and computational analysis of a Zweiger,G. (1999) Knowledge discovery in gene-expression-
sea urchin gene. Science, 279, 1896–1902. microarray data: mining the information output of the genome.
Zhu,J. and Zhang,M.Q. (2000) Cluster, function and promoter: Trends Biotech., 17, 429–436.
analysis of yeast expression array. Pacific Symposium on Biocom-

726

You might also like