Anomaly Detection in Gene Expression via Stochastic Models of Gene Regulatory Networks
Anomaly Detection in Gene Expression via Stochastic Models of Gene Regulatory Networks
from Asia Pacific Bioinformatics Network (APBioNet) Eighth International Conference on Bioinformatics (InCoB2009)
Singapore 7-11 September 2009
Abstract
Background: The steady-state behaviour of gene regulatory networks (GRNs) can provide
crucial evidence for detecting disease-causing genes. However, monitoring the dynamics of GRNs is
particularly difficult because biological data only reflects a snapshot of the dynamical behaviour of
the living organism. Also most GRN data and methods are used to provide limited structural
inferences.
Results: In this study, the theory of stochastic GRNs, derived from G-Networks, is applied to
GRNs in order to monitor their steady-state behaviours. This approach is applied to a simulation
dataset which is generated by using the stochastic gene expression model, and observe that the
G-Network properly detects the abnormally expressed genes in the simulation study. In the
analysis of real data concerning the cell cycle microarray of budding yeast, our approach finds that
the steady-state probability of CLB2 is lower than that of other agents, while most of the genes
have similar steady-state probabilities. These results lead to the conclusion that the key regulatory
genes of the cell cycle can be expressed in the absence of CLB type cyclines, which was also the
conclusion of the original microarray experiment study.
Conclusion: G-networks provide an efficient way to monitor steady-state of GRNs. Our method
produces more reliable results then the conventional t-test in detecting differentially expressed
genes. Also G-networks are successfully applied to the yeast GRNs. This study will be the base of
further GRN dynamics studies cooperated with conventional GRN inference algorithms.
Page 1 of 10
(page number not for citation purposes)
BMC Genomics 2009, 10(Suppl 3):S26 http://www.biomedcentral.com/1471-2164/10/S3/S26
methods focus on the inference of network structure enables us to obtain the steady-state of GRNs with only
which only provides a snapshot of a given dataset. polynomial computational complexity due to the pro-
Probabilistic Boolean Networks (PBNs) represent duct form solution of G-Networks; the computational
the dynamics of GRNs [3], but PBNs are limited by cost due to large memory space and non-polynomial
the computational complexity of the related algo- computational complexity are basic limitations in con-
rithms [4]. ventional methods such as PBN. Also our method can
provide more reliable measures to detect differentially
In [5], a new approach to the steady-state analysis of expressed genes in microarray analysis (as shown in our
GRNs based on G-Network theory [6,7] is proposed, simulation study).
while G-Networks were firstly applied to GRNs with
simplifying assumptions concerning gene expression
in [8]. However, the G-Network approach also exhibits G-networks and gene regulatory networks
specific difficulties because of the large number of The GRN model used in this study is the probabilistic
parameters that are needed to compute their steady- gene regulatory model introduced in [5]. In this
state solution. Thus, in this study we reduce the number model, let Ki(t) be integer-valued random variables
of model parameters on the basis of biological assump- which represent a quantity (mRNA) of the gene i at
tions and focus on estimating two parameters in time t. If the Ki(t) is zero, the gene i cannot interact
particular: the total input rate and steady-state prob- with other genes. Then we have the following
ability of a gene. Probabilities,
Although queueing theory is a common computational the signal of gene i exits the system so Ki(t) is depleted by 1
tool, G-Networks are an essential departure from
queueing theory; in particular conventional queues Let’s define a random process K(t) = [K1(t), ..., Kn(t)],
could not be possibly applied to GRNs because the t ≥ 0 and an n-vector of non-negative integers k = [k1, ...,
notion of inhibition does not exist in queueing theory kn]. The P (k, t) is the probability that K(t) takes k at time
but was introduced by G-Network theory. There are two t, P (k, t) = P (K(t) = k). Then the probability that
other essential novelties in our work. First, our approach K(t) have k at time t + Δt is defined by
Page 2 of 10
(page number not for citation purposes)
BMC Genomics 2009, 10(Suppl 3):S26 http://www.biomedcentral.com/1471-2164/10/S3/S26
∑ ⎡⎣ (Λ Δt + o(Δt))P(k
km
P(k , t + Δt ) = i
−
i , t )I(k i > 0) + (μ i Δt + o(Δt ))P(k i+ , t ) P( K m = k m ) = q m (1 − q m ),
i =1 |I|
+(d i ri Δt + o(Δt ))P(k i+ , t )
n (
P (K m1 ,..., K m|I| ) = (k m1 ,..., k m|I| ) = ) ∏q k mi
m i (1 − q mi )
+ ∑ {(P (i, j)r Δt +
j =1
+
i o(Δ))P(k ij+− , t )I(k j > 0) i =1
(4)
+(P − (i, j)ri Δt + o(Δ))P(k ij++ , t )
+(P − (i, j)ri Δt + o(Δ))P(k i+ , t )I(k j = 0) where for any subset I ⊂ 1, ..., n such that qm<1 for each
n m Œ I, and I{m1, ..., m|I|}.
+ ∑ ((Q(i, j, l)r Δt + o(Δt))P(k
l =1
i
++−
ijl , t )
++−
+(Q( j, i, l)r j Δt + o(Δt ))P(k ijl , t ))I(k l > 0) } Results and discussion
Simple gene regulatory networks using stochastic
+(1 − Λ i Δ t − μ i Δt − ri Δt + o(Δt ))P(k , t )I(k i > 0) ]
gene expression model
(1) In order to assess our G-Network model, we construct a
simple GRN structure and generate the expression data
where k i+ (k i− ) is a vector that the value of ith element is using a synthetic stochastic gene expression model
ki + 1 (ki - 1) and I(x) is indicator function which is 1 if [13,14]. This stochastic gene expression model has
the condition, x, is satisfied or 0 other wise. The first and several important features such as protein dimerization
second terms describe the increment and decrement of [15] and time delay for protein signalling [13]. Figure 1
the length of queue i, respectively. Third term is the shows the simulated network structure which is based on
probability that the gene i is activated but nothing is the following basic principles: the number of proteins
happened except queue i lose one mRNA. From fourth per cell chases the number of mRNAs which in turn
to sixth terms are the probabilities that gene i is activated chases the number of active genes [14]. Figure 2 depicts
and interacts with gene j. The rest terms of (1) represent the assumptions of our model and (5)~(11) give the
the probabilities that the interaction of gene i and gene j corresponding processes (RPo: RNA open complex, Pro:
affect the gene l (length of lth queue). Divide (1) by Δt promoter, R: mRNA, P: protein monomer, PP: protein
and introduce the equilibrium probability distribution dimmer, 0: degradation, t: time, and Δt: time increment):
of the system P(k) = limt Æ ∞ P (k, t) then we obtain
following dynamic behaviour, λ2
RPo i (t ) + Pro i (t ) → RPo i (t ) + Pro i (t ) + R i (t ) (5)
n
∂P(k)
∂t
= ∑ ⎡⎣ Λ P(k )I(k ) > 0) + (μ
i =1
i
−
i i i + d i ri )P(k i+ )
λ3
n R i (t ) → R i (t ) + Pi (t ) (6)
+ ∑{ P
j =1
+
(i, j)ri P(k ij+− )I(k j > 0) + P − (i, j)ri (P(k ij++ ) + P(k i+ )I(k j = 0))
ka2
n
} Pi (t ) + P j (t ) → PPij (t ) (7)
+ ∑ ((r Q(i, j, l) +
l =1
i
++−
r jQ( j, i, l)P(k ijl )I(k l ))
j ,l =1,l ≠ j
q jq l r jQ( j, l, i) l d1
Pro lPPij (t ) + PPmn (t ) → Pro l (t + Δt ) + PPij (t + Δt ) + PPmn (t + Δt )
(3)
n n (10)
Λ i− = μ i + ∑
j =1
q j r j P − ( j, i) + ∑
j ,l =1,l ≠ j
q lq i rlQ(l, i, j)
γ2
R i (t ) → 0(t )
γ3
Where qi (= Λ i+ /(ri + Λ i− )) represents the probability
Pi (t ) → 0(t ) (11)
that gene i is expressed in steady-state. Using (2) and (3),
E. Gelenbe showed the following product form is γ4
satisfied [5,7]. PPij (t ) → 0(t )
Page 3 of 10
(page number not for citation purposes)
BMC Genomics 2009, 10(Suppl 3):S26 http://www.biomedcentral.com/1471-2164/10/S3/S26
Page 4 of 10
(page number not for citation purposes)
BMC Genomics 2009, 10(Suppl 3):S26 http://www.biomedcentral.com/1471-2164/10/S3/S26
Table 2: Steady-state probability and total income rate of dataset showing significant p-value of GA
Dataset 3 500 Samples GA 0.474 0.725 0.319 0.495 0.67 0.68 0.000
GB 0.503 0.745 0.584 0.775 1.16 1.04 0.000
GC 0.460 0.695 0.443 0.705 0.96 1.01 0.304
GD 0.521 0.765 0.541 0.785 1.03 1.02 0.122
which are drawn from all the data points. In Dataset 1, genes are involved with specific cell cycle phases, but the
the expression of GA is significantly different (p-value of number of key regulators that are responsible for the
t-test <0.01 in Table 2) while the difference of the GA control of the cell cycle process is much smaller. Thus,
expression in Dataset 2 is not significant. The third based on published information, we build a cell cycle
dataset consists of 500 samples which are randomly GRN with the key regulators in budding yeast as shown
chosen from the original observations. in Figure 3, although the relationships that contribute to
the true regulatory network structure of the cell cycle still
Table 2 summarizes the results of the three datasets. In remain uncertain. Therefore we simplify the cell cycle
the case groups of Datasets 1 and 2, both the qA and ΛA network structure by selecting thirteen key regulatory
have the lowest values among the four nodes while the t- genes (the gray circles in Figure 3) and connect the genes
test of the GA expression in Dataset 2 shows that it is not without regard to the transcriptional and post-transcrip-
significant (p-value = 0.202). In the small sample results tional processes. Figure 4 shows the reconstructed
(Datasets 1 and 2), our method provides consistent regulatory network structure.
results with large sample analysis (Dataset 3). The ratios
(case/normal) also show that the qA and ΛA, in the case The activity of cyclin-dependent kinases (CDKs) plays an
group, are smaller than one while the other ratios stay important role in controlling periodic events during cell
around one. In Dataset 3, the p-value of GB is significant cycle. Some studies of cell cycle with high-throughput
along with that of GA because the expression of GA technologies have suggested alternative regulation models
directly affects the expression of GB. However, GB is not of periodic transcription [20]. D. Olando et., al. [12]
the causal gene in this study. Our G-Network analysis measured the transcription levels of cell cycle related genes
reveals that only GA has lower q and Λ values than other with the use of Yeast 2.0 oligonucleotide array and
nodes including GB. All these results concur with the determined the manner in which transcription factor
simulation data generated with one half of the normal networks contribute to CDKs and to global regulation of
transcription rate. the cell-cycle transcription process. This microarray dataset
is used in our study with the cell cycle network structure of
Modeling cell cycle gene regulatory networks in Figure 4; it consists of two groups: one group is obtained
budding yeast from wild-type (WT) cells and the other is from cyclin-
The cell cycle regulated transcription and its overall mutant (CM) cells which are disrupted for all S-phase and
controls have been studied in detail for budding yeast mitotic cyclins (mutate clb1, 2, 3, 4, 5, and 6).
[19]. Recent developments in high-throughput micro-
array techniques help to reveal many of yeast genes The microarray data consist of a total of 30 data points
controlling the cell cycle [20] which consists of four taken over 270 minutes. We subdivide it into five states
distinct phases: Gap1 (G1), Synthesis (S), Gap2 (G2), (groups), each consisting of 6 data points. The expres-
and Mitosis (M). The cells grow during their G1 and G2 sion levels are transformed by taking the natural
phases and their DNA is replicated during the S phase. In logarithm. Figure 5 depicts the transformed expression
the M phase, cell growth stops and the cell divides into profiles of the 13 genes with 5 states. The black and gray
two daughter cells that include nuclear division. Many solid lines are the expression profiles from WT and CM
Page 5 of 10
(page number not for citation purposes)
BMC Genomics 2009, 10(Suppl 3):S26 http://www.biomedcentral.com/1471-2164/10/S3/S26
Figure 3
Cell cycle regulatory network structure in budding yeast. The genes are represented by circles. Complex molecules
consisted of two more proteins are represented by a white rectangle. The gray and black boxes are transcription and post-
transcription processes, respectively. Activation processes are depicted by the solid lines and inhibitions or repressions are
shown by the dashed lines. The genes with gray circles are used to model the G-Networks.
cells, respectively, and S1, S2, ..., S5 represent the five i.e. that the steady-state of the 12 genes does not entirely
states. It is obvious that the profiles of CLB2 are different depend on the expression of CLB2. Table 4 shows the
between WT and CM cells because the CM dataset is estimated total input rate of the 13 genes. These results
designed to monitor the cell cycle processes without the also show that only the input rates of CLB2 decrease in
clb cyclines. the CM group.
Page 6 of 10
(page number not for citation purposes)
BMC Genomics 2009, 10(Suppl 3):S26 http://www.biomedcentral.com/1471-2164/10/S3/S26
Methods
Once a GRN structure is determined, it is necessary to
estimate the total input rate (Λi) of ith queue and its
steady-state probability, (qi). For the simplicity, the
probabilities, P+ (i, j), P- (i, j), and Q(i, j, l) in (3) are
set to be one. Then, it can be rewritten as follows
Λ i + f i+ (q j )
qi =
Ri + f i−(q j )
n n
∑ ∑
Figure 4
Cell cycle regulatory network structure with selected where f i+ (q j ) = q jr j + q jq l r j (12)
13 genes. Each node represents a queue. Signals are j =1 j ,l =1,l ≠ j
transferred through the edges. Solid and dashed lines are n n
positive and negative interactions, respectively. f i− (q j ) = ∑
j =1
q jr j + ∑
j ,l =1,l ≠ j
q lq i rl
can be expressed without the clb cyclines; this result is In (12), the Λi and Ri is the total input (Λi = li + Ii) and
consistent with the original experimental study. total output rates (Ri = ri + μi), respectively. f i+ is a
function of activation probabilities of genes which affect
However, the unchanged steady-state probabilities in all to gene i positively and f i− is a function of activation
the five states may need to be considered, because the cell probabilities of genes which affect to gene i negatively.
cycle has four phases (G1, S, G2, M) and expressions of We fix the ri as the number of out degrees of gene i and
genes involved with a specific phase are expected to be the degradation rate of mRNA, i, as a constant (Table 1)
different from those in other phases. Also the small because the total output rate, Ri is not our interest.
decrease rate and relatively large total input rates of CLB2 Therefore, we need to estimate two parameters, the total
may require a more careful analysis of the G-Network input rate, Λi, and the steady-state probability, qi.
approach in relation to cell cycle GRN structure. The
manner in which we have used G-Network models in this Let Λ lower
i is the lower bound of the Λi, which is larger
paper did not currently include simultaneous interactions than zero. The lower bound of total input is regarded as
with three or more nodes. However this is not really a an initial transcription rate without any external input.
limiting effect of the model, since it suffices to include In this study, we use Λ lower
i = 0.0025 [16]. The upper
chain representations of dependencies in the G-Network bound of Λi Λ iupper is obtained by assuming inputs from
model as has been done for neuronal networks [9] to other nodes are zero and the queues fully work. That is
cover excitatory and inhibitory effects that involve three
or more nodes, and in fact random chains of nodes of any Λ iupper = q i∗(Ri + f i− (q ∗j ))
length. Although in this study the probabilities that gene i
affect gene j, P+ (i, j) and P- (i, j) in (3), are fixed at the where the probabilities q i∗ and q ∗j are one.
value one, we think that the conventional reverse
engineering GRN methods using the “Ensemble” method Let q (i 0) is the initial value of qi. Then q (i 0) can be
[21] can provide these probabilities more accurately for obtained as follow,
an improved steady-state analysis of GRNs.
q (i 0) = E j[q ij(0)]
In conclusion, our study has illustrated the use of
G-Networks as a new approach for the steady-state analysis = E j[x ij / max( x ij )]
Page 7 of 10
(page number not for citation purposes)
BMC Genomics 2009, 10(Suppl 3):S26 http://www.biomedcentral.com/1471-2164/10/S3/S26
Figure 5
Expression profiles of selected 13 genes. The black and gray lines represent the wild-type (WT) and clb-mutant (CM)
groups' expression levels.
State Cells CLN3 WHI5 SWI4 MBP1 CLB2 YOX1 YHP1 HCM1 FKH2 NDD1 SWI5 ACE2 SIC1
S1 WT 0.880 0.813 0.829 0.839 0.784 0.99 0.803 0.843 0.855 0.836 0.799 0.99 0.99
CM 0.878 0.814 0.818 0.848 0.770 0.99 0.802 0.842 0.864 0.839 0.787 0.99 0.99
C/W 0.998 1.001 0.987 1.011 0.981 1.00 0.999 0.999 1.011 1.004 0.986 1.00 1.00
S2 WT 0.882 0.845 0.845 0.840 0.847 0.99 0.850 0.870 0.863 0.863 0.825 0.99 0.99
CM 0.876 0.837 0.846 0.847 0.769 0.99 0.853 0.873 0.865 0.861 0.807 0.99 0.99
C/W 0.994 0.990 1.000 1.008 0.909 1.00 1.004 1.004 1.002 0.998 0.978 1.00 1.00
S3 WT 0.890 0.840 0.826 0.846 0.886 0.99 0.844 0.855 0.863 0.854 0.871 0.99 0.99
CM 0.880 0.846 0.820 0.849 0.751 0.99 0.863 0.863 0.869 0.870 0.840 0.99 0.99
C/W 0.989 1.008 0.993 1.003 0.847 1.00 1.022 1.010 1.007 1.019 0.964 1.00 1.00
S4 WT 0.890 0.841 0.837 0.845 0.866 0.99 0.839 0.870 0.862 0.853 0.857 0.99 0.99
CM 0.879 0.835 0.821 0.849 0.757 0.99 0.864 0.864 0.859 0.863 0.845 0.99 0.99
C/W 0.988 0.993 0.982 1.005 0.874 1.00 1.029 0.994 0.996 1.012 0.986 1.00 1.00
S5 WT 0.891 0.850 0.837 0.846 0.877 0.99 0.839 0.869 0.862 0.856 0.865 0.99 0.99
CM 0.869 0.830 0.823 0.842 0.756 0.99 0.862 0.862 0.857 0.861 0.845 0.99 0.99
C/W 0.976 0.977 0.983 0.995 0.862 1.00 1.027 0.991 0.994 1.006 0.976 1.00 1.00
Page 8 of 10
(page number not for citation purposes)
BMC Genomics 2009, 10(Suppl 3):S26 http://www.biomedcentral.com/1471-2164/10/S3/S26
Table 4: Estimated total input rate of the 13 genes in cell cycle GRNs
State Cells CLN3 WHI5 SWI4 MBP1 CLB2 YOX1 YHP1 HCM1 FKH2 NDD1 SWI5 ACE2 SIC1
S1 WT 4.127 2.248 5.309 0.763 1.278 2.006 0.914 0.995 0.884 1.015 1.783 1.006 1.006
CM 4.127 2.248 5.238 0.793 1.217 2.006 0.914 0.995 0.914 1.036 1.702 1.006 1.006
C/W 1.000 1.000 0.987 1.040 0.953 1.000 1.000 1.000 1.034 1.020 0.955 1.000 1.000
S2 WT 4.187 2.339 5.521 0.763 1.430 2.006 0.995 1.036 0.854 0.995 1.945 1.006 1.006
CM 4.187 2.309 5.521 0.793 1.187 2.006 0.995 1.036 0.854 1.056 1.743 1.006 1.006
C/W 1.000 0.987 1.000 1.040 0.830 1.000 1.000 1.000 1.000 1.061 0.896 1.000 1.000
S3 WT 4.187 2.339 5.379 0.763 1.551 2.006 0.995 1.015 0.884 0.955 2.187 1.006 1.006
CM 4.187 2.339 5.379 0.793 1.127 2.006 1.036 1.036 0.884 1.096 1.824 1.006 1.006
C/W 1.000 1.000 1.000 1.040 0.726 1.000 1.041 1.020 1.000 1.148 0.834 1.000 1.000
S4 WT 4.187 2.339 5.450 0.763 1.490 2.006 0.975 1.036 0.854 0.955 2.106 1.006 1.006
CM 4.187 2.309 5.379 0.793 1.157 2.006 1.036 1.036 0.854 1.076 1.864 1.006 1.006
C/W 1.000 0.987 0.987 1.040 0.776 1.000 1.062 1.000 1.000 1.127 0.885 1.000 1.000
S5 WT 4.187 2.369 5.450 0.763 1.521 2.006 0.975 1.036 0.854 0.955 2.147 1.006 1.006
CM 4.127 2.278 5.379 0.793 1.157 2.006 1.036 1.036 0.854 1.076 1.864 1.006 1.006
C/W 0.986 0.962 0.987 1.040 0.761 1.000 1.062 1.000 1.000 1.127 0.868 1.000 1.000
Page 9 of 10
(page number not for citation purposes)
BMC Genomics 2009, 10(Suppl 3):S26 http://www.biomedcentral.com/1471-2164/10/S3/S26
Page 10 of 10
(page number not for citation purposes)