TMP D73
TMP D73
Address: 1Center for Evolutionary Functional Genomics, The Biodesign Institute, Tempe, Arizona State University, USA and 2Instituto de
Biomedicina de Valencia, Consejo Superior de Investigaciones Científicas (IBV-CSIC), Valencia, Spain
Email: Antonio Marco - [email protected]; Ignacio Marín* - [email protected]
* Corresponding author
Abstract
Background: The characterization of the global functional structure of a cell is a major goal in
bioinformatics and systems biology. Gene Ontology (GO) and the protein-protein interaction
network offer alternative views of that structure.
Results: This study presents a comparison of the global structures of the Gene Ontology and the
interactome of Saccharomyces cerevisiae. Sensitive, unsupervised methods of clustering applied to a
large fraction of the proteome led to establish a GO-interactome correlation value of +0.47 for a
general dataset that contains both high and low-confidence interactions and +0.58 for a smaller,
high-confidence dataset.
Conclusion: The structures of the yeast cell deduced from GO and interactome are substantially
congruent. However, some significant differences were also detected, which may contribute to a
better understanding of cell function and also to a refinement of the current ontologies.
Page 1 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
the PPI data generated in massive, high-throughput exper- we described novel strategies of graph analysis and we
iments [7-11]. showed their usefulness to explore the structures of differ-
ent complex biological graphs, such as the interactome or
GO and interactome provide alternative views of how an protein domain graphs [15,28-30]. Our methods generate
organism is structured and functions. It is thus logical to hierarchical structures, dendrograms, based on the aver-
explore whether they are congruent. This is however prob- age strength of the connections among the units of a
lematic, because GO and PPI data are very different. On graph, and then establish whether clusters in the dendro-
one hand, gene products may be either annotated or not grams are enriched for units with particular features.
with GO terms. Thus, from the point of view of each GO These procedures open the way for a global comparison of
term, the classification is dichotomous. On the other interactome and GO. Particularly, they avoid the need of
hand, PPI data are best expressed as a graph or network of selecting modules to compare with GO. In interactome-
units (proteins) connected by edges (known interactions). based dendrograms, it is possible to include all proteins
How to compare then these two, so different, types of that we wish to analyze – without dividing them into
information? The simplest way to collate GO and interac- those highly connected, included in modules, and those
tome data is to characterize from PPI results groups of excluded from them – and to establish whether any clus-
densely connected units, i. e. modules [12-15] and then to ter of proteins, no matter the number of direct interac-
establish whether modules are statistically enriched for tions among its members, is enriched for GO terms. As we
particular GO terms. This strategy has been followed with will show, this allows for a precise mathematical determi-
success by several groups [12,15-18]. Discussions cur- nation of the similarity between the GO-based and the
rently center in the best way to define modules so they interactome-based classifications.
make sense from either the mathematical or the biological
point of view (e. g. refs. [18-20]), but it is generally In this study, we obtained a hierarchical representation of
accepted that modules are often enriched for particular large fragments of the interactome of Saccharomyces cerevi-
GO terms. This congruence between GO and PPI data has siae. Then, we determined and quantified the global simi-
led to works in which proteins are assigned functions larity between a significant part of the structures of
according to the GO annotations of their interaction part- interactome and GO in the yeast. Our results greatly
ners [21-23]. Similarity in GO annotations has been also enrich our knowledge of the relationships between the
used to predict interactions among pairs of proteins alternative views of the yeast cell that its gene ontology
[24,25]. and interactome provide.
Page 2 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
those trees were significantly enriched for some child GO about 80% of the interactions derive from massive exper-
terms, hierarchically situated just below the parent term in iments. Second, we used the "Binary gold standard data-
the GO structure. If interactome and GO are congruent, set" (which we will call from now on "GOLD dataset"), a
we would expect to detect in a tree clusters of units set of 1318 high-confidence binary interactions selected
enriched for the child GO terms. A significant technical by Yu et al. [31]. The comparison between the results
point is that, because we use each parent term in isolation, obtained with the DIP dataset and those obtained with
we avoided the analytical problems which would derive the GOLD dataset will allow us to determine whether
from the fact that sometimes a GO term has several parent using massive data creates biases that may affect our gen-
terms. eral conclusions.
Table 1 summarizes the data for the nine parent terms About 79% of the proteins annotated with the nine
selected for this study (see Methods for the criteria used selected parent terms were included in the interactome
for choosing them). Interactome data were obtained from dataset that we obtained from the DIP database. The final
two different databases. First, we used all the information groups of proteins included in both the GO and the DIP
available for S. cerevisiae at the Database of Interacting interactome dataset contained from 230 to 632 units
Proteins (DIP; http://dip.doe-mbi.ucla.edu). This dataset (average: 354 units; Table 1). This means that each com-
contains both low- and high-throughput data, although parison included from 4 to 11% of all S. cerevisiae pro-
Page 3 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
Table 1: Parent GO terms selected for the analysis, and number of elements included.
BP: Biological Process; CC: Cellular Component; MF: Molecular Function. 1: Levels of the parent GO terms. Level 1 terms are hierarchically located
just below the three main categories (BP, CC and MF) while Level 2 terms are below a Level 1 term. 2: Number of genes selected for the analysis,
i. e. those ascribed to the parent GO term which are also included in one of the selected child GO terms. 3: Genes among those in the previous
column that contain ORFs and therefore encode for proteins. 4: Number of products among those in the selected ORFs for which interactions
were compiled in the DIP database. 5: Same as 4, but for the GOLD dataset.
teins. The nine comparisons together included about 44% Once the data had been chosen, UVCLUSTER was used to
of the proteins present in the yeast (percentages derived obtain dendrograms, one per each of the nine parent GO
from [32]; notice that a protein may be annotated with terms (see Methods). Then, we searched for clusters of
multiple terms). The GOLD dataset is much more units significantly enriched for child GO terms using
reduced. Only 28% of the proteins annotated with one of TreeTracker (see again the Methods section for the
the nine parent GO terms were found in that dataset. The details). In Table 3 and Additional File 1, we describe the
average size of the groups analyzed was correspondingly results obtained. Table 3 contains the summary of results
much smaller than those found in DIP, including in aver- for parent GO terms and Additional File 1, the details for
age just 130 proteins (range 63 – 257; Table 1). In the next child GO terms. We used four parameters (coverage,
sections, we will first discuss the results obtained for the purity, ambiguity and Φ coefficient; see Methods for pre-
DIP dataset and, later, we will show that our main find- cise definitions) to quantify the results obtained. The
ings are confirmed with the smaller, high-confidence summary of the results detailed in Table 3 is as follows: 1)
GOLD dataset. Confirming that our methodology indeed detects clusters
highly enriched for the corresponding GO terms, the
Interactome and GO structures are substantially purity of the clusters (i. e. the percentage of proteins
congruent: DIP data included in a positive cluster, detected as significantly
The nine selected parent GO terms were subdivided into enriched for a given GO term, which indeed belong to
child terms, which are detailed in Table 2. Using DIP data, that GO term), was high (62 – 96%, average: 80.1%). This
we found that each child GO term included an average of is good evidence for our approach being very sensitive, in
96.7 proteins. Table 2 also shows an important prelimi- agreement with our previous work [30]; 2) Coverage (a
nary point, namely that interactome and GO data are measure of to which extent a given GO term is detected in
largely independent. Less than 5% of the proteins ana- the interactome data), was quite complete, ranging from
lyzed in the DIP dataset were assigned to a particular GO 34 to 67%, with a global average of 51.2%. This means
because of PPI data in absence of other evidence (i. e. that a significant fraction of proteins in the examined GO
assignations annotated as "inferred from physical interac- classes are recovered in the interactome-based clusters.
tion" in GO databases). Moreover, this percentage dimin- Interestingly, GO terms in the Biological Process category
ishes to only 3% if two exceptional child GO terms (Small had higher coverages (average: 61.2%) than those in the
nucleolar ribonucleoprotein complex and Structural constituent Cellular Component (average: 49.7%) or Molecular Func-
of cytoskeleton) are excluded and is 0.0% for 19 of the 46 tion (average: 39.0%) categories; 3) Ambiguity, which
child GO terms. Therefore, we can confidently assume measures cluster overlap, was variable, ranging from 0 to
that, if we find evidence for global congruence between 20% (average: 7.7%); and, 4) Finally, Phi coefficients (Φ),
the GO and interactome structures, this will not be caused a precise measure of correlation between GO and interac-
by PPI being systematically used to define to which GO tome data (see Methods), are all positive and quite high
terms the proteins are assigned. (+0.39 to +0.64), with an average of +0.47 ± 0.03. This last
Page 4 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
Developmental process (32502) 632 (16) 257 (8) Organelle envelope (31967) 230 (12) 69 (2)
Reproductive developmental process 26 (0) 13 (0) Organelle inner membrane (19866) 105 (8) 27 (2)
(3006)
Anatomical structure development 186 (15) 94 (8) Organelle outer membrane (31968) 24 (0) ---
(48856)
Cellular developmental process (48869) 450 (1) 169 (0) Organelle envelope lumen (31970) 25 (0) ---
Aging (7568) 40 (0) 22 (0) Nuclear envelope (5635) 86 (3) 35 (0)
Mitochondrial envelope (5740) 148 (9) 34 (2)
Reproduction (3) 245 (7) 111 (4)
Sexual reproduction (19953) 95 (0) 41 (0) Transcription regulator activity (30528) 276 (14) 107 (5)
Asexual reproduction (19954) 74 (6) 44 (4) Transcriptional activator activity (16563) 50 (0) 24 (0)
Reproductive process (22414) 207 (7) 88 (4) Transcriptional repressor activity (16564) 35 (2) 13 (1)
Rep. of a single-celled organism (32505) 220 (7) 99 (4) Transcription factor activity (3700) 45 (2) 13 (1)
RNA polymerase II transcription factor 112 (4) 44 (1)
activity (3702)
Establishment of cellular localization 452 (21) 188 (10) Transcriptional elongation regulator activity 14 (6) ---
(51649) (3711)
Secretion by cell (32940) 206 (9) 84 (3) Transcription cofactor activity (3712) 36 (1) 16 (0)
Establishment of nucleus localization 17 (0) ---
(40023)
Intracellular transport (46907) 409 (21) 175 (10) Structural molecule activity (5198) 231 (29) 75 (18)
Structural constituent of ribosome (3735) 115 (0) 21 (0)
Response to stimulus (50896) 514 (3) 207 (0) Structural constituent of cytoskeleton (5200) 50 (29) 31 (18)
Response to endogenous stimulus 197 (3) 101 (0)
(9719)
Cellular response to stimulus (51716) 13 (0) --- Transporter Activity (5215) 297 (8) 63 (1)
Response to abiotic stimulus (9628) 83 (0) 32 (0) Ion transport activity (15075) 111 (5) 16 (0)
Response to external stimulus (9605) 27 (0) 13 (0) Carbohydrate transporter activity (15144) 26 (0) ---
Response to biotic stimulus (6907) 19 (0) --- ATPase activity, coupled to movement of 41 (2) ---
substances (43492)
Response to chemical stimulus (42221) 212 (0) 65 (0) Amine transporter activity (5275) 27 (0) ---
Response to stress (6950) 370 (3) 159 (0) Organic acid transporter activity (5342) 32 (0) ---
Carrier activity (5386) 67 (0) 13 (0)
Ribonucleoprotein complex (30529) 318 (64) 96 (12) Intracellular transporter activity (5478) 28 (0) 17 (0)
Small nuclear ribonucleoprotein 58 (2) 24 (0) Protein transporter activity (8565) 48 (1) 29 (1)
complex (30532)
Preribosome (30684) 12 (4) --- Lipid transporter activity (5319) 11(2) ---
Spliceosome (5681) 74 (12) 33 (2)
Small nucleolar ribonucleoprotein 49 (43) 10 (9)
complex (5732)
Ribosome (5840) 156 (5) 45 (1)
Polysome (5844) 11 (0) ---
Results for both the DIP and GOLD datasets are indicated. Parent GO terms are indicated in bold and, below them, the child GO terms are
detailed. The numbers in parentheses adjacent to the names refer to the numerical identifiers of the GO terms. N: number of proteins for which
we obtained PPI data and whose genes were annotated to the GO term. (P): in parentheses, number of proteins among those N that are annotated
with the GO term based exclusively on PPI evidence. The child GO terms with less than 10 proteins found when analyzing the GOLD dataset were
not further examined (dashes).
result demonstrates that the GO and interactome classifi- present in average in each cluster. The summary is that
cations are, when globally considered, significantly simi- positive clusters were detected for 45 of the 46 child GO
lar. terms. Purities larger than 70% were observed for 31 out
of those 45 child GO terms and 22 of the 46 child GO
Additional File 1 details the results for all child terms. In terms had coverages larger than 50%. Φ values were posi-
addition of the purity, coverage and Φ coefficient values, tive for all 45 child GO terms for which we found signifi-
that table also details how many significant, non-overlap- cant clusters. Once put aside the two already mentioned
ping clusters were detected for each GO term and how child GO terms with a high number of assignments based
many proteins corresponding to the GO child term were on PPI data, which may therefore be spuriously significant
Page 5 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
Table 3: General results for the parent GO terms. Analyses using the DIP dataset.
Developmental process (32502) 63.6% (402/632) 62.2% 13.0% (74/570) 0.46 ± 0.02
Reproduction (3) 58.4% (142/245) 94.1% 0% (0/25) 0.38 ± 0.11
Establishment of cellular localization (51649) 66.8% (302/452) 88.4% 1.1% (3/264) 0.43 ± 0.10
Response to stimulus (50896) 56.4% (290/514) 77.5% 19.5% (32/164) 0.46 ± 0.05
Ribonucleoprotein complex (30529) 59.7% (190/318) 77.8% 12.8% (31/242) 0.64 ± 0.06
Organelle envelope (31967) 39.6% (91/230) 84.9% 1.2% (1/83) 0.47 ± 0.09
Transcription regulator activity (30528) 43.5% (120/276) 67.6% 15.0% (30/200) 0.40 ± 0.08
Structural molecule activity (5198) 39.8% (92/231) 95.6% 0% (0/165) 0.53
Transporter Activity (5215) 33.7% (100/297) 72.5% 6.4% (12/186) 0.43 ± 0.06
(Small nucleolar ribonucleoprotein complex and Structural determining, we performed additional analyses using the
constituent of cytoskeleton; see above), we determined the GOLD dataset in order not only to validate those results,
significance level for the other 43 child GO terms using a but also to check for the potential effects of low-confi-
chi square test and Bonferroni's correction (see Methods). dence interactions in our conclusions. First, we repeated
Φ was highly significant for 41 of those 43 terms (Addi- the screening for assignations to GO terms based only in
tional File 1). These results further confirm that GO and PPI data, again finding that only 5.6% of the proteins
interactome are notably congruent. included in our parent GO terms according to the GOLD
dataset were in that class and that the percentage again
Figures 2 and 3 graphically show typical results. Figure 2 went down to 2.7% when we excluded the same two
depicts the UVCLUSTER-based dendrogram of the parent exceptional terms Structural constituent of cytoskeleton and
GO term Ribonucleoprotein complex, which includes well- Small nucleolar ribonucleoprotein complex, mentioned
known cellular components such as the ribosome or the above. Once demonstrated the almost complete inde-
spliceosome. Significant clusters for its six child terms are pendence of the GO and interactome data, we performed
indicated. Interestingly, significant clusters for four out of the same analyses that we did before for the DIP dataset.
the six child GO terms (Spliceosome, Ribosome, Small nucle- In this case, there were just 33 child GO terms containing
olar ribonucleoprotein complex and Preribosome) were almost 10 or more units. We again focused our analyses in deter-
completely independent, while significant clusters for the mining whether those 33 groups appeared in the general
other two (Small nuclear ribonucleoprotein complex and Poly- dendrograms generated with all the proteins annotated to
some) appeared included in more comprehensive clusters the parent GO terms. Table 4 shows the average results for
positive for other child GO terms (Spliceosome and Preri- the nine parent GO terms using the GOLD dataset. They
bosome, respectively). This overlap explains the relatively are in general quite similar to those shown before for the
high ambiguity of the Ribonucleoprotein complex term DIP dataset (Table 3). As happened in the DIP analyses,
(12.8%; Table 3). In Figure 3, the graph with all the both the purity (76.9%; range 64.7% – 93.6%) and cover-
known direct PPI among the proteins in the parent GO age (average: 78.9%; range 39.3% – 96.4%) were high.
term is shown. The color codes allow visualizing why the Ambiguity was higher than in the DIP analyses (average
Spliceosome and Small nuclear ribonucleoprotein complex 28.1%; range 0% – 46.2%). This result was however
terms overlap in the UVCLUSTER analyses: a large expected, considering that the number of proteins in the
number of proteins are annotated with both GO terms GOLD-based trees is much smaller than in the DIP-based
(shown in Figure 3 as blue/yellow dots). The high degree trees, favoring the overlap of the significant clusters.
of purity (77.8%) for the Ribonucleoprotein complex GO Finally, the positive correlation between GO and interac-
term can be also easily visualized in this representation: tome measured by the Φ coefficient was also highly signif-
notice the very few dots with a color different from that of icant and a bit higher than in the DIP-based analyses, with
the clusters (surrounded by the polygons). Those corre- an average of +0.58 ± 0.06 (range: +0.37 – +0.91). This
spond to the few proteins included in a cluster but not difference in average Φ coefficients for the two datasets is
annotated with the corresponding child GO term. however statistically not significant (t test). The results for
all child GO terms are detailed in Additional File 2. They
Analyses of the GOLD dataset: confirming the congruence were very similar to those shown before for the DIP data-
between GO and interactome set (Additional File 1). We detected significant clusters for
While the results shown in the previous section provide all (n = 33) the child GO terms of size ≥ 10. Both purities
the general picture of the congruence between the GO and above 70% and coverages larger than 50% were found in
interactome classifications that we were interested in 24 of those 33 terms. After eliminating the two terms with
Page 6 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
Figure
Hierarchical
2 representation of the protein interaction network for the Ribonucleoprotein complex term
Hierarchical representation of the protein interaction network for the Ribonucleoprotein complex term. On the
left, tree based on secondary distances. The tree on the right is shown to make the topology easier to visualize. At the bottom,
"Unconnected proteins" are those with no direct interactions, which are separated from the rest by UVCLUSTER. Numbers
refer to different clusters found for the same child GO term, which are again shown in Figure 3. snoRNP complex: Small nucle-
olar ribonucleoprotein complex; snRNP complex: Small nuclear ribonucleoprotein complex. NMD: nonsense-mediated mRNA
decay. LSM: like-SM protein complex.
Page 7 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
Figure 3
Ribonucleoprotein complex protein interaction network
Ribonucleoprotein complex protein interaction network. All the proteins (dots) in this parent GO term that have at
least one direct connection are shown. Colors refer to the child GO terms to which the proteins are annotated. White dots
are proteins that do not belong to any of the analyzed child GO terms. The clusters detected in our analyses are framed with
colored polygons. Color codes and cluster numbers as in Figure 2.
Table 4: General results for the parent GO terms. Analyses using the GOLD dataset.
Developmental process (32502) 83.3% (214/257) 82.0% 7.2% (16/222) 0.51 ± 0.06
Reproduction (3) 96.4% (107/111) 82.5% 8.3% (1/12) 0.45 ± 0.03
Establishment of cellular localization (51649) 86.7% (163/188) 76.8% 46.2% (49/106) 0.37 ± 0.02
Response to stimulus (50896) 78.3% (162/207) 73.2% 32.1% (18/56) 0.48 ± 0.07
Ribonucleoprotein complex (30529) 82.3% (79/96) 70.7% 56.2% (41/73) 0.72 ± 0.03
Organelle envelope (31967) 87.0% (60/69) 79.5% 26.5% (9/34) 0.70 ± 0.05
Transcription regulator activity (30528) 39.3% (42/107) 64.7% 33.8% (26/77) 0.42 ± 0.03
Structural molecule activity (5198) 69.3% (52/75) 68.8% 42.3% (22/52) 0.91
Transporter Activity (5215) 87.3% (55/63) 93.6% 0.0% (0/50) 0.63 ± 0.13
Page 8 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
a high assignment based solely on PPI data, we found that dataset (Additional File 2). This fragmentation may be
29 of the 31 child GO terms left had significant Φ coeffi- due to three different causes. First, lack of PPI data con-
cients. All these results confirm the major findings necting the clusters, due to incompleteness of the current
obtained analyzing the DIP dataset. PPI information. Alternatively, it could be due to an arti-
factual division in clusters due to methodological limita-
Differences between the interactome and GO structures tions. Finally, it could also be caused by lumping of
In spite of the clear general congruence between GO and several independent cellular modules into single GO
interactome described in the previous sections, some sig- terms. Results shown in Figures 2, 3, 4 and 5 for the Ribo-
nificant structural differences were also detected in our nucleoprotein complex GO term, using the DIP dataset, sug-
analyses. We will base the following description mainly gest an important role for lumping (similar results were
on results obtained from the DIP dataset, but similar con- obtained for other GO terms). The GO term in those fig-
siderations arose when considering the GOLD data (see ures for which fragmentation is larger (Ribosome, 5 clus-
some details below). ters) is composed by groups of proteins that belong to as
many independent functional units: translation initiation
First of all, several GO terms had low coverages, meaning factors, ribosome stalk, elongation factors and small and
that PPI data to connect proteins annotated with those large mitochondrial ribosomal subunits. These functional
terms is limited or absent. The fact that PPI data is still par- units are largely independent according to PPI data (Fig-
tial obviously contributes to this problem. For example, ures 2 and 3). The structure deduced from the interactome
the GO term Ribonucleoprotein complex had a quite high is summarized in Figure 4, in which the relationships
coverage (59.7% using DIP data; 82.3% using GOLD among the significant clusters of size ≥ 5 are detailed. Five
data) largely because it included several large multipro- of them correspond to the Ribosome GO term. When we
tein complexes (e. g. both units of the mitochondrial then determined which GO terms among those included
ribosome; spliceosome), for which interactome informa- in the general GO term Ribonucleoprotein complex con-
tion is abundant. However, coverage could have been tained a significant number of proteins belonging to the
even higher except for the fact that PPI for proteins of the five detected Ribosome clusters (see Methods), we found
cytoplasmic ribosome were scarce. In fact, no clusters for the results summarized in Figure 5. The fact that four clus-
the cytoplasmic ribosome units were detected (Figure 2). ters (nos. 1, 2, 3, 5) are detected as significantly enriched
Even so, lack of PPI data does not explain all cases of low in different low-level GO terms demonstrates that the
coverage. Often, proteins were annotated with particular detection of multiple clusters is not spurious, but caused
terms by facts unrelated to them collaborating in the cell. by real heterogeneity among the functions of the proteins
This fact explains the especially low coverage values for included in different clusters. The appearance of multiple
some terms in the Molecular Function category, which put clusters may thus be ascribed to the fact that the general
together proteins with related biochemical properties Ribosome GO term indeed includes independent func-
even if their functions are, from a biological point of view, tional units.
totally unrelated. Typical in this sense were our results for
the child GO term Transcription activator activity. In the DIP Figure 4 also shows the third main characteristic discrep-
dataset, this term included 50 proteins, but only 4 pro- ancy that we have observed between interactome and GO:
teins were detected in the UVCLUSTER dendrograms some clusters (snRNP, snoRNP 1, Ribosome 2) are
(Additional File 1). Coverage was thus one of the lowest included within others. This is due to multiple proteins
in the whole DIP dataset, a mere 8.0%. When we searched being annotated with two or more GO terms (Figure 3).
for direct interactions among the 50 proteins annotated The high degree of overlapping among GO terms can be
with this GO term, we found that just 23 loosely inter- best detected when we again determine the GO terms to
acted (none of those had more than 2 interactions with which the proteins in the clusters are annotated (Figures 5
other proteins in the set). It is extremely unlikely that this and 6). In some cases (Figure 5), the degree of overlap is
is solely due to PPI data for all these proteins having been limited. However, in others the overlap is very considera-
missed so far. The simplest explanation is that proteins ble. For example, to generate Figure 6 we took the clusters
included in this GO term function alone or at most in of size ≥ 5 detected for the GO terms Spliceosome, snRNP
small groups, they do not form any functional module. and snoRNP shown in Figures 2 and 3 (a total of 4 clusters;
DIP dataset) and we determined all the GO terms for
A second significant difference between GO and interac- which a significant enrichment of proteins in those clus-
tome structures is that most child GO terms were frag- ters was present. Notably, all 11 GO terms detected as car-
mented into multiple significant PPI clusters. For the DIP rying a higher than expected number of proteins present
dataset, we detected in average 4.1 significant clusters for in those clusters were actually significant for proteins
each child GO term, with 14.9 proteins per cluster (Addi- included in two or even three of them (Figure 6). Similar
tional File 1). Similar results were obtained for the GOLD results were found for some other GO terms.
Page 9 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
Interactome-based
Figure 4 structure of the GO term Ribonucleoprotein complex, as deduced from Figure 2
Interactome-based structure of the GO term Ribonucleoprotein complex, as deduced from Figure 2. For simplic-
ity, significant clusters of size < 5 are omitted. This eliminates the term Polysome, for which only one cluster of size = 3 was
found.
Page 10 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
Figure
GO
childterms
GO5 term
for which it was found a significant enrichment for proteins in the clusters detected when analyzing the Ribosome
GO terms for which it was found a significant enrichment for proteins in the clusters detected when analyzing
the Ribosome child GO term. Notice how this structure, directly taken from the GO, differs from that shown in Figure 4.
Numbers refer to the five clusters shown also in the other figures (1: Translation initiation factors; 2: Ribosome stalk; 3: Large
mitochondrial subunit; 4: Elongation factors; 5: Small mitochondrial subunit).
Page 11 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
Figure
Spliceosome
GO terms 6 for
waswhich
detected
a significant enrichment for proteins in the clusters detected for the child GO terms snRNP, snoRNP and
GO terms for which a significant enrichment for proteins in the clusters detected for the child GO terms
snRNP, snoRNP and Spliceosome was detected. The names below the boxes refer to the child GO terms from which
derive the clusters of proteins detected as significant. Notice the obvious overlap due to many proteins belonging to two or
even the three child GO terms.
term, above). We think that to annotate with a GO term careful reconsideration of these GO terms attending to the
proteins that do not work together in the cell may be PPI data may generate a more natural classification.
acceptable for terms in the Molecular Function category, Finally, a third significant discrepancy between GO and
useful just for obtaining a biochemical classification of interactome regards the overlaps and the hierarchical rel-
gene products. In fact, terms in that category generally had ative position of terms. The knowledge of biological net-
the lowest coverages (see Tables 3, 4). However, low cov- works may be very useful to define the levels in biological
erages for terms in the Biological Process or Cellular Com- ontologies. One of the first goals may be to avoid as much
ponent categories should be regarded with suspicion. A as possible to establish at the same level two terms that
Page 12 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
contain many common proteins (e. g. Figure 6). Also, as mary distances [15]. Dendrograms using secondary dis-
we have seen (Figures 2 and 4), according to PPI data, a tances were obtained using the UPGMA routine in Mega 3
cluster for one GO term often contains a smaller cluster [33].
for another GO term of the same level. Those two terms
may be based, at least in part, in just one functional mod- UVCLUSTER analyses are very time consuming when the
ule, being thus substantially redundant. This situation number of units is higher than 1000 [15]. That is why we
should be also as much as possible avoided. selected parent GO terms with at most 1000 annotated
proteins. Moreover, we selected parent GO terms subdi-
Conclusion vided into multiple child GO terms to speed up the recol-
In summary, in Saccharomyces cerevisiae, GO and the glo- lection and analysis of the data. We finally centered our
bal structure of the interactome show a substantial degree analysis on the child GO terms containing at least 10 pro-
of congruence. This is comforting, given that both classifi- teins for which interactome data were available, discard-
cations have been obtained almost independently. We ing smaller child GO terms, to avoid biases that could be
conclude that our current "curated" view of the yeast cell, caused by a few missing or a few false positive links in
as schematized in the GO, is globally confirmed by the small groups of proteins. Some child GO terms were
unsupervised type of analysis developed here. However, excluded specifically from the GOLD analyses, given that
the discrepancies detected mean that the current develop- in the GOLD dataset they contained less than 10 proteins
ment of the Saccharomyces Gene Ontology is still incom-
plete and a better integration of PPI data may contribute GO is divided into three main categories: Biological Proc-
to its improvement. ess, Cellular Component and Molecular Function. The
first of these groups reflects the known information about
Methods the cellular functions in which gene products are
We searched the GO annotations compiled in the Saccha- involved, the second refers to the locations (subcellular
romyces Genome Database (SGD; http://www.yeastge structures, macromolecular complexes) in which those
nome.org) for large parent GO terms including 200–1000 products act and the third refers to the biochemical task
proteins and with at least 4 child GO terms, each with 10 that the products perform (e. g. they have certain enzy-
or more proteins. All proteins not included in a child GO matic activity, act as receptors, etc.). We retrieved four par-
term (i. e. annotated only with the parent GO term) were ent GO terms from the Biological Process category and
excluded from the cluster analyses. The UVCLUSTER pro- three more for the Molecular Function category that com-
gram [15] (see http://www.uv.es/~genomica/UVCLUS ply with our criteria of selection and were hierarchically
TER) was then used to obtain the hierarchical structure of located just below these two main categories (these are
the graphs for each set of proteins annotated with a GO often called "level 1 GO terms"). However, none of the
term. The starting point to obtain the hierachical trees level 1 GO terms of the Cellular Component category
with UVCLUSTER analyses are the "primary distances" matched our criteria of size and number of child terms.
among the proteins (shortest path lengths in the interac- We thus selected as parents two level 2 GO terms of that
tome graph). They were obtained from two sources. First, category that indeed comply with those criteria. The
from the Database of Interacting Proteins (DIP; http:// selected parent GO terms are summarized in Table 1.
dip.doe-mbi.ucla.edu). We used the full S. cerevisiae data-
set in DIP, which compiles information from multiple Explorations of the dendrograms to estimate the enrich-
sources, although about 80% of the included protein-pro- ment for GO terms were performed as described in [30].
tein interactions derive from high-throughput experi- This highly sensitive method, implemented in the
ments, either using the yeast two hybrid method or TreeTracker program, compares the enrichments for child
affinity purification of protein complexes. The second GO terms in the observed tree with those in random sim-
source was the "Binary gold standard set" described by Yu ulations based on the same tree topology. Whenever the
et al. [31], which includes only high-confidence data, probability of finding by chance a particular enrichment
mostly based on direct physical interactions characterized was sufficiently low (in this study, p < 0.001; i. e. only 1/
by the two-hybrid method. For UVCLUSTER analyses, 1000 of significant clusters detected are expected to be
10000 iterations, generating as many alternative topolo- false positives) and provided that the cluster contained 2
gies, and an affinity coefficient of 100 were used to esti- or more units belonging to the analyzed GO term, the
mate the "secondary distances" that are used to build the cluster was labeled as positive.
final dendrograms (see [15] for details on these parame-
ters). Secondary distances, obtained by weighting the To quantify the congruence between GO and interactome,
10000 alternative trees, have clear advantages over pri- we used four parameters. The first one is the coverage,
Page 13 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
Page 14 of 15
(page number not for citation purposes)
BMC Systems Biology 2009, 3:69 http://www.biomedcentral.com/1752-0509/3/69
18. Luo F, Yang Y, Chen CF, Chang R, Zhou J, Scheuermann RH: Modu-
lar organization of protein interaction networks. Bioinformat-
ics 2007, 23:207-214.
19. Brohée S, van Helden J: Evaluation of clustering algorithms for
protein-protein interaction networks. BMC Bioinformatics 2006,
7:488.
20. Marín I, Hoyas S: Basic networks: definition and applications.
Journal of Theoretical Biology 2009, 258:53-59.
21. Deng M, Zhang K, Mehta S, Chen T, Sun F: Prediction of protein
function using protein-protein interaction data. J Comput Biol
2003, 10:947-960.
22. Letovsky S, Kasif S: Predicting protein function from protein/
protein interaction data a probabilistic approach. Bioinformat-
ics 2003, 19 Suppl 1:i197-204.
23. Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor CR, Kasif
S: Whole-genome annotation by using evidence integration
in functional-linkage networks. Proc Natl Acad Sci USA 2004,
101:2888-2893.
24. Lu LJ, Xia Y, Paccanaro A, Yu H, Gerstein M: Assessing the limits
of genomic data integration for predicting protein networks.
Genome Res 2005, 15:945-953.
25. Wu X, Zhu L, Guo J, Zhang DY, Lin K: Prediction of yeast pro-
tein-protein interaction network insights from the Gene
Ontology and annotations. Nucleic Acids Res 2006, 34:2137-2150.
26. Barabási A, Oltvai ZN: Network biology understanding the
cell's functional organization. Nat Rev Genet 2004, 5:101-113.
27. Albert R: Scale-free networks in cell biology. J Cell Sci 2005,
118:4947-4957.
28. Arnau V, Marín I: A hierarchical clustering strategy and its
application to proteomic interaction data. Lec Notes Comp Sci
2003, 2652:62-69 [http://www.springerlink.com/content/
mdne0nbmtypjjl6j/].
29. Lucas JI, Arnau V, Marín I: Comparative genomics and protein
domain graph analyses link ubiquitination and RNA metabo-
lism. J Mol Biol 2006, 357:9-17.
30. Marco A, Marin I: A general strategy to determine the congru-
ence between a hierarchical and a non-hierarchical classifica-
tion. BMC Bioinformatics 2007, 8:442.
31. Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hiro-
zane-Kishikawa T, Gebreab F, Li N, Simonis N, Hao T, Rual JF, Dricot
A, Vazquez A, Murray RR, Simon C, Tardivo L, Tam S, Svrzikapa N,
Fan C, de Smet AS, Motyl A, Hudson ME, Park J, Xin X, Cusick ME,
Moore T, Boone C, Snyder M, Roth FP, Barabási AL, Tavernier J, Hill
DE, Vidal M: High-quality binary protein interaction map of
the yeast interactome network. Science 2008, 322:104-110.
32. Dolinski K, Botstein D: Changing perspectives in yeast research
nearly a decade after the genome sequence. Genome Res 2005,
15:1611-1619.
33. Kumar S, Tamura K, Nei M: MEGA3: Integrated software for
Molecular Evolutionary Genetics Analysis and sequence
alignment. Brief Bioinform 2004, 5(2):150-63.
34. Sokal RR, Rohlf FJ: Biometry the principles and practice of statistics in bio-
logical research New York; WH Freeman and Co; 1995.
35. Burset M, Guigó R: Evaluation of gene structure prediction
programs. Genomics 1996, 34:353-367.
36. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov
AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS,
Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden
J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing
computational tools for the discovery of transcription factor
binding sites. Nat Biotechnol 2005, 23:137-144. Publish with Bio Med Central and every
37. Zeeberg BR, Qin H, Narasimhan S, Sunshine M, Cao H, Kane DW,
Reimers M, Stephens RM, Bryant D, Burt SK, Elnekave E, Hari DM, scientist can read your work free of charge
Wynn TA, Cunningham-Rundles C, Stewart DM, Nelson D, Wein- "BioMed Central will be the most significant development for
stein JN: High-Throughput GOMiner, an 'industrial-strength' disseminating the results of biomedical researc h in our lifetime."
integrative Gene Ontology tool for interpretation of multi-
ple-microarray experiments, with application to studies of Sir Paul Nurse, Cancer Research UK
Common Variable Immune Deficiency (CVID). BMC Bioinfor- Your research papers will be:
matics 2005, 6:168.
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Page 15 of 15
(page number not for citation purposes)