A Comprehensive Analysis of The Structure-Function Relationship in Proteins Based On Local Structure Similarity
A Comprehensive Analysis of The Structure-Function Relationship in Proteins Based On Local Structure Similarity
Abstract
Background: Sequence similarity to characterized proteins provides testable functional hypotheses for less than 50% of the
proteins identified by genome sequencing projects. With structural genomics it is believed that structural similarities may
give functional hypotheses for many of the remaining proteins.
Methodology/Principal Findings: We provide a systematic analysis of the structure-function relationship in proteins using
the novel concept of local descriptors of protein structure. A local descriptor is a small substructure of a protein which
includes both short- and long-range interactions. We employ a library of commonly reoccurring local descriptors general
enough to assemble most existing protein structures. We then model the relationship between these local shapes and Gene
Ontology using rule-based learning. Our IF-THEN rule model offers legible, high resolution descriptions that combine local
substructures and is able to discriminate functions even for functionally versatile folds such as the frequently occurring TIM
barrel and Rossmann fold. By evaluating the predictive performance of the model, we provide a comprehensive
quantification of the structure-function relationship based only on local structure similarity. Our findings are, among others,
that conserved structure is a stronger prerequisite for enzymatic activity than for binding specificity, and that structure-
based predictions complement sequence-based predictions. The model is capable of generating correct hypotheses, as
confirmed by a literature study, even when no significant sequence similarity to characterized proteins exists.
Conclusions/Significance: Our approach offers a new and complete description and quantification of the structure-function
relationship in proteins. By demonstrating how our predictions offer higher sensitivity than using global structure, and
complement the use of sequence, we show that the presented ideas could advance the development of meta-servers in
function prediction.
Citation: Hvidsten TR, Lægreid A, Kryshtafovych A, Andersson G, Fidelis K, et al. (2009) A Comprehensive Analysis of the Structure-Function Relationship in
Proteins Based on Local Structure Similarity. PLoS ONE 4(7): e6266. doi:10.1371/journal.pone.0006266
Editor: Joel L. Sussman, Weizmann Institute of Science, Israel
Received October 7, 2008; Accepted June 10, 2009; Published July 15, 2009
This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public
domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
Funding: This work was supported by The Knut and Alice Wallenberg Foundation, The Swedish Foundation for Strategic Research, The Swedish Research
Council, The Swedish Governmental Agency for Innovation Systems (VINNOVA) and The US National Institutes of Health/National Library of Medicine (LM007085
to KF). These funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
Introduction new protein families [4], use these structures as templates for in
silico structure prediction methods [5,6], and then use the solved
Revealing functions of proteins is one of the major challenges of and predicted structures to infer function [7,8]. However, this
molecular biology. Sequence similarity search tools such as requires new computational methods that utilize structure for
BLAST [1] revolutionized biological research by providing function prediction. Thus understanding and predicting structure-
functional hypotheses that could be tested experimentally. function relationships in proteins is considered by many to be the
However, identifying functionally characterized homologues using holy grail of computational biology.
sequence similarity is only possible for less than 50% of the Approaches to the analysis of the structure-function relation-
proteins predicted from genome sequencing projects. Since ships in proteins either rely on global similarities (fold) or local
structure is evolutionarily more conserved than sequence, it is similarities (motifs) [9–12]. Fold similarities have been shown to
believed that structural information provides a solution for many associate with function [13,14], and have also been used to infer
of the remaining proteins [2,3]. Indeed, the extended goal of function-specific sequence patterns [15]. However, many folds
structural genomics is to systematically solve protein structures for such as the TIM barrel and the Rossmann fold are found in
proteins with several different functions [2], and this has led to activity and GO:0008270: zinc ion binding. In addition, some
various local structure-motif methods based on, for example, functions are not completely discernible in terms of structure
known functional sites or function-specific sequence patterns [16– because, e.g., the functionally discriminating properties are too
21]. Recently, meta-servers have obtained functional predictions rare to be singled out by general rules. Consequently, the THEN-
by allowing a large number of different evidence (including global part of the rules often contains several GO-classes with different
and local properties) to independently vote for a particular probabilities (Figure 1D). Our model for GO molecular function
function [22–24]. encompasses ,20,000 rules describing various overlapping
Here, we provide a comprehensive analysis of the structure- structure-function relationships at different levels of specificity
function relationship in proteins, in which a library of recurring (Table S2). As a point of reference, we also induced rules based on
multi-fragment structural motifs called local descriptors of protein domain-specific global structural similarity in terms of orientations
structure [25,26] are used to learn IF-THEN rules [27,28] that and connectivity of the main secondary structure elements (CATH
associate combinations of local substructures with specific protein fold, see Materials and Methods) [33].
functions. Unlike previous studies, we investigate all recurring
motifs and all annotated proteins using no prior knowledge of Quantification of the structure-function relationship
functional sites or any sequence information. Thus, we induce a We argue that a rigorous evaluation of the ability of structure-
rule-model that constitutes a complete representation of the based models to predict function for unseen proteins is the best
structure-function relationship in proteins based only on structure way to quantify the degree to which function depends on structure.
similarity. By a computational evaluation of the model’s ability To this end, we estimated the predictive performance of the
generalize and predict the function of unseen proteins, we offer a models using cross-validation and Receiver Operating Character-
full quantification of the structure-function relationship. This istic (ROC) analysis, and report the Area Under the ROC Curve
enables us to make critical observations about the importance of (AUC) [34] for each class of molecular function, biological process
structure in various aspects of protein function. Our findings can and cellular component (Figure 2, Table S3).
be summarized as follows: (a) nearly two-thirds of all molecular Both the local and the global structure-based methods are better
functions are predicted with a statistically significant accuracy, (b) at predicting molecular function than at predicting biological
biological processes and cellular components are considerably process and cellular component (Figure 2). This is not unexpected
harder to predict from structure than molecular function, (c) since proteins sharing a cellular location or being part of a broad
combining local similarities results in better predictive power than biological process need not be structurally related. This adds
using global similarity, in particular for functionally versatile folds, complementary evidence to other studies that have shown that
and also allows prediction of the function of new folds, (d) catalytic gene-expression time profiles are needed to explain biological
activities are better predicted than most functions involving processes [35]. Consequently, we will focus our detailed analysis
binding and this is related to protein dynamics and disorder, on molecular function.
and (e) structure-based predictions complement sequence-based For a selected set of decision thresholds, the local substructure
predictions and are shown through literature-validation to provide approach correctly predicts 51% of the annotations, and at least
many correct predictions even when no significant sequence one annotation for 56% of the proteins, with 37% of the
similarities exist. predictions being correct (i.e., precision). The local approach
consistently outperforms the global approach (Figure 2B) due to
Results the flexibility associated with combining several local substructures
to obtain function-specific rules. In particular, we see a
Library of annotated local substructures of proteins pronounced difference for proteins with the same fold, but
A local descriptor of protein structure is a set of short continuous different function. For example, 69% of 169 proteins with the
backbone fragments (segments) centered in three dimensions Rossmann fold had one function correctly predicted by the local
around a particular amino acid (Figure 1A, B). We built a library substructure method (precision = 27%), compared to only 17% for
of 4197 such recurring local substructures [25] from a represen- CATH (precision = 9%), while corresponding numbers for the 50
tative set of all experimentally determined protein structure TIM barrel proteins were 66% (precision = 21%) for local
domains in the Protein Data Bank (PDB) with less than 40% substructures and 50% (precision = 12%) for CATH. Clearly,
sequence identity to each other [29,30]. The library was used to the use of local substructures increases the resolution and allows us
automatically represent all protein structures in terms of matching to functionally discriminate proteins with the same fold.
or not matching each of the local substructures. We then
organized the Gene Ontology (GO) annotations [31,32] of all Catalytic activities rely on conserved structure
characterized proteins into 113 classes of molecular functions, 139 Using local substructures, we obtain significant AUC values (i.e.
classes of biological processes, and 30 classes of cellular AUC.0.7) for 82 of the 113 GO molecular function classes.
components (see Table 1 and Materials and Methods for details). However, not all aspects of molecular function are equally
dependent on structure. When the predictive quality of GO
Model induction classes was investigated in relation to groups of wider functional
The relationship between structure and function was modeled categories given by the hierarchical nature of GO, we found that
using IF-THEN rules [27,28] where the IF-part of each rule 53 of the 63 GO molecular function classes located under
specifies a minimal combination of local substructures discerning a GO:0003824: catalytic activity were significantly predicted
particular protein structure from structures annotated to other GO (P,0.0020). On the other hand, 15 of 37 classes under
classes (Figure 1C, D). The rule model was induced using only GO:0005488: binding (P,0.027) and all four classes located under
substructures observed in protein structures statistically overrep- GO:0030528: transcription regulator activity (P,0.0049), three of
resented in at least one GO class (Table S1). The GO classes are which also were located under binding, were not significantly
not mutually exclusive. For example, the catalytic activity of a predicted. The same tendency was observed in the CATH-based
metalloendopeptidase involving a zinc ion will give rise to the GO predictions. Our results thus indicate that properties related to
molecular function annotations GO:0004222: metalloendopeptidase binding are difficult to model from the employed representations
Figure 1. Local substructure group with central descriptor 1qama_#37. Descriptors are named: ‘PDB protein domain name’#‘central amino
acid’. A) Cartoon of the secondary structure of the central descriptor and its structural alignment with the ten closest descriptors in the group. B) The
sequence alignment resulting from the structural alignment in A. C) Location in Gene Ontology of the significantly overrepresented (FDR controlled
at 0.05 [39]) molecular functions annotated to the 68 proteins matching the local substructure in A (marked in red). In total, 28 molecular functions
were annotated to the 68 proteins. D) The rule IF (1qama_#37 AND 1xvaa_#68) THEN (GO:0008757 OR GO:0000287) combining the substructure
1qama_#37 in A with the substructure 1xvaa_#68 to uniquely describe 12 of the proteins annotated with GO:0008757: S-adenosylmethionine-
dependent methyltransferase activity. Two of these proteins are additionally annotated with GO:0000287: magnesium ion binding. The rule thus
effectively combines local substructures to address only one of the three statistically significant GO classes related to 1qama_#37.
doi:10.1371/journal.pone.0006266.g001
Table 1. Gene Ontology annotations for molecular function, biological process and cellular component.
The second column gives the number of annotated proteins and the number of annotations for these proteins. The third column gives the number of GO classes
selected and the related numbers of proteins and annotations.
doi:10.1371/journal.pone.0006266.t001
Figure 2. Model prediction performance using cross-validation and ROC analysis. A) List of the ten best predicted GO molecular function
classes as measured by the AUC and its standard error [34]. We also report sensitivity (SENS), specificity (SPEC), and the number of true positives (TP),
false positives (FP), true negatives (TN), and false negatives (FN) at one specific decision threshold (THR). See Materials and Methods for details. (B, C
and D) Performance for all GO classes and all three GO subontologies using local substructures or CATH folds at different decision thresholds
(resulting from varying the costs on false positives, see Materials and Methods for details). Coverage is the percentage of proteins with at least one
correct prediction or the percentage of annotations correctly predicted, and precision is the percentage of predictions that are correct. Numbers
corresponding to the decision thresholds in A are circled.
doi:10.1371/journal.pone.0006266.g002
of structure while catalytic mechanisms seem to associate well with found in many proteins may turn out to describe general
conserved structural similarity (Table S4). This may be related to mechanisms behind protein functions (Figure 3B). The fact that
the fact that the catalytic action of enzymes is not restricted to the local substructure complexes emerge from rules that can predict
catalytic site, but is connected to inner protein dynamics [36]. protein function indicates that the approach chosen here is
CATH folds and, to some degree, local substructures primarily capable of generalizing and describing protein function beyond
describe protein cores. Thus they may be well suited for modeling approaches based on global similarity. This also demonstrates the
catalytic activity. Binding, on the other hand, mainly requires that the advantage of modeling structure-function relationships using
protein has a surface with appropriate properties as defined by explicit and legible IF-THEN rules.
electrostatic-, hydrophobic- and van der Waals-interactions, and
such a surface may be generated by alternative structures. Protein disorder
Exceptions from the observation that binding is hard to predict It has become increasingly more clear that some protein
include some of the interactions with metal ions (AUCs of 0.95, functions require intrinsic disorder [37]. By using gaps of three or
0.92, 0.80, 0.75), which are often involved in the catalytic more residues in X-ray characterized proteins in PDB as an
mechanism, and GTP and ATP binding (AUCs of 0.89 and 0.77), indication of disorder (http://www.disprot.org/), we found a
which play very important roles in the enzymatic activity. significant correlation between the AUC value of each molecular
Local descriptors that co-occur in rules in the model are selected function class and the degree of disorder in proteins from these
because they are function-specific. Hence, it is intriguing to classes (correlation coefficient of 20.36, which is different from no
observe that such co-occurring substructures, significantly more correlation at P,9.961025). Furthermore, we found that for
often than randomly selected substructures (P,2.2610216), form wider functional categories in GO such as catalytic activity and
connected complexes in which one or more residues from each binding, GO classes that are not significantly predicted display a
substructure are within 5 Å of each other (see Materials and consistently higher degree of disorder compared to proteins in GO
Methods for details). The recently published contact between a classes that are well predicted (Table S5). The same tendency was
loop region and a hydrophobic cluster associated with the inner observed in the CATH-based prediction. This indicates that some
dynamics of the enzyme cyclophilin A (CypA) is exactly described aspects of protein function violate the assumption that sequence
by one of our rules (Figure 3A) [36]. We expect that rules that determines a specific structure as a prerequisite for function, and is
combine local substructures representing stable contact surfaces in line with other results reported recently [38]. Examples include
Figure 3. Rules combining local substructures into connected complexes. A) Structure of CypA (PDB id: 1aka). The loop region represented
by Phe 67 is correlated with the dynamics of the core represented by the hydrophobic cluster including Leu 39, Phe 46, Phe 48 and Ile 78 [36]. The
rule combining local substructures 1elva1#604 and 1bif_2#398 describes exactly this mechanism. 1elva1#604 (yellow) matches the loop region
including Phe 67 (space filled yellow), while 1bif_2#398 (blue) matches parts of the core including Leu 39 (space filled blue). The overlap between
the two local substructures is in green. Although all the residues in the hydrophobic cluster are described in our local substructure library, the
minimal IF-THEN rule only needs one residue in the cluster to discriminate the function. B) Two local substructures in the rule in Figure 1D matching
the enzyme Cytosine-N4-Specific (PDB id: 1boo): 1qama_#37 in yellow, 1xvaa_#68 in blue, the overlap in green and the residues in contact as space
filled. The combined local substructures have a very similar number of residues in contact in the 12 matching proteins (the average contact surface
included 25% of the non-overlapping residues in the two local substructures with standard deviation 6.4).
doi:10.1371/journal.pone.0006266.g003
GO:0046983: dimerization activity (AUC = 0.69, disorder = 9.8%), 30%. Of all 444 correct predictions, 398 were made by PSI-
GO:0005261: cation channel activity (AUC = 0.42, disorder = 8.1%) BLAST (62%) while the remaining 46 were made exclusively by
and GO:0003713: transcription coactivator activity (AUC = 0.58, the descriptor-based method. Thus the approach that combines
disorder = 8.1%). Thus, such functions may only be predicted PSI-BLAST and our structure-based method predicts correctly
correctly by incorporating information in the rules on disorder. more annotations than using PSI-BLAST alone, even when
sequence similarity exists.
Complementarities of sequence and structure in function We finally challenged the system to predict function for 167
prediction unseen proteins (with 224 annotations) with no significant
The ultimate validation of predictions is done experimentally. sequence similarity to the training set (E-score greater than
However, in silico validation offers advantages in that a much 0.05). For these rather demanding targets, the local descriptor
larger number of hypotheses may be tested and statistically sound method obtained coverage and precision of only around 10%,
conclusions may be drawn. We applied our model to functionally showing that the model is not independent from sequence even
characterized proteins that were not structurally solved at the time though it is based purely on structure. However, automatic
of model induction. We divided this test set into proteins with a annotations constitute 92.4% of the database and these annota-
weak but statistically significant sequence similarity to the training tions are generally known to be incomplete. Hence, we manually
set and proteins with no statistically significant similarity. validated all predictions made by the descriptor approach of these
We predicted the molecular function of 429 protein structures 167 targets. This analysis revealed that out of 190 predictions
(with 634 annotations) with a weak but statistically significant made for 93 proteins, 91 predictions made for 57 proteins found
sequence similarity (less than 40% sequence identity and E-score some support in the scientific literature (Table S6). One example is
less than 0.05) to the training set. For these proteins we were able the protein alanyl-tRNA synthetase (PDB id. 1riq) with four
to predict 45% of all the annotations, and at least one correct predictions: GO:0000049: tRNA binding, GO:0000287: magnesium
annotation for 53% of the proteins, with a precision of 29%. Since ion binding, GO:0005524: ATP binding and GO:0004812: tRNA ligase
this performance is comparable to the cross-validation estimates activity. Only the last two predictions were annotated. However, all
obtained from the training set (Figure 2), we may conclude that of them were verified as correct by literature search. Furthermore,
rules based on the library of local substructures generalize well to the fold of this protein was not represented in the training set and
unseen structures across the whole continuum of sequence thus this protein could not have been correctly predicted using
similarity. By combining the predictions from the local descriptor global structural similarity. The fact that the local substructure
approach with predictions derived from the annotations of the method is fully automatic is also an advantage over methods that
closest sequence-neighbor in the training set (detected by PSI- rely on manual assignments since predictions can be made for
BLAST [1], see Materials and Methods for details), we could newly solved structural genomics targets. Only 58 of the 167
correctly predict 70% of all the annotations, and at least one recently solved structures discussed here have so far been assigned
correct annotation for 76% of the proteins, with a precision of a fold in CATH. Finally, although structural similarity in the
Figure 4. Overview of the function prediction method. A library of local descriptors of protein structure is built from a representative subset of
PDB (i.e. training set). The library is used to represent protein structures, and a model that discriminates classes of Gene Ontology annotations is
induced using combinations of local substructures. The model is evaluated both internally and on an external test set.
doi:10.1371/journal.pone.0006266.g004
products), biological processes (i.e., broad biological goals topology (i.e. fold) in this paper rather than CATH homologous
accomplished by an ordered assembly of molecular functions), superfamily or other, more manually inferred databases [11].
and cellular components (i.e., locations where gene products are
active). Sequence-based predictions
We obtained annotations from the GO homepage (http://www. The sequence comparison program PSI-BLAST [1] was used to
geneontology.org) [32] for the proteins used to build the local obtain sequence-based predictions. Each domain in the external
substructure library described earlier (2878 proteins with 4006 test set was blasted against the training set with a sequence profile
protein domains in ASTRAL). We distributed these annotations obtained using the non-redundant sequence database of NCBI
(upwards) in the GO graph (version 1.419), and discarded all GO (ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr) (PSI-BLAST was run
terms (nodes) used to annotate less than ten proteins. We then with three iterations and an E-value threshold of 0.005 for
selected, among the remaining terms, the most specific terms as including a sequences in the model). The annotations for the
our training classes (Table 1, Table S3). By only considering GO closest match in the training set (determined by E-value) were used
terms used to annotate at least ten proteins, some annotations were as predictions, and predictions for a protein were taken to be the
lost. However, a majority of the proteins kept at least one predictions for all its domains.
annotation, indicating that there is a set of large classes providing
at least one annotation for almost every protein, and that the Contacts between local substructures
additional annotations often are from less populated GO terms. For each rule and each matching protein, we computed the
Furthermore, selecting specific classes as training classes resulted in average fraction of residues in pairs of local substructure that were
the loss of some general annotations. in contact (residues common to both of the local substructures
were not considered). We defined two residues to be in contact if
Significant GO classes in descriptor groups the shortest distance between atoms in these amino acids was less
We used the hypergeometric distribution to calculate p-values than 5 Å. This threshold is based on the hydrophobic contact
reflecting to which degree proteins annotated to a particular GO distance between ligands and proteins [40]. Hence, for each rule
class were over-represented in the descriptor group. We then used we obtained the average number of contacts between pairs of
false discovery rate (FDR) [39] controlled at 0.05 to define statistically substructures in matching proteins and the standard deviation
significant local descriptors. FDR is a method for correcting for indicating the stability of these contact surfaces over different
multiple hypotheses in statistical hypothesis testing. proteins. The average contact surface of a pair of local
In the library, 84% of the descriptor groups had a significant substructures in rules encompassed 11% of the non-overlapping
overrepresentation of proteins annotated to at least one of the 113 residues in this pair with an average standard deviation of 9.3. For
molecular function classes (FDR controlled at 0.05). Correspond- comparison, we randomly sampled 1000 pairs of local substruc-
ing numbers for the 139 biological processes and 30 cellular tures matching at least two proteins and where the substructures
components were 77% and 29%, respectively. All GO classes for occurred in at least one of the rules. The contact surfaces for
all three parts of GO were significantly overrepresented in at least function-specific rules were significantly greater than for these
one descriptor group (with the exception of the cellular component randomly sampled pairs (at P,2.2610216 using the Kolmogorov-
GO:0005938: cell cortex). See Table S1 for details. Smirnov test), while the standard deviations were significantly
It is a fundamental principle in machine learning that a higher smaller (also at P,2.2610216). Some large contact conformations
ratio of examples to features produces models that perform better were particularly stable; 8.5% of the rules were associated with an
on unseen cases (given the same class separability). To cope with average contact surface that included more than 20% of the
the large number of structural features (i.e., 4197 local residues and where the standard deviation was less than 5%. This
substructures) compared to the number of proteins (2815 for was only true for 2.8% of the randomly sampled local substructure
molecular function) in this study, we only used FDR significant pairs.
descriptor groups to induce rule models.
Rule-learning
CATH The rough set theory [27,28] constitutes a mathematical
CATH [33] (version 2.6.0) is a classification tree that classifies framework for inducing rules from examples. We used this
domain structures, in increasing specificity, according to class (C), framework, as implemented in the ROSETTA rough set system
architecture (A), topology (T) and homologous superfamily (H). [28] (http://rosetta.lcb.uu.se), for learning IF-THEN rules
Class is assigned according to the secondary structure composition associating combinations of local substructures of proteins with
and packing of the structure domain. This is done automatically in particular GO classes. The framework has previously been used to
90% of the cases. Architecture refers to the overall shape of a learn GO biological process from gene-expression time profiles
domain structure in terms of the relative orientations of the [35,41] (see Hvidsten et al. (2003) for a more theoretical/
secondary structure elements. Architecture is assigned manually. mathematical treatment of the rule-learning method).
Topology refers to the connectivity of the secondary structure In principle, the method finds the minimal sets of local
elements in otherwise similar architectures, and the assignment is substructures that discern a particular protein from all other
done automatically. Finally, homologous superfamily refers to the proteins annotated to a different GO class. One rule is then
proteins that are homologues as determined by sequence constructed from each such set, so that the IF-part is the
similarity. These assignments are also done manually. combination of these local substructures and the THEN-part is
Local descriptors are classified into groups according to the all GO classes used to annotate proteins matching the IF-part. If
relative positioning and orientation of their segments. Hence this the rule includes several GO classes, it means that the
corresponds to architecture in CATH. However, CATH archi- corresponding protein is annotated with a GO class that cannot
tecture is too general for function prediction (results not shown). be uniquely defined from the local substructure data (i.e., the class
Moreover, CATH homologous superfamily would introduce is said to be rough). In this study, we used a genetic algorithm to
sequence similarity into the analysis and would therefore obscure find approximate minimal sets that discern each protein from a
the pure structure-function signal. Hence we opt for using CATH sufficiently large fraction (at least 90%) of the proteins from other
GO classes. Rules from such approximate solutions are less likely Found at: doi:10.1371/journal.pone.0006266.s001 (1.28 MB
to overfit the data and handle noise better than exact solutions. PDF)
We compared the approach using rules based on minimal,
Table S2 All induced rules for molecular function. For each
discerning subsets of local substructures, with the approach of
Gene Ontology molecular function class in the THEN-part, the p-
using all rules based on one single local substructure. Such very
value is given together with the parameters for the hypergeometric
simple decision rules, called 1R rules, were proposed by Holte
[42]. Using this approach we found that combinations were distribution used to compute the p-values: N,n,k,x, where
important for the local descriptor approach, but did not help when N = 2725 is the number of protein-GO class pairs in the data
using CATH folds. set, n is the number of proteins matched the IF-part of the rule, k is
the number of proteins in the GO class and x is the number of
proteins matched by the rule and in the GO class.
Prediction and evaluation
Found at: doi:10.1371/journal.pone.0006266.s002 (1.16 MB
We tested the generalizing capability of our rule approach using
ten-fold cross-validation. The set of proteins was randomly divided PDF)
into ten equally sized subsets. A rule classifier was induced from Table S3 Prediction performance. (a) Prediction performance.
nine subsets (the training set) and used to classify the proteins in 10-fold cross-validation AUC estimates for all molecular function
the remaining subset (the test set). This procedure was repeated ten classes (b) Prediction performance. 10-fold cross-validation AUC
times, so that each protein was in the test set once and in the estimates for all biological process classes (c) Prediction perfor-
training set nine times. mance. 10-fold cross-validation AUC estimates for all cellular
A protein was classified by letting each matching rule cast votes component classes
to the GO classes specified by the rule. The number of votes cast Found at: doi:10.1371/journal.pone.0006266.s003 (0.04 MB
by each rule to each class corresponded to the number of proteins PDF)
in the training set from that class that matched that rule (i.e., the
rule support). A p-value was then calculated for each class based Table S4 The overrepresentation of GO classes with significant
on the votes using the hypergeometric distribution. These p-values AUC values. (a) Local substructures. The overrepresentation of
were obtained during cross-validation and a ROC curve was GO classes with significant AUC values (AUC. = 0.7) and not
computed for each class plotting sensitivity against specificity for significant values (AUC,0.7). P-values are calculated based on the
all possible p-value thresholds. Sensitivity is TP/(TP+FN) and number of proteins and the number of GO classes in each of the
specificity is TN/(TN+FP) where TP is True Positives, FP is False more general GO terms. (b) CATH folds. The overrepresentation
Positives, TN is True Negatives and FN is False Negatives. The of GO classes with significant AUC values (AUC. = 0.7) and not
ROC curve evaluates the threshold-independent performance of significant values (AUC,0.7). P-values are calculated based on the
the classifier. We reported the area under the ROC curve (AUC) number of proteins and the number of GO classes in each of the
as a measure of performance. This value is between 0 and 1, where more general GO terms.
1 signifies perfect discrimination while 0.5 signifies no discrimina- Found at: doi:10.1371/journal.pone.0006266.s004 (0.01 MB
tory power at all. When doing actual function predictions we used PDF)
p-value thresholds from the ROC curves corresponding to the Table S5 Protein disorder. (a) Local substructures. Protein
points maximizing sensitivity plus specificity (specificity was always
disorder. Average disorder in the top level of Gene Ontology
greater than 0.90 to control the number of false positives due to
and correlation between predictive performance in terms of AUC
the large number of classes).
cross validation and protein disorder. (b) CATH folds. Protein
By randomly shuffling the molecular function annotations we
disorder. Average disorder in the top level of Gene Ontology and
showed that cross-validation AUC values equal to or greater than
correlation between predictive performance in terms of AUC cross
0.7 are unlikely to be obtained by chance (P,0.01). Thus
validation and protein disorder.
AUC$0.7 was denoted statistically significant in this study.
Found at: doi:10.1371/journal.pone.0006266.s005 (0.03 MB
PDF)
Supporting Information
Table S6 Literature evaluation. Predictions and literature
Table S1 All FDR significant local descriptor-GO class pairs. (a) evaluation of the 167 proteins with no homology to the training
Molecular function: All FDR significant local descriptor-GO class set.
pairs. (b) Biological process: All FDR significant local descriptor-
Found at: doi:10.1371/journal.pone.0006266.s006 (0.06 MB
GO class pairs. (c) Cellular component: All FDR significant local
PDF)
descriptor-GO class pairs. PARAMETERS refer to the parame-
ters in the hypergeometric distribution used to compute the p-
values: N,n,k,x, where N is the number of protein-GO class pairs Author Contributions
in the data set, n is the number of proteins matched by the local Conceived and designed the experiments: TRH. Performed the experi-
descriptor, k is the number of proteins in the GO class and x is the ments: TRH. Analyzed the data: TRH AL AK GA KF. Contributed
number of proteins matched by the local descriptor and in the GO reagents/materials/analysis tools: TRH AK KF JK. Wrote the paper:
class. TRH AL AK GA KF JK.
References
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped 3. Skolnick J, Fetrow JS (2000) From genes to protein structure and function: novel
BLAST and PSI-BLAST: a new generation of protein database search applications of computational approaches in the genomic era. Trends Biotechnol
programs. Nucleic Acids Res 25: 3389–3402. 18: 34–39.
2. Kinoshita K, Nakamura H (2003) Protein informatics towards function 4. Chandonia JM, Brenner SE (2006) The impact of structural genomics:
identification. Curr Opin Struct Biol 13: 396–400. expectations and outcomes. Science 311: 347–351.
5. Baker D, Sali A (2001) Protein structure prediction and structural genomics. 25. Hvidsten TR, Kryshtafovych A, Fidelis K (2009) Local descriptors of protein
Science 294: 93–96. structure: A systematic analysis of the sequence-structure relationship in proteins
6. Kopp J, Bordoli L, Battey JN, Kiefer F, Schwede T (2007) Assessment of CASP7 using short- and long-range interactions. Proteins 75: 870–884.
predictions for template-based modeling targets. Proteins 69 Suppl 8: 38–56. 26. Hvidsten TR, Kryshtafovych A, Komorowski J, Fidelis K (2003) A novel
7. Zhang C, Kim SH (2003) Overview of structural genomics: from structure to approach to fold recognition using sequence-derived properties from sets of
function. Curr Opin Chem Biol 7: 28–32. structurally similar local fragments of proteins. Bioinformatics 19 Suppl 2:
8. Murzin AG, Patthy L (1999) Sequences and topology: From sequence to II81–II91.
structure to function. Curr Opin Struct Biol 9: 359–362. 27. Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Theory
9. Orengo CA, Todd AE, Thornton JM (1999) From protein structure to function. and decision library Series D, System theory, knowledge engineering, and
Curr Opin Struct Biol 9: 374–382. problem solving. Dordrecht; Boston: Kluwer Academic Publishers. pp 229.
10. Thornton JM, Todd AE, Milburn D, Borkakoti N, Orengo CA (2000) From 28. Komorowski J, Øhrn A, Skowron A (2002) The ROSETTA Rough Set
structure to function: approaches and limitations. Nat Struct Biol 7 Suppl: Software System. In: Klösgen W, Zytkow J, eds. Handbook of Data Mining and
991–994. Knowledge Discovery Oxford University Press. pp 554–559.
11. Ouzounis CA, Coulson RM, Enright AJ, Kunin V, Pereira-Leal JB (2003) 29. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The
Classification schemes for protein structure and function. Nat Rev Genet 4: Protein Data Bank. Nucleic Acids Res 28: 235–242.
508–519. 30. Brenner SE, Koehl P, Levitt M (2000) The ASTRAL compendium for protein
12. Lee D, Redfern O, Orengo C (2007) Predicting protein function from sequence structure and sequence analysis. Nucleic Acids Res 28: 254–256.
and structure. Nat Rev Mol Cell Biol 8: 995–1005. 31. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene
13. Shakhnovich BE, Dokholyan NV, DeLisi C, Shakhnovich EI (2003) Functional ontology: tool for the unification of biology. The Gene Ontology Consortium.
fingerprints of folds: evidence for correlated structure-function evolution. J Mol Nat Genet 25: 25–29.
Biol 326: 1–9. 32. Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, et al. (2003) The
14. Hegyi H, Gerstein M (1999) The relationship between protein structure and Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-
function: a comprehensive survey with application to the yeast genome. J Mol
PROT, TrEMBL, and InterPro. Genome Res 13: 662–672.
Biol 288: 147–164.
33. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, et al. (1997) CATH–
15. Pazos F, Sternberg MJ (2004) Automated prediction of protein function and
a hierarchic classification of protein domain structures. Structure 5: 1093–1108.
detection of functional sites from structure. Proc Natl Acad Sci U S A 101:
34. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver
14754–14759.
operating characteristic (ROC) curve. Radiology 143: 29–36.
16. Di Gennaro JA, Siew N, Hoffman BT, Zhang L, Skolnick J, et al. (2001)
Enhanced functional annotation of protein sequences via the use of structural 35. Lægreid A, Hvidsten TR, Midelfart H, Komorowski J, Sandvik AK (2003)
descriptors. J Struct Biol 134: 232–245. Predicting gene ontology biological process from temporal gene expression
17. Kasuya A, Thornton JM (1999) Three-dimensional structure analysis of patterns. Genome Res 13: 965–979.
PROSITE patterns. J Mol Biol 286: 1673–1691. 36. Eisenmesser EZ, Millet O, Labeikovsky W, Korzhnev DM, Wolf-Watz M, et al.
18. Jonassen I, Eidhammer I, Taylor WR (1999) Discovery of local packing motifs in (2005) Intrinsic dynamics of an enzyme underlies catalysis. Nature 438:
protein structures. Proteins 34: 206–219. 117–121.
19. Russell RB (1998) Detection of protein three-dimensional side-chain patterns: 37. Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z (2002)
new examples of convergent evolution. J Mol Biol 279: 1211–1227. Intrinsic disorder and protein function. Biochemistry 41: 6573–6582.
20. Ferre F, Ausiello G, Zanzoni A, Helmer-Citterich M (2004) SURFACE: a 38. Lobley A, Swindells MB, Orengo CA, Jones DT (2007) Inferring function using
database of protein surface regions for functional annotation. Nucleic Acids Res patterns of native disorder in proteins. PLoS Comput Biol 3: e162.
32: D240–244. 39. Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: a
21. Polacco BJ, Babbitt PC (2006) Automated discovery of 3D motifs for protein Practical and Powerful Approach to Multiple Testing. Journal of the Royal
function annotation. Bioinformatics 22: 723–730. Statistical Society B 57: 289–300.
22. Laskowski RA, Watson JD, Thornton JM (2005) ProFunc: a server for predicting 40. Rarey M, Kramer B, Lengauer T (1999) Docking of hydrophobic ligands with
protein function from 3D structure. Nucleic Acids Res 33: W89–93. interaction-based matching algorithms. Bioinformatics 15: 243–250.
23. Pal D, Eisenberg D (2005) Inference of protein function from protein structure. 41. Hvidsten TR, Lægreid A, Komorowski J (2003) Learning rule-based models of
Structure 13: 121–130. biological process from gene expression time profiles using gene ontology.
24. Watson JD, Sanderson S, Ezersky A, Savchenko A, Edwards A, et al. (2007) Bioinformatics 19: 1116–1123.
Towards fully automated structure-based function prediction in structural 42. Holte RC (1993) Very simple classification rules perform well on most
genomics: a case study. J Mol Biol 367: 1511–1522. commonly used datasets. Machine learning 11: 63–91.