Bandyopadhyay 2016
Bandyopadhyay 2016
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1
Abstract—Protein-protein interaction (PPI) plays a key role in understanding cellular mechanisms in different organisms. Many
supervised classifiers like Random Forest (RF) and Support Vector Machine (SVM) have been used for intra or inter-species
interaction prediction. For improving the prediction performance, in this paper we propose a novel set of features to represent a protein
pair using their annotated Gene Ontology (GO) terms, including their ancestors. In our approach a protein pair is treated as a
document (bag of words), where the terms annotating the two proteins represent the words. Feature value of each word is calculated
using information content of the corresponding term multiplied by a coefficient, which represents the weight of that term inside a
document (i.e., a protein pair). We have tested the performance of the classifier using the proposed feature on different well known
data sets of different species like S. cerevisiae, H. Sapiens, E. Coli and D. melanogaster. Comparison with the other GO based feature
representation technique, and demonstrates its competitive performance.
Index Terms—Protein interaction prediction, GO based feature, Kernel methods for PPI prediction.
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2
[28] and [29]. Many of them have shown that high GO in a proteome. It is defined as follows [33].
similarity among proteins generally indicates interactions
between them, with GO similarity value being considered
IC(t) = − log(p(t)), (2)
as a measure of confidence of interaction. Therefore, in-
teraction based on semantic similarity score is basically ∑
annotation(t) + d∈descendant(t) (annotation(d))
unsupervised in nature [30]. p(t) = ∑ , (3)
c∈descendant(root) annotation(c)
Ben-hur et al. [14] used a non-sequence kernel derived
from GO similarity to train an SVM classifier for PPI pre- where p(t) is the probability of occurrence of a
diction (the term non-sequence indicates that the kernel term t, annotation(t) is the number of direct annota-
does not use any sequence information). GO similarity score tions of the term t in the considered proteome, and
∑
between two proteins was considered as input to the kernel d∈descendant(t) (annotation(d)) is the number of indirect
function. In their experiments [31] they used non-sequence annotations by the descendant in the GO graph of the term
kernel as a mixture of kernels derived from GO similarity t. The summation of the former two counts are divided by
score, BLAST score of orthologous interacting protein pairs the total number of the annotated terms in the proteome.
and mutual clustering coefficient derived from PPI network. GO terms with higher IC values indicate more specific
GO based non-sequence kernel was expressed by using the biological concepts; representing all the terms by the same
following equation [14]: value of 1 may be misleading. Another important issue is
that a term may be annotated directly or indirectly by any
Knon seq ((p1 , p2 ), (p′1 , p′2 )) =
descendant of that term. Therefore, a good representation of
K ′ (SGO (p1 , p2 ), SGO (p′1 , p′2 )), (1) features based on GO terms should consider the specificity
where SGO is a GO similarity score between two proteins. of that term as well as the annotation depth of the indirectly
Here, for computing the kernel for two pairs of proteins annotated terms. In the approach proposed in this article, all
(p1 , p2 ) and (p′1 , p′2 ), individual values of GO similarity be- annotated GO terms, including their ancestors, are treated
tween p1 (or, p′1 ) and p2 (or, p′2 ) are used. The kernel does not as a bag of words as applied in document classification
explicitly compare the similarity between the two protein [34]. The IC value of each indirectly annotated GO term is
pairs with respect to their annotated GO terms space. Tastan multiplied by a coefficient which decreases with its distance
et al. [32] have used GO similarity value as features to from the directly annotated term. This coefficient will differ-
predict viral host protein interaction. entiate the value of a GO term on the basis of whether the
term is directly or indirectly annotated by a protein pair.
Instead of using the GO similarity value of protein pairs The article is organized as follows: In Section 2, a dis-
directly as features, Maetschke et al. [30] used GO terms cussion about the feature vector construction method is pre-
in a binary feature vector for supervised learning. Random sented in detail. The different datasets used in this article are
Forest and Bayesian classifier were used in their work. Here described in Section 3. For performance evaluation of the
if a protein pair (p1 and p2 ) has an experimental evidence new feature, Gaussian kernel SVM is used in this article. The
about their interaction then another protein pair (p′1 and p′2 ), classifier performance is evaluated using Area Under the
having similar GO term annotation profile as (p1 and p2 ), Curve of Receiver Operating Characteristics (AUC-ROC)
may also interact. Learning was done with feature vectors and precision etc. These results are discussed at Section 4.
prepared from GO terms annotating the two proteins, as Section 5 concludes the article.
well as their ancestors. For the purpose of feature rep-
resentation, various combinations of ancestor terms were
considered, viz., all ancestor terms of an annotated pair, 2 A NEW FEATURE VECTOR USING GO TERMS
terms that are common from the lowest common ancestor Annotated GO terms of a protein are extracted from the GO
up to the root of an annotated pair, terms up to lowest annotation database www.geneontology.org . Let a protein
common ancestor of annotated term pairs of a protein pair P be annotated by np number of GO terms, termset(P ) =
etc. Finally, the feature vector was expressed as a combined (t1 , t2 .., tnp ) and let ancs(ti ) = (ti , t1i , t2i ...troot ) be all an-
i
binary vector, based on presence or absence of terms se- cestor terms of a term ti . The actual term set of the protein
lected by any of the above mentioned combination methods P is a union of the ancestors of∪all terms in termset(P ),
of induced terms for a protein pair (i.e., if the k th term is expressed as termsetall (P ) = ∀i∈termset(P ) ancs(ti ). A
present in a term set of a protein pair, then the k th position common approach to represent GO term based feature in
of the feature vector is 1 otherwise 0). So if a particular vector space model is by using the termsetall (P ) to prepare
proteome has a total of n unique GO term annotations a binary encoded vector of GO terms on the basis of pres-
(including all ancestor terms), then the dimension of the ence or absence of a term. The following section discusses
feature vector is n. An intuition behind this representation is the adopted approach for assigning weights to the members
to train a classifier using associated GO terms of interacting of the feature vector.
and non-interacting protein pairs. However, the simple 0-
1 based representation of feature has a major problem.
Here all the terms annotating a protein is represented by 2.1 Computation of weights of GO terms
1 regardless of where the terms are located. Note that terms In this work, two types of weights for each GO term present
appearing in different levels in a GO graph have different in the protein feature are considered. The first one is a
information content (IC) values [33]. IC is defined as the global significance of the term, which is represented by the
negative logarithm of probability of occurrence of a GO term IC value of the term. The second is a local weight of that
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3
For efficient computation of the weights of terms, the +k (F (P1 ), F (P2 )) + k (F (P2 ), F (P1 )) + k (F (P2 ), F (P2 ))
following steps are followed as specified in the Algorithm , (7)
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4
′
where k (a, b) = a ∗ bT , is a linear kernel. Note that this Park’s [16] dataset: The yeast PPI data in [16] has been
kernel is symmetric in nature, and operates on the feature used in this article. The dataset was prepared from DIP core
of protein pairs directly. This becomes possible because of interaction dataset. It had a total of 3734 positive interactions
the combination approach adopted. This kernel can be used and around 3,00,000 randomly selected negative interaction.
on the top of a polynomial or Gaussian kernel to map the So positive to negative data ratio was about 1:100.
data into more complex non-linear space. Meatschke et al. [30] datasets: The PPI datasets of S.
Cerevisiae (SC), H. Sapiens (HS), E. Coli (EC) and D.
2.3 Selection of induced GO terms Melanogaster (DM) as used in [30] have been taken in this
article. These data sets were collected from the STRING
In this article we consider two different categories of in-
[37] database. There are 15238, 3490, 1167 and 321 positive
duced GO term based features that were found to perform
interactions for yeast, human, E. Coli and D. Melanogaster,
well in [30]. For each category, all the features are assigned
respectively. For negative data, of the same size as the
certain weights. The first category of feature comprises all
positive data, they have used random protein pairs.
annotated GO terms and their ancestor terms for a protein
Yu et al. [38] dataset: High confidence (HC) PPI dataset
pair. Weights are assigned to all these terms using Eqn. 6.
(15408 positive data) of human comprising interactions
This feature is mentioned as weighted all ancestors or WAA
present in both BioGRID and HPRD datasets was consid-
in this article.
ered [38]. They have used two different types of negative
Instead of considering all annotated GO terms in the
datasets namely, random pair of proteins and a balanced
feature vector, the second category includes all induced
random pair of proteins. In balanced random protein pair,
terms up to the lowest common ancestor (ULCA) [30]. The
the number of times a protein appears in negative dataset
induced GO terms are selected by the following method.
is same as that in the positive dataset. However, note that
For each pair of annotated GO terms ((ti , tj ) : ti ∈
Park et al. [18] have shown that this kind of negative data
termset(P1 ), tj ∈ termset(P2 )) for a protein pair (P1 , P2 ),
selection fails to simulate the behavior of the global protein
the terms located up to the lowest common ancestor (LCA)
pair population which is better simulated by randomly
of (P1 , P2 ), including ti and tj , are chosen. Finally the union
selected negative data.
of the ULCA induced term sets resulting from all the term
pairs ((ti , tj ) : ti ∈ termset(P1 ), tj ∈ termset(P2 )) is used
for characterizing the protein pair (P1 , P2 ). As in the case 4 R ESULTS AND D ISCUSSION
for WAA, weights are assigned to all these ULCA induced In this section the PPI prediction performance using the
terms using Eqn. 6, where instead of all descendants, only proposed weighted GO induced term features (WAA and
the ULCA induced descendants are considered. This feature WULCA) are compared with those of a recently developed
is called Weighted ULCA or WULCA in this article. approach containing binary features (AA and ULCA) [30].
SVM classifier with Gaussian kernel (C-SVC of libsvm [39])
3 DATASETS is used for classification. Features generated from the three
ontology (BP, CC and MF) are combined into a single vector
3.1 Gene Ontology Data in all the experiments.
We have constructed the GO graph from Gene Ontology Preparation of negative data is an important issue for PPI
omnibus data collected from www.geneontology.org [36]. Two prediction. Various approaches proposed in the literature
types of relations among GO terms are considered in this include protein pairs constructed from different cellular
article, is a and part of. We have collected GO annotation locations [14], involved in different biological processes [40],
dataset for different species from Uniprot database and gene etc. However, this type of negative data selection criteria can
ontology omnibus data. Ancestors of annotated GO terms make the dataset biased, specially when GO based features
are retrieved from the GO graph. are used for PPI prediction. Most of the data sets considered
in this article used randomly selected negative data. Only
3.2 PPI datasets for Different Species the data of Yu et al. [38] used balanced random negative
dataset.
PPI datasets for various species were collected from existing
Performance of the different approaches are reported
databases. Only those proteins that have GO annotation
using the measures Area under the curve of ROC (AUC-
with respect to all the three GO aspects (BP, MF and CC)
ROC) and ROC50 (AUC-ROC50), sensitivity and specificity.
are considered. Descriptions of the datasets are provided in
AUC-ROC is a curve plotting the sensitivity versus (1-
the following paragraphs.
specificity) at different thresholds of the classifier prediction
Ben-hur et al. [14] datasets: Three yeast PPI datasets from
score. AUC-ROC measures the accuracy of the complete set
[14] have been used in this article. The first one, referred to
of predictions. AUC-ROC50 is the area under the curve of
as BIND, was collected from BIND database (10517 positive
ROC upto first fifty false positives. It measures the high
interactions and same number of negative interactions).
confidence prediction performance of a classifier.
The second, referred to as Reliable-BIND, consisted of 750
reliable interactions filtered from the latter database. Finally,
the third one, referred to as DIP-MIPS, was collected from 4.1 Estimation of Parameters and Classifier Details
DIP/MIPS database (4837 positive interactions and around Two parameters α1 and α2 are used for weighting of the
10000 negative interactions). In all these three datasets, is a and part of relations respectively in the Eqn. 4. More
randomly selected protein pairs with no known interaction preference is given to the is a relation than the part of rela-
were used as the negative set. tion in this article. Because is a relation is considered more
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5
TABLE 1
Parameter value used in our experiments using SVM classifier obtained by cross validation using proposed feature WULCA
TABLE 2
AUC-ROC values for Ben-hur et al [14] data sets. Results of DIP-MIPS dataset using GO features do not use any threshold based filtering of
negative data.
direct than the part of relation and former is also highly own work [14], where they used 5-fold cross validation
frequent than the latter. Smaller weight value of part of (to keep parity all the other results were also obtained
relation will contribute lower weight values to the terms using 5-fold cross validation). For BIND they have used
those are linked by that relation and reverse will happen for GO features using non-sequence kernel whereas for reliable-
an is a relation link. For identifying the appropriate values BIND a mixture of non-sequence kernels prepared from GO
of these parameters, we have run the grid search for two features as well as blast score and MCC features are used
parameters α1 and α2 in the range of [0,1] with a step size in their [31] work. For DIP-MIPS they have used sequence
0.1. ROC of the classification using SVM is used to find the spectrum count based features with pairwise kernel. As can
good value of parameters. Contour plot was plotted (See be seen, the SVM classifier trained with proposed WULCA
Fig. 1) in the range of parameters for the WULCA feature on and WAA features performed the best (0.89) as compared to
the S. Cerevisiae dataset of Maetschke et al [30]. It is found AA (0.83) and ULCA (0.88) features(see Table 2). The result
that the performance is best with α1 ∈ [0.7,0.8] and α2 ∈ reported in [14] is the lowest. In the case of the proposed
[0.5,0.6]. For the other dataset and features it is observed features, the ROC50 score (0.65) for the WULCA feature is
that the same range of values of α1 and α2 performed best. also the best with respect to the other features. For the case
C-SVC of libsvm with Gaussian kernel is used as classi- of reliable-BIND, an AUC-ROC score of 0.95 was reported
fier. We have tested with many other kernels but we found in [14]. From the Table 2 we see that the proposed GO
that Gaussian kernel performs the same as, often better than, term based WULCA shows a marginal improvement (AUC
other kernels for all features and for all data sets. For the 0.96). Note that in this experiment they [14] have used some
sake of brevity we have omitted results with other kernels. A other types of features in addition to GO similarity score
critical issue for using SVM classifier is to select parameters feature in the non sequence kernel. However, the proposed
for optimized performance, which is determined by cross weighted GO term based feature is found to be strong
validation. C is the regularization parameter, i.e. used as enough to be combined with a simple kernel to provide
penalty for misclassification. Here we have kept the value a superior performance. Again, both WAA and WULCA
of misclassification penalty for positive data high, because encoding of features performed better than AA and ULCA
random negative dataset is less reliable than positives. In based approach, respectively. This indicates that the pro-
our all experiments the values of C for positive and negative posed weighted features may have an advantage over the
data are considered with a ratio 9:1. We have found that the unweighted ones.
value of C ≤ 20 is effective for data sets larger than 5000 For DIP-MIPS PPI data set, pairwise kernel SVM with
instances and positive to negative data ratio 1:1. Table 1 sequence spectrum count was used in [14]. They prepared
reports the C and γ values for the SVM. negative interaction dataset by filtering with GO CC similar-
ity threshold i.e., the similarity of each negative interaction
4.2 Result on Ben-hur et al.’s dataset [14] pair is lower than the threshold. As is reported in the table,
Table 2 reports the result on the data set from [14]. The AUC-ROC score varied from 0.87 to 0.97 when thresholds of
second column shows the results obtained in the authors’ GO CC similarity was varied from 0.50 to 0.04. In the other
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 6
Alpha2
because of the negative data thresholding, since protein 0.5
pairs taken from different cellular component are less likely
0.4
to interact.
0.3
to negative training data ratio. The results are shown in 0.1 WULCA
ULCA
Table 4. As expected, with increasing proportion of negative 0
0 0.2 0.4 0.6 0.8 1
data, the classifier becomes biased toward the negatives, FALSE POSITIVE RATIO
0.9
0.7
Here the performance of the proposed as well as existing 0.6
features for PPI prediction on datasets of different species 0.5
as mentioned in Section 3 are compared. Results for ULCA, 0.4
is consistently poor. Ten fold cross validation results are 0.1 WULCA
ULCA
reported in Table 5. As can be seen, the proposed WAA 0
0 0.2 0.4 0.6 0.8 1
feature provides best AUC and AUC50 scores for the dataset FALSE POSITIVE RATIO
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7
TABLE 3
AUC-ROC Values for Park et al. [16] data sets with 1:10 positive to negative ratio
TABLE 4
AUC-ROC on the dataset used by Park et al. [16] tested with different positive to negative ratio
TABLE 5
AUC-ROC values for Meastache et al. [30] data sets of S. Cerevisiae, H. Sapiens, E. Coli and D. Melanogaster
TABLE 6
AUC-ROC values for Yu et al. [38] data sets.
and WULCA features are trained with five fold cross vali- features which are trained by non-sequence GO kernel
dation, the AUC-ROC scores obtained were 0.80, 0.82 and [14] as mentioned in the Section 1. The proposed feature
0.83, respectively (see Table 6). The ROC plot is provided WULCA achieved best performance over the non-sequence
in Fig. 3. With balanced random negative dataset, the clas- GO kernel and other features. For the reliable-BIND dataset,
sifier achieved AUC-ROC scores of 0.64, 0.67 and 0.68 with a combination of different non-sequence kernels (mixture of
ULCA, WAA and WULCA features respectively, whereas kernels prepared from GO similarity and sequence homol-
the TTPK8000 kernel provides AUC score of 0.60. From the ogy score) were used. Note that, GO based WULCA feature
results, we see that for both the datasets, WULCA feature (0.96) are more discriminative than non-sequence kernels
performs the best, especially for the balanced random nega- (0.95). As mentioned earlier, several authors [31], [16], [41]
tive dataset. have reported PPI prediction using sequence based features.
From the results shown in Section 4 (see results obtained on
DIP-MIPS, Park and High confidence positive datasets in
4.6 Discussion
Table 2, Table 3 and Table 6 respectively) it can be concluded
In this work, a new GO-based feature representation that the performance of weighted GO based features is
method has been proposed for supervised PPI prediction. better than the sequence signature based features for PPI
Each protein pair is represented by a weighted feature vec- prediction. From the analysis of Ben-hur et al. [14] it is
tor, where the weights are derived from the term specificities revealed that sequence based features improves the perfor-
and the topological structure of the GO graph. An SVM clas- mance of a classifier for PPI prediction if it is combined with
sifier is trained on biologically validated interacting protein GO based features. Both the features have large number
pairs, while using a randomly generated negative set. The of dimensions. Computation of kernel matrix for the full
trained classifier is used to predict protein protein interac- dataset using these features is computationally expensive.
tions. From the results of the various experiments, it is found In this context GO term based feature is computationally
that the proposed weighted feature vector representation simple and performance is better.
(WULCA) performs better than the unweighted version
(ULCA) which is statistically significant (See Table 7). In All three combined GO ontology terms produce around
Table 2, the result of BIND dataset represents the GO-based 6800 and 17000 features for yeast and human, respec-
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 8
TABLE 7
McNemar’s test P-values of significant performance improvement of feature WULCA compared to ULCA on some of the datasets used in this
article.
TABLE 8
AUC-ROC values using the Random Forest classifier
TABLE 9
AUC-ROC values using different combination of features of different sub-ontologies (BP, CC and MF) of the S. Cerevisiae dataset [30] using SVM
classifier and 10 fold cross validation
TABLE 10
Results simulates the effect of IEA annotated GO term inclusion (IEA+) and exclusion (IEA-) in the feature vector. AUC-ROC value is reported for
SVM classifier.
tively. With this large dimensionality and small number tions exhibit that the RF classifier is prone to overfitting
of instances, the use of SVM as the underlying classifier for PPI prediction with GO based feature vectors. Again,
is proposed in this article. Note that here we could also with a huge number of instances and unbalanced positive
have used the RF-classifier, but in such cases, it has a to negative data ratio (often observed in the domain of
greater chance of over fitting. The result of RF classifier PPI prediction), the performance of RF classifier becomes
in the Table 8 demonstrated that proposed weighted GO unreliable and very much time consuming as well. We have
feature performed better than the unweighted version. For seen that SVM performs better than the RF classifier for
understanding the effect of over fitting, we have evaluated the above mentioned cases. However, selection of parameter
the performance of both RF and SVM classifiers on the values is an important issue with the SVM classifier.
training set itself (training error). In this experiment, for the
S.Cerevisiae dataset shown in Table 8, RF classifier (AUC- A reduced GO term set (i.e., GO-slim term set) can be
ROC = 0.99) demonstrated a higher accuracy with respect to used for reducing the high dimensionality of GO features.
the SVM classifier (AUC-ROC = 0.97) with WULCA feature. Notably, according to the analysis of Maetschke et al. [30],
During the 10 fold cross validation testing RF classifier reduced GO term set affects the performance of a classifier.
(AUC-ROC = 0.935) performed inferior compared to SVM A naive classifier with less overhead in determining its
(AUC-ROC = 0.95) for the WULCA feature. Similar behavior parameter values or linear SVM (viz., liblin) [42] can be
was observed for other GO based features. These observa- easily trained. This is the advantage of using a reduced
dimensional data. Since there is millions of protein pairs
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9
in the dataset, therefore it is difficult to use the standard [3] A. J. Enright, I. Iliopoulos, N. C. Kyrpides, and C. A. Ouzounis,
feature extraction methods (like PCA or KPCA) that require “Protein interaction maps for complete genomes based on gene
fusion events,” Nature, vol. 402, no. 6757, pp. 86–90, 1999.
‘eigen decomposition’. Hence, feature extraction method [4] T. Dandekar, B. Snel, M. Huynen, and P. Bork, “Conservation
from such large data sets can be developed in future. For of gene order: a fingerprint of proteins that physically interact,”
the purpose of the analysis of the performance of reduced Trends in biochemical sciences, vol. 23, no. 9, pp. 324–328, 1998.
dimension GO feature, we have trained the classifier using [5] C. Von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver,
S. Fields, and P. Bork, “Comparative assessment of large-scale data
the different combination of features prepared from the sets of protein–protein interactions,” Nature, vol. 417, no. 6887, pp.
three ontologies (BP, CC and MF). From the result shown 399–403, 2002.
in the Table 9 it is revealed that features from BP ontology [6] C. Huang, F. Morcos, S. P. Kanaan, S. Wuchty, D. Z. Chen, and J. A.
alone achieved best accuracy among the features from other Izaguirre, “Predicting protein-protein interactions from protein
domains using a set cover approach,” IEEE/ACM Transactions on
ontology. Best performance was achieved with the merged Computational Biology and Bioinformatics (TCBB), vol. 4, no. 1, pp.
features of all three ontologies. Merged features of BP and 78–87, 2007.
CC have shown the nearest performance to the best one. [7] A. Birlutiu, F. d’Alche Buc, and T. Heskes, “A bayesian framework
for combining protein and network topology information for
Some GO terms are annotated by automatic electronic in- predicting protein-protein interactions,” 2014.
ference (known as IEA evidence code). An analysis is done [8] S. Wuchty, “Topology and weights in a protein domain interac-
to realize the effect of performance of the classifier for the tion network–a novel way to predict protein interactions,” Bmc
case of the exclusion of IEA annotated terms in the feature Genomics, vol. 7, no. 1, p. 122, 2006.
[9] R. Singh, J. Xu, and B. Berger, “Struct2net: Integrating structure
vector. It can be seen from the results of Table 10, that the into protein-protein interaction prediction.” in Pacific Symposium
performance of SVM classifier is not affected much by the on Biocomputing, vol. 11. World Scientific, 2006, pp. 403–414.
exclusion of IEA annotated GO terms. [10] R. Hosur, J. Xu, J. Bienkowska, and B. Berger, “iwrap: an inter-
face threading approach with application to prediction of cancer-
related protein–protein interactions,” Journal of molecular biology,
5 C ONCLUSION vol. 405, no. 5, pp. 1295–1310, 2011.
[11] S. M. Gomez, W. S. Noble, and A. Rzhetsky, “Learning to predict
In this work we have proposed a new feature representation protein–protein interactions from protein sequences,” Bioinformat-
technique based on annotated GO terms of a protein pair. ics, vol. 19, no. 15, pp. 1875–1881, 2003.
[12] S. Pitre, M. Alamgir, J. R. Green, M. Dumontier, F. Dehne, and
A supervised classifier (SVM) is used for the purpose of A. Golshani, “Computational methods for predicting protein–
predicting novel PPIs. Unsupervised approaches for PPI protein interactions,” in Protein–Protein Interaction. Springer, 2008,
prediction use GO semantic similarity as confidence value, pp. 247–267.
which is not efficient compared to supervised. Features [13] S. Martin, D. Roe, and J.-L. Faulon, “Predicting protein–protein in-
teractions using signature products,” Bioinformatics, vol. 21, no. 2,
from GO terms have been developed earlier with the help pp. 218–226, 2005.
of various inducer term sets including the ancestor terms. [14] A. Ben-Hur and W. S. Noble, “Kernel methods for predicting
However, usually a binary (0-1) representation based on protein-protein interactions,” Oxford Bioinformatics, vol. 21, no. 4,
the presence or absence of terms has been used. In this March 2005.
[15] J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li, and
work we have considered weighted values of feature instead H. Jiang, “Predicting protein–protein interactions based only on
of binary values, where the weight is proportional to the sequences information,” Proceedings of the National Academy of
global annotation statistics and topological position of terms Sciences, vol. 104, no. 11, pp. 4337–4341, 2007.
[16] Y. Park, “Critical assessment of sequence-based protein-protein
in a GO subgraph. Results demonstrate that the proposed
interaction prediction methods that do not require homologous
inducer based weighted feature (WULCA) performs better protein sequences,” BMC bioinformatics, vol. 10, no. 1, p. 419, 2009.
than the simple binary inducer based approach for all the [17] A. Ben-Hur and W. S. Noble, “Choosing negative examples for
benchmark PPI datasets. Computation of WAA feature vec- the prediction of protein-protein interactions,” BMC bioinformatics,
vol. 7, no. Suppl 1, p. S2, 2006.
tor is easy compared to WULCA feature and the perfor- [18] Y. Park and E. M. Marcotte, “Revisiting the negative example
mance is also competitive. From the analysis in the article it sampling problem for predicting protein–protein interactions,”
can be concluded that GO term based features have better Bioinformatics, vol. 27, no. 21, pp. 3024–3028, 2011.
performance than sequence based spectrum count features. [19] S. Pitre, F. Dehne, A. Chan, J. Cheetham, A. Duong, A. Emili,
M. Gebbia, J. Greenblatt, M. Jessulat, N. Krogan et al., “Pipe:
In future, all types of existing relations in a GO graph a protein-protein interaction prediction engine based on the re-
may be considered which will add more terms in the feature occurring short polypeptide sequences between known interact-
vector and will thereby increase performance. This feature ing protein pairs,” BMC bioinformatics, vol. 7, no. 1, p. 365, 2006.
can also be applied for different types of classification prob- [20] S. Pitre, C. North, M. Alamgir, M. Jessulat, A. Chan, X. Luo,
J. Green, M. Dumontier, F. Dehne, and A. Golshani, “Global in-
lems where GO semantic similarity is used, namely in viral vestigation of protein–protein interactions in yeast saccharomyces
host protein interaction prediction, drug target prediction cerevisiae using re-occurring short polypeptide sequences,” Nu-
and discovery of functional similarity of protein motifs. cleic acids research, vol. 36, no. 13, pp. 4286–4294, 2008.
[21] Y. Guo, L. Yu, Z. Wen, and M. Li, “Using support vector machine
combined with auto covariance to predict protein–protein interac-
tions from protein sequences,” Nucleic acids research, vol. 36, no. 9,
ACKNOWLEDGMENTS pp. 3025–3030, 2008.
R EFERENCES [22] G. O. Consortium et al., “The gene ontology (go) project in 2006,”
Nucleic Acids Research, vol. 34, no. suppl 1, pp. D322–D326, 2006.
[1] F. Pazos, M. Helmer-Citterich, G. Ausiello, and A. Valencia, [23] X. Wu, L. Zhu, J. Guo, D.-Y. Zhang, and K. Lin, “Prediction of
“Correlated mutations contain information about protein-protein yeast protein–protein interaction network: insights from the gene
interaction,” Journal of molecular biology, vol. 271, no. 4, pp. 511–523, ontology and annotations,” Nucleic acids research, vol. 34, no. 7, pp.
1997. 2137–2150, 2006.
[2] R. A. Craig and L. Liao, “Phylogenetic tree information aids su- [24] J. P. Miller, R. S. Lo, A. Ben-Hur, C. Desmarais, I. Stagljar, W. S.
pervised learning for predicting protein-protein interaction based Noble, and S. Fields, “Large-scale identification of yeast inte-
on distance matrices,” Bmc Bioinformatics, vol. 8, no. 1, p. 1, 2007. gral membrane protein interactions,” Proceedings of the National
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10
Academy of Sciences of the United States of America, vol. 102, no. 34, Sanghamitra Bandyopadhyay received her
pp. 12 123–12 128, 2005. PhD in Computer Science in 1998 from Indian
[25] S. Bandyopadhyay and K. Mallick, “A new path based hybrid Statistical Institute, Kolkata, India, where she
measure for gene ontology similarity,” Computational Biology and currently serves as a Professor. She has re-
Bioinformatics, IEEE/ACM Transactions on, vol. 11, no. 1, pp. 116– ceived the prestigious S. S. Bhatnagar award
127, 2014. in 2010, Humboldt fellowship for experienced
[26] D. Lin, “An information-theoretic definition of similarity.” in researchers and the Senior Associateship of
ICML, vol. 98, 1998, pp. 296–304. ICTP, Italy. She is a Fellow of the Indian Na-
[27] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus tional Academy of Engineering and the National
statistics and lexical taxonomy,” arXiv preprint cmp-lg/9709008, Academy of Science, India. Dr. Bandyopadhyay
1997. has co-authored six books and more than 250
[28] P. Resnik, “Using information content to evaluate semantic simi- research papers. Her research interests include Pattern Recognition,
larity in a taxonomy,” arXiv preprint cmp-lg/9511007, 1995. Data Mining, Evolutionary Computing and Bioinformatics.
[29] A. Schlicker, F. S. Domingues, J. Rahnenführer, and T. Lengauer,
“A new measure for functional similarity of gene products based
on gene ontology,” BMC bioinformatics, vol. 7, no. 1, p. 302, 2006.
[30] S. R. Maetschke, M. Simonsen, M. J. Davis, and M. A. Ragan,
“Gene ontology-driven inference of protein–protein interactions
using inducers,” Bioinformatics, vol. 28, no. 1, pp. 69–75, 2012.
[31] A. Ben-Hur and W. S. Noble, “Kernel methods for predicting
protein-protein interactions,” Bioinformatics, vol. 21, no. 1, pp. i38–
i46, * 2005.
[32] O. Tastan, Y. Qi, J. G. Carbonell, and J. Klein-Seetharaman, “Pre-
diction of interactions between hiv-1 and human proteins by
information integration.” in Pacific Symposium on Biocomputing,
vol. 14. World Scientific, 2009, pp. 516–527.
[33] P. Resnik, “Semantic similarity in a taxonomy: An information
based measure and its application to problems of ambiguity in
natural language,” Journal of Artificial Intelligence Research, vol. 11,
pp. 99–130, February 1999.
[34] L. M. Manevitz and M. Yousef, “One-class svms for document
classification,” the Journal of machine Learning research, vol. 2, pp.
139–154, 2002.
[35] B. Trstenjak, S. Mikac, and D. Donko, “Knn with tf-idf based
framework for text categorization,” Procedia Engineering, vol. 69,
pp. 1356–1364, 2014.
[36] G. O. Consortium et al., “Gene ontology annotations and re-
sources,” Nucleic acids research, vol. 41, no. D1, pp. D530–D535,
2013.
[37] A. Franceschini, D. Szklarczyk, S. Frankild, M. Kuhn, M. Si-
monovic, A. Roth, J. Lin, P. Minguez, P. Bork, C. von Mering et al.,
“String v9. 1: protein-protein interaction networks, with increased
coverage and integration,” Nucleic acids research, vol. 41, no. D1,
pp. D808–D815, 2013. Koushik Mallick received his M.E. degree in
[38] J. Yu, M. Guo, C. J. Needham, Y. Huang, L. Cai, and D. R. Computer Science and Engineering from Ja-
Westhead, “Simple sequence-based kernels do not predict protein– davpur University at 2009. He is pursuing Ph. D.
protein interactions,” Bioinformatics, vol. 26, no. 20, pp. 2610–2614, from Calcutta University while working at Indian
2010. Statistical Institute, Kolkata, India. At present, he
[39] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector is Assistant Professor at RCCIIT, Kolkata.
machines,” ACM Transactions on Intelligent Systems and Technology
(TIST), vol. 2, no. 3, p. 27, 2011.
[40] M.-G. Shi, J.-F. Xia, X.-L. Li, and D.-S. Huang, “Predicting protein–
protein interactions from sequence using correlation coefficient
and high-quality interaction dataset,” Amino Acids, vol. 38, no. 3,
pp. 891–899, 2010.
[41] J. Yu, M. Guo, C. J. Needham, Y. Huang, L. Cai, and D. R.
Westhead, “Simple sequence-based kernels do not predict protein-
protein interactions,” Bioinformatics, vol. 26, no. 20, pp. 2610–4,
2010.
[42] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
“Liblinear: A library for large linear classification,” The Journal of
Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.