0% found this document useful (0 votes)
50 views

Bandyopadhyay 2016

This article proposes a new method of representing proteins as feature vectors based on their annotated Gene Ontology (GO) terms for predicting protein-protein interactions. The method treats each protein pair as a "document" where the GO terms annotating the proteins represent "words". A feature value is calculated for each term based on its information content and weight within the protein pair. The performance of classifiers using these features is evaluated on several datasets and shown to be competitive with other GO-based feature representation techniques.

Uploaded by

R majumdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Bandyopadhyay 2016

This article proposes a new method of representing proteins as feature vectors based on their annotated Gene Ontology (GO) terms for predicting protein-protein interactions. The method treats each protein pair as a "document" where the GO terms annotating the proteins represent "words". A feature value is calculated for each term based on its information content and weight within the protein pair. The performance of classifiers using these features is evaluated on several datasets and shown to be competitive with other GO-based feature representation techniques.

Uploaded by

R majumdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1

A new feature vector based on Gene Ontology


terms for protein-protein interaction prediction
Sanghamitra Bandyopadhyay, Senior Member, IEEE, Koushik Mallick

Abstract—Protein-protein interaction (PPI) plays a key role in understanding cellular mechanisms in different organisms. Many
supervised classifiers like Random Forest (RF) and Support Vector Machine (SVM) have been used for intra or inter-species
interaction prediction. For improving the prediction performance, in this paper we propose a novel set of features to represent a protein
pair using their annotated Gene Ontology (GO) terms, including their ancestors. In our approach a protein pair is treated as a
document (bag of words), where the terms annotating the two proteins represent the words. Feature value of each word is calculated
using information content of the corresponding term multiplied by a coefficient, which represents the weight of that term inside a
document (i.e., a protein pair). We have tested the performance of the classifier using the proposed feature on different well known
data sets of different species like S. cerevisiae, H. Sapiens, E. Coli and D. melanogaster. Comparison with the other GO based feature
representation technique, and demonstrates its competitive performance.

Index Terms—Protein interaction prediction, GO based feature, Kernel methods for PPI prediction.

1 I NTRODUCTION interacting protein pairs, taken from some well known


interaction database, are used as positive class. However,
U NDERSTANDING the protein-protein interaction net-
work is a challenging problem in computational bi-
ology. It is an important task to understand the biological
identifying negative data is difficult since not much infor-
mation is available about protein pairs that do not interact
functions of gene products and to discover involvement [17]. A general strategy to build negative protein pairs is to
of new proteins in different pathways. This will help in select random protein pairs excluding those that are known
discovering the relationship between diseases and genes, to interact [18]. For classification, Martin et al. [13] used
and may result in the identification of new drug targets. SVM with pairwise kernel with protein sequence spectrum
Though various advanced high throughput experimental features. Features prepared from different types of sequence
assays provide a lot of interactions, the total number of signatures proposed in [15], [19], [20] and [21], were used
explored interactions are still very few compared to the to make consensus decision using SVM classifier [16]. Ben-
whole proteome. It is believed that proteins exhibit their hur et al. [14] used combination of pairwise kernels, where
functions by interacting with other proteins and none of multiple kernels were derived from different features ob-
them work in isolation. So the total number of interactions tained from various information sources namely, GO simi-
is expected to be very large, although only a small fraction larity, k-mer sequence signature, protein motif domain data
is known. Discovering new interactions through laboratory from PFAM, Emotif database and BLAST similarity score
experiments is expensive and time consuming. For this of interacting proteins from other species etc. Finally they
reason computational methods have been applied to predict combined these derived kernels to train an SVM classifier.
PPI. Various genomic and proteomic knowledge are used to SVM classifier is also used for features derived from auto
build computational models for PPI prediction. Some state- covariance of physio-chemical properties of amino acids in
of-the methods used the information derived from phy- a protein sequence [21] and conjoined triad based features
logenetic conservation and co-evolution [1], phylogenetic [15] from protein sequence for PPI prediction.
distance [2], gene fusion [3], co-localization of genes in a GO is a controlled and structured vocabulary of terms,
chromosome [4], gene expression profile [5], protein domain that describe information about protein’s localization within
interaction [6], network topology parameters [7], [8], protein the cells (i.e., cellular component or CC), participation in
structures [9], [10] etc. Protein motif domain information are biological processes (BP) and associated molecular functions
used in a probabilistic way to infer interactions by Gomez (MF) [22]. GO terms are connected by a directed acyclic
et al. [11] and Huang et al [6]. A fine review work can be graph (DAG) where nodes of the graph are represented by
found in [12]. terms and edges among nodes represent relations. Com-
In [13], [14], [15], [16] etc. PPI prediction has been viewed monly used types of relations between GO terms are is a,
as a supervised classification problem with features pre- part of, regulates. In the DAG, child terms represent more
pared from various proteomic information. High confidence specific biological concepts while parent terms represent
more general concepts. Generally interacting proteins often
• S. Bandyopadhyay and K. Mallick are with Machine Intelligence Unit, participate in similar BP and/or exhibit similar MF and/or
Indian Statistical Institute, Kolkata-700108, India. are co-localized in similar CC [23], [24], and hence exhibit
Email: [email protected], [email protected] high GO semantic similarity [23]. Many approaches to mea-
sure GO semantic similarity exist e.g., [25], [26], [27], [23],

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2

[28] and [29]. Many of them have shown that high GO in a proteome. It is defined as follows [33].
similarity among proteins generally indicates interactions
between them, with GO similarity value being considered
IC(t) = − log(p(t)), (2)
as a measure of confidence of interaction. Therefore, in-
teraction based on semantic similarity score is basically ∑
annotation(t) + d∈descendant(t) (annotation(d))
unsupervised in nature [30]. p(t) = ∑ , (3)
c∈descendant(root) annotation(c)
Ben-hur et al. [14] used a non-sequence kernel derived
from GO similarity to train an SVM classifier for PPI pre- where p(t) is the probability of occurrence of a
diction (the term non-sequence indicates that the kernel term t, annotation(t) is the number of direct annota-
does not use any sequence information). GO similarity score tions of the term t in the considered proteome, and

between two proteins was considered as input to the kernel d∈descendant(t) (annotation(d)) is the number of indirect
function. In their experiments [31] they used non-sequence annotations by the descendant in the GO graph of the term
kernel as a mixture of kernels derived from GO similarity t. The summation of the former two counts are divided by
score, BLAST score of orthologous interacting protein pairs the total number of the annotated terms in the proteome.
and mutual clustering coefficient derived from PPI network. GO terms with higher IC values indicate more specific
GO based non-sequence kernel was expressed by using the biological concepts; representing all the terms by the same
following equation [14]: value of 1 may be misleading. Another important issue is
that a term may be annotated directly or indirectly by any
Knon seq ((p1 , p2 ), (p′1 , p′2 )) =
descendant of that term. Therefore, a good representation of
K ′ (SGO (p1 , p2 ), SGO (p′1 , p′2 )), (1) features based on GO terms should consider the specificity
where SGO is a GO similarity score between two proteins. of that term as well as the annotation depth of the indirectly
Here, for computing the kernel for two pairs of proteins annotated terms. In the approach proposed in this article, all
(p1 , p2 ) and (p′1 , p′2 ), individual values of GO similarity be- annotated GO terms, including their ancestors, are treated
tween p1 (or, p′1 ) and p2 (or, p′2 ) are used. The kernel does not as a bag of words as applied in document classification
explicitly compare the similarity between the two protein [34]. The IC value of each indirectly annotated GO term is
pairs with respect to their annotated GO terms space. Tastan multiplied by a coefficient which decreases with its distance
et al. [32] have used GO similarity value as features to from the directly annotated term. This coefficient will differ-
predict viral host protein interaction. entiate the value of a GO term on the basis of whether the
term is directly or indirectly annotated by a protein pair.
Instead of using the GO similarity value of protein pairs The article is organized as follows: In Section 2, a dis-
directly as features, Maetschke et al. [30] used GO terms cussion about the feature vector construction method is pre-
in a binary feature vector for supervised learning. Random sented in detail. The different datasets used in this article are
Forest and Bayesian classifier were used in their work. Here described in Section 3. For performance evaluation of the
if a protein pair (p1 and p2 ) has an experimental evidence new feature, Gaussian kernel SVM is used in this article. The
about their interaction then another protein pair (p′1 and p′2 ), classifier performance is evaluated using Area Under the
having similar GO term annotation profile as (p1 and p2 ), Curve of Receiver Operating Characteristics (AUC-ROC)
may also interact. Learning was done with feature vectors and precision etc. These results are discussed at Section 4.
prepared from GO terms annotating the two proteins, as Section 5 concludes the article.
well as their ancestors. For the purpose of feature rep-
resentation, various combinations of ancestor terms were
considered, viz., all ancestor terms of an annotated pair, 2 A NEW FEATURE VECTOR USING GO TERMS
terms that are common from the lowest common ancestor Annotated GO terms of a protein are extracted from the GO
up to the root of an annotated pair, terms up to lowest annotation database www.geneontology.org . Let a protein
common ancestor of annotated term pairs of a protein pair P be annotated by np number of GO terms, termset(P ) =
etc. Finally, the feature vector was expressed as a combined (t1 , t2 .., tnp ) and let ancs(ti ) = (ti , t1i , t2i ...troot ) be all an-
i
binary vector, based on presence or absence of terms se- cestor terms of a term ti . The actual term set of the protein
lected by any of the above mentioned combination methods P is a union of the ancestors of∪all terms in termset(P ),
of induced terms for a protein pair (i.e., if the k th term is expressed as termsetall (P ) = ∀i∈termset(P ) ancs(ti ). A
present in a term set of a protein pair, then the k th position common approach to represent GO term based feature in
of the feature vector is 1 otherwise 0). So if a particular vector space model is by using the termsetall (P ) to prepare
proteome has a total of n unique GO term annotations a binary encoded vector of GO terms on the basis of pres-
(including all ancestor terms), then the dimension of the ence or absence of a term. The following section discusses
feature vector is n. An intuition behind this representation is the adopted approach for assigning weights to the members
to train a classifier using associated GO terms of interacting of the feature vector.
and non-interacting protein pairs. However, the simple 0-
1 based representation of feature has a major problem.
Here all the terms annotating a protein is represented by 2.1 Computation of weights of GO terms
1 regardless of where the terms are located. Note that terms In this work, two types of weights for each GO term present
appearing in different levels in a GO graph have different in the protein feature are considered. The first one is a
information content (IC) values [33]. IC is defined as the global significance of the term, which is represented by the
negative logarithm of probability of occurrence of a GO term IC value of the term. The second is a local weight of that

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3

term i.e., topological weight of that term generated from 1.


the annotation list of the protein. Let ti be an annotated
term where ti ∈ termset(P ) for a protein P . Then all terms Algorithm 1 Steps to compute the weight of features for
in ancs(ti ) are assigned their corresponding IC values as proteins
their global weights. The local weight of a term is computed 1: For every GO term t in the Gene Ontology Om-
as follows: Let SPi,k = (ti , t1i , t2i ..., tk−1
i , tki ) be the shortest nibus Database make an adjacency list graph structure
path from ti to an ancestor ti . Then the coefficient value cki
k (ALGO (t)) for that term, with its immediate descen-
for tki is computed as: dants.
2: Extract the annotated term set of each characterized pro-
cki = dti ,t1i ∗ dt1i ,t2i ...dtk−1 ,tk , (4) tein (i.e., termsetBP (Pi ) for protein Pi ) for BP ontology
i i

where, from an annotation dataset. The same is done for CC and


 MF ontology. Build sperate list of uniquely annotated

 0, if ti and tj are not neighbour in SPi,k terms lBP for BP ontology from the termsetBP (Pi ). The

α ,
1 if Rel(ti , tj )=”is a” Same is done for other ontologies (lCC , lM F ) individu-
di,j = ally.

 α 2, if Rel(ti , tj )=”part of”

 3: Extract the ancestor subgraph upto root term for every
0, otherwise
term ti ∈ lBP starting from ALGO (ti ). Compute the
In this article, the ranges of α1 and α2 can be written shortest path from ti to all ancestor nodes in that sub-
as 0 ≤ α2 ≤ α1 ≤ 1 in a combined way. This coefficient graph and calculate coefficients of those terms using the
value is 1 for ti which is a directly annotated term. The Eqn. 4. Keep the termset of the ancestor subgraph with
coefficient cki is multiplied with the IC value of tki , for their coefficient values for the term ti in a list named
k = 1, 2, . . . , root. This computation helps to make the AncsBP (ti ). Repeat this step for all other ontology term
feature values of directly and indirectly annotated terms lists (lCC , lM F ).
different with indirectly annotated terms necessarily having 4: For all terms t ∈ AncsBP (ti ) calculate the weight by
smaller values. It is noted that a term in the set ancs(ti ), using the coefficient value from Step 3 and Eqn. 6 and
will be multiplied with a coefficient which decreases with do the same for other lists (AncsCC (ti ), AncsM F (ti )).
increasing length of the shortest path from ti . In this way
the coefficient values of all ancestor terms in the set ancs(ti )
are calculated. The weight of a term tia is calculated by the
2.2 Feature construction of protein pairs
following Eqn. 5:
For feature representation of a protein pair (P1 , P2 ) one

wtia = IC(tia ) ∗ ctia (5) of the two approaches is usually followed: (i) combination
(F (P1 ) ⊕ F (P2 )), or (ii) concatenation [F (P1 ), F (P2 )]. Here
The above weighted formulation is repeated for ev- F (P ) is the feature vector corresponding to protein P .
ery term in termset(P ). If a term ta is indirectly anno- Here ⊕ operation will add element by element the feature
tated by multiple m descendant terms (ta1 , ta2 , ...tam ) ∈ values of the two proteins. Let length of F (P ) be n. Then
termsetall (P ), the coefficient value of the term ta in the fea- combination and concatenation operations produce feature
ture of protein P is the summation of individual coefficients vectors of length n and 2n respectively. For the prediction
(ctai ) generated by each of those descendant annotated of PPI using GO features, a protein pair feature represen-
terms of the term ta for protein P . The final weight wta tation using concatenation is not preferred because with
for the term ta is computed as: this approach [F (P1 ), F (P2 )] and [F (P2 ), F (P1 )] do not
′ ′ ′ ∑
m have a same representation in feature space. Consequently
wta = wta1 + wta2 . . . + wtan = IC(ta ) ∗ ctai . (6) different kernel value is computed for different ordering
i=1 of concatenation of the protein feature during SVM opti-
Finally the weight value of individual terms are placed mization. In contrast, feature using combination approach is
in an n dimensional feature vector for a protein P , where independent of the ordering of a protein pair. Generally two
n is the number of unique GO terms in all proteins of a interacting proteins share similar types of GO terms in the
proteome. Representation of the feature vector for a pair of ontology (i.e., generally they participate in similar biological
proteins (P1 , P2 ) from individual features of P1 and P2 is process and molecular function, and/or may be co-localized
discussed in Subsection 2.1. in the same cellular component). This is the key intuition
A protein is considered as a bag of words (GO terms). for representing a pair of proteins as a document with a bag
Notably, this feature representation is motivated by tf –idf of words with the combination approach. The combination
(term frequency-inverse document frequency) representa- approach of representing features of protein pairs is used in
tion used for document classification [35]. In this regard, the this article. Once the features of a protein pair is determined,
coefficient (shown in Eqn. 4) that represents the weighted an SVM classifier is trained on PPI data. The direct pairwise
frequency of a term with respect to annotations in a protein, kernel K of two combined vectors (F (P1 ) ⊕ F (P2 )) and
′ ′
is analogous to tf whereas the IC value (IC(tia )) is the (F (P1 ) ⊕ F (P2 )) is given by the following equation:
specificity of a term (i.e., word) with respect to overall ′ ′ ′ ′
K(F (P1 ) ⊕ F (P2 ), F (P1 ) ⊕ F (P2 )) = k (F (P1 ), F (P1 ))
annotations in a proteome. This is analogous to idf . ′ ′ ′ ′ ′ ′

For efficient computation of the weights of terms, the +k (F (P1 ), F (P2 )) + k (F (P2 ), F (P1 )) + k (F (P2 ), F (P2 ))
following steps are followed as specified in the Algorithm , (7)

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4

where k (a, b) = a ∗ bT , is a linear kernel. Note that this Park’s [16] dataset: The yeast PPI data in [16] has been
kernel is symmetric in nature, and operates on the feature used in this article. The dataset was prepared from DIP core
of protein pairs directly. This becomes possible because of interaction dataset. It had a total of 3734 positive interactions
the combination approach adopted. This kernel can be used and around 3,00,000 randomly selected negative interaction.
on the top of a polynomial or Gaussian kernel to map the So positive to negative data ratio was about 1:100.
data into more complex non-linear space. Meatschke et al. [30] datasets: The PPI datasets of S.
Cerevisiae (SC), H. Sapiens (HS), E. Coli (EC) and D.
2.3 Selection of induced GO terms Melanogaster (DM) as used in [30] have been taken in this
article. These data sets were collected from the STRING
In this article we consider two different categories of in-
[37] database. There are 15238, 3490, 1167 and 321 positive
duced GO term based features that were found to perform
interactions for yeast, human, E. Coli and D. Melanogaster,
well in [30]. For each category, all the features are assigned
respectively. For negative data, of the same size as the
certain weights. The first category of feature comprises all
positive data, they have used random protein pairs.
annotated GO terms and their ancestor terms for a protein
Yu et al. [38] dataset: High confidence (HC) PPI dataset
pair. Weights are assigned to all these terms using Eqn. 6.
(15408 positive data) of human comprising interactions
This feature is mentioned as weighted all ancestors or WAA
present in both BioGRID and HPRD datasets was consid-
in this article.
ered [38]. They have used two different types of negative
Instead of considering all annotated GO terms in the
datasets namely, random pair of proteins and a balanced
feature vector, the second category includes all induced
random pair of proteins. In balanced random protein pair,
terms up to the lowest common ancestor (ULCA) [30]. The
the number of times a protein appears in negative dataset
induced GO terms are selected by the following method.
is same as that in the positive dataset. However, note that
For each pair of annotated GO terms ((ti , tj ) : ti ∈
Park et al. [18] have shown that this kind of negative data
termset(P1 ), tj ∈ termset(P2 )) for a protein pair (P1 , P2 ),
selection fails to simulate the behavior of the global protein
the terms located up to the lowest common ancestor (LCA)
pair population which is better simulated by randomly
of (P1 , P2 ), including ti and tj , are chosen. Finally the union
selected negative data.
of the ULCA induced term sets resulting from all the term
pairs ((ti , tj ) : ti ∈ termset(P1 ), tj ∈ termset(P2 )) is used
for characterizing the protein pair (P1 , P2 ). As in the case 4 R ESULTS AND D ISCUSSION
for WAA, weights are assigned to all these ULCA induced In this section the PPI prediction performance using the
terms using Eqn. 6, where instead of all descendants, only proposed weighted GO induced term features (WAA and
the ULCA induced descendants are considered. This feature WULCA) are compared with those of a recently developed
is called Weighted ULCA or WULCA in this article. approach containing binary features (AA and ULCA) [30].
SVM classifier with Gaussian kernel (C-SVC of libsvm [39])
3 DATASETS is used for classification. Features generated from the three
ontology (BP, CC and MF) are combined into a single vector
3.1 Gene Ontology Data in all the experiments.
We have constructed the GO graph from Gene Ontology Preparation of negative data is an important issue for PPI
omnibus data collected from www.geneontology.org [36]. Two prediction. Various approaches proposed in the literature
types of relations among GO terms are considered in this include protein pairs constructed from different cellular
article, is a and part of. We have collected GO annotation locations [14], involved in different biological processes [40],
dataset for different species from Uniprot database and gene etc. However, this type of negative data selection criteria can
ontology omnibus data. Ancestors of annotated GO terms make the dataset biased, specially when GO based features
are retrieved from the GO graph. are used for PPI prediction. Most of the data sets considered
in this article used randomly selected negative data. Only
3.2 PPI datasets for Different Species the data of Yu et al. [38] used balanced random negative
dataset.
PPI datasets for various species were collected from existing
Performance of the different approaches are reported
databases. Only those proteins that have GO annotation
using the measures Area under the curve of ROC (AUC-
with respect to all the three GO aspects (BP, MF and CC)
ROC) and ROC50 (AUC-ROC50), sensitivity and specificity.
are considered. Descriptions of the datasets are provided in
AUC-ROC is a curve plotting the sensitivity versus (1-
the following paragraphs.
specificity) at different thresholds of the classifier prediction
Ben-hur et al. [14] datasets: Three yeast PPI datasets from
score. AUC-ROC measures the accuracy of the complete set
[14] have been used in this article. The first one, referred to
of predictions. AUC-ROC50 is the area under the curve of
as BIND, was collected from BIND database (10517 positive
ROC upto first fifty false positives. It measures the high
interactions and same number of negative interactions).
confidence prediction performance of a classifier.
The second, referred to as Reliable-BIND, consisted of 750
reliable interactions filtered from the latter database. Finally,
the third one, referred to as DIP-MIPS, was collected from 4.1 Estimation of Parameters and Classifier Details
DIP/MIPS database (4837 positive interactions and around Two parameters α1 and α2 are used for weighting of the
10000 negative interactions). In all these three datasets, is a and part of relations respectively in the Eqn. 4. More
randomly selected protein pairs with no known interaction preference is given to the is a relation than the part of rela-
were used as the negative set. tion in this article. Because is a relation is considered more

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5

TABLE 1
Parameter value used in our experiments using SVM classifier obtained by cross validation using proposed feature WULCA

Author name Data set name C Gamma


Yu et al. High confidence positive with random negative 20 0.02
Ben-hur DIP-MIPS 47 0.3
BIND 25 0.23
Park Yeast PPI 10 0.3
Meatche et al. S. Cerevisiae 10 0.1
H. Sapiens 10 0.03

TABLE 2
AUC-ROC values for Ben-hur et al [14] data sets. Results of DIP-MIPS dataset using GO features do not use any threshold based filtering of
negative data.

Dataset Ben-hur et al. [14] AA Proposed ULCA Proposed


name WAA WULCA
AUC/AUC50 AUC/AUC50 AUC/AUC50 AUC/AUC50 AUC/AUC50
BIND 0.68/- 0.83/0.57 0.89/0.64 0.88/0.63 0.89/0.65
Reliable 0.95/- 0.89/0.58 0.94/0.61 0.94/0.60 0.96/0.62
BIND
DIP- 0.87-0.97/0.08-0.46 0.88/0.46 0.93/0.51 0.92/0.49 0.93/0.52
MIPS (negative pair
threshold range
0.5-0.04)

direct than the part of relation and former is also highly own work [14], where they used 5-fold cross validation
frequent than the latter. Smaller weight value of part of (to keep parity all the other results were also obtained
relation will contribute lower weight values to the terms using 5-fold cross validation). For BIND they have used
those are linked by that relation and reverse will happen for GO features using non-sequence kernel whereas for reliable-
an is a relation link. For identifying the appropriate values BIND a mixture of non-sequence kernels prepared from GO
of these parameters, we have run the grid search for two features as well as blast score and MCC features are used
parameters α1 and α2 in the range of [0,1] with a step size in their [31] work. For DIP-MIPS they have used sequence
0.1. ROC of the classification using SVM is used to find the spectrum count based features with pairwise kernel. As can
good value of parameters. Contour plot was plotted (See be seen, the SVM classifier trained with proposed WULCA
Fig. 1) in the range of parameters for the WULCA feature on and WAA features performed the best (0.89) as compared to
the S. Cerevisiae dataset of Maetschke et al [30]. It is found AA (0.83) and ULCA (0.88) features(see Table 2). The result
that the performance is best with α1 ∈ [0.7,0.8] and α2 ∈ reported in [14] is the lowest. In the case of the proposed
[0.5,0.6]. For the other dataset and features it is observed features, the ROC50 score (0.65) for the WULCA feature is
that the same range of values of α1 and α2 performed best. also the best with respect to the other features. For the case
C-SVC of libsvm with Gaussian kernel is used as classi- of reliable-BIND, an AUC-ROC score of 0.95 was reported
fier. We have tested with many other kernels but we found in [14]. From the Table 2 we see that the proposed GO
that Gaussian kernel performs the same as, often better than, term based WULCA shows a marginal improvement (AUC
other kernels for all features and for all data sets. For the 0.96). Note that in this experiment they [14] have used some
sake of brevity we have omitted results with other kernels. A other types of features in addition to GO similarity score
critical issue for using SVM classifier is to select parameters feature in the non sequence kernel. However, the proposed
for optimized performance, which is determined by cross weighted GO term based feature is found to be strong
validation. C is the regularization parameter, i.e. used as enough to be combined with a simple kernel to provide
penalty for misclassification. Here we have kept the value a superior performance. Again, both WAA and WULCA
of misclassification penalty for positive data high, because encoding of features performed better than AA and ULCA
random negative dataset is less reliable than positives. In based approach, respectively. This indicates that the pro-
our all experiments the values of C for positive and negative posed weighted features may have an advantage over the
data are considered with a ratio 9:1. We have found that the unweighted ones.
value of C ≤ 20 is effective for data sets larger than 5000 For DIP-MIPS PPI data set, pairwise kernel SVM with
instances and positive to negative data ratio 1:1. Table 1 sequence spectrum count was used in [14]. They prepared
reports the C and γ values for the SVM. negative interaction dataset by filtering with GO CC similar-
ity threshold i.e., the similarity of each negative interaction
4.2 Result on Ben-hur et al.’s dataset [14] pair is lower than the threshold. As is reported in the table,
Table 2 reports the result on the data set from [14]. The AUC-ROC score varied from 0.87 to 0.97 when thresholds of
second column shows the results obtained in the authors’ GO CC similarity was varied from 0.50 to 0.04. In the other

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 6

approaches, no such thresholding of the negative dataset Contour plot of ROC


1
was used. Both of the proposed WAA and WULCA feature
with Gaussian SVM kernel achieved an AUC-ROC score 0.9
0.93 but The proposed WULCA feature demonstrated su-
0.8
perior performance with respect to WAA feature in terms of
the ROC50 score. Again, it was observed that the weighted 0.7
features always outperformed the unweighted ones. As
0.6
expected, the results reported in [14] were somewhat better

Alpha2
because of the negative data thresholding, since protein 0.5
pairs taken from different cellular component are less likely
0.4
to interact.
0.3

4.3 Result on yeast dataset of Park [16] 0.2

Here, the effectiveness of proposed weighted features are


0.1
compared to other unweighted GO based features for PPI 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Alpha1
prediction using yeast dataset of Park [16] is discussed. In
that work, the author combined the results of four SVMs,
each using a different set of sequence based features, in a Fig. 1. Illustration of contour plot of different values of parameters α1 and
α2 used for the weights of is a and part of relation. Cross validation
ensembling approach. Four-fold cross validation was used. AUC score was used to draw the contour plot for the dataset of S.
Table 3 shows the performance of the approach of Cerevisiae of Maetschke et al [30]. horizontal axis shows α1 , vertical
Park [16] along with those of the other GO-based features axis shows α2
using SVM as the underlying classifier. It is observed that
the proposed WULCA feature (AUC-ROC = 0.93) again 1
RECIEVER−OPERATING CHARACTERISTICS (ROC) CURVE

outperforms all others including the Park’s approach. WAA 0.9


feature (AUC-ROC = 0.926) performed very similar with 0.8
the WULCA. Moreover, the ULCA inducer based feature
TRUE POSITIVE RATIO
0.7

(AUC-ROC = 0.917) outperforms the AA inducer based 0.6

feature (AUC-ROC = 0.85) by a large margin. This was also 0.5

observed in [30] where Random Forests classifier was used. 0.4

A further experimentation was carried out on the yeast 0.3

dataset using the WULCA features by varying the positive 0.2

to negative training data ratio. The results are shown in 0.1 WULCA
ULCA
Table 4. As expected, with increasing proportion of negative 0
0 0.2 0.4 0.6 0.8 1

data, the classifier becomes biased toward the negatives, FALSE POSITIVE RATIO

with more positive data getting misclassified as negatives.


Consequently, the specificity (true negative rate) increases Fig. 2. ROC plot using SVM classifier with WULCA and ULCA features
on S. Cerevisiae data set of Maetschke et al. [30]
while sensitivity (true positive rate) decreases. However,
AUC scores remain more or less stable, showing very little
change. 1
RECIEVER−OPERATING CHARACTERISTICS (ROC) CURVE

0.9

4.4 Result on Maetschke et al.’s dataset [30] 0.8


TRUE POSITIVE RATIO

0.7
Here the performance of the proposed as well as existing 0.6
features for PPI prediction on datasets of different species 0.5
as mentioned in Section 3 are compared. Results for ULCA, 0.4

proposed WAA and WULCA features only are included. 0.3

Results of AA is not shown because their performance 0.2

is consistently poor. Ten fold cross validation results are 0.1 WULCA
ULCA
reported in Table 5. As can be seen, the proposed WAA 0
0 0.2 0.4 0.6 0.8 1
feature provides best AUC and AUC50 scores for the dataset FALSE POSITIVE RATIO

of H. Sapiens. For rest of the cases WULCA feature shows


best performance and followed by the performance of WAA Fig. 3. ROC plot using SVM classifier with WULCA and ULCA features
feature with a very close margin. For illustration purpose on human PPI data set of Yu et al. [38] with random negative data.
ROC plot for S. Cerevisiae dataset is shown in Fig. 2.

based features by counting the amino acid trigrams, re-


4.5 Result on Yu et al.’s dataset [38] sulting in a set of 8000 features. Tensor product pairwise
In this dataset there are High confidence human protein kernel (TPPK8000) [13] is used with SVM classifier and
interaction pairs and two types of negative dataset namely, trained with five-fold cross validation. With random nega-
random and balanced random as mentioned in Section 3. tive dataset, AUC score of 0.82 was achieved with TPPK8000
In their work [38], the authors used protein sequence kernel. Using SVM classifier with ULCA, proposed WAA

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7

TABLE 3
AUC-ROC Values for Park et al. [16] data sets with 1:10 positive to negative ratio

Park AA WAA ULCA WULCA


AUC/AUC50 AUC/AUC50 AUC/AUC50 AUC/AUC50 AUC/AUC50
0.85/- 0.85/0.54 0.926/0.58 0.917/0.58 0.93/0.59

TABLE 4
AUC-ROC on the dataset used by Park et al. [16] tested with different positive to negative ratio

Data Ratio WULCA WAA ULCA


Positive:negative AUC/std SN SP AUC/std SN SP AUC/std SN SP
1:1 0.930/.02 0.83 0.85 0.926/.02 0.82 0.85 0.917/.01 0.8 0.82
1:5 0.932/.03 0.82 0.89 0.928/.01 0.83 0.89 0.909/.03 0.79 0.85
1:10 .931/ .01 0.78 0.95 0.926/.02 0.76 0.94 0.903/.01 0.76 0.89
1:100 0.928 /.02 0.76 0.98 0.919/.03 0.74 0.97 0.894/.02 0.75 0.95

TABLE 5
AUC-ROC values for Meastache et al. [30] data sets of S. Cerevisiae, H. Sapiens, E. Coli and D. Melanogaster

Species Proposed WULCA Proposed WAA ULCA


AUC/ AUC50 AUC/ AUC50 AUC/ AUC50
S. Cerevisiae 0.95 / 0.64 0.95/0.63 0.92 / 0.630
H. Sapiens 0.93 /0.60 0.95/0.63 0.90 / 0.57
E. Coli 0.96 / 0.60 0.96/0.59 0.93 / 0.58
D. Melanogaster 0.86 / 0.52 0.85/0.51 0.82 / 0.50

TABLE 6
AUC-ROC values for Yu et al. [38] data sets.

High confidence positive Yu. et al (TPPK8000) ULCA WAA WULCA


Random negative 0.82 0.80 0.82 0.83
Balanced Random negative 0.60 0.64 0.67 0.68

and WULCA features are trained with five fold cross vali- features which are trained by non-sequence GO kernel
dation, the AUC-ROC scores obtained were 0.80, 0.82 and [14] as mentioned in the Section 1. The proposed feature
0.83, respectively (see Table 6). The ROC plot is provided WULCA achieved best performance over the non-sequence
in Fig. 3. With balanced random negative dataset, the clas- GO kernel and other features. For the reliable-BIND dataset,
sifier achieved AUC-ROC scores of 0.64, 0.67 and 0.68 with a combination of different non-sequence kernels (mixture of
ULCA, WAA and WULCA features respectively, whereas kernels prepared from GO similarity and sequence homol-
the TTPK8000 kernel provides AUC score of 0.60. From the ogy score) were used. Note that, GO based WULCA feature
results, we see that for both the datasets, WULCA feature (0.96) are more discriminative than non-sequence kernels
performs the best, especially for the balanced random nega- (0.95). As mentioned earlier, several authors [31], [16], [41]
tive dataset. have reported PPI prediction using sequence based features.
From the results shown in Section 4 (see results obtained on
DIP-MIPS, Park and High confidence positive datasets in
4.6 Discussion
Table 2, Table 3 and Table 6 respectively) it can be concluded
In this work, a new GO-based feature representation that the performance of weighted GO based features is
method has been proposed for supervised PPI prediction. better than the sequence signature based features for PPI
Each protein pair is represented by a weighted feature vec- prediction. From the analysis of Ben-hur et al. [14] it is
tor, where the weights are derived from the term specificities revealed that sequence based features improves the perfor-
and the topological structure of the GO graph. An SVM clas- mance of a classifier for PPI prediction if it is combined with
sifier is trained on biologically validated interacting protein GO based features. Both the features have large number
pairs, while using a randomly generated negative set. The of dimensions. Computation of kernel matrix for the full
trained classifier is used to predict protein protein interac- dataset using these features is computationally expensive.
tions. From the results of the various experiments, it is found In this context GO term based feature is computationally
that the proposed weighted feature vector representation simple and performance is better.
(WULCA) performs better than the unweighted version
(ULCA) which is statistically significant (See Table 7). In All three combined GO ontology terms produce around
Table 2, the result of BIND dataset represents the GO-based 6800 and 17000 features for yeast and human, respec-

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 8

TABLE 7
McNemar’s test P-values of significant performance improvement of feature WULCA compared to ULCA on some of the datasets used in this
article.

Author name Data set name WULCA Vs ULCA


Yu et al. High confidence positive with random negative 2.4 × 10−9
Ben-hur DIP-MIPS 1.9 × 10−4
BIND 1.3 × 10−5
Park Yeast PPI 4.7 × 10−18
Maetschke et al. S. Cerevisiae 2.3 × 10−21
H. Sapiens 5.4 × 10−22

TABLE 8
AUC-ROC values using the Random Forest classifier

Dataset AA WAA WULCA ULCA


S.Cerevisiae Meatchske et al. [30] 0.88 0.93 0.935 0.90
Human HC positive with random 0.74 0.798 0.80 0.77
negative Yu et al. [38]

TABLE 9
AUC-ROC values using different combination of features of different sub-ontologies (BP, CC and MF) of the S. Cerevisiae dataset [30] using SVM
classifier and 10 fold cross validation

Combination of ontologies WAA WULCA ULCA


BP 0.936 0.94 0.90
CC 0.927 0.928 0.88
MF 0.83 0.84 0.78
(BP, CC) 0.942 0.947 0.912
(BP, MF) 0.927 0.93 0.90
(MF, CC) 0.90 0.91 0.88
(BP, CC, MF) 0.95 0.95 0.92

TABLE 10
Results simulates the effect of IEA annotated GO term inclusion (IEA+) and exclusion (IEA-) in the feature vector. AUC-ROC value is reported for
SVM classifier.

Data set IEA+ IEA-


WAA WULCA ULCA WAA WULCA ULCA
Human HC positive with random 0.82 0.83 0.80 0.817 0.826 0.792
negative Yu et al. [38]
S.Cerevisiae Meatchske et al. [30] 0.95 0.95 0.92 0.946 0.943 0.917

tively. With this large dimensionality and small number tions exhibit that the RF classifier is prone to overfitting
of instances, the use of SVM as the underlying classifier for PPI prediction with GO based feature vectors. Again,
is proposed in this article. Note that here we could also with a huge number of instances and unbalanced positive
have used the RF-classifier, but in such cases, it has a to negative data ratio (often observed in the domain of
greater chance of over fitting. The result of RF classifier PPI prediction), the performance of RF classifier becomes
in the Table 8 demonstrated that proposed weighted GO unreliable and very much time consuming as well. We have
feature performed better than the unweighted version. For seen that SVM performs better than the RF classifier for
understanding the effect of over fitting, we have evaluated the above mentioned cases. However, selection of parameter
the performance of both RF and SVM classifiers on the values is an important issue with the SVM classifier.
training set itself (training error). In this experiment, for the
S.Cerevisiae dataset shown in Table 8, RF classifier (AUC- A reduced GO term set (i.e., GO-slim term set) can be
ROC = 0.99) demonstrated a higher accuracy with respect to used for reducing the high dimensionality of GO features.
the SVM classifier (AUC-ROC = 0.97) with WULCA feature. Notably, according to the analysis of Maetschke et al. [30],
During the 10 fold cross validation testing RF classifier reduced GO term set affects the performance of a classifier.
(AUC-ROC = 0.935) performed inferior compared to SVM A naive classifier with less overhead in determining its
(AUC-ROC = 0.95) for the WULCA feature. Similar behavior parameter values or linear SVM (viz., liblin) [42] can be
was observed for other GO based features. These observa- easily trained. This is the advantage of using a reduced
dimensional data. Since there is millions of protein pairs

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9

in the dataset, therefore it is difficult to use the standard [3] A. J. Enright, I. Iliopoulos, N. C. Kyrpides, and C. A. Ouzounis,
feature extraction methods (like PCA or KPCA) that require “Protein interaction maps for complete genomes based on gene
fusion events,” Nature, vol. 402, no. 6757, pp. 86–90, 1999.
‘eigen decomposition’. Hence, feature extraction method [4] T. Dandekar, B. Snel, M. Huynen, and P. Bork, “Conservation
from such large data sets can be developed in future. For of gene order: a fingerprint of proteins that physically interact,”
the purpose of the analysis of the performance of reduced Trends in biochemical sciences, vol. 23, no. 9, pp. 324–328, 1998.
dimension GO feature, we have trained the classifier using [5] C. Von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver,
S. Fields, and P. Bork, “Comparative assessment of large-scale data
the different combination of features prepared from the sets of protein–protein interactions,” Nature, vol. 417, no. 6887, pp.
three ontologies (BP, CC and MF). From the result shown 399–403, 2002.
in the Table 9 it is revealed that features from BP ontology [6] C. Huang, F. Morcos, S. P. Kanaan, S. Wuchty, D. Z. Chen, and J. A.
alone achieved best accuracy among the features from other Izaguirre, “Predicting protein-protein interactions from protein
domains using a set cover approach,” IEEE/ACM Transactions on
ontology. Best performance was achieved with the merged Computational Biology and Bioinformatics (TCBB), vol. 4, no. 1, pp.
features of all three ontologies. Merged features of BP and 78–87, 2007.
CC have shown the nearest performance to the best one. [7] A. Birlutiu, F. d’Alche Buc, and T. Heskes, “A bayesian framework
for combining protein and network topology information for
Some GO terms are annotated by automatic electronic in- predicting protein-protein interactions,” 2014.
ference (known as IEA evidence code). An analysis is done [8] S. Wuchty, “Topology and weights in a protein domain interac-
to realize the effect of performance of the classifier for the tion network–a novel way to predict protein interactions,” Bmc
case of the exclusion of IEA annotated terms in the feature Genomics, vol. 7, no. 1, p. 122, 2006.
[9] R. Singh, J. Xu, and B. Berger, “Struct2net: Integrating structure
vector. It can be seen from the results of Table 10, that the into protein-protein interaction prediction.” in Pacific Symposium
performance of SVM classifier is not affected much by the on Biocomputing, vol. 11. World Scientific, 2006, pp. 403–414.
exclusion of IEA annotated GO terms. [10] R. Hosur, J. Xu, J. Bienkowska, and B. Berger, “iwrap: an inter-
face threading approach with application to prediction of cancer-
related protein–protein interactions,” Journal of molecular biology,
5 C ONCLUSION vol. 405, no. 5, pp. 1295–1310, 2011.
[11] S. M. Gomez, W. S. Noble, and A. Rzhetsky, “Learning to predict
In this work we have proposed a new feature representation protein–protein interactions from protein sequences,” Bioinformat-
technique based on annotated GO terms of a protein pair. ics, vol. 19, no. 15, pp. 1875–1881, 2003.
[12] S. Pitre, M. Alamgir, J. R. Green, M. Dumontier, F. Dehne, and
A supervised classifier (SVM) is used for the purpose of A. Golshani, “Computational methods for predicting protein–
predicting novel PPIs. Unsupervised approaches for PPI protein interactions,” in Protein–Protein Interaction. Springer, 2008,
prediction use GO semantic similarity as confidence value, pp. 247–267.
which is not efficient compared to supervised. Features [13] S. Martin, D. Roe, and J.-L. Faulon, “Predicting protein–protein in-
teractions using signature products,” Bioinformatics, vol. 21, no. 2,
from GO terms have been developed earlier with the help pp. 218–226, 2005.
of various inducer term sets including the ancestor terms. [14] A. Ben-Hur and W. S. Noble, “Kernel methods for predicting
However, usually a binary (0-1) representation based on protein-protein interactions,” Oxford Bioinformatics, vol. 21, no. 4,
the presence or absence of terms has been used. In this March 2005.
[15] J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li, and
work we have considered weighted values of feature instead H. Jiang, “Predicting protein–protein interactions based only on
of binary values, where the weight is proportional to the sequences information,” Proceedings of the National Academy of
global annotation statistics and topological position of terms Sciences, vol. 104, no. 11, pp. 4337–4341, 2007.
[16] Y. Park, “Critical assessment of sequence-based protein-protein
in a GO subgraph. Results demonstrate that the proposed
interaction prediction methods that do not require homologous
inducer based weighted feature (WULCA) performs better protein sequences,” BMC bioinformatics, vol. 10, no. 1, p. 419, 2009.
than the simple binary inducer based approach for all the [17] A. Ben-Hur and W. S. Noble, “Choosing negative examples for
benchmark PPI datasets. Computation of WAA feature vec- the prediction of protein-protein interactions,” BMC bioinformatics,
vol. 7, no. Suppl 1, p. S2, 2006.
tor is easy compared to WULCA feature and the perfor- [18] Y. Park and E. M. Marcotte, “Revisiting the negative example
mance is also competitive. From the analysis in the article it sampling problem for predicting protein–protein interactions,”
can be concluded that GO term based features have better Bioinformatics, vol. 27, no. 21, pp. 3024–3028, 2011.
performance than sequence based spectrum count features. [19] S. Pitre, F. Dehne, A. Chan, J. Cheetham, A. Duong, A. Emili,
M. Gebbia, J. Greenblatt, M. Jessulat, N. Krogan et al., “Pipe:
In future, all types of existing relations in a GO graph a protein-protein interaction prediction engine based on the re-
may be considered which will add more terms in the feature occurring short polypeptide sequences between known interact-
vector and will thereby increase performance. This feature ing protein pairs,” BMC bioinformatics, vol. 7, no. 1, p. 365, 2006.
can also be applied for different types of classification prob- [20] S. Pitre, C. North, M. Alamgir, M. Jessulat, A. Chan, X. Luo,
J. Green, M. Dumontier, F. Dehne, and A. Golshani, “Global in-
lems where GO semantic similarity is used, namely in viral vestigation of protein–protein interactions in yeast saccharomyces
host protein interaction prediction, drug target prediction cerevisiae using re-occurring short polypeptide sequences,” Nu-
and discovery of functional similarity of protein motifs. cleic acids research, vol. 36, no. 13, pp. 4286–4294, 2008.
[21] Y. Guo, L. Yu, Z. Wen, and M. Li, “Using support vector machine
combined with auto covariance to predict protein–protein interac-
tions from protein sequences,” Nucleic acids research, vol. 36, no. 9,
ACKNOWLEDGMENTS pp. 3025–3030, 2008.
R EFERENCES [22] G. O. Consortium et al., “The gene ontology (go) project in 2006,”
Nucleic Acids Research, vol. 34, no. suppl 1, pp. D322–D326, 2006.
[1] F. Pazos, M. Helmer-Citterich, G. Ausiello, and A. Valencia, [23] X. Wu, L. Zhu, J. Guo, D.-Y. Zhang, and K. Lin, “Prediction of
“Correlated mutations contain information about protein-protein yeast protein–protein interaction network: insights from the gene
interaction,” Journal of molecular biology, vol. 271, no. 4, pp. 511–523, ontology and annotations,” Nucleic acids research, vol. 34, no. 7, pp.
1997. 2137–2150, 2006.
[2] R. A. Craig and L. Liao, “Phylogenetic tree information aids su- [24] J. P. Miller, R. S. Lo, A. Ben-Hur, C. Desmarais, I. Stagljar, W. S.
pervised learning for predicting protein-protein interaction based Noble, and S. Fields, “Large-scale identification of yeast inte-
on distance matrices,” Bmc Bioinformatics, vol. 8, no. 1, p. 1, 2007. gral membrane protein interactions,” Proceedings of the National

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2555304, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10

Academy of Sciences of the United States of America, vol. 102, no. 34, Sanghamitra Bandyopadhyay received her
pp. 12 123–12 128, 2005. PhD in Computer Science in 1998 from Indian
[25] S. Bandyopadhyay and K. Mallick, “A new path based hybrid Statistical Institute, Kolkata, India, where she
measure for gene ontology similarity,” Computational Biology and currently serves as a Professor. She has re-
Bioinformatics, IEEE/ACM Transactions on, vol. 11, no. 1, pp. 116– ceived the prestigious S. S. Bhatnagar award
127, 2014. in 2010, Humboldt fellowship for experienced
[26] D. Lin, “An information-theoretic definition of similarity.” in researchers and the Senior Associateship of
ICML, vol. 98, 1998, pp. 296–304. ICTP, Italy. She is a Fellow of the Indian Na-
[27] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus tional Academy of Engineering and the National
statistics and lexical taxonomy,” arXiv preprint cmp-lg/9709008, Academy of Science, India. Dr. Bandyopadhyay
1997. has co-authored six books and more than 250
[28] P. Resnik, “Using information content to evaluate semantic simi- research papers. Her research interests include Pattern Recognition,
larity in a taxonomy,” arXiv preprint cmp-lg/9511007, 1995. Data Mining, Evolutionary Computing and Bioinformatics.
[29] A. Schlicker, F. S. Domingues, J. Rahnenführer, and T. Lengauer,
“A new measure for functional similarity of gene products based
on gene ontology,” BMC bioinformatics, vol. 7, no. 1, p. 302, 2006.
[30] S. R. Maetschke, M. Simonsen, M. J. Davis, and M. A. Ragan,
“Gene ontology-driven inference of protein–protein interactions
using inducers,” Bioinformatics, vol. 28, no. 1, pp. 69–75, 2012.
[31] A. Ben-Hur and W. S. Noble, “Kernel methods for predicting
protein-protein interactions,” Bioinformatics, vol. 21, no. 1, pp. i38–
i46, * 2005.
[32] O. Tastan, Y. Qi, J. G. Carbonell, and J. Klein-Seetharaman, “Pre-
diction of interactions between hiv-1 and human proteins by
information integration.” in Pacific Symposium on Biocomputing,
vol. 14. World Scientific, 2009, pp. 516–527.
[33] P. Resnik, “Semantic similarity in a taxonomy: An information
based measure and its application to problems of ambiguity in
natural language,” Journal of Artificial Intelligence Research, vol. 11,
pp. 99–130, February 1999.
[34] L. M. Manevitz and M. Yousef, “One-class svms for document
classification,” the Journal of machine Learning research, vol. 2, pp.
139–154, 2002.
[35] B. Trstenjak, S. Mikac, and D. Donko, “Knn with tf-idf based
framework for text categorization,” Procedia Engineering, vol. 69,
pp. 1356–1364, 2014.
[36] G. O. Consortium et al., “Gene ontology annotations and re-
sources,” Nucleic acids research, vol. 41, no. D1, pp. D530–D535,
2013.
[37] A. Franceschini, D. Szklarczyk, S. Frankild, M. Kuhn, M. Si-
monovic, A. Roth, J. Lin, P. Minguez, P. Bork, C. von Mering et al.,
“String v9. 1: protein-protein interaction networks, with increased
coverage and integration,” Nucleic acids research, vol. 41, no. D1,
pp. D808–D815, 2013. Koushik Mallick received his M.E. degree in
[38] J. Yu, M. Guo, C. J. Needham, Y. Huang, L. Cai, and D. R. Computer Science and Engineering from Ja-
Westhead, “Simple sequence-based kernels do not predict protein– davpur University at 2009. He is pursuing Ph. D.
protein interactions,” Bioinformatics, vol. 26, no. 20, pp. 2610–2614, from Calcutta University while working at Indian
2010. Statistical Institute, Kolkata, India. At present, he
[39] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector is Assistant Professor at RCCIIT, Kolkata.
machines,” ACM Transactions on Intelligent Systems and Technology
(TIST), vol. 2, no. 3, p. 27, 2011.
[40] M.-G. Shi, J.-F. Xia, X.-L. Li, and D.-S. Huang, “Predicting protein–
protein interactions from sequence using correlation coefficient
and high-quality interaction dataset,” Amino Acids, vol. 38, no. 3,
pp. 891–899, 2010.
[41] J. Yu, M. Guo, C. J. Needham, Y. Huang, L. Cai, and D. R.
Westhead, “Simple sequence-based kernels do not predict protein-
protein interactions,” Bioinformatics, vol. 26, no. 20, pp. 2610–4,
2010.
[42] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
“Liblinear: A library for large linear classification,” The Journal of
Machine Learning Research, vol. 9, pp. 1871–1874, 2008.

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like