0% found this document useful (0 votes)
6 views13 pages

Satyajit BIB

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Satyajit BIB

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Briefings in Bioinformatics, 00(00), 2021, 1–13

https://doi.org/10.1093/bib/bbab255
Problem Solving Protocol

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Improved prediction of protein–protein interaction
using a hybrid of functional-link Siamese neural
network and gradient boosting machines
Satyajit Mahapatra and Sitanshu Sekhar Sahu
Corresponding author. Satyajit Mahapatra, Department of Electronics and Communication Engineering, Birla Institute of Technology, Mesra,
Ranchi-835215, India. E-mail: [email protected]

Abstract
In this paper, for accurate prediction of protein–protein interaction (PPI), a novel hybrid classifier is developed by combining
the functional-link Siamese neural network (FSNN) with the light gradient boosting machine (LGBM) classifier. The hybrid
classifier (FSNN-LGBM) uses the fusion of features derived using pseudo amino acid composition and conjoint triad
descriptors. The FSNN extracts the high-level abstraction features from the raw features and LGBM performs the PPI
prediction task using these abstraction features. On performing 5-fold cross-validation experiments, the proposed hybrid
classifier provides average accuracies of 98.70 and 98.38%, respectively, on the intraspecies PPI data sets of Saccharomyces
cerevisiae and Helicobacter pylori. Similarly, the average accuracies for the interspecies PPI data sets of the Human-Bacillus and
Human-Yersinia data sets are 98.52 and 97.40%, respectively. Compared with the existing methods, the hybrid classifier
achieves higher prediction accuracy on the independent test sets and network data sets. The improved prediction
performance obtained by the FSNN-LGBM makes it a f lexible and effective PPI prediction model.

Key words: Protein–protein interaction; Siamese architecture; functional-link Siamese neural network; light gradient
boosting machine

Introduction between two organisms. In this process, the pathogen protein


Identifying and characterizing protein–protein interactions binds with the host protein, altering some biological activities
(PPIs) is essential for understanding the biological processes inside the host cell (humans, animals and plants). Infectious
in the cell. Knowledge from these studies facilitates the diseases, such as COVID19, Ebola, Anthrax, Plague, HIV and
identification of therapeutic targets and the design of novel Cholera, are caused by virus or bacterial pathogen protein
drugs [1, 2]. Analysis of the intraspecies protein interactions interaction with the protein of human host, affecting the health
(interaction between proteins present within the organism) or leading to the death of millions of people. In the case of
helps to understand different life processes such as hormone plant, host interacting with fungal or bacterial pathogens such as
regulation, metabolism, etc. The interspecies interaction, other- Pseudomonas syringae or Magnaporthe grisea leads to colossal
wise called host-pathogen interaction, is the protein interaction crop loss across the globe. Experimental approaches for PPI

Satyajit Mahapatra is a PhD scholar in the Department of Electronics and Communication, Birla Institute of Technology Mesra, Ranchi, India. His research
interests include applied machine learning and genomic signal processing.
Sitanshu Sekhar Sahu is presently working as an Assistant Professor in the Department of Electronics and Communication Engineering at the Birla Institute
of Technology Mesra, Ranchi, India. He has received his PhD Degree from NIT Rourkela, India in 2011. He has completed his Post-doctorate from Oklahoma
State University, USA, in 2012–2014. He has been a recipient of the DFAIT GSEF Fellowship from Canada Govt. in 2008. He has published more than 60
research papers in reputed refereed international journals and conferences. His research interests include signal and image processing, bioinformatics,
machine learning and computer vision.
Submitted: 5 October 2020; Received (in revised form): 26 November 2020

© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

1
2 Mahapatra and Sahu

prediction remain expensive, laborious and time-consuming. architecture suitable for the PPI prediction task is developed by
In addition, they often have high levels of false-positive integrating the FEM of FLANN with the Siamese NN architecture.
predictions [3, 4]. Therefore, computational methods have The newly designed NN is known as the functional-link Siamese
emerged as an alternative for high throughput identification NN (FSNN). In several applications, such as character recognition
and characterization of PPIs. To develop a computational model [33], remote sensing [34], protein fold recognition [35], speech

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
for PPI prediction, protein information related to structure, separation [36], etc., hybrid models built by cascading NNs with
domain and sequence is required. Since information related to baseline classifiers such as SVM, RF and XGB (extreme gradi-
protein sequence is easily accessible, the computational model ent boosting machine) produce superior performance compared
for PPI prediction using sequence information has attracted with a single classifier.
researchers’ attention. The recently proposed Light gradient boosting machines
In recent years, numerous computational methods have been (LGBM) classifier has the benefits of high training speed and
developed that use several baseline classifiers [4–15] and deep lower memory consumption while still achieving approximately
neural network (DNN)-based classifiers [3, 16–23] to predict PPIs the same accuracy compared with XGB and pGBRT [37]. The
as a binary classification task. Support vector machine (SVM) LGBM classifier has shown superior performance than the
[4–7], K nearest neighbor [8], Forest classifier [9, 10], Extreme existing baseline classifier in bioinformatics and computational
learning machines [11] and gradient boosting machines [12, biology tasks [12, 13, 38]. Therefore, in this paper, a hybrid
13] are some of the baseline classifiers used for PPI predic- classifier is developed by combining the FSNN with LGBM
tion. DeepPPI [16], DPPI [17], EsnDNN [18], DeepInteract [19] and for intraspecies and interspecies PPI prediction. The proposed
RCNN [20] are some of the DNN-based classifiers used to predict classifier uses the fusion of features obtained using PseAAC and
protein interactions. These models are based on the features CT descriptors. The FSNN operates on the raw features to extract
derived from protein sequences, using various descriptors such high-level abstraction features. LGBM uses these abstraction
as autocovariance (AC) [5], conjoint triad (CT) [6], multi-scale features to predict the interaction between proteins.
continuous and discontinuous (MCD) [7], local descriptors (LD)
[8], local phase quantization [9], pseudo amino acid composition
(PseAAC) [12], position-specific scoring matrix (PSSM) [17], etc. Materials and methods
In general, these features encapsulate specific characteristics of
Data sets
the protein sequences, which includes local pattern frequencies,
physicochemical properties and the positional distribution of In this paper, 10 benchmark data sets are employed for the
the amino acids. Many researchers have adopted the multi- evaluation of the proposed hybrid classifier. The details are as
feature fusion strategy for predicting PPI. The combination of follows:
multiple features results in high-dimensional features. There- • The intraspecies PPI database of Saccharomyces cerevisiae and
fore, many feature selection techniques, such as mRMR [7], elas- Helicobacter pylori collected from [12] contains 11 888 pairs of
tic net [12], L1-Regularized Logistic Regression [13], PCA [14] and protein (5594 positive interactions and 5594 negative inter-
Chi-square test [15], have been used as a preprocessing step actions) and 2916 pairs of protein (1458 positive interactions
to obtain the optimal feature subset for use with the baseline and 1458 negative interactions), respectively.
classifiers. Secondly, these high-dimensional features are also • The interspecies interaction database of Human-Bacillus and
used with DNNs [16–23] because of their ability to extract high- Human-Yersinia is collected from [25]. There are 3094 inter-
level abstraction features. The use of these abstraction features acting pairs and 9500 noninteracting pairs in the Human-
yields high prediction accuracy. A hybrid classifier scheme for Bacillus interaction data set. Interaction data set Human-
predicting protein interactions is developed by cascading the Yersinia comprises 4097 interacting pairs and 12 500 nonin-
CNN with FSRF [24]. In this scheme, CNN is used to derive the teracting pairs. The number of samples in both positive and
abstraction features from the raw features that the FSRF uses for negative classes is not equal in the PPI data sets, resulting
prediction. Compared with other existing methods, the hybrid of in an unbalanced data set. Therefore, a balanced data set is
CNN and FSRF yielded improved performance. obtained by randomly selecting the negative samples, equal
to the number of positive samples.
• Furthermore, the model is validated using four independent
Motivation and contribution
test data sets and two network data sets obtained from [12].
Although good prediction accuracies have been achieved by the The independent test data set contains interacting pairs of
existing machine learning methods listed in the literature, most Caenorhabditis elegans (4013 pairs), Escherichia coli (6954 pairs),
of them are focused on intraspecies PPI prediction. Fewer efforts Homo sapiens (1412 pairs) and Mus musculus (313 pairs) data
have been made for interspecies PPI prediction [25–29]. Siamese sets. The network data set contains a one-core network that
neural networks (NNs) are efficacious for tasks involving under- has 16 pairs of PPIs and a crossover network, which has 96
standing the dynamic relationship between two entities [30]. In pairs of PPIs.
PPI prediction, for processing the input protein pairs, Siamese
DNN architecture is preferred [16, 17, 20]. Due to the addi-
tional layers of abstraction, the amount of weight adjustment
increases, making the training process of DNN complicated. Feature extraction
The functional-link artificial NN (FLANN) reported in [31] is From the literature of the PPI study, it is observed that the fusion
a flat single layer network that makes learning and weight of multiple types of sequence-based features has shown better
adjustment simpler. In FLANN, the input feature dimensions results compared with a single feature type [7, 12–15]. In this
are artificially expanded, and using the functional expansion paper, pseudo amino acid composition (PseAAC) and conjoint
mechanism (FEM), the enhanced feature space produces better triad (CT) descriptors are employed to transform the variable-
discriminatory input patterns [32]. In this work, to mitigate the length protein sequences, represented in alphabet form, into a
requirement of a large number of abstraction layers, an NN numerical form of fixed-length. The dipoles and volume of side
FSNN-LGBM for prediction of PPI 3

chains influence the electrostatic and hydrophobic properties of whether the two proteins are interacting or noninteracting. It
amino acids, which controls the interaction between proteins consists of three modules: an input module, a feature extraction
[6]. The CT descriptor extracts the sequence information by module and a prediction module. The input module, feature
grouping the amino acids based on the dipoles and volume extraction module consists of two channels for processing the
of side chains. The PseAAC descriptor extracts the correlation pair of inputs. The output of the two channels are combined and

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
between residues of a certain distance which are useful for passed through the prediction module to obtain the probability
interaction study [12]. Therefore, the fusion of features obtained score.
using PseAAC and CT descriptors is used to extract the pattern
present in the protein sequences.
Input module
In this module, the features of protein sequences are artificially
Pseudo amino acid composition
expanded using an FEM. Consider a data set Dmn , where m and n
In the PseAAC descriptor [12], the amino acid correlation factor represent the number of samples and features, and k is the order
(λ) is integrated with the amino acid composition information. of the functional expansion, respectively. For each input sample,
As a result, a (20 + λ)-dimensional feature vector (Z) is obtained, (2n + 1)numbers of functionally expanded values are generated.
for each protein sequence, as defined in Equation 1

O (Dmn ) = Dmn , sin (Dmn ) , cos (Dmn ) , sin (2Dmn ) ,


Z = {z1 , z2 , z3 , . . . . . . . . . . . . , z20 , z20+1 , . . . . . . . . . ., z20+λ }T (λ < L) . (1)    
cos (2Dmn ) , . . . . . . . . . . . . , sin kDmn , cos kDmn .

The (20 + λ) components are computed as


Feature extraction module
⎧ ⎫

⎪ fη ⎪
⎪ In this module, each channel consists of two layers, as shown in

⎨ 20 λ , 1 ≤ η ≤ 20 ⎪

zη = η=1 fη +ω κ=1 τκ , (2)
Figure 1. The first layer contains three hidden units (HUs) con-

⎪ ωτη−20 ⎪ nected in parallel to process the expanded inputs. The element-

⎩ 20  , 21 ≤ η ≤ 20 + λ ⎪


η=1 fη + ω λκ=1 τκ wise summation of the output of the first layer’s HUs is passed
through the HU of the second layer. In the HUs, the weighted
sum of inputs is computed and passed through an activation
where L represents the length of the protein sequence, ω repre-
function. In this paper, a rectified linear unit function (ReLU)
sents the weight factor, fη represents the occurrence frequency of
activation function is used. For positive values, the ReLU function
η amino acids present in a protein sequence Z and τk represents
outputs the input directly, and for negative values, it outputs
the k-tier amino acid correlation information computed as
zero. Some neurons are dropped out (assigned zero) along with
their connections to prevent over-fitting.
L−κ
1 The HU begins with a dense layer and ends with a
τκ = Fi,i+κ (κ < L) (3)
L−κ dropout layer. The output of an HU is defined as HU =
l
Dropout(Relu(DenseN ())).
The computations of channel 1 are as follows:
1 2 2
Fi,i+κ = [M (Rei ) − M (Rei+κ )] + [HA (Rei ) − HA (Rei+κ )]
3
 
+[HB (Rei ) − Hτ (Rei+κ )]2 , (4) C1l1 = HU(P1) + HU sin (P1) + HU (cos (P1)) + . . . . . . ,
   
+ HU sin (kP1) + HU cos(kP1) ; C1l2 = HU (C1l1 ) ,
whereM(Rei ), HA (Rei ) and HB (Rei ) are the side chain mass,
hydrophobicity value and hydrophilicity value of the amino acid where C1l1 and C1l2 represents the output of layer 1 and 2 of
residue Rei , respectively. The maximum value of λ should be less channel 1, respectively.
than the length of the shortest sequence present in the data set. The computations of channel 2 are as follows:

Conjoint triad  
C2l1 =HU(P2) + HU sin (P2) + HU (cos (P2)) + . . . . . .+
The CT descriptor [6] is based on the relationship between one
HU(sin(kP2)) + HU(cos(kP2)); C2l2 = HU (C2l1 ) ,
amino acid with its adjacent amino acids. At first, the 20 amino
acids are separated into seven clusters ({A, G, V}; {I, L, F, P}; {Y,
M, T, S}; {H, N, Q, W}; {R, K}; {D, E}; {C}), based on the volume of where C2l1 and C2l2 represents the output of layer 1 and 2 of
the side chains and dipoles. The number of its respective group channel 2, respectively.
substitutes each amino acid present in the protein sequence. The The number of neurons (N) present in each HU of layer 1 is
three amino acids in succession are considered as a unit. Total, 256, and layer 2 is 128. The output of two channels undergoes

a set of combination which includes {1, 1, 1}; {1, 2, 1}; · · · {1, 7, 1}; an element-wise multiplication, i.e. (C = C1l2 C2l2 ) to infer the
· · · {1, 7, 7}; · · · {7, 7, 7} is formed. As a result, a 343-dimensional relationship between the pair of proteins [17, 20]. As a result, for
feature vector is obtained for each protein sequence. each protein pairs, a 128-dimensional feature vector is obtained.
Before passing through the prediction module, the abstraction
features are normalized using a min-max normalization
Functional-link Siamese NN
operator.
The FSNN takes the features of a pair of protein sequences
P1 (protein1) and P2 (protein2) as inputs, as shown in Figure 1, C − min(C)
F= . (5)
and gives a binary value O(P1, P2) as output, which indicates max(C) − min(C)
4 Mahapatra and Sahu

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Figure 1. Proposed FSNNs for prediction of PPI (PseAAC, Pseudo amino acid composition features; CT, Conjoint triad features).

Prediction layer loss function given by


This module consists of an HU with 64 neurons, followed
 
by a single neuron with a sigmoid activation function that l O (P1 , P2 ) , YP1 P2 = − (YP1 P2 log (O (P1 , P2 ))
transforms the input vector Q (Q = Dropout(Relu(Dense64 (F))))  
+ 1 − YP1 P2 log (1 − O (P1 , P2 )) , (7)
of dimension d from the previous layer into an output
score.
where YP1 P2 = 1 and YP1 P2 = 0 are the respective class labels for
d
interacting and noninteracting.
O (P1 , P2 ) = Qk Wk + β0 , (6)
k=1
Light gradient boosting machine
LGBM is a fast, distributed, high-performance, decision tree
where Wk are the weights corresponding to the input Qk and β0 algorithm-based gradient boosting system developed by
is the bias vector. Microsoft in 2016 [37]. LGBM uses the histogram-based algo-
Finally, the interaction probability between P1 and P2 is com- rithm, which transforms continuous feature values into discrete
1
puted by −O(P1 ,P2 ) and passed through the binary cross-entropy values that fasten the training processes and minimize memory
1+e
FSNN-LGBM for prediction of PPI 5

consumption. In contrast to other boosting algorithms, instead Performance evaluation


of a level-wise splitting approach, the LGBM uses a leaf-wise
For evaluating the performance of the FSNN-LGBM, the 5-
splitting approach. The leaf-wise algorithm reduces more losses
fold cross-validation approach is employed. In this process,
compared with the level-wise algorithm, resulting in better
the whole data set is divided randomly into five separate
accuracy. The leaf-wise growth leads to the development of
nonoverlapping subsets of the same size. For testing, one subset

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
an unbalanced decision tree which may lead to over-fitting. To
is used, while the remaining four subsets are used for training.
prevent over-fitting, LGBM restricts the maximum depth during
To ensure that each subset is used as a test set only once, this
tree growth.
experiment is conducted five times, and the mean and standard
Consider a data set D = {(d1 , y1 ); (d2 , y2 ); . . . ..(dK , yK )}, where d
deviation of these studies are taken as final results. For the
and y represent features and class labels, respectively.
evaluation of the proposed method, the output metrics used are
Step 1: Initialization of model f0 (d)
as follows:

K
  TP + TN
f0 (d) = argminh L yk , h , (8) Accuracy (ACC) =
TP + TN + FP + FN
k=1

TP
Sensitivity (Sens) =
where k (k = 1, 2, . . . ..K) represents the number of samples. TP + FN
L(yk , h) denotes the loss function, computed as L(yk , h) =   TN
Specificity Spec =
2
L(y, f (d)) = (y − f (d)) . TN + FP
Step 2: Compute for N iterations to generate N weak learning TP
Precision (Prec) =
models TP + FP
(a) The gradient or the pseudo-residuals (rnk ) are computed as TP × TN − FP × FN
MCC = √ ,
(TP + FP) (TP + FN) (TN + FP) (TN + FN)
    
∂L yk , f dk where TP and TN are the correctly predicted number of inter-
rnk =−    , (9)
∂f f dk acting and noninteracting pairs, FP is the number of noninter-
f (d)=fn−1 (d)
acting pairs predicted as interacting and FN is the number of
interacting pairs predicted as noninteracting.
where n (n = 1, 2, . . . ..N) represents the number of weak learning
models.
(b) The residue rnk is considered as the new feature of the Results and discussion
sample. Fit a decision tree for {(d1 , rn1 ), . . . . . . . . . ., (dk , rnk )} and
For each protein sequence, the PseAAC descriptor (with λ = 3 as
create a new decision tree for fn (d). The area of the leaf nodes
optimal value) gives a 23-dimensional feature vector, and the CT
corresponding to fn (d) is Rln (for l = 1, 2, . . . . . . . . . . . . .L), where L is
descriptor gives a 343-dimensional feature vector. The features
the number of leaf nodes in the classification tree fn (d).
obtained using the two descriptors are combined, yielding a
(c) Compute the best-fit value of the leaf area
366-dimensional feature vector for each protein sequence. Thus,
for a pair of protein sequences, a 732-dimensional feature vec-
  
Clm = argminc L yk , fn−1 dk + h , (10) tor is obtained. This feature vector is used with the proposed
dk Rln hybrid FSNN-LGBM classifier to predict the interaction. The grid
search is used to get the FSNN and LGBM classifiers’ optimum
where fn (d) and fn−1 (d) are the new model and current model, parameters, outlined in Tables 1 and 2.
respectively. The 5-fold cross-validation results obtained on benchmark
I denotes the indicator function, where I = { 01 ifif dd∈R
Rln
}. intraspecies (S. cerevisiae, H. pylori) and interspecies (Human-
/ ln
Step 3: The final additive model F(d) Bacillus, Human- Yersinia) data sets are presented in Table 3. From
the sensitivity and specificity values, it is observed that the
proposed FSNN-LGBM can distinguish the positive and negative
N L
  samples effectively.
F(d) = Cln I d Rln . (11)
n=1 l=1 The 5-fold cross-validation accuracy of the FSNN-LGBM is
compared with the standard FSNN, LGBM and other baseline
classifiers, such as SVM, random forest (RF) and Adaboost (AB)
Hybrid classifier
across the four data sets. The proposed model is also compared
The proposed hybrid classifier is constructed by replacing the with Siamese DNN-architecture called DeepPPI reported in [16]
final layer of FSNN with the LGBM classifier. The activation alongside the baseline classifiers. From the results presented
function (sigmoid) in the final layer of the FSNN provides an in Figure 3, it is observed that with less number of abstraction
estimated probability of the input. The linear combination of layers, the proposed FSNN produces equivalent or superior per-
outputs from the penultimate layer with trainable weights is formance compared with DeepPPI when used upon a fusion of
the input to this activation function. For other classifiers, the PseAAC and CT features. Secondly, compared with the baseline
output values of the penultimate layer can be used as input fea- classifiers such as SVM, RF and AB, the LGBM yields superior
tures. Figure 2 presents the architecture of the proposed hybrid performance. Therefore, using the LGBM classifier on the high-
classifier. First, in the input layer, the protein sequence features level abstract features obtained using FSNN leads to an improve-
are expanded and used to train the FSNN. After the training is ment in prediction accuracy. On the S. cerevisiae and H. pylori data
complete, the activation maps, otherwise called the abstraction sets, the FSNN-LGBM achieves an accuracy of 98.70 and 98.38%,
features present in the penultimate layer, are predicted and used which is about 5 and 3.5% higher than the individual accuracy of
as input to the LGBM classifier. FSNN and LGBM. An accuracy of 98.52 and 97.40% is obtained
6 Mahapatra and Sahu

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Figure 2. Schematic diagram of hybrid classifier (FSNN-LGBM).

Table 1. Parameters used for simulation of FSNN

Hyperparameter name Range Optimal value

Order of functional expansion (K) 1, 2, 3 1


Learning rate 1, 0.1, 0.01, 0.001 0.01
Batch size 16, 32, 64, 128 64
Momentum rate 0.8, 0.9 0.9
Weight initialization uniform, normal, glorot_normal, glorot_uniform glorot_normal
Weight regularization L1, L2 L2
Adaptive learning rate method SGD, RMSprop, Adam SGD
Activation ReLU, Sigmoid
Dropout rate 0.1, 0.2, 0.5 0.2
Loss function binary_crossentropy
Epochs 10, 20, 30, 40, 50, 100 50

on the Human-Bacillus and Human-Yersinia data set, which is proposed hybrid classifier and other baseline classifiers using
about 7 and 5% higher than the individual accuracy of FSNN and the S. cerevisiae and Human-Yesrinia data set is presented in
LGBM. Figures 4 and 5. From the ROC curve, it is observed that the
Furthermore, to assess the performance of the classifiers, proposed hybrid classifier has a high TPR and low FPR, which
receiver operating characteristics (ROC) is analyzed. ROC is the indicates its high discrimination capability. It provides that the
plot between the true positive rate (TPR) and the false pos- area under the curve (AUC) for the S. cerevisiae data set is 0.997
itive rate (FPR). The comparison of ROC curves between the and the Human-Yesrinia data set is 0.993, respectively.
FSNN-LGBM for prediction of PPI 7

Table 2. Parameters used for simulation of LGBM

Hyperparameter name Range Optimal value

Booster gbtree
Learning rate 0.05, 0.1, 0.15, 0.2 0.2

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Gamma 0
Max_depth 6, 8, 10, 15 10
Tree method Auto
NumParallael tree 500, 1000 1000

Table 3. Five-fold cross-validation results of the FSNN-LGBM on the intraspecies and interspecies data sets

Data set Acc (%) Sens(%) Spec (%) Prec (%) MCC (%) AUC

S. cerevisiae 98.70 ± 0.22 98.28 ± 0.39 99.12 ± 0.30 99.11 ± 0.26 97.41 ± 0.45 0.997
H. pylori 98.38 ± 0.48 98.26 ± 0.65 98.50 ± 0.47 98.30 ± 0.64 96.78 ± 0.96 0.992
Human-Bacillus 98.52 ± 0.37 98.51 ± 0.31 98.54 ± 0.44 98.55 ± 0.35 97.06 ± 0.73 0.995
Human-Yersinia 97.40 ± 0.36 97.73 ± 0.40 97.07 ± 0.51 97.09 ± 0.49 94.80 ± 0.92 0.993

Note: Results in the table are in the form of Mean ± Standard Deviation from the 5-folds.

Figure 3. Comparison of accuracy (%) of the hybrid classifier with baseline and DNN-based classifiers.

Comparison with existing prediction methods (MLD + RF [4], MCD + SVM [7], LightGBM [12], PSSM+RoF [39],
For evaluation of the performance of the hybrid model, it is GE + WSRC [40]) and deep learning-based methods (DeepPPI
compared with the existing prediction methods on the four [16], DPPI [17], EsnDNN [18], DeepInteract [19], RCNN [20], DNN-
data sets. Using the S. cerevisiae data set, the proposed method LCTD [22], CNN-FSRF [24]). The comparison result is presented
is compared with several classical machine learning methods in Table 4. Among these methods, PSSM+ROF [39] provides the
8 Mahapatra and Sahu

orthologs will interact. The C. elegans, E. coli, H. sapiens and M.


musculus have shown orthologs to the S. cerevisiae [4]. Therefore,
the entire S. cerevisiae data set is used as a training set, and the
C. elegans, E. coli, H. sapiens and M. musculus are employed as
independent test sets. The independent test sets comprise of

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
only interacting pairs. Thus, only accuracy (ACC percent) is cal-
culated and, compared with existing methods, the comparison is
reported in Table 8. The proposed hybrid classifier is compared
with five deep learning (CNN-FSRF [24], DPPI [17], DeepPPI [16],
EsnDNN [18], DNN-LCTD [22]) and four classical approaches
(GcForest-PPI [10], GTB-PPI [13], LightGBM [12], MLD-RF [4]) for
prediction. It is observed that, relative to existing methods, the
proposed method produces better performance, demonstrating
that the proposed method has a better generalization capability.

Comparison of prediction performance


Figure 4. Comparison of ROC curves of different classifiers using the S. cerevisiae on PPI network data sets
data set.
A computational model for PPI prediction is considered suitable
for practical applications if it is capable of predicting PPI net-
works [6]. Therefore, to evaluate the efficacy of the proposed
method, two PPI network data sets, i.e. the one-core network
and the crossover network data sets, collected from [12] are
employed. In the one-core network data set, the human CD9 is
the core protein that interacts with 16 other satellite proteins.
The crossover network consists of several multi-core and/or
one-core networks with dynamic interactions between these
networks. This data set contains 96 interacting protein pairs
that are associated with the formation and growth of tumors
in humans. The proteins of S. cerevisiae have orthologs with
human proteins [4]. Therefore, the S. cerevisiae data set is used for
training the model, while the one-core and crossover network
data sets are used as test sets. A graphical representation of
the prediction results achieved by the FSNN-LGBM on the one-
core and the crossover network data sets is presented in Figure 6
and 7, respectively. Each node in the graph represents a protein.
Figure 5. Comparison of ROC curves of different classifiers using the Human- The protein pairs connected with a solid line represent that
Yesrinia data set. they are predicted as interacting by the model and connections
with a dotted line represent that the model predicts them as
best accuracy among the classical methods, and CNN-FSRF [24] noninteracting.
provides the best accuracy for the deep learning-based methods. As shown in Figure 6, the proposed FSNN-LGBM classifier
The proposed method achieved an enhancement of 1.64% correctly predicts all the PPI in the single-core network; 95 out of
compared with PSSM+ROF and 0.95% to the CNN-FSRF. The 96 interactions are correctly predicted in the crossover network
performance comparison of the proposed method with existing data set, as shown in Figure 7. A comparative analysis of the
methods using the H. pylori data set is presented in Table 5. prediction results obtained by the proposed method with the
The best among the classical method is GE + WSRC [40], and existing methods on two network data sets is provided in Table 9.
the best accuracy is achieved by CNN-FSRF [24] in deep learning- It is observed that the proposed method has achieved equivalent
based approach. Again, the proposed hybrid classifier has shown or better results compared with the existing methods.
superior results compared with these methods. It achieves an It is inferred from the comparative analysis that the proposed
improvement in an accuracy of 5.55% compared with GE + WSRC method achieves better results on both intraspecies and inter-
and 9.42% compared with CNN-FSRF. Performance comparison species data sets than the existing classical machine learning
of the proposed method with the existing method using Human- and deep learning methods. Furthermore, the assessment on
Bacillus and Human-Yersinia is presented in Tables 6 and 7, the independent test sets and network data sets shows its
respectively. The proposed method achieves an improvement generalization ability in interaction prediction.
in accuracy of 13.12 and 6.82% on the Human-Bacillus data set
and 12.8 and 10.1% on the Human-Yersinia data set, compared
Prediction of human-SARS-CoV-2 PPI
with the LBE + RF [25] and LD + DNN [28].
COVID-19 (Coronavirus Disease-19), a disease caused by the
SARS-CoV-2 virus, was declared as a pandemic by the World
Comparison of prediction performance on independent
Health Organization on 11 March 2020. The FSNN-LGBM is
test sets
used to construct a Human-SARS-CoV-2 PPI prediction model
The independent test is based on the presumption that if a that can be used to predict new interacting protein pairs
large number of interacting proteins have evolved in a correlated between humans and SARS-CoV-2. The Human-SARS-CoV-2
manner in one organism, then it is probable that their respective Positive Data PPI data set is obtained from the Intact database
FSNN-LGBM for prediction of PPI 9

Table 4. Performance comparison of FSNN-LGBM with existing methods on S. cerevisiae data set

Method Acc (%) Sens (%) Prec (%) MCC (%) AUC(%)

MLD + RF [4] 94.72 94.34 98.91 85.99 NA

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
MCD + SVM [7] 91.36 90.37 91.94 84.21 97.07
LightGBM [12] 95.07 92.21 97.82 90.30 98.75
DeepPPI [16] 94.43 92.06 96.65 88.97 97.00
DPPI [17] 94.55 92.24 96.68 NA NA
EsnDNN [18] 95.29 95.12 95.45 90.59 97.00
DeepInteract [19] 92.67 86.85 98.31 85.96 NA
RCNN [20] 97.09 97.17 97.00 94.17 NA
DNN-LCTD [22] 93.11 92.40 93.75 86.24 97.95
CNN-FSRF [24] 97.75 99.61 95.89 96.04 97.54
PSSM+RoF [39] 97.06 95.23 98.85 94.18 97.11
GE + WSRC [40] 96.82 93.63 100 93.83 96.88
Proposed Method
(FSNN-LGBM) 98.70 98.28 99.11 97.41 99.70

Note: NA means not available.

Table 5. Performance comparison of FSNN-LGBM with existing methods on H. pylori data set

Method Acc (%) Sens (%) Prec (%) MCC (%) AUC(%)

MLD + RF [4] 88.30 92.47 NA 79.19 NA


LightGBM [12] 89.03 89.99 88.36 78.14 95.34
DeepPPI [16] 86.23 89.44 84.32 72.63 NA
CNN-FSRF [24] 88.96 91.86 86.86 78.09 89.08
PSSM+RoF [39] 89.69 88.53 90.66 79.42 90.07
GE + WSRC [40] 92.83 89.32 96.13 86.65 93.75
Proposed Method 98.38 98.26 98.30 96.78 99.20
(FSNN-LGBM)

Note: NA means not available.

Table 6. Performance comparison of FSNN-LGBM with existing methods on Human-Bacillus data set

Method Acc (%) Sens (%) Prec (%) MCC (%) AUC(%)

LBE + BN [25] 78.7 73.0 42.0 43.4 83.70


LBE + NB [25] 82.5 52.8 47.8 39.7 82.10
LBE + RF [25] 85.4 24.0 67.0 34.0 86.80
AAC + BN [25] 77.4 51.7 37.3 30.3 79.00
LBE + j48 [25] 80.06 31.2 39.6 23.9 54.10
LD + DNN [28] 91.7 89.5 93.9 83.5 96.37
Proposed Method 98.52 98.51 98.55 97.06 99.50
(FSNN-LGBM)

Table 7. Performance comparison of FSNN-LGBM with existing methods on Human-Yersinia data set

Method Acc (%) Sens (%) Prec (%) MCC (%) AUC(%)

LBE-BN [25] 76.10 73.5 38.6 40.1 81.30


LBE-NB [25] 80.9 45.5 43.2 32.8 78.60
LBE-RF [25] 84.6 16.0 66.3 27.3 83.50
AAC-BN [25] 80.0 52.4 42.1 34.9 75.6
LBE-j48 [25] 80.1 27.9 37.1 20.8 51.70
LD-DNN [28] 87.3 84.2 90.4 74.9 94.99
Proposed Method 97.40 97.73 97.09 94.80 99.30
(FSNN-LGBM)
10 Mahapatra and Sahu

Table 8. Performance comparison of FSNN-LGBM with existing methods on the Independent test set

Data set→ C. elegans (Acc %) E. coli (Acc %) H. sapiens (Acc %) M. musculus (Acc %)
Algorithm↓

Proposed Method 97.06 96.98 98.44 98.40

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
(FSNN-LGBM)
MLD-RF [4] 87.71 89.30 94.19 91.96
GcForest-PPI [10] 96.01 96.30 98.58 99.04
LightGBM [12] 90.16 92.16 94.83 94.57
GTB-PPI [13] 92.42 94.06 97.38 98.08
DeepPPI [16] 93.77 91.37 94.84 92.19
DPPI [17] 95.51 96.66 96.24 95.84
EsnDNN [18] 93.22 95.10 95.00 94.06
DNN-LCTD [22] 93.17 94.62 94.18 92.65
CNN-FSRF [24] 96.41 95.47 98.65 93.27

Table 9. Performance comparison of FSNN-LGBM with existing methods on PPI network data sets

Data set→ One-core network Crossover network


Algorithm↓

Proposed method (FSNN-LGBM) 16/16 95/96


SVM-CT [6] 13/16 73/96
GcForest-PPI [10] 16/16 94/96
LightGBM-PPI [12] 15/16 89/96
GTB-PPI [13] 15/16 92/96

Note: The numerator represents the number of correctly predicted interactions, and the denominator represents the total number of interactions.

The protein sequences are randomly paired, one from the host
(human) and the other from the pathogen organism (SARS-
CoV2) for which there is no evidence of an interaction. The
FSNN-LGBM classifier is trained using 4000 positive and 4000
negative samples, utilizing the 5-fold cross-validation approach,
and then evaluated using the independent test set. Accuracy of
98% obtained in both training and testing as listed in Table 10
demonstrates the effectiveness of the proposed method in
analyzing the Human-SARS-CoV-2 protein pairs which are
suspected to be interacting. It is important to take note that
the samples suspected to be interacting do not appear in the
negative data set.

Feature visualization
The main advantage of the NN is its ability to extract discrimi-
native features (abstraction features) from the raw features. The
t-SNE [42] (t-Distributed Stochastic Neighbor Embedding) is a
powerful tool commonly used with the NN to visualize and com-
pare abstraction features and the raw features. In the current
Figure 6. Graphical representation of the prediction result achieved by the FSNN- study, the abstraction features derived by the FSNN is used as
LGBM on the one-core network data set. Core and satellite proteins are colored an input to the LGBM classifier to enhance the accuracy of PPI
yellow and green, respectively. prediction. Therefore, visualizing the distribution of the raw and
the abstraction features will provide an insight into the reason
for the enhancement of accuracy. A comparison of the t-SNE plot
[41], which contains human protein interactions with SARS- of original features and abstraction features of the S. cerevisiae
CoV2 and SARS-CoV, as well as some interactions with other data set and the Human-Yesrinia data set is given in Figures 8
members of the Coronaviridae community. A total of 4658 and 9. From the figures, it can be seen that the original features
interacting pairs (UniProt ID pairs) are extracted, of which about of positive and negative samples are overlapping, whereas the
85% (4000 samples) are used as a positive data set and the abstraction features are discriminative. Thus, it is inferred that
remaining 15% (658 samples) are used as an independent test the proposed FSNN architecture efficiently extracts meaningful
set. Due to the unavailability of an experimentally confirmed information from the raw features pertinent to the interaction.
noninteracting data set (i.e. negative samples), a negative data The t-SNE plot is implemented in python using the scikit-learn
set is prepared according to the procedure specified in [25]. library.
FSNN-LGBM for prediction of PPI 11

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Figure 7. Graphical representation of the prediction result achieved by the FSNN-LGBM on the crossover network data set. The solid lines are the true prediction, and
the dotted lines are the false prediction.

Table 10. Performance of the FSNN-LGBM on Human-SARS-CoV-2 PPI

Data set (Human- Acc (%) Sens(%) Spec (%) Prec (%) MCC (%) AUC
SARS-CoV-2)

Five-fold CV 98.86 ± 0.22 98.60 ± 0.35 99.12 ± 0.25 99.13 ± 0.24 97.73 ± 0.45 0.998
Independent test Accuracy: 98.50%

Figure 8. t-SNE plots of S. cerevisiae data set (A) raw input features, (B) abstracted features present in the penultimate layer. Blue color dots represent positive interactions,
and orange dots represent negative interactions.
12 Mahapatra and Sahu

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Figure 9. t-SNE plots of Human-Yesrinia data set (A) raw input features, (B) abstracted features present in the penultimate layer. Blue color dots represent positive
interactions, and orange dots represent negative interactions.

Conclusion Funding
In this paper, a new hybrid approach that combines the FSNN Science and Engineering Research Board, Department of
with the LGBM is presented to predict PPI efficiently. The fusion Science and Technology Government of India (Grant No.
of sequence information features conjoint triad (CT), and pseudo ECR/2017/000345).
amino acid composition (PseAAC) descriptors are used as input
to the hybrid classifier. The FSNN utilizes nonlinear transforma-
tion techniques to extract abstraction features from the raw fea- References
tures of the protein sequences. The extracted features used with
1. Petta I, Lievens S, Libert C, et al. Modulation of protein–
the LGBM classifier enhances the prediction accuracy. When
protein interactions for the development of novel therapeu-
assessed with several standard intraspecies and interspecies
tics. Mol Ther 2016;24(4):707–18.
PPIs, the FSNN-LGBM performs substantially well compared with
2. Skrabanek L, Saini HK, Bader GD, et al. Computational
existing methods. In addition, on independent test sets, the
prediction of protein–protein interactions. Mol Biotechnol
hybrid classifier achieved higher accuracy than existing meth-
2008;38(1):1–17.
ods. The prediction results on network data sets indicate that
3. Sun T, Zhou B, Lai L, et al. Sequence-based prediction of
it can provide new insights into the signaling pathway analysis,
protein protein interaction using a deep-learning algorithm.
the prediction of drug targets and the understanding of disease
BMC Bioinformatics 2017;18(1):277.
pathogenesis. Although the proposed method provided better
4. You ZH, Chan KC, Hu P. Predicting protein-protein interac-
results than existing methods, it comes at the expense of an
tions from primary protein sequences using a novel multi-
increased computational burden.
scale local feature representation scheme and the random
forest. PLoS One 2015;10(5):e0125811.
Key Points 5. Guo Y, Yu L, Wen Z, et al. Using support vector machine
combined with auto covariance to predict protein–protein
• A hybrid classifier termed FSNN-LGBM is developed by
interactions from protein sequences. Nucleic Acids Res
combining the functional link Siamese neural network 2008;36(9):3025–30.
(FSNN) with light gradient boosting machines (LGBM) 6. Shen J, Zhang J, Luo X, et al. Predicting protein–protein inter-
for PPI prediction. actions based only on sequences information. Proc Natl Acad
• FSNN extracts high-level abstraction features from
Sci 2007;104(11):4337–41.
raw protein sequence features. 7. You ZH, Zhu L, Zheng CH, et al. Prediction of protein-protein
• LGBM classifier uses abstraction features for predict-
interactions from amino acid sequences using a novel multi-
ing PPIs. scale continuous and discontinuous feature set. BMC Bioin-
• FSNN-LGBM achieved improved accuracy on inter-
formatics 2014;15(S15):S9.
species and intraspecies PPI data sets. 8. Yang L, Xia JF, Gui J. Prediction of protein-protein interac-
tions from protein sequence using local descriptors. Protein
Pept Lett 2010;17(9):1085–90.
Availability of data and codes 9. Wong, L., You, Z. H., Li, S., et al. Detection of protein-protein
interactions from amino acid sequences using a rotation
The datasets and codes used for this study are available on
forest model with a novel PR-LPQ descriptor. In International
request to the corresponding author. Conference on Intelligent Computing (pp. 713–20), 2015. Springer,
Cham.
10. Yu B, Chen C, Wang X, et al. Prediction of protein-protein
Acknowledgements
interactions based on elastic net and deep forest. Expert Syst
This work has been carried out in the Signal Processing Lab, Appl 2019;176:114876.doi: 10.1016/j.eswa.2021.114876.
Department of Electronics and Communication Engineering 11. You ZH, Lei YK, Zhu L, et al. Prediction of protein-protein
of Birla Institute of Technology, Mesra, Ranchi interactions from amino acid sequences with ensemble
FSNN-LGBM for prediction of PPI 13

extreme learning machines and principal component anal- In 2020 IEEE International Students’ Conference on Electrical,
ysis. BMC Bioinformatics 2013;14(S8):S10. Electronics and Computer Science (SCEECS), 2020, pp. 1–4. doi:
12. Chen C, Zhang Q, Ma Q, et al. LightGBM-PPI: predicting 10.1109/SCEECS48394.2020.150.
protein-protein interactions through LightGBM with multi- 29. Chen H, Li F, Wang L, et al. Systematic evaluation of machine
information fusion. Chemom Intel Lab Syst 2019;191:54–64. learning methods for identifying human-pathogen protein-

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
13. Yu B, Chen C, Zhou H, et al. GTB-PPI: predict protein-protein protein interactions. Brief Bioinform 2021;22(3):bbaa068.
interactions based on L1-regularized logistic regression and https://doi.org/10.1093/bib/bbaa068.
gradient tree boosting. Genomics, Proteomics Bioinforma 2021. 30. Bromley J, Bentz JW, Bottou L, et al. Signature verifica-
doi: 10.1016/j.gpb.2021.01.001. tion using a “siamese” time delay neural network. Interna-
14. Göktepe YE, Kodaz H. Prediction of protein-protein interac- tional Journal of Pattern Recognition and Artificial Intelligence,
tions using an effective sequence based combined method. 1993;7(4):669–88.
Neurocomputing 2018;303:68–74. 31. Pao YH. Adaptive Pattern Recognition and Neural Networks. 1989,
15. Wang L, You ZH, Xia SX, et al. An improved efficient rotation Chapter 8, pp. 197–22. Addison-Wesley, Reading, MA.
forest algorithm to predict the interactions among proteins. 32. Naik B, Obaidat MS, Nayak J, et al. Intelligent secure ecosys-
Soft Computing 2018;22(10):3373–81. tem based on metaheuristic and functional link neural
16. Du X, Sun S, Hu C, et al. DeepPPI: boosting prediction of network for edge of things. IEEE Transactions on Industrial
protein–protein interactions with deep neural networks. J Informatics 2019;16(3):1947–56.
Chem Inf Model 2017;57(6):1499–510. 33. Weldegebriel HT, Liu H, Haq AU, et al. A new hybrid con-
17. Hashemifar S, Neyshabur B, Khan AA, et al. Predicting volutional neural network and eXtreme gradient boosting
protein–protein interactions through sequence-based deep classifier for recognizing handwritten Ethiopian characters.
learning. Bioinformatics 2018;34(17):i802–10. IEEE Access 2019;8:17804–18.
18. Zhang L, Yu G, Xia D, et al. Protein–protein interactions 34. Dong L, Du H, Mao F, et al. Very high resolution remote sens-
prediction based on ensemble deep neural networks. Neu- ing imagery classification using a fusion of random forest
rocomputing 2019;324:10–9. and deep learning technique—subtropical area for example.
19. Patel S, Tripathi R, Kumari V, et al. DeepInteract: deep neural IEEE Journal of Selected Topics in Applied Earth Observations and
network based protein-protein interaction prediction tool. Remote Sensing 2019;13:113–28.
Current Bioinformatics 2017;12(6):551–7. 35. Liu B, Li CC, Yan K. DeepSVM-fold: protein fold recognition by
20. Chen M, Ju CJT, Zhou G, et al. Multifaceted protein–protein combining support vector machines and pairwise sequence
interaction prediction based on Siamese residual RCNN. similarity scores generated by deep learning networks.
Bioinformatics 2019;35(14):i305–14. Brief Bioinform 202021(5):1733–41. https://doi.org/10.1093/bib/
21. Wang X, Wang R, Wei Y, et al. A novel conjoint triad auto bbz098.
covariance (CTAC) coding method for predicting protein- 36. Wang Y, Wang D. Towards scaling up classification-based
protein interaction based on amino acid sequence. Math speech separation. IEEE Trans Audio Speech Lang Process
Biosci 2019;313:41–7. 2013;21(7):1381–90.
22. Wang J, Zhang L, Jia L, et al. Protein-protein interactions 37. Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient
prediction using a novel local conjoint triad descriptor of gradient boosting decision tree. In Proceedings of the 31st Inter-
amino acid sequences. Int J Mol Sci 2017;18(11):2373. national Conference on Neural Information Processing Systems
23. Yao Y, Du X, Diao Y, et al. An integration of deep learning (NIPS’17). 2017, 3149–57. Curran Associates Inc., Red Hook,
with feature embedding for protein-protein interaction pre- NY, USA.
diction. PeerJ 2019;7:e7126. http://doi.org/10.7717/peerj.7126. 38. Zhang Y, Xie R, Wang J, et al. Computational analysis and pre-
24. Wang L, Wang HF, Liu SR, et al. Predicting protein-protein diction of lysine malonylation sites by exploiting informa-
interactions from matrix-based protein sequence using con- tive features in an integrative machine-learning framework.
volution neural network and feature-selective rotation for- Brief Bioinform 2019;20(6):2185–99.
est. Sci Rep 2019;9(1):1–12. 39. Zhu HJ, You ZH, Shi WL, et al. Improved prediction of
25. Kösesoy İ, Gök M, Öz C. A new sequence based encoding protein-protein interactions using descriptors derived from
for prediction of host–pathogen protein interactions. Comput PSSM via gray level co-occurrence matrix. IEEE Access 2019;7:
Biol Chem 2019;78:170–7. 49456–65.
26. Barman RK, Saha S, Das S. Prediction of interactions between 40. Huang YA, You ZH, Chen X, et al. Sequence-based predic-
viral and host proteins using supervised machine learn- tion of protein-protein interactions using weighted sparse
ing methods. PLoS One 2014;9(11): e112034. https://doi.o representation model combined with global encoding. BMC
rg/10.1371/journal.pone.0112034. Bioinformatics 2016;17(1):184.
27. Zhou X, Park B, Choi D, et al. A generalized approach to 41. Orchard S, Ammari M, Aranda B, et al. The MIntAct project—
predicting protein-protein interactions between virus and IntAct as a common curation platform for 11 molecular
host. BMC Genomics 2018;19(6):568. interaction databases. Nucleic Acids Res 2014;42(D1):D358–63.
28. Mahapatra S, Sahu SS. Boosting predictions of Host- 42. Maaten LVD, Hinton G. Visualizing data using t-SNE. Journal
Pathogen protein interactions using Deep neural networks. of Machine Learning Research 2008;9(Nov):2579–605.

You might also like