Satyajit BIB
Satyajit BIB
https://doi.org/10.1093/bib/bbab255
Problem Solving Protocol
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Improved prediction of protein–protein interaction
using a hybrid of functional-link Siamese neural
network and gradient boosting machines
Satyajit Mahapatra and Sitanshu Sekhar Sahu
Corresponding author. Satyajit Mahapatra, Department of Electronics and Communication Engineering, Birla Institute of Technology, Mesra,
Ranchi-835215, India. E-mail: [email protected]
Abstract
In this paper, for accurate prediction of protein–protein interaction (PPI), a novel hybrid classifier is developed by combining
the functional-link Siamese neural network (FSNN) with the light gradient boosting machine (LGBM) classifier. The hybrid
classifier (FSNN-LGBM) uses the fusion of features derived using pseudo amino acid composition and conjoint triad
descriptors. The FSNN extracts the high-level abstraction features from the raw features and LGBM performs the PPI
prediction task using these abstraction features. On performing 5-fold cross-validation experiments, the proposed hybrid
classifier provides average accuracies of 98.70 and 98.38%, respectively, on the intraspecies PPI data sets of Saccharomyces
cerevisiae and Helicobacter pylori. Similarly, the average accuracies for the interspecies PPI data sets of the Human-Bacillus and
Human-Yersinia data sets are 98.52 and 97.40%, respectively. Compared with the existing methods, the hybrid classifier
achieves higher prediction accuracy on the independent test sets and network data sets. The improved prediction
performance obtained by the FSNN-LGBM makes it a f lexible and effective PPI prediction model.
Key words: Protein–protein interaction; Siamese architecture; functional-link Siamese neural network; light gradient
boosting machine
Satyajit Mahapatra is a PhD scholar in the Department of Electronics and Communication, Birla Institute of Technology Mesra, Ranchi, India. His research
interests include applied machine learning and genomic signal processing.
Sitanshu Sekhar Sahu is presently working as an Assistant Professor in the Department of Electronics and Communication Engineering at the Birla Institute
of Technology Mesra, Ranchi, India. He has received his PhD Degree from NIT Rourkela, India in 2011. He has completed his Post-doctorate from Oklahoma
State University, USA, in 2012–2014. He has been a recipient of the DFAIT GSEF Fellowship from Canada Govt. in 2008. He has published more than 60
research papers in reputed refereed international journals and conferences. His research interests include signal and image processing, bioinformatics,
machine learning and computer vision.
Submitted: 5 October 2020; Received (in revised form): 26 November 2020
© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1
2 Mahapatra and Sahu
prediction remain expensive, laborious and time-consuming. architecture suitable for the PPI prediction task is developed by
In addition, they often have high levels of false-positive integrating the FEM of FLANN with the Siamese NN architecture.
predictions [3, 4]. Therefore, computational methods have The newly designed NN is known as the functional-link Siamese
emerged as an alternative for high throughput identification NN (FSNN). In several applications, such as character recognition
and characterization of PPIs. To develop a computational model [33], remote sensing [34], protein fold recognition [35], speech
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
for PPI prediction, protein information related to structure, separation [36], etc., hybrid models built by cascading NNs with
domain and sequence is required. Since information related to baseline classifiers such as SVM, RF and XGB (extreme gradi-
protein sequence is easily accessible, the computational model ent boosting machine) produce superior performance compared
for PPI prediction using sequence information has attracted with a single classifier.
researchers’ attention. The recently proposed Light gradient boosting machines
In recent years, numerous computational methods have been (LGBM) classifier has the benefits of high training speed and
developed that use several baseline classifiers [4–15] and deep lower memory consumption while still achieving approximately
neural network (DNN)-based classifiers [3, 16–23] to predict PPIs the same accuracy compared with XGB and pGBRT [37]. The
as a binary classification task. Support vector machine (SVM) LGBM classifier has shown superior performance than the
[4–7], K nearest neighbor [8], Forest classifier [9, 10], Extreme existing baseline classifier in bioinformatics and computational
learning machines [11] and gradient boosting machines [12, biology tasks [12, 13, 38]. Therefore, in this paper, a hybrid
13] are some of the baseline classifiers used for PPI predic- classifier is developed by combining the FSNN with LGBM
tion. DeepPPI [16], DPPI [17], EsnDNN [18], DeepInteract [19] and for intraspecies and interspecies PPI prediction. The proposed
RCNN [20] are some of the DNN-based classifiers used to predict classifier uses the fusion of features obtained using PseAAC and
protein interactions. These models are based on the features CT descriptors. The FSNN operates on the raw features to extract
derived from protein sequences, using various descriptors such high-level abstraction features. LGBM uses these abstraction
as autocovariance (AC) [5], conjoint triad (CT) [6], multi-scale features to predict the interaction between proteins.
continuous and discontinuous (MCD) [7], local descriptors (LD)
[8], local phase quantization [9], pseudo amino acid composition
(PseAAC) [12], position-specific scoring matrix (PSSM) [17], etc. Materials and methods
In general, these features encapsulate specific characteristics of
Data sets
the protein sequences, which includes local pattern frequencies,
physicochemical properties and the positional distribution of In this paper, 10 benchmark data sets are employed for the
the amino acids. Many researchers have adopted the multi- evaluation of the proposed hybrid classifier. The details are as
feature fusion strategy for predicting PPI. The combination of follows:
multiple features results in high-dimensional features. There- • The intraspecies PPI database of Saccharomyces cerevisiae and
fore, many feature selection techniques, such as mRMR [7], elas- Helicobacter pylori collected from [12] contains 11 888 pairs of
tic net [12], L1-Regularized Logistic Regression [13], PCA [14] and protein (5594 positive interactions and 5594 negative inter-
Chi-square test [15], have been used as a preprocessing step actions) and 2916 pairs of protein (1458 positive interactions
to obtain the optimal feature subset for use with the baseline and 1458 negative interactions), respectively.
classifiers. Secondly, these high-dimensional features are also • The interspecies interaction database of Human-Bacillus and
used with DNNs [16–23] because of their ability to extract high- Human-Yersinia is collected from [25]. There are 3094 inter-
level abstraction features. The use of these abstraction features acting pairs and 9500 noninteracting pairs in the Human-
yields high prediction accuracy. A hybrid classifier scheme for Bacillus interaction data set. Interaction data set Human-
predicting protein interactions is developed by cascading the Yersinia comprises 4097 interacting pairs and 12 500 nonin-
CNN with FSRF [24]. In this scheme, CNN is used to derive the teracting pairs. The number of samples in both positive and
abstraction features from the raw features that the FSRF uses for negative classes is not equal in the PPI data sets, resulting
prediction. Compared with other existing methods, the hybrid of in an unbalanced data set. Therefore, a balanced data set is
CNN and FSRF yielded improved performance. obtained by randomly selecting the negative samples, equal
to the number of positive samples.
• Furthermore, the model is validated using four independent
Motivation and contribution
test data sets and two network data sets obtained from [12].
Although good prediction accuracies have been achieved by the The independent test data set contains interacting pairs of
existing machine learning methods listed in the literature, most Caenorhabditis elegans (4013 pairs), Escherichia coli (6954 pairs),
of them are focused on intraspecies PPI prediction. Fewer efforts Homo sapiens (1412 pairs) and Mus musculus (313 pairs) data
have been made for interspecies PPI prediction [25–29]. Siamese sets. The network data set contains a one-core network that
neural networks (NNs) are efficacious for tasks involving under- has 16 pairs of PPIs and a crossover network, which has 96
standing the dynamic relationship between two entities [30]. In pairs of PPIs.
PPI prediction, for processing the input protein pairs, Siamese
DNN architecture is preferred [16, 17, 20]. Due to the addi-
tional layers of abstraction, the amount of weight adjustment
increases, making the training process of DNN complicated. Feature extraction
The functional-link artificial NN (FLANN) reported in [31] is From the literature of the PPI study, it is observed that the fusion
a flat single layer network that makes learning and weight of multiple types of sequence-based features has shown better
adjustment simpler. In FLANN, the input feature dimensions results compared with a single feature type [7, 12–15]. In this
are artificially expanded, and using the functional expansion paper, pseudo amino acid composition (PseAAC) and conjoint
mechanism (FEM), the enhanced feature space produces better triad (CT) descriptors are employed to transform the variable-
discriminatory input patterns [32]. In this work, to mitigate the length protein sequences, represented in alphabet form, into a
requirement of a large number of abstraction layers, an NN numerical form of fixed-length. The dipoles and volume of side
FSNN-LGBM for prediction of PPI 3
chains influence the electrostatic and hydrophobic properties of whether the two proteins are interacting or noninteracting. It
amino acids, which controls the interaction between proteins consists of three modules: an input module, a feature extraction
[6]. The CT descriptor extracts the sequence information by module and a prediction module. The input module, feature
grouping the amino acids based on the dipoles and volume extraction module consists of two channels for processing the
of side chains. The PseAAC descriptor extracts the correlation pair of inputs. The output of the two channels are combined and
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
between residues of a certain distance which are useful for passed through the prediction module to obtain the probability
interaction study [12]. Therefore, the fusion of features obtained score.
using PseAAC and CT descriptors is used to extract the pattern
present in the protein sequences.
Input module
In this module, the features of protein sequences are artificially
Pseudo amino acid composition
expanded using an FEM. Consider a data set Dmn , where m and n
In the PseAAC descriptor [12], the amino acid correlation factor represent the number of samples and features, and k is the order
(λ) is integrated with the amino acid composition information. of the functional expansion, respectively. For each input sample,
As a result, a (20 + λ)-dimensional feature vector (Z) is obtained, (2n + 1)numbers of functionally expanded values are generated.
for each protein sequence, as defined in Equation 1
Conjoint triad
C2l1 =HU(P2) + HU sin (P2) + HU (cos (P2)) + . . . . . .+
The CT descriptor [6] is based on the relationship between one
HU(sin(kP2)) + HU(cos(kP2)); C2l2 = HU (C2l1 ) ,
amino acid with its adjacent amino acids. At first, the 20 amino
acids are separated into seven clusters ({A, G, V}; {I, L, F, P}; {Y,
M, T, S}; {H, N, Q, W}; {R, K}; {D, E}; {C}), based on the volume of where C2l1 and C2l2 represents the output of layer 1 and 2 of
the side chains and dipoles. The number of its respective group channel 2, respectively.
substitutes each amino acid present in the protein sequence. The The number of neurons (N) present in each HU of layer 1 is
three amino acids in succession are considered as a unit. Total, 256, and layer 2 is 128. The output of two channels undergoes
a set of combination which includes {1, 1, 1}; {1, 2, 1}; · · · {1, 7, 1}; an element-wise multiplication, i.e. (C = C1l2 C2l2 ) to infer the
· · · {1, 7, 7}; · · · {7, 7, 7} is formed. As a result, a 343-dimensional relationship between the pair of proteins [17, 20]. As a result, for
feature vector is obtained for each protein sequence. each protein pairs, a 128-dimensional feature vector is obtained.
Before passing through the prediction module, the abstraction
features are normalized using a min-max normalization
Functional-link Siamese NN
operator.
The FSNN takes the features of a pair of protein sequences
P1 (protein1) and P2 (protein2) as inputs, as shown in Figure 1, C − min(C)
F= . (5)
and gives a binary value O(P1, P2) as output, which indicates max(C) − min(C)
4 Mahapatra and Sahu
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Figure 1. Proposed FSNNs for prediction of PPI (PseAAC, Pseudo amino acid composition features; CT, Conjoint triad features).
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
an unbalanced decision tree which may lead to over-fitting. To
is used, while the remaining four subsets are used for training.
prevent over-fitting, LGBM restricts the maximum depth during
To ensure that each subset is used as a test set only once, this
tree growth.
experiment is conducted five times, and the mean and standard
Consider a data set D = {(d1 , y1 ); (d2 , y2 ); . . . ..(dK , yK )}, where d
deviation of these studies are taken as final results. For the
and y represent features and class labels, respectively.
evaluation of the proposed method, the output metrics used are
Step 1: Initialization of model f0 (d)
as follows:
K
TP + TN
f0 (d) = argminh L yk , h , (8) Accuracy (ACC) =
TP + TN + FP + FN
k=1
TP
Sensitivity (Sens) =
where k (k = 1, 2, . . . ..K) represents the number of samples. TP + FN
L(yk , h) denotes the loss function, computed as L(yk , h) = TN
Specificity Spec =
2
L(y, f (d)) = (y − f (d)) . TN + FP
Step 2: Compute for N iterations to generate N weak learning TP
Precision (Prec) =
models TP + FP
(a) The gradient or the pseudo-residuals (rnk ) are computed as TP × TN − FP × FN
MCC = √ ,
(TP + FP) (TP + FN) (TN + FP) (TN + FN)
∂L yk , f dk where TP and TN are the correctly predicted number of inter-
rnk =− , (9)
∂f f dk acting and noninteracting pairs, FP is the number of noninter-
f (d)=fn−1 (d)
acting pairs predicted as interacting and FN is the number of
interacting pairs predicted as noninteracting.
where n (n = 1, 2, . . . ..N) represents the number of weak learning
models.
(b) The residue rnk is considered as the new feature of the Results and discussion
sample. Fit a decision tree for {(d1 , rn1 ), . . . . . . . . . ., (dk , rnk )} and
For each protein sequence, the PseAAC descriptor (with λ = 3 as
create a new decision tree for fn (d). The area of the leaf nodes
optimal value) gives a 23-dimensional feature vector, and the CT
corresponding to fn (d) is Rln (for l = 1, 2, . . . . . . . . . . . . .L), where L is
descriptor gives a 343-dimensional feature vector. The features
the number of leaf nodes in the classification tree fn (d).
obtained using the two descriptors are combined, yielding a
(c) Compute the best-fit value of the leaf area
366-dimensional feature vector for each protein sequence. Thus,
for a pair of protein sequences, a 732-dimensional feature vec-
Clm = argminc L yk , fn−1 dk + h , (10) tor is obtained. This feature vector is used with the proposed
dk Rln hybrid FSNN-LGBM classifier to predict the interaction. The grid
search is used to get the FSNN and LGBM classifiers’ optimum
where fn (d) and fn−1 (d) are the new model and current model, parameters, outlined in Tables 1 and 2.
respectively. The 5-fold cross-validation results obtained on benchmark
I denotes the indicator function, where I = { 01 ifif dd∈R
Rln
}. intraspecies (S. cerevisiae, H. pylori) and interspecies (Human-
/ ln
Step 3: The final additive model F(d) Bacillus, Human- Yersinia) data sets are presented in Table 3. From
the sensitivity and specificity values, it is observed that the
proposed FSNN-LGBM can distinguish the positive and negative
N L
samples effectively.
F(d) = Cln I d Rln . (11)
n=1 l=1 The 5-fold cross-validation accuracy of the FSNN-LGBM is
compared with the standard FSNN, LGBM and other baseline
classifiers, such as SVM, random forest (RF) and Adaboost (AB)
Hybrid classifier
across the four data sets. The proposed model is also compared
The proposed hybrid classifier is constructed by replacing the with Siamese DNN-architecture called DeepPPI reported in [16]
final layer of FSNN with the LGBM classifier. The activation alongside the baseline classifiers. From the results presented
function (sigmoid) in the final layer of the FSNN provides an in Figure 3, it is observed that with less number of abstraction
estimated probability of the input. The linear combination of layers, the proposed FSNN produces equivalent or superior per-
outputs from the penultimate layer with trainable weights is formance compared with DeepPPI when used upon a fusion of
the input to this activation function. For other classifiers, the PseAAC and CT features. Secondly, compared with the baseline
output values of the penultimate layer can be used as input fea- classifiers such as SVM, RF and AB, the LGBM yields superior
tures. Figure 2 presents the architecture of the proposed hybrid performance. Therefore, using the LGBM classifier on the high-
classifier. First, in the input layer, the protein sequence features level abstract features obtained using FSNN leads to an improve-
are expanded and used to train the FSNN. After the training is ment in prediction accuracy. On the S. cerevisiae and H. pylori data
complete, the activation maps, otherwise called the abstraction sets, the FSNN-LGBM achieves an accuracy of 98.70 and 98.38%,
features present in the penultimate layer, are predicted and used which is about 5 and 3.5% higher than the individual accuracy of
as input to the LGBM classifier. FSNN and LGBM. An accuracy of 98.52 and 97.40% is obtained
6 Mahapatra and Sahu
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Figure 2. Schematic diagram of hybrid classifier (FSNN-LGBM).
on the Human-Bacillus and Human-Yersinia data set, which is proposed hybrid classifier and other baseline classifiers using
about 7 and 5% higher than the individual accuracy of FSNN and the S. cerevisiae and Human-Yesrinia data set is presented in
LGBM. Figures 4 and 5. From the ROC curve, it is observed that the
Furthermore, to assess the performance of the classifiers, proposed hybrid classifier has a high TPR and low FPR, which
receiver operating characteristics (ROC) is analyzed. ROC is the indicates its high discrimination capability. It provides that the
plot between the true positive rate (TPR) and the false pos- area under the curve (AUC) for the S. cerevisiae data set is 0.997
itive rate (FPR). The comparison of ROC curves between the and the Human-Yesrinia data set is 0.993, respectively.
FSNN-LGBM for prediction of PPI 7
Booster gbtree
Learning rate 0.05, 0.1, 0.15, 0.2 0.2
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Gamma 0
Max_depth 6, 8, 10, 15 10
Tree method Auto
NumParallael tree 500, 1000 1000
Table 3. Five-fold cross-validation results of the FSNN-LGBM on the intraspecies and interspecies data sets
Data set Acc (%) Sens(%) Spec (%) Prec (%) MCC (%) AUC
S. cerevisiae 98.70 ± 0.22 98.28 ± 0.39 99.12 ± 0.30 99.11 ± 0.26 97.41 ± 0.45 0.997
H. pylori 98.38 ± 0.48 98.26 ± 0.65 98.50 ± 0.47 98.30 ± 0.64 96.78 ± 0.96 0.992
Human-Bacillus 98.52 ± 0.37 98.51 ± 0.31 98.54 ± 0.44 98.55 ± 0.35 97.06 ± 0.73 0.995
Human-Yersinia 97.40 ± 0.36 97.73 ± 0.40 97.07 ± 0.51 97.09 ± 0.49 94.80 ± 0.92 0.993
Note: Results in the table are in the form of Mean ± Standard Deviation from the 5-folds.
Figure 3. Comparison of accuracy (%) of the hybrid classifier with baseline and DNN-based classifiers.
Comparison with existing prediction methods (MLD + RF [4], MCD + SVM [7], LightGBM [12], PSSM+RoF [39],
For evaluation of the performance of the hybrid model, it is GE + WSRC [40]) and deep learning-based methods (DeepPPI
compared with the existing prediction methods on the four [16], DPPI [17], EsnDNN [18], DeepInteract [19], RCNN [20], DNN-
data sets. Using the S. cerevisiae data set, the proposed method LCTD [22], CNN-FSRF [24]). The comparison result is presented
is compared with several classical machine learning methods in Table 4. Among these methods, PSSM+ROF [39] provides the
8 Mahapatra and Sahu
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
only interacting pairs. Thus, only accuracy (ACC percent) is cal-
culated and, compared with existing methods, the comparison is
reported in Table 8. The proposed hybrid classifier is compared
with five deep learning (CNN-FSRF [24], DPPI [17], DeepPPI [16],
EsnDNN [18], DNN-LCTD [22]) and four classical approaches
(GcForest-PPI [10], GTB-PPI [13], LightGBM [12], MLD-RF [4]) for
prediction. It is observed that, relative to existing methods, the
proposed method produces better performance, demonstrating
that the proposed method has a better generalization capability.
Table 4. Performance comparison of FSNN-LGBM with existing methods on S. cerevisiae data set
Method Acc (%) Sens (%) Prec (%) MCC (%) AUC(%)
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
MCD + SVM [7] 91.36 90.37 91.94 84.21 97.07
LightGBM [12] 95.07 92.21 97.82 90.30 98.75
DeepPPI [16] 94.43 92.06 96.65 88.97 97.00
DPPI [17] 94.55 92.24 96.68 NA NA
EsnDNN [18] 95.29 95.12 95.45 90.59 97.00
DeepInteract [19] 92.67 86.85 98.31 85.96 NA
RCNN [20] 97.09 97.17 97.00 94.17 NA
DNN-LCTD [22] 93.11 92.40 93.75 86.24 97.95
CNN-FSRF [24] 97.75 99.61 95.89 96.04 97.54
PSSM+RoF [39] 97.06 95.23 98.85 94.18 97.11
GE + WSRC [40] 96.82 93.63 100 93.83 96.88
Proposed Method
(FSNN-LGBM) 98.70 98.28 99.11 97.41 99.70
Table 5. Performance comparison of FSNN-LGBM with existing methods on H. pylori data set
Method Acc (%) Sens (%) Prec (%) MCC (%) AUC(%)
Table 6. Performance comparison of FSNN-LGBM with existing methods on Human-Bacillus data set
Method Acc (%) Sens (%) Prec (%) MCC (%) AUC(%)
Table 7. Performance comparison of FSNN-LGBM with existing methods on Human-Yersinia data set
Method Acc (%) Sens (%) Prec (%) MCC (%) AUC(%)
Table 8. Performance comparison of FSNN-LGBM with existing methods on the Independent test set
Data set→ C. elegans (Acc %) E. coli (Acc %) H. sapiens (Acc %) M. musculus (Acc %)
Algorithm↓
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
(FSNN-LGBM)
MLD-RF [4] 87.71 89.30 94.19 91.96
GcForest-PPI [10] 96.01 96.30 98.58 99.04
LightGBM [12] 90.16 92.16 94.83 94.57
GTB-PPI [13] 92.42 94.06 97.38 98.08
DeepPPI [16] 93.77 91.37 94.84 92.19
DPPI [17] 95.51 96.66 96.24 95.84
EsnDNN [18] 93.22 95.10 95.00 94.06
DNN-LCTD [22] 93.17 94.62 94.18 92.65
CNN-FSRF [24] 96.41 95.47 98.65 93.27
Table 9. Performance comparison of FSNN-LGBM with existing methods on PPI network data sets
Note: The numerator represents the number of correctly predicted interactions, and the denominator represents the total number of interactions.
The protein sequences are randomly paired, one from the host
(human) and the other from the pathogen organism (SARS-
CoV2) for which there is no evidence of an interaction. The
FSNN-LGBM classifier is trained using 4000 positive and 4000
negative samples, utilizing the 5-fold cross-validation approach,
and then evaluated using the independent test set. Accuracy of
98% obtained in both training and testing as listed in Table 10
demonstrates the effectiveness of the proposed method in
analyzing the Human-SARS-CoV-2 protein pairs which are
suspected to be interacting. It is important to take note that
the samples suspected to be interacting do not appear in the
negative data set.
Feature visualization
The main advantage of the NN is its ability to extract discrimi-
native features (abstraction features) from the raw features. The
t-SNE [42] (t-Distributed Stochastic Neighbor Embedding) is a
powerful tool commonly used with the NN to visualize and com-
pare abstraction features and the raw features. In the current
Figure 6. Graphical representation of the prediction result achieved by the FSNN- study, the abstraction features derived by the FSNN is used as
LGBM on the one-core network data set. Core and satellite proteins are colored an input to the LGBM classifier to enhance the accuracy of PPI
yellow and green, respectively. prediction. Therefore, visualizing the distribution of the raw and
the abstraction features will provide an insight into the reason
for the enhancement of accuracy. A comparison of the t-SNE plot
[41], which contains human protein interactions with SARS- of original features and abstraction features of the S. cerevisiae
CoV2 and SARS-CoV, as well as some interactions with other data set and the Human-Yesrinia data set is given in Figures 8
members of the Coronaviridae community. A total of 4658 and 9. From the figures, it can be seen that the original features
interacting pairs (UniProt ID pairs) are extracted, of which about of positive and negative samples are overlapping, whereas the
85% (4000 samples) are used as a positive data set and the abstraction features are discriminative. Thus, it is inferred that
remaining 15% (658 samples) are used as an independent test the proposed FSNN architecture efficiently extracts meaningful
set. Due to the unavailability of an experimentally confirmed information from the raw features pertinent to the interaction.
noninteracting data set (i.e. negative samples), a negative data The t-SNE plot is implemented in python using the scikit-learn
set is prepared according to the procedure specified in [25]. library.
FSNN-LGBM for prediction of PPI 11
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Figure 7. Graphical representation of the prediction result achieved by the FSNN-LGBM on the crossover network data set. The solid lines are the true prediction, and
the dotted lines are the false prediction.
Data set (Human- Acc (%) Sens(%) Spec (%) Prec (%) MCC (%) AUC
SARS-CoV-2)
Five-fold CV 98.86 ± 0.22 98.60 ± 0.35 99.12 ± 0.25 99.13 ± 0.24 97.73 ± 0.45 0.998
Independent test Accuracy: 98.50%
Figure 8. t-SNE plots of S. cerevisiae data set (A) raw input features, (B) abstracted features present in the penultimate layer. Blue color dots represent positive interactions,
and orange dots represent negative interactions.
12 Mahapatra and Sahu
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
Figure 9. t-SNE plots of Human-Yesrinia data set (A) raw input features, (B) abstracted features present in the penultimate layer. Blue color dots represent positive
interactions, and orange dots represent negative interactions.
Conclusion Funding
In this paper, a new hybrid approach that combines the FSNN Science and Engineering Research Board, Department of
with the LGBM is presented to predict PPI efficiently. The fusion Science and Technology Government of India (Grant No.
of sequence information features conjoint triad (CT), and pseudo ECR/2017/000345).
amino acid composition (PseAAC) descriptors are used as input
to the hybrid classifier. The FSNN utilizes nonlinear transforma-
tion techniques to extract abstraction features from the raw fea- References
tures of the protein sequences. The extracted features used with
1. Petta I, Lievens S, Libert C, et al. Modulation of protein–
the LGBM classifier enhances the prediction accuracy. When
protein interactions for the development of novel therapeu-
assessed with several standard intraspecies and interspecies
tics. Mol Ther 2016;24(4):707–18.
PPIs, the FSNN-LGBM performs substantially well compared with
2. Skrabanek L, Saini HK, Bader GD, et al. Computational
existing methods. In addition, on independent test sets, the
prediction of protein–protein interactions. Mol Biotechnol
hybrid classifier achieved higher accuracy than existing meth-
2008;38(1):1–17.
ods. The prediction results on network data sets indicate that
3. Sun T, Zhou B, Lai L, et al. Sequence-based prediction of
it can provide new insights into the signaling pathway analysis,
protein protein interaction using a deep-learning algorithm.
the prediction of drug targets and the understanding of disease
BMC Bioinformatics 2017;18(1):277.
pathogenesis. Although the proposed method provided better
4. You ZH, Chan KC, Hu P. Predicting protein-protein interac-
results than existing methods, it comes at the expense of an
tions from primary protein sequences using a novel multi-
increased computational burden.
scale local feature representation scheme and the random
forest. PLoS One 2015;10(5):e0125811.
Key Points 5. Guo Y, Yu L, Wen Z, et al. Using support vector machine
combined with auto covariance to predict protein–protein
• A hybrid classifier termed FSNN-LGBM is developed by
interactions from protein sequences. Nucleic Acids Res
combining the functional link Siamese neural network 2008;36(9):3025–30.
(FSNN) with light gradient boosting machines (LGBM) 6. Shen J, Zhang J, Luo X, et al. Predicting protein–protein inter-
for PPI prediction. actions based only on sequences information. Proc Natl Acad
• FSNN extracts high-level abstraction features from
Sci 2007;104(11):4337–41.
raw protein sequence features. 7. You ZH, Zhu L, Zheng CH, et al. Prediction of protein-protein
• LGBM classifier uses abstraction features for predict-
interactions from amino acid sequences using a novel multi-
ing PPIs. scale continuous and discontinuous feature set. BMC Bioin-
• FSNN-LGBM achieved improved accuracy on inter-
formatics 2014;15(S15):S9.
species and intraspecies PPI data sets. 8. Yang L, Xia JF, Gui J. Prediction of protein-protein interac-
tions from protein sequence using local descriptors. Protein
Pept Lett 2010;17(9):1085–90.
Availability of data and codes 9. Wong, L., You, Z. H., Li, S., et al. Detection of protein-protein
interactions from amino acid sequences using a rotation
The datasets and codes used for this study are available on
forest model with a novel PR-LPQ descriptor. In International
request to the corresponding author. Conference on Intelligent Computing (pp. 713–20), 2015. Springer,
Cham.
10. Yu B, Chen C, Wang X, et al. Prediction of protein-protein
Acknowledgements
interactions based on elastic net and deep forest. Expert Syst
This work has been carried out in the Signal Processing Lab, Appl 2019;176:114876.doi: 10.1016/j.eswa.2021.114876.
Department of Electronics and Communication Engineering 11. You ZH, Lei YK, Zhu L, et al. Prediction of protein-protein
of Birla Institute of Technology, Mesra, Ranchi interactions from amino acid sequences with ensemble
FSNN-LGBM for prediction of PPI 13
extreme learning machines and principal component anal- In 2020 IEEE International Students’ Conference on Electrical,
ysis. BMC Bioinformatics 2013;14(S8):S10. Electronics and Computer Science (SCEECS), 2020, pp. 1–4. doi:
12. Chen C, Zhang Q, Ma Q, et al. LightGBM-PPI: predicting 10.1109/SCEECS48394.2020.150.
protein-protein interactions through LightGBM with multi- 29. Chen H, Li F, Wang L, et al. Systematic evaluation of machine
information fusion. Chemom Intel Lab Syst 2019;191:54–64. learning methods for identifying human-pathogen protein-
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab255/6318175 by Birla Institute of Technology Mesra user on 10 July 2021
13. Yu B, Chen C, Zhou H, et al. GTB-PPI: predict protein-protein protein interactions. Brief Bioinform 2021;22(3):bbaa068.
interactions based on L1-regularized logistic regression and https://doi.org/10.1093/bib/bbaa068.
gradient tree boosting. Genomics, Proteomics Bioinforma 2021. 30. Bromley J, Bentz JW, Bottou L, et al. Signature verifica-
doi: 10.1016/j.gpb.2021.01.001. tion using a “siamese” time delay neural network. Interna-
14. Göktepe YE, Kodaz H. Prediction of protein-protein interac- tional Journal of Pattern Recognition and Artificial Intelligence,
tions using an effective sequence based combined method. 1993;7(4):669–88.
Neurocomputing 2018;303:68–74. 31. Pao YH. Adaptive Pattern Recognition and Neural Networks. 1989,
15. Wang L, You ZH, Xia SX, et al. An improved efficient rotation Chapter 8, pp. 197–22. Addison-Wesley, Reading, MA.
forest algorithm to predict the interactions among proteins. 32. Naik B, Obaidat MS, Nayak J, et al. Intelligent secure ecosys-
Soft Computing 2018;22(10):3373–81. tem based on metaheuristic and functional link neural
16. Du X, Sun S, Hu C, et al. DeepPPI: boosting prediction of network for edge of things. IEEE Transactions on Industrial
protein–protein interactions with deep neural networks. J Informatics 2019;16(3):1947–56.
Chem Inf Model 2017;57(6):1499–510. 33. Weldegebriel HT, Liu H, Haq AU, et al. A new hybrid con-
17. Hashemifar S, Neyshabur B, Khan AA, et al. Predicting volutional neural network and eXtreme gradient boosting
protein–protein interactions through sequence-based deep classifier for recognizing handwritten Ethiopian characters.
learning. Bioinformatics 2018;34(17):i802–10. IEEE Access 2019;8:17804–18.
18. Zhang L, Yu G, Xia D, et al. Protein–protein interactions 34. Dong L, Du H, Mao F, et al. Very high resolution remote sens-
prediction based on ensemble deep neural networks. Neu- ing imagery classification using a fusion of random forest
rocomputing 2019;324:10–9. and deep learning technique—subtropical area for example.
19. Patel S, Tripathi R, Kumari V, et al. DeepInteract: deep neural IEEE Journal of Selected Topics in Applied Earth Observations and
network based protein-protein interaction prediction tool. Remote Sensing 2019;13:113–28.
Current Bioinformatics 2017;12(6):551–7. 35. Liu B, Li CC, Yan K. DeepSVM-fold: protein fold recognition by
20. Chen M, Ju CJT, Zhou G, et al. Multifaceted protein–protein combining support vector machines and pairwise sequence
interaction prediction based on Siamese residual RCNN. similarity scores generated by deep learning networks.
Bioinformatics 2019;35(14):i305–14. Brief Bioinform 202021(5):1733–41. https://doi.org/10.1093/bib/
21. Wang X, Wang R, Wei Y, et al. A novel conjoint triad auto bbz098.
covariance (CTAC) coding method for predicting protein- 36. Wang Y, Wang D. Towards scaling up classification-based
protein interaction based on amino acid sequence. Math speech separation. IEEE Trans Audio Speech Lang Process
Biosci 2019;313:41–7. 2013;21(7):1381–90.
22. Wang J, Zhang L, Jia L, et al. Protein-protein interactions 37. Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient
prediction using a novel local conjoint triad descriptor of gradient boosting decision tree. In Proceedings of the 31st Inter-
amino acid sequences. Int J Mol Sci 2017;18(11):2373. national Conference on Neural Information Processing Systems
23. Yao Y, Du X, Diao Y, et al. An integration of deep learning (NIPS’17). 2017, 3149–57. Curran Associates Inc., Red Hook,
with feature embedding for protein-protein interaction pre- NY, USA.
diction. PeerJ 2019;7:e7126. http://doi.org/10.7717/peerj.7126. 38. Zhang Y, Xie R, Wang J, et al. Computational analysis and pre-
24. Wang L, Wang HF, Liu SR, et al. Predicting protein-protein diction of lysine malonylation sites by exploiting informa-
interactions from matrix-based protein sequence using con- tive features in an integrative machine-learning framework.
volution neural network and feature-selective rotation for- Brief Bioinform 2019;20(6):2185–99.
est. Sci Rep 2019;9(1):1–12. 39. Zhu HJ, You ZH, Shi WL, et al. Improved prediction of
25. Kösesoy İ, Gök M, Öz C. A new sequence based encoding protein-protein interactions using descriptors derived from
for prediction of host–pathogen protein interactions. Comput PSSM via gray level co-occurrence matrix. IEEE Access 2019;7:
Biol Chem 2019;78:170–7. 49456–65.
26. Barman RK, Saha S, Das S. Prediction of interactions between 40. Huang YA, You ZH, Chen X, et al. Sequence-based predic-
viral and host proteins using supervised machine learn- tion of protein-protein interactions using weighted sparse
ing methods. PLoS One 2014;9(11): e112034. https://doi.o representation model combined with global encoding. BMC
rg/10.1371/journal.pone.0112034. Bioinformatics 2016;17(1):184.
27. Zhou X, Park B, Choi D, et al. A generalized approach to 41. Orchard S, Ammari M, Aranda B, et al. The MIntAct project—
predicting protein-protein interactions between virus and IntAct as a common curation platform for 11 molecular
host. BMC Genomics 2018;19(6):568. interaction databases. Nucleic Acids Res 2014;42(D1):D358–63.
28. Mahapatra S, Sahu SS. Boosting predictions of Host- 42. Maaten LVD, Hinton G. Visualizing data using t-SNE. Journal
Pathogen protein interactions using Deep neural networks. of Machine Learning Research 2008;9(Nov):2579–605.