An Ensemble of Support Vector Machines For Predicting Virulent Proteins
An Ensemble of Support Vector Machines For Predicting Virulent Proteins
a r t i c l e
i n f o
a b s t r a c t
It is important to develop a reliable system for predicting bacterial virulent proteins for nding novel drug/vaccine and for understanding virulence mechanisms in pathogens. In this work we have proposed a bacterial virulent protein prediction method based on an ensemble of classiers where the features are extracted directly from the amino acid sequence of a given protein. It is well known in the literature that the features extracted from the evolutionary information of a given protein are better than the features extracted from the amino acid sequence. Our method tries to ll the gap between the amino acid sequence based approaches and the evolutionary information based approaches. An extensive evaluation according to a blind testing protocol, where the parameters of the system are calculated using the training set and the system is validated in three different independent datasets, has demonstrated the validity of the proposed method. 2008 Elsevier Ltd. All rights reserved.
Keywords: Virulent proteins Machine learning Ensemble of classiers Support vector machines
1. Introduction The aim of this paper is to propose a novel ensemble of classier for predicting virulent proteins (Hastings, Paget-McNicol, & Saul, 2004; Weiss, 2002). The virulence factors of bacteria are typically proteins that are coded by genes in the chromosomal DNA. An example of virulent proteins is shown in Fig. 1 (from Lilic, Vujanac, and Stebbins (2006)). In the last years there is an increase threat due to drug resistant strains of infectious agents (Morens, Folkers, & Fauci, 2004). The rst pathogen genome sequenced was, in 1995, the genome sequence of Haemophilus inuenzae (Fleischmann et al., 1995). Nowadays, there are 532 microbial genomes completely sequenced (Liolios, Tavernarakis, Hugenholtz, & Kyrpides, 2006), moreover, a large number of virulent proteins are discovered. Several methods for predicting virulent proteins are proposed in the literature. The rst developed methods were similarity search methods like BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990) and PSI-BLAST (Altschul et al., 1997). More recently, machine learning algorithms for predicting virulent proteins are proposed: In Sachdeva, Kumar, Jain, and Ramachandran (2005) the authors propose a neural network based prediction of virulence factors; In Garg and Gupta (2008) the authors propose an ensemble of support vector machine (SVM) where the different SVMs
* Corresponding author. E-mail addresses: [email protected], [email protected] (L. Nanni). 0957-4174/$ - see front matter 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.09.036
classiers were trained with sequence features of bacterial virulent proteins such as amino acid compositions, 2-gram compositions, higher order dipeptide composition and evolutionary information. In the literature several methods are proposed to extract a feature vector from the primary sequence of a protein. The main part of these methods are proposed for the subcellular location prediction (Chou & Shen, 2007a). In Garg and Gupta (2008) is shown that the prediction methods for virulent proteins based on the features directly extracted from the amino acid sequence do not perform well as the feature extraction methods based on the evolutionary information. In this paper, we deal with the virulent proteins prediction problem using an ensemble of support vector machines trained using features extracted directly from the amino acid sequence. We show that the ensemble of classiers permits to boost the performance of a system based on the amino acid sequence. The good performance, respect to that obtained by a stand-alone method, of the ensemble of classiers are well known, several examples are published in the bioinformatics literature. Several ensemble methods are applied on protein secondary structure prediction (Riis & Krogh, 1996), protein fold pattern prediction (Shen & Chou, 2006), protein subcellular localization prediction (Chou & Shen, 2007d), membrane protein type prediction (Chou & Shen, 2007b), and signal peptide prediction (Chou & Shen, 2007c). To build the ensemble of classiers we select a set of amino acid indices where each amino acid index is used to train a random subspace of radial basis function of support vector machines.
7459
Fig. 1. The InvB molecule in Salmonella (blue and yellow) binds (purple arrow) the Salmonella invasion protein (red). (For interpretation of the references to color in this gure legend, the reader is referred to the web version of this article.)
2. Proposed algorithm The ensemble here propose is based on the perturbation of the system AAIndexLoc proposed in Tantoso and Li (in press). The AAIndexLoc extracts for each protein P a feature vector obtained concatenating the following features: Amino acid (AA) composition, it is the percentage or fraction of amino acid y in P. Weighted AA composition, the weighted AA composition for amino acid y is dened as (amino acid composition of y) (index value a for the amino acid y). Five-level grouping composition, the amino acids are classied into ve groups, based on their amino acid index values, by kmeans clustering, then the ve-level dipeptide composition is performed. The ve-level dipeptide composition is dened as the composition of the occurrence of two consecutive groups (see Tantoso and Li (in press) for more details). Since there are ve groups there are 25 combinations of two consecutive groups. Each protein is thus represented by 70 features: 20 input features for AA composition; 20 input features for weighted AA composition; 25 input features for the ve-level grouping composition. Moreover, in the original work each protein sequence is divided into three parts and a feature vector is extracted from each part. In the problem studied in this paper we have obtained the same performance considering the features extracted from the whole protein sequence and the features extracted from the three parts of the proteins. We have proposed the following modications to the original method proposed in Tantoso and Li (in press) for building an ensemble of classiers: a different classier is trained considering a different amino acid index to calculate the weighted AA composition. In this work 494 amino acid indices and 83 substitution matrices from Kawashima and Kanehisa (2000) have been encoded. In order to reduce the number of amino acid indices/substitution matrices a sequential forward oating selection (SFFS) (Pudil, Novovicova, & Kittler, 1994) feature selection approach has been adopted (as in Nanni and Lumini (2006)), where the objective function is the maximization of the area under the receiver operating characteristic (ROC) curve (Fawcett, 2004) in the training set; in this
way the number of the features used for classication is reduced to K (different values of K ranging from 1 to 15 have been tested in the experiments). Each feature (i.e. a amino acid index or a substitution matrix) selected by SFFS is used to build a different weighted AA composition. Each weighted AA composition (concatenated with the AA composition and the ve-level grouping composition) is used to train a different classier. a genetic algorithm1 is used for clustering the amino acids in ve groups. The scheme of a genetic algorithm is shown in Fig. 2. Genetic algorithms (implemented as in GAOT MATLAB TOOLBOX) are a class of optimization methods inspired by the process of the natural evolution (Goldberg, 1989; Goldberg, 2002). The objective function of the genetic algorithm is the maximization of the area under the receiver operating characteristic curve in the training set. In the encoding scheme, the chromosome is a string whose length is 20 (the number of amino acids). Each value in the chromosome species at which group a given amino acid belongs (as in Nanni and Lumini (2008)). a random subspace (Ho, 1998) of radial basis function of support vector machine (Cristianini & Shawe-Taylor, 2000) is used in the classication step. The random subspace (Sachdeva et al., 2005) is a method for the creation of ensembles that selects different subsets of all the features to train each classier of the ensemble. Given a training set T containing patterns represented by Q features, the random subspace method generates M (M = 50 in this paper) new training sets T1,. . ., TM; each containing Q K features (0 < K < 1, K = 0.5 in this paper). Then M classiers are trained using these generated training sets and combined by sum rule (Kittler, 1998). As classier the radial basis function support vector machine2 with the default parameters (C = 1, Gamma = 1) is used. SVMs are widely considered as the state-of-the-art among the machine learning classiers. The goal of SVMs is to establish the equation of a hyperplane that divides the feature space (as shown in Figs. 3 and 4), leaving all the points of the same class on the same side, while maximizing the distance between the two classes and the hyperplane.
1 The initial population is a randomly generated set of chromosomes, then a xed number E (in this paper E = 5) of generation steps is performed by the application of the following basic operators: selection, crossover and mutation.Selection: The selection strategy is cross generational. Assuming a population of size D (in this paper D = 10), the offspring doubles the size of the population and the best D individuals from the combined parent-offspring population are retained.Crossover: uniform crossover is used, the crossover probability is xed to 0.96 in the experiments. Mutation: the mutation probability is 0.02. 2 Implemented as in OSU svm Matlab Toolbox.
7460
L. Nanni, A. Lumini / Expert Systems with Applications 36 (2009) 74587462 Table 1 Characteristics of the datasets used in the experimentation. Virulent proteins Training set Independent dataset 1 Independent dataset 2 Independent dataset 3 1025 469 40 141 Non-virulent proteins 1030 703 43 143
EUC
Fig. 3. Scheme of the SVM (from http://www.cac.science.ru.nl/people/ustun/ SVM.JPG).
0.18 1 2 3 4 5 6 7 8 9 10
K
Fig. 5. EUC in the independent dataset 1 varying the number K of classiers selected by SFFS.
Bordetella (27 virulent and 27 non-virulent sequences); Haemophilus (35 virulent and 35 non-virulent); Listeria (15 virulent and 17 non-virulent). A summary of the characteristics of these datasets is reported in Table 1. As performance indicator we have used the error under the ROC curve (EUC)3 (Fawcett, 2004). The EUC is a scalar measure to evaluate performance which can be interpreted as the probability that the classier will assign a lower score to a randomly picked virulent protein sample than to a randomly picked non-virulent protein sample. In Fig. 3 we plot the performance in the training set varying the number K of classiers selected by SFFS, in this test as classier a stand-alone support vector machine is used and the amino acid are clustered by k-means. Moreover, we want to stress that the SFFS is run only considering the training data, then the features selected in the training data are used to classify the test data. In Fig. 5 we report for the Independent dataset 1 the EUC obtained by: Two-gram, the best method used in Garg and Gupta (2008) for extracting features from the amino acid sequence; SA, varying the number K of features selected by SFFS, in this test as classier a stand-alone support vector machine is used and the amino acid are clustered by k-means; RS, varying the number K of features selected by SFFS, in this test as classier a random subspace of support vector machine is used and the amino acid are clustered by kmeans; GA, varying the number K of features selected by SFFS, in this test as classier a random subspace of support vector machine is used and the amino acid are clustered by the genetic algorithm.
3
Fig. 4. EUC in the training set varying the number K of classiers selected by SFFS.
3. Experiments We have used the same datasets used in Garg and Gupta (2008). Training set: The bacterial virulent protein sequences were retrieved from the SWISS-PROT (Bairoch & Apweiler, 2000) and VFDB (an integrated and comprehensive database of virulence factors of bacterial pathogens, (Chen et al., 2005). It consists of 1025 virulent and 1030 non-virulent sequences and it is freely available at VirulentPred web server site at http://bioinfo.icgeb.res.in/virulent. This dataset was used for training SVM classiers and for building the ensemble of classiers. Independent dataset 1: It is the SPAAN dataset (Sachdeva et al., 2005), it consists of 469 adhesins and 703 non-adhesins proteins. Independent dataset 2: This independent dataset consists of 83 SWISS-PROT sequences (40 virulent and 43 non-virulent protein sequences), in this dataset there are not two sequences that are more than 40% similar. Independent dataset 3: This dataset consists of 141 virulent and 143 non-virulent sequences from bacterial pathogens: Campylobacter (39 virulent and 40 non-virulent protein sequences); Neisseria (25 virulent and 24 non-virulent);
7461
Notice that the performance obtained by SA with K = 1 is the performance obtained using the stand-alone AAIndexLoc (without the multi-classication method proposed in this paper). It is clear the ensembles here proposed outperform the stand-alone method. Now, in Figs. 6 and 7, the EUC obtained by the baseline 2-gram and by GA are reported also for the independent dataset 2 and the independent dataset 3. The following conclusions can be drawn from the results reported in this section: The best proposed system (named GA in the previous gures) obtains performance better than that obtained by 2-gram (the best system based on the amino acid sequence used in Garg and Gupta (2008)); Our method tries to ll the performance gap between the amino acid sequence based features and the evolutionary information based features. In Garg and Gupta (2008) it is shown that the fusion among sequence based methods and the evolutionary information based method outperforms the evolutionary information based method. Since the proposed method outperforms the 2-gram we hope that the fusion between our best proposed system and the evolutionary information based method outperforms the fusion between the 2-gram method and the evolutionary information based method.
4. Conclusions In this paper, we have presented methods based on ensemble of classiers for virulent proteins prediction. An extensive evaluation on a large dataset according to a blind testing protocol has demonstrated the superiority of these ensembles with respect to the stand-alone approaches. We have demonstrated that our system, based on the features extracted from the amino acid sequence, efciently classies sequences not used in the training, including the ones from the organisms not present in the training set. Please note that all the reported results have been obtained without any kind of parameter optimization for the SVMs used in the ensemble, we have simply used the default parameters. Acknowledgment The authors would like to thank Aarti Garg and Dinesh Gupta for sharing the datasets used in this paper. References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215, 403410. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., et al. (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 33893402. Bairoch, A., & Apweiler, R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research, 28, 4548. Chen, L., Yang, J., Yu, J., Yao, Z., Sun, L., Shen, Y., et al. (2005). VFDB: A reference database for bacterial virulence factors. Nucleic Acids Research, 33, D325D328. Chou, K. C., & Shen, H. B. (2007a). Review, recent progresses in protein subcellular location prediction. Analytical Biochemistry, 370, 116. Chou, K. C., & Shen, H. B. (2007b). MemType-2L, a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochemical and Biophysical Research Communications, 360, 339345. Chou, K. C., & Shen, H. B. (2007c). Signal-CF, a subsite-coupled and window-fusing approach for predicting signal peptides. Biochemical and Biophysical Research Communications, 357, 633640. Chou, K. C., & Shen, H. B. (2007d). Euk-mPLoc: A fusion classier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. Journal of Proteome Research, 6, 17281734. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press. Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., et al. (1995). Whole-genome random sequencing and assembly of Haemophilus inuenzae Rd. Science, 269, 496512. Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Technical report. Palo Alto, USA: HP Laboratories. Garg, A., & Gupta, D. (2008). VirulentPred: A SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics, 9, 62. doi:10.1186/ 1471-2105-9-62. Goldberg, David E. (1989). Genetic algorithms in search, optimization and machine learning. Boston, MA: Kluwer Academic Publishers. Goldberg, David E. (2002). The design of innovation: Lessons from and for competent genetic algorithms. Reading, MA: Addison-Wesley. Hastings, I. M., Paget-McNicol, S., & Saul, A. (2004). Can mutation and selection explain virulence in human P. falciparum infections? Malaria Journal, 2, 3. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832844. Kawashima, S., & Kanehisa, M. (2000). AAindex: Amino acid index database. Nucleic Acids Research, 28, 374. Kittler, J. (1998). On combining classiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 226239. Lilic, M., Vujanac, M., & Stebbins, C. E. (2006). A common structural motif in the binding of virulence factors to bacterial secretion chaperones molecular. Cell, 21, 653664. Liolios, K., Tavernarakis, N., Hugenholtz, P., & Kyrpides, N. C. (2006). The Genomes on line database (GOLD) v.2: A monitor of genome projects worldwide. Nucleic Acids Research, 34, D332D334. Morens, D. M., Folkers, G. K., & Fauci, A. S. (2004). The challenge of emerging and reemerging infectious diseases. Nature, 430, 242249. Nanni, L., & Lumini, A. (2006). An ensemble of K-local hyperplane for predicting proteinprotein interactions. Bioinformatics, 10(22), 12071210. Nanni, L., & Lumini, A. (2008). A genetic approach for building different alphabets for peptide and protein classication. BMC Bioinformatics, 9, 45. Pudil, P., Novovicova, J., & Kittler, J. (1994). Flotating search methods in feature selection. Pattern Recognition Letters, 15, 11191125.
EUC
K
Fig. 6. EUC in the independent dataset 2 varying the number K of classiers selected by SFFS.
EUC
10
K
Fig. 7. EUC in the independent dataset 3 varying the number K of classiers selected by SFFS.
7462
L. Nanni, A. Lumini / Expert Systems with Applications 36 (2009) 74587462 Shen, H. B., & Chou, K. C. (2006). Ensemble classier for protein fold pattern recognition. Bioinformatics, 22, 17171722. Tantoso, E., Li, K.-B. AAIndexLoc, predicting subcellular localization of proteins based on a new representation of sequences using amino acid indices, amino acids. On-line version 10.1007/s00726-007-0616-y. Weiss, R. A. (2002). Virulence and pathogenesis. Trends in Microbiology, 10, 314317.
Riis, S. K., & Krogh, A. (1996). Improving prediction of protein secondary structure using neural networks and multiple sequence alignments. Journal of Computational Biology, 3, 163183. Sachdeva, G., Kumar, K., Jain, P., & Ramachandran, S. (2005). SPAAN: A software for prediction of adhesins and adhesin-like proteins using neural networks. Bioinformatics, 21, 483491.