Metapocket: A Meta Approach To Improve Protein Ligand Binding Site Prediction
Metapocket: A Meta Approach To Improve Protein Ligand Binding Site Prediction
Bingding Huang
Abstract
The identification of ligand-binding sites is often the starting point for protein function annotation and structure-
based drug design. Many computational methods for the prediction of ligand-binding sites have been developed
in recent decades. Here we present a consensus method metaPocket, in which the predicted sites from four
methods: LIGSITEcs, PASS, Q-SiteFinder, and SURFNET are combined together to improve the prediction
success rate. All these methods are evaluated on two datasets of 48 unbound=bound structures and 210 bound
structures. The comparison results show that metaPocket improves the success rate from *70 to 75% at the top 1
prediction. MetaPocket is available at http:==metapocket.eml.org.
325
326 HUANG
the key idea is that a sphere that separates two atoms and does around the pocket sites. We will describe the metaPocket ap-
not contain any atoms defines a pocket (see Fig. 1b). First, a proach in detail in the following section.
sphere is placed so that the two given atoms are on opposite
sides of the spheres surface. If the sphere contains any other Materials and Methods
atoms, it is reduced in size until no more atoms are contained.
are kept. The result of MetaPocket algorithm
Only spheres with a radius of 1 to 4 A
this procedure is a number of separate groups of interpene- For each protein structure, we first use LIGSITEcs, PASS,
trating spheres, called gap regions, both inside the protein and Q-SiteFinder, and SURFNET to identify pocket sites. For
on its surface, which correspond to the proteins cavities and LIGSITEcs, PASS, and SURFNET, we use the executable pro-
clefts. CAST (Binkowski et al., 2003) computes a triangulation gram to search for the pocket sites on a protein surface. Each
(see Fig. 1c) of the proteins surface atoms using alpha shapes identified pocket site is represented as a single probe and has a
(Edelsbrunner et al., 1995). In the next step, triangles are ranking score. A python script is implemented to submit the
grouped by letting small triangles flow toward neighboring protein structures to the Q-SiteFinder server and retrieve the
larger triangles, which act as sinks. The pocket is then defined predicted binding sites (probes) automatically. These pre-
as a collection of empty triangles. PASS (Brady and Stouten, dicted pocket sites from Q-SiteFinder are represented by
2000) uses probe spheres to fill cavities layer by layer (see Fig. probes and are already clustered. For each cluster, the mass
1d). First, an initial coating of the protein with probe spheres is center of the probes within it is calculated and is represented
calculated. Each probe has a burial count that counts the as a pocket site ranked by their size. The pocket sites identified
number of atoms within an 8 A distance. Only probes with a by these four methods have different ranking scoring func-
count above a threshold are retained. This procedure is iter- tions. Therefore, it is hard to compare and evaluate the pre-
ated until a layer produces no more new buried probe dicted pocket sites directly. To make the ranking scores
spheres. Then each probe is assigned a probe weight, which is comparable, a z-score is calculated separately for each site in
proportional to the number of probe spheres in the vicinity different methods. Afterward, only the top three pocket sites
and the extent to which they are buried. A small number of in each method are taken into further consideration. There-
active site points (ASP) are then selected by identifying the fore, we have a total of 12 pocket sites, which are clustered
central probes in regions that contain many spheres with a using a simple hierarchical clustering algorithm, according to
high burial count. Finally, the retained active site points are their spatial similarity (distance based). Probes within a cer-
ranked by the probe weight. tain distance threshold (8 A used here) are grouped together
Besides the purely geometric methods mentioned above, as a cluster. Then each cluster is ranked by a scoring function
there are other energetic methods. In Q-SiteFinder (Laurie and metaZScore, which is the sum of the z-scores of the pocket
Jackson, 2005), the protein surface is coated with a layer of sites in a cluster.
methyl (CH3) probes to calculate van der Waals interaction
energies between the protein and probes. Probes with favor- Test dataset
able interaction energies are retained, and clusters of these
In this study, we use the same datasets as those in our pre-
probes are ranked based on the number of probes in a cluster.
vious work. One is a dataset of 48 unbound=bound struc-
The largest or energetically most favorable cluster is then
tures in which both ligand-bound and unbound structures are
ranked first and considered as a potential ligand-binding site.
present. The other one is a nonredundant dataset of 210 ligand-
Morita et al. (2008) refined Q-SiteFinder to achieve a higher
bound only structures, which is derived from the PLD data-
success rate by using a better probe distribution technique
base (Puvanendrampillai and Mitchell, 2003). For a detailed
and more suitable force field parameters to calculate interac-
description of these two datasets, see our previous work
tion energies.
(Huang and Schroeder, 2006). For the first dataset, the pre-
Among these above methods, some are freely available for
dictions are made for the unbound (apo) structures and
academic users. The source codes of LIGSITEcs, PocketPicker,
checked against the bound structures. In the case of the 210
and SURFNET are also freely available. For CAST, Pocket-
bound proteins, the ligands are taken away when making pre-
Finder and Q-SiteFinder, a Web server, is available through
dictions and then put back for the evaluation. For a realistic
which the users can submit a protein structure and visualize
evaluation, we should use the same criteria for all the methods.
the predicted ligand binding sites. PASS provides executable
Each pocket site identified by different methods is represented
binaries for various operating systems. Therefore, it is of great
interest to put all those available methods together to check
whether they identify the same pocket sites for the same
protein. In this work, we follow the idea of metaPPI, in which Table 1. Success Rate (%) of the Top Three
five proteinprotein binding site predictors were combined Predictions by Different Methods
together to improve the prediction success rate (Huang and for 48 Bound=Unbound Structures
Schroeder, 2008), and propose a meta method called meta-
Unbound Bound
Pocket that includes four proteinligand binding site predic-
tors: LIGSITEcs, PASS, Q-SiteFinder, and SURFNET. In all these Method Top 1 Top 2 Top 3 Top 1 Top 2 Top 3
four methods, the probes around the pocket sites on a protein
surface are identified and predicted as potential ligand-binding MetaPocket 75 85 90 83 94 96
sites. PocketFinder, PocketPicker, and LIGSITEcsc are discarded LIGSITEcs 71 79 85 81 90 92
to avoid biasing because of their similarity to LIGSITEcs. CAST PASS 58 67 75 58 81 85
Q-SiteFinder 52 60 75 75 83 90
is not taken into account in metaPocket because it identifies
SURFNET 42 58 62 42 56 60
the protein atoms forming a pocket rather than the probes
FIG. 1. Illustration of different pocket identification methods, taken from Huang and Schroeder, (2006). (a) POCKET,
LIGSITE, and LIGSITEcs scan the grid for proteinsolventprotein and surfacesolventsurface events, respectively. POCKET
uses three, LIGSITE and LIGSITEcs seven directions. POCKET and LIGSITE use atom coordinates, while LIGSITEcs uses the
Connolly surface. (b) SURFNET places a sphere, which must not contain any atoms, between two atoms. The spheres with
maximal volume define the largest pocket. (c) CAST triangulates the surface atoms and clusters triangles by merging small
triangles to neighboring large triangles. (d) PASS coats the protein with probe spheres, selects the probes with many atom
contacts, and then repeats coating until no new probes are kept. The pockets, or active site points, are the probes with the
largest number of atom contacts. Q-SiteFinder is similar to LIGSITE, but the ranking of the pocket sites is the sum of the van
der Waals interaction energy between the probes and protein atoms.
FIG. 2. Ligand binding site on protein 1a6u (unbound)= FIG. 3. The ligand (in red) binding site and identified
1a6w (bound). The ligand NIP (red) is bound to a pocket site, pockets on the surface of the protein structure 1aec. The
which is predicted as the top one by Q-SiteFinder (green sites predicted by LIGSITEcs (cyan sphere), PASS (yellow),
sphere) and top five by LIGSITEcs (cyan). LIGSITE, PASS, Q-SiteFinder (green), and SURFNET (magenta) are all in the
and SURFNET fail to identify this ligand binding site within top one prediction. These four top one sites are spatially
the top three predictions. similar and identify the same ligand binding site.
327
328 HUANG
Table 2. Success Rate (%) of the Top Three Predictions Table 4. Number of Proteins with Different Cluster
by Different Methods for 210 Bound Structures Sizes [Number of Pocket Site (ps)] in MetaPocket
for the Top Three Predictions in the Case
Method Top 1 Top 2 Top 3 of 210 Bound Structures
MetaPocket 75 88 93 First prediction Second prediction Third prediction
LIGSITEcs 70 80 86
PASS 51 71 80 4 ps 119 5 1
Q-SiteFinder 70 85 90 3 ps 31 11 2
SURFNET 42 52 57 2 ps 6 10 7
1 ps 2 0 1
Table 3. Number of Proteins in Each Pocket Prediction Class For 210 Bound Structures
C1: Binding site (bs) in the first pocket 158 146 108 152 88
C2: Bs in the second pocket 26 22 42 26 21
C3: Bs in the third pocket 11 12 17 10 11
C4: Bs in none of above 15 20 43 22 90
METAPOCKET 329
FIG. 4. Distribution of the number of proteins in different number of clusters after clustering the 12 pocket sites in the case
of 210 bound structures. There are two cases (1ai5 and 2yhx) where the 12 pocket sites are clustered into 10 clusters, 78 cases
for 6 clusters, and only 1 case (1ppi) for 3 clusters, that is, all the four methods identify the same three pocket sites in their top
three predictions.
top three pocket sites from the four methods overlap some- and the distance threshold we use for hierarchal clus-
8.4 A
how, that is, they form five to eight clusters from the 12 sites .
tering is 8 A
(probes). There are two cases where the 12 probes are clus- Thus, if we increase the clustering distance threshold, me-
tered into 10 clusters. One case is the protein structure 2yhx taPocket will identify this ligand-binding site at its top one
(Fig. 5), in which LIGSITEcs, PASS, and SURFNET identify the prediction. We try different distance thresholds (5 to 10 A ) in
same pocket site in their top two prediction, and thus the three
the hierarchal clustering and 8 A returns the best performance
top two probes are clustered into the same cluster. However, for metaPocket (data not shown). Furthermore, there is only
the rest of the probes occupy different pocket sites. In this one case (1ppi) where metaPocket has only three sites after
case, LIGSITEcs and SURFNET detect the ligand-binding site clustering, that is, all the four methods identify the same three
correctly at their top one prediction. However, the distance pocket sites in their top three predictions. As shown in Figure
between the top one probe from LIGSITEcs and SURFNET is 6, all the four methods pick up the ligand-binding site at
their top one, and thus, metaPocket at its top one as well. In
the dataset of 210 bound proteins, there are 80 cases that
have more than one ligand binding sites, for all of which
metaPocket can pick up at least one binding site correctly in Connolly, M. (1983). Analytical molecular surface calculation,
the top three predictions. And there are 23 cases that meta- J Appl Crystallogh, 16, 548558.
Pocket can identify more than one ligand binding sites in the Delano, W. (2002). The PyMOL Molecular Graphics System.
top three predictions. Edelsbrunner, H., Facello, M., Fu, P., and Liang, J. (1995). Mea-
suring proteins and voids in proteins. Proc 28th Annu Hawaii
Conclusions Int Conf Syst Sci 5, 256264.
Glaser, F., Morris, R., Najmanovich, R., Laskowski, R., and
In recent decades, many computational efforts have been Thornton, J. (2006). A method for localizing ligand binding
done to predict protein functional sites based on protein pockets in protein structures, Proteins 62, 479488.
structures. These efforts include methods for prediction of Hendlich, M., Rippmann, F., and Barnickel, G. (1997). LIGSITE:
proteinprotein interaction sites and proteinligand binding automatic and efficient detection of potential small molecule-
sites. A number of tools are available to identify pockets binding sites in proteins, J Mol Graph Model 15, 359363.
on protein surfaces and predict ligand-binding sites from Huang, B., and Schroeder, M. (2006). LIGSITEcsc: predicting li-
the pockets. In this article, we propose a method called meta- gand binding sites using the Connolly surface and degree of
Pocket, which combines the predictions done by LIGSITEcs, conservation. BMC Struct Biol 6, 19.
PASS, Q-SiteFinder, and SURFNET. We compare metaPocket Huang, B., and Schroeder, M. (2008). Using protein binding site
to the individual methods on a dataset of 48 unbound=bound prediction to improve protein docking, Gene 422, 1421.
and 210 bound-only proteinligand complexes using the same Laskowski, R. (1995). SURFNET: a program for visualizing
evaluation criteria. The comparison results show that meta- molecular surfaces, cavities and intermolecular interactions.
Pocket performs slightly better than the other approaches and J Mol Graph 13, 323330.
Laurie, A., and Jackson, R. (2005). Q-SiteFinder: an energy-based
correctly predicts the ligand-binding site in 75% of the cases at
method for the prediction of protein-ligand binding sites.
top one, and 93% at top three predictions.
Bioinformatics, 21, 19081916.
MetaPocket is online at http:==metpocket.eml.org. Users
Levitt, D., and Banaszak, L. (1992). POCKET: a computer
can submit PDB files or enter a PDB ID and specify the chain graphics method for identifying and displaying protein cavities
ID. It returns the pocket sites identified by different methods and their surrounding amino acids. J Mol Graph 10, 229234.
in a standard PDB file format as well as a python script for Liang, J., Edelsbrunner, H., and Woodward, C. (1998). Anatomy of
visualizing the pockets using PyMol (Delano, 2002). protein pockets and cavities: measurement of binding site ge-
ometry and implications for ligand design. Protein Sci 7, 1884
Acknowledgments 1897.
We thank the authors of PASS, Q-SiteFinder, and Morita, M., Nakamura, S., and Shimizu, K. (2008). Highly ac-
curate method for ligand-binding site prediction in unbound
SURFNET for making their tools publicly available. We thank
state (apo) protein structures. Proteins, 73, 468479.
Rebecca Wade for discussions and Outi Salo-Ahen for reading
Puvanendrampillai, D., and Mitchell, J. (2003). Protein Ligand
the manuscript.
Database (PLD): additional understanding of the nature and
specificity of proteinligand complexes, Bioinformatics 19,
Author Disclosure Statement 18561857.
The authors declare that no conflicting financial interests Weisel, M., Proschak, E., and Schneider, G. (2007). PocketPicker:
exist. analysis of ligand binding-sites with shape descriptors. Chem
Cent J 1, 7.
References
Address correspondence to:
Binkowski, T., Naghibzadeh, S., and Liang, J. (2003). CASTp: Dr. Bingding Huang
computed atlas of surface topography of proteins, Nucleic EML Research gGmbH
Acids Res 31, 33523355. Schloss-Wolfsbrunnenweg 33
Brady, G., and Stouten, P. (2000). Fast prediction and visuali- 69118, Heidelberg, Germany
zation of protein binding pockets with PASS, J Comput Aided
Mol Des 14, 383401. E-mail: [email protected]