0% found this document useful (0 votes)
139 views

Metapocket: A Meta Approach To Improve Protein Ligand Binding Site Prediction

The document describes MetaPocket, a new consensus method for predicting protein ligand binding sites. MetaPocket combines the top predicted binding sites from four existing methods - LIGSITEcs, PASS, Q-SiteFinder, and SURFNET. It then clusters the predicted sites and ranks the clusters based on a metaZScore, which sums the z-scores of the sites within each cluster. The document suggests that MetaPocket improves the success rate of top-1 binding site prediction over the individual methods, from around 70% to 75%. It provides details on the MetaPocket algorithm and evaluation using two protein structure datasets.

Uploaded by

logan_rangel1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views

Metapocket: A Meta Approach To Improve Protein Ligand Binding Site Prediction

The document describes MetaPocket, a new consensus method for predicting protein ligand binding sites. MetaPocket combines the top predicted binding sites from four existing methods - LIGSITEcs, PASS, Q-SiteFinder, and SURFNET. It then clusters the predicted sites and ranks the clusters based on a metaZScore, which sums the z-scores of the sites within each cluster. The document suggests that MetaPocket improves the success rate of top-1 binding site prediction over the individual methods, from around 70% to 75%. It provides details on the MetaPocket algorithm and evaluation using two protein structure datasets.

Uploaded by

logan_rangel1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

OMICS A Journal of Integrative Biology

Volume 13, Number 4, 2009


Research Article
Mary Ann Liebert, Inc.
DOI: 10.1089=omi.2009.0045

MetaPocket: A Meta Approach to Improve


Protein Ligand Binding Site Prediction

Bingding Huang

Abstract

The identification of ligand-binding sites is often the starting point for protein function annotation and structure-
based drug design. Many computational methods for the prediction of ligand-binding sites have been developed
in recent decades. Here we present a consensus method metaPocket, in which the predicted sites from four
methods: LIGSITEcs, PASS, Q-SiteFinder, and SURFNET are combined together to improve the prediction
success rate. All these methods are evaluated on two datasets of 48 unbound=bound structures and 210 bound
structures. The comparison results show that metaPocket improves the success rate from *70 to 75% at the top 1
prediction. MetaPocket is available at http:==metapocket.eml.org.

Introduction for pockets that are characterized as a sequence of grid points,


starting and ending with the label protein and having a

I n most cellular processes, proteins interact with other


molecules to perform their biological functions. Therefore,
knowledge about these interaction sites helps us to under-
period of solvent grid points in between.
These sequences are called proteinsolventprotein events.
Only grid points that exceed a threshold of proteinsolvent
stand protein functions. Knowing the location of the func- protein events are retained for the final pocket prediction.
tional sites (e.g., substrate or ligand-binding sites of enzymes Since the definition of a pocket in POCKET is dependent
or receptor proteins) on protein surfaces makes it possible to on the angle of rotation of the protein relative to the axes,
design inhibitors or antagonists and to introduce targeted LIGSITE extends POCKET by scanning along the four cubic
mutations aimed at improving the protein function. The pro- diagonals in addition to the x, y, and z directions. Pocket-
tein surface can form pockets that are binding sites of small Finder is another implementation of LIGSITE (Laurie and
molecule ligands. Therefore, the identification of pocket sites Jackson, 2005). In our previous work (Huang and Schroeder,
on the protein surface is often the starting point for protein 2006), we made two extensions to LIGSITE. The first extension
function annotation and structure-based drug design. Also, is LIGSITEcs, in which we capture more accurate surface
proper ligand-binding site detection is a prerequisite for solventsurface events using the proteins Connolly surface
proteinligand docking. In recent decades, many computa- (Connolly, 1983), instead of capturing proteinsolventprotein
tional methods have been developed to predict protein events. The second extension is LIGSITEcsc(LIGSITEcs
ligand binding sites based on detection of cavities on protein Conservation), in which we rerank the pockets identified by the
surface. These methods include POCKET (Levitt and Ba- surfacesolventsurface events by the degree of conservation
naszak, 1992), LIGSITE (Hendlich et al., 1997), LIGSITEcs of the involved surface residues. PocketPicker (Weisel et al.,
(Huang and Schroeder, 2006), SURFNET (Laskowski, 1995), 2007) is another extension of LIGSITE using a finer scanning
CAST (Liang et al., 1998), PASS (Brady and Stouten, 2000), approach to calculate the buriedness-index of grid probes. The
and PocketPicker (Weisel et al., 2007), all of which use pure buriedness-index is calculated by scanning the protein sur-
geometric characteristics and do not require any knowledge roundings along 30 search rays having length of 10 A and
of the ligands.
width of 0.9 A. Then the clustering of grid probes for pocket
One of the first geometric methods, POCKET (Levitt and identification is restricted to those probes with buriedness-
Banaszak, 1992), introduced the idea of proteinsolvent indices ranging from 16 to 26. However, the performance
protein events as the key concept for the identification (see of PocketPicker is not much better than that of LIGSITEcs,
Fig. 1a). The protein is mapped onto a 3D grid. A grid point is although it scans more directions (Weisel et al., 2007).
part of the protein if it is within 3 A of an atom coordinate; The other geometric approaches for pocket detection are
otherwise, it is solvent. Next, the x-, y-, and z-axes are scanned SURFNET, CAST, and PASS. In SURFNET (Laskowski, 1995),

EML Research gGmbH, Schloss-Wolfsbrunnenweg 33, 69118, Heidelberg, Germany.

325
326 HUANG

the key idea is that a sphere that separates two atoms and does around the pocket sites. We will describe the metaPocket ap-
not contain any atoms defines a pocket (see Fig. 1b). First, a proach in detail in the following section.
sphere is placed so that the two given atoms are on opposite
sides of the spheres surface. If the sphere contains any other Materials and Methods
atoms, it is reduced in size until no more atoms are contained.
are kept. The result of MetaPocket algorithm
Only spheres with a radius of 1 to 4 A
this procedure is a number of separate groups of interpene- For each protein structure, we first use LIGSITEcs, PASS,
trating spheres, called gap regions, both inside the protein and Q-SiteFinder, and SURFNET to identify pocket sites. For
on its surface, which correspond to the proteins cavities and LIGSITEcs, PASS, and SURFNET, we use the executable pro-
clefts. CAST (Binkowski et al., 2003) computes a triangulation gram to search for the pocket sites on a protein surface. Each
(see Fig. 1c) of the proteins surface atoms using alpha shapes identified pocket site is represented as a single probe and has a
(Edelsbrunner et al., 1995). In the next step, triangles are ranking score. A python script is implemented to submit the
grouped by letting small triangles flow toward neighboring protein structures to the Q-SiteFinder server and retrieve the
larger triangles, which act as sinks. The pocket is then defined predicted binding sites (probes) automatically. These pre-
as a collection of empty triangles. PASS (Brady and Stouten, dicted pocket sites from Q-SiteFinder are represented by
2000) uses probe spheres to fill cavities layer by layer (see Fig. probes and are already clustered. For each cluster, the mass
1d). First, an initial coating of the protein with probe spheres is center of the probes within it is calculated and is represented
calculated. Each probe has a burial count that counts the as a pocket site ranked by their size. The pocket sites identified
number of atoms within an 8 A distance. Only probes with a by these four methods have different ranking scoring func-
count above a threshold are retained. This procedure is iter- tions. Therefore, it is hard to compare and evaluate the pre-
ated until a layer produces no more new buried probe dicted pocket sites directly. To make the ranking scores
spheres. Then each probe is assigned a probe weight, which is comparable, a z-score is calculated separately for each site in
proportional to the number of probe spheres in the vicinity different methods. Afterward, only the top three pocket sites
and the extent to which they are buried. A small number of in each method are taken into further consideration. There-
active site points (ASP) are then selected by identifying the fore, we have a total of 12 pocket sites, which are clustered
central probes in regions that contain many spheres with a using a simple hierarchical clustering algorithm, according to
high burial count. Finally, the retained active site points are their spatial similarity (distance based). Probes within a cer-
ranked by the probe weight. tain distance threshold (8 A used here) are grouped together
Besides the purely geometric methods mentioned above, as a cluster. Then each cluster is ranked by a scoring function
there are other energetic methods. In Q-SiteFinder (Laurie and metaZScore, which is the sum of the z-scores of the pocket
Jackson, 2005), the protein surface is coated with a layer of sites in a cluster.
methyl (CH3) probes to calculate van der Waals interaction
energies between the protein and probes. Probes with favor- Test dataset
able interaction energies are retained, and clusters of these
In this study, we use the same datasets as those in our pre-
probes are ranked based on the number of probes in a cluster.
vious work. One is a dataset of 48 unbound=bound struc-
The largest or energetically most favorable cluster is then
tures in which both ligand-bound and unbound structures are
ranked first and considered as a potential ligand-binding site.
present. The other one is a nonredundant dataset of 210 ligand-
Morita et al. (2008) refined Q-SiteFinder to achieve a higher
bound only structures, which is derived from the PLD data-
success rate by using a better probe distribution technique
base (Puvanendrampillai and Mitchell, 2003). For a detailed
and more suitable force field parameters to calculate interac-
description of these two datasets, see our previous work
tion energies.
(Huang and Schroeder, 2006). For the first dataset, the pre-
Among these above methods, some are freely available for
dictions are made for the unbound (apo) structures and
academic users. The source codes of LIGSITEcs, PocketPicker,
checked against the bound structures. In the case of the 210
and SURFNET are also freely available. For CAST, Pocket-
bound proteins, the ligands are taken away when making pre-
Finder and Q-SiteFinder, a Web server, is available through
dictions and then put back for the evaluation. For a realistic
which the users can submit a protein structure and visualize
evaluation, we should use the same criteria for all the methods.
the predicted ligand binding sites. PASS provides executable
Each pocket site identified by different methods is represented
binaries for various operating systems. Therefore, it is of great
interest to put all those available methods together to check
whether they identify the same pocket sites for the same
protein. In this work, we follow the idea of metaPPI, in which Table 1. Success Rate (%) of the Top Three
five proteinprotein binding site predictors were combined Predictions by Different Methods
together to improve the prediction success rate (Huang and for 48 Bound=Unbound Structures
Schroeder, 2008), and propose a meta method called meta-
Unbound Bound
Pocket that includes four proteinligand binding site predic-
tors: LIGSITEcs, PASS, Q-SiteFinder, and SURFNET. In all these Method Top 1 Top 2 Top 3 Top 1 Top 2 Top 3
four methods, the probes around the pocket sites on a protein
surface are identified and predicted as potential ligand-binding MetaPocket 75 85 90 83 94 96
sites. PocketFinder, PocketPicker, and LIGSITEcsc are discarded LIGSITEcs 71 79 85 81 90 92
to avoid biasing because of their similarity to LIGSITEcs. CAST PASS 58 67 75 58 81 85
Q-SiteFinder 52 60 75 75 83 90
is not taken into account in metaPocket because it identifies
SURFNET 42 58 62 42 56 60
the protein atoms forming a pocket rather than the probes
FIG. 1. Illustration of different pocket identification methods, taken from Huang and Schroeder, (2006). (a) POCKET,
LIGSITE, and LIGSITEcs scan the grid for proteinsolventprotein and surfacesolventsurface events, respectively. POCKET
uses three, LIGSITE and LIGSITEcs seven directions. POCKET and LIGSITE use atom coordinates, while LIGSITEcs uses the
Connolly surface. (b) SURFNET places a sphere, which must not contain any atoms, between two atoms. The spheres with
maximal volume define the largest pocket. (c) CAST triangulates the surface atoms and clusters triangles by merging small
triangles to neighboring large triangles. (d) PASS coats the protein with probe spheres, selects the probes with many atom
contacts, and then repeats coating until no new probes are kept. The pockets, or active site points, are the probes with the
largest number of atom contacts. Q-SiteFinder is similar to LIGSITE, but the ranking of the pocket sites is the sum of the van
der Waals interaction energy between the probes and protein atoms.

FIG. 2. Ligand binding site on protein 1a6u (unbound)= FIG. 3. The ligand (in red) binding site and identified
1a6w (bound). The ligand NIP (red) is bound to a pocket site, pockets on the surface of the protein structure 1aec. The
which is predicted as the top one by Q-SiteFinder (green sites predicted by LIGSITEcs (cyan sphere), PASS (yellow),
sphere) and top five by LIGSITEcs (cyan). LIGSITE, PASS, Q-SiteFinder (green), and SURFNET (magenta) are all in the
and SURFNET fail to identify this ligand binding site within top one prediction. These four top one sites are spatially
the top three predictions. similar and identify the same ligand binding site.

327
328 HUANG

Table 2. Success Rate (%) of the Top Three Predictions Table 4. Number of Proteins with Different Cluster
by Different Methods for 210 Bound Structures Sizes [Number of Pocket Site (ps)] in MetaPocket
for the Top Three Predictions in the Case
Method Top 1 Top 2 Top 3 of 210 Bound Structures
MetaPocket 75 88 93 First prediction Second prediction Third prediction
LIGSITEcs 70 80 86
PASS 51 71 80 4 ps 119 5 1
Q-SiteFinder 70 85 90 3 ps 31 11 2
SURFNET 42 52 57 2 ps 6 10 7
1 ps 2 0 1

as a single probe in the center of the pocket. One way to decide


whether the identified pocket site is the real ligand-binding site diction. Our method metaPocket improves it to 75%. The
is to check whether it is within 4 A of any atom of the ligand. If success rate is comparable to that of LIGSITEcsc method in
there are multiple ligands bound to the proteins, the best hit is which the top three predictions were reranked using the de-
picked up. This is how we evaluated the prediction method in gree of conservation score of the residues around the pocket
our previous work (Huang and Schroeder, 2006), and here we site (Huang and Schroeder, 2006). One can see that the success
just simply adapt it for this work. rates present in this work are slightly different from those in
our previous work (Huang and Schroeder, 2006). The reason
Results for this small difference is the different parameters used here.
The predicted pocket sites are classified into four classes: the
Table 1 shows the success rates using these five methods on
actual ligand binding site, the second and third pocket, or
the 48 bound=unbound structures. For unbound structures,
none of these. Table 3 shows the number of proteins in these
metaPocket achieves the best overall success rate for all the top
four classes for all the methods. In the top three pocket sites
three predictions. Among the four single methods, LIGSITEcs
identified by metaPocket, there are 158 cases (75%) where the
is the best, and can identify ligand-binding sites at 71 and 85%
ligand binds into the first pocket site, 26 and 11 cases that the
accuracy for the top one and top three pocket sites, respec-
ligands choose the second or third pocket site as their binding
tively. By taking all the top three sites from these single
site, respectively.
methods into account, metaPocket improves the success rate
from 71 to 75%, 85 to 90% (43 cases), for the top one and top
Discussion
three predictions, respectively. Among the five cases where
metaPocket fails to succeed, there are three cases (5cpa, 3app, As described above, the top three pocket sites identified by
and 6ins), where none of the four single methods can iden- the four single methods are retained to be further clustered by
tify the real ligand binding sites correctly within the top metaPocket. Thus, we get a total of 12 pocket sites of which
three predictions. In the rest two cases (1a6u and 2tga), only each is represented as a single probe. During the clustering
Q-SiteFinder predicts correctly the binding site for 1a6u at procedure, only different pocket sites from different methods
top 1. LIGSITEcs can also identify the same pocket site but the can be clustered into the same cluster. Therefore, each cluster
ranking is top five (see Fig. 2). The reason is that the bound contains one to four pocket sites (ps). Table 4 shows the
ligand NIP is rather small and does not bind to the largest number of proteins with different cluster sizes (14 ps) within
pocket site. Q-SiteFinder succeeds in this case because it uses the top three predictions of the metaPocket approach for 210
the probe energy as ranking schema rather than the size of the bound structures. As shown in the table, among the 158 cases
pocket. In 2tga, the loops near the binding site stretch signifi- that the ligand binds to the first pocket site, there are 119 cases
cantly to allow ligand binding. LIGSITEcs predicts the site at (75%) where all the four single methods detect this ligand
top three, but the other three methods fail. However, this li- binding site correctly within their top three predictions, and
gand binding site is the biggest pocket on the bound structure 31 cases (20%) where three single methods predict the site
1mtw. correctly. Figure 3 shows one case (protein structure 1ace)
Table 2 shows the success rates of the five methods for the where all the top one pocket sites predicted by these four
dataset of 210 bound structures. Overall, metaPocket achieves methods are spatially similar and identify the same ligand
a slightly better success rate: a three percentage improvement binding site. Figure 4 shows the number of proteins with
for the top two and three; a five percentage improvement for different number of clusters (number of pocket sites in meta-
the top one prediction. In this larger dataset, LIGSITEcs and Pocket) for 210 bound structures after clustering the 12 pocket
Q-SiteFinder both get a 70% success rate for the top one pre- sites from the four different methods. In most of the cases, the

Table 3. Number of Proteins in Each Pocket Prediction Class For 210 Bound Structures

Class metaPocket LIGSITEcs PASS Q-SiteFinder SURFNET

C1: Binding site (bs) in the first pocket 158 146 108 152 88
C2: Bs in the second pocket 26 22 42 26 21
C3: Bs in the third pocket 11 12 17 10 11
C4: Bs in none of above 15 20 43 22 90
METAPOCKET 329

FIG. 4. Distribution of the number of proteins in different number of clusters after clustering the 12 pocket sites in the case
of 210 bound structures. There are two cases (1ai5 and 2yhx) where the 12 pocket sites are clustered into 10 clusters, 78 cases
for 6 clusters, and only 1 case (1ppi) for 3 clusters, that is, all the four methods identify the same three pocket sites in their top
three predictions.

top three pocket sites from the four methods overlap some- and the distance threshold we use for hierarchal clus-
8.4 A
how, that is, they form five to eight clusters from the 12 sites .
tering is 8 A
(probes). There are two cases where the 12 probes are clus- Thus, if we increase the clustering distance threshold, me-
tered into 10 clusters. One case is the protein structure 2yhx taPocket will identify this ligand-binding site at its top one
(Fig. 5), in which LIGSITEcs, PASS, and SURFNET identify the prediction. We try different distance thresholds (5 to 10 A ) in
same pocket site in their top two prediction, and thus the three
the hierarchal clustering and 8 A returns the best performance
top two probes are clustered into the same cluster. However, for metaPocket (data not shown). Furthermore, there is only
the rest of the probes occupy different pocket sites. In this one case (1ppi) where metaPocket has only three sites after
case, LIGSITEcs and SURFNET detect the ligand-binding site clustering, that is, all the four methods identify the same three
correctly at their top one prediction. However, the distance pocket sites in their top three predictions. As shown in Figure
between the top one probe from LIGSITEcs and SURFNET is 6, all the four methods pick up the ligand-binding site at
their top one, and thus, metaPocket at its top one as well. In
the dataset of 210 bound proteins, there are 80 cases that
have more than one ligand binding sites, for all of which

FIG. 5. The identified 12 pocket sites for protein 2yhx.


These 12 pocket sites form 10 clusters. The three top two
probes from LIGSITEcs (cyan), PASS (yellow), and SURFNET
(magenta) are in the same pocket site, as shown in the top left FIG. 6. The 12 identified pocket sites for protein 1ppi.
region. The rest of the nine probes occupy different pocket These 12 probes occupy three pocket sites. All the four
sites. In this case, LIGSITEcs and SURFNET detect the ligand methods detect the ligand (red) binding site correctly at their
binding site correctly at their top one prediction. top one prediction.
330 HUANG

metaPocket can pick up at least one binding site correctly in Connolly, M. (1983). Analytical molecular surface calculation,
the top three predictions. And there are 23 cases that meta- J Appl Crystallogh, 16, 548558.
Pocket can identify more than one ligand binding sites in the Delano, W. (2002). The PyMOL Molecular Graphics System.
top three predictions. Edelsbrunner, H., Facello, M., Fu, P., and Liang, J. (1995). Mea-
suring proteins and voids in proteins. Proc 28th Annu Hawaii
Conclusions Int Conf Syst Sci 5, 256264.
Glaser, F., Morris, R., Najmanovich, R., Laskowski, R., and
In recent decades, many computational efforts have been Thornton, J. (2006). A method for localizing ligand binding
done to predict protein functional sites based on protein pockets in protein structures, Proteins 62, 479488.
structures. These efforts include methods for prediction of Hendlich, M., Rippmann, F., and Barnickel, G. (1997). LIGSITE:
proteinprotein interaction sites and proteinligand binding automatic and efficient detection of potential small molecule-
sites. A number of tools are available to identify pockets binding sites in proteins, J Mol Graph Model 15, 359363.
on protein surfaces and predict ligand-binding sites from Huang, B., and Schroeder, M. (2006). LIGSITEcsc: predicting li-
the pockets. In this article, we propose a method called meta- gand binding sites using the Connolly surface and degree of
Pocket, which combines the predictions done by LIGSITEcs, conservation. BMC Struct Biol 6, 19.
PASS, Q-SiteFinder, and SURFNET. We compare metaPocket Huang, B., and Schroeder, M. (2008). Using protein binding site
to the individual methods on a dataset of 48 unbound=bound prediction to improve protein docking, Gene 422, 1421.
and 210 bound-only proteinligand complexes using the same Laskowski, R. (1995). SURFNET: a program for visualizing
evaluation criteria. The comparison results show that meta- molecular surfaces, cavities and intermolecular interactions.
Pocket performs slightly better than the other approaches and J Mol Graph 13, 323330.
Laurie, A., and Jackson, R. (2005). Q-SiteFinder: an energy-based
correctly predicts the ligand-binding site in 75% of the cases at
method for the prediction of protein-ligand binding sites.
top one, and 93% at top three predictions.
Bioinformatics, 21, 19081916.
MetaPocket is online at http:==metpocket.eml.org. Users
Levitt, D., and Banaszak, L. (1992). POCKET: a computer
can submit PDB files or enter a PDB ID and specify the chain graphics method for identifying and displaying protein cavities
ID. It returns the pocket sites identified by different methods and their surrounding amino acids. J Mol Graph 10, 229234.
in a standard PDB file format as well as a python script for Liang, J., Edelsbrunner, H., and Woodward, C. (1998). Anatomy of
visualizing the pockets using PyMol (Delano, 2002). protein pockets and cavities: measurement of binding site ge-
ometry and implications for ligand design. Protein Sci 7, 1884
Acknowledgments 1897.
We thank the authors of PASS, Q-SiteFinder, and Morita, M., Nakamura, S., and Shimizu, K. (2008). Highly ac-
curate method for ligand-binding site prediction in unbound
SURFNET for making their tools publicly available. We thank
state (apo) protein structures. Proteins, 73, 468479.
Rebecca Wade for discussions and Outi Salo-Ahen for reading
Puvanendrampillai, D., and Mitchell, J. (2003). Protein Ligand
the manuscript.
Database (PLD): additional understanding of the nature and
specificity of proteinligand complexes, Bioinformatics 19,
Author Disclosure Statement 18561857.
The authors declare that no conflicting financial interests Weisel, M., Proschak, E., and Schneider, G. (2007). PocketPicker:
exist. analysis of ligand binding-sites with shape descriptors. Chem
Cent J 1, 7.
References
Address correspondence to:
Binkowski, T., Naghibzadeh, S., and Liang, J. (2003). CASTp: Dr. Bingding Huang
computed atlas of surface topography of proteins, Nucleic EML Research gGmbH
Acids Res 31, 33523355. Schloss-Wolfsbrunnenweg 33
Brady, G., and Stouten, P. (2000). Fast prediction and visuali- 69118, Heidelberg, Germany
zation of protein binding pockets with PASS, J Comput Aided
Mol Des 14, 383401. E-mail: [email protected]

You might also like