Protein Ligand Docking
Protein Ligand Docking
Matthias Rarey
GMD - German National Research Center for Information Technology
Institute for Algorithms and Scientific Computing (SCAI)
53754 Sankt Augustin, Germany
[email protected]
Matthias Rarey, GMD-SCAI
01/01 2
Outline of this lecture
n Introduction
n The docking problem
n Applications
n Scoring functions
n Rigid-body protein-ligand
docking
n Clique-search-based
methods
n The CLIX approach
n Geometric-hashing-based
methods
n Flexible protein-ligand
docking
n Docking by simulation
n Incremental construction
algorithms
n Genetic algorithms
n Protein-protein docking
n next lecture by T.Lengauer
Matthias Rarey, GMD-SCAI
01/01 3
Introduction
n The molecular docking problem:
n Given two molecules with 3D conformations in atomic detail
n Do the molecules bind to each other? If yes:
n How strong is the binding affinity?
n How does the molecule-molecule complex look like?
n Docking problems in biochemistry:
n Protein-Ligand docking
urigid-body docking
uflexible docking
n Protein-Protein docking
n Protein-DNA docking
n DNA-Ligand docking
Matthias Rarey, GMD-SCAI
01/01 4
Some basic principles...
n The association of molecules is based on interactions:
n hydrogen bonds, salt bridges, hydrophobic contacts
n electrostatics
n very strong repulsive interactions (van der Waals) on short
distances
n The associative interactions are weak and short-range
=> tight binding implies surface complementarity
n Most molecules are flexible:
n bond lengths > bond angles > torsion angles / ring conformations
n macro molecules are restricted in conformational space in a
complicated way
Matthias Rarey, GMD-SCAI
01/01 5
More basic principles...
n The binding affinity is the energetic difference to the
uncomplexed state:
n the surrounding medium (water in most cases) plays an important
role
n entropy can have a significant impact to the binding energy
n The binding affinity describes an ensemble of complex
structures, not a single one
n tight binders often have a dominating binding mode ...
n ... and weak binders?
Matthias Rarey, GMD-SCAI
01/01 6
Energetic Contributions
ligand
conformation
+
ligand orientation
bound water
ligand and protein in solution
+
protein-ligand complex
in solution
protein conformational
change
bulk water
n weak short-range interactions imply complementarity
n ligand (and protein) are conformationally flexible
n energy estimation is difficult (solvent, electrostatics,
entropic effects, etc.)
Matthias Rarey, GMD-SCAI
01/01 7
G = - RT ln K
i
pK
i
-18 -15 -12 -9 -6 -3 0
k
J
/
m
o
l
-100
-80
-60
-40
-20
0
T = 37C
pico
nano
milli
femtomolar
Binding affinities
~ 6 kJ/mol
1 order in K
i
"1 -2 hydrogen bonds"
~ 6 kJ/mol
1 order in K
i
"1 -2 hydrogen bonds"
[P][L]
K
i
=
[PL]
[P][L]
K
i
=
[PL]
Equilibrium Constant
Free Energy of Binding
G = H T S
G = H T S
Matthias Rarey, GMD-SCAI
01/01 8
Applications
n Estimating the binding affinity
n Searching for lead structures for protein targets
n Comparing a set of inhibitors
n Estimating the influence of modifications in lead structures
n De Novo Ligand Design
n Design of targeted combinatorial libraries
n Predicting the molecule complex
n Understanding the binding mode / principle
n Optimizing lead structures
Matthias Rarey, GMD-SCAI
01/01 9
Scoring functions
n Input: 3D structure of a protein-ligand complex
n Output: estimated binding energy G (freie Enthalpie)
n Comments:
n measured G describes energetic difference between bound
and unbound state based on a structure ensemble.
n Assumption: measured G is dominated by a single structure of
minimal energy
n G = H - T S H: enthalpic contributions, S: entropic contr.
S is very difficult to approximate!
n more about energy: Atkins, (Kurzlehrbuch) Physikalische
Chemie, Spektrum Akademischer Verlag, 1992
Matthias Rarey, GMD-SCAI
01/01 10
Scoring functions
n Force field:
n describes only enthalpic contributions H, no estimate for G
n conformation terms (bond lengths and angles) have a steep rise
(sometimes not used in docking calculations)
n time consuming calculations (electrostatics)
n Potentials of mean force / Knowledge-based scoring
n Analysis of known low-energy complexes: frequent occurance
energetically favorable
n Pair potentials: f(a,b,d) = relative frequency of observation atom of
type a and atom of type b occur with distance d in the database
n Conversion into an energy term g
ab
(d) (inverse Boltzmann law)
total energy:
r of type atom : ) ( l and r between distance : ) , (
)) , ( ( ) , (
, , Atome
) ( ) (
r a l r d
l r d g L R E
L R l r
l a r a
=
Matthias Rarey, GMD-SCAI
01/01 11
Scoring functions
n Empirical scoring functions
n calibration of mikroscopic observations with measured
macroscopic G values
n data: set of protein-ligand complexes with known 3D structure
and binding affinity G
n Example: Bhm-Function
(Bhm, J.Comput.-Aided Mol. Design, Vol. 8 (1994), pp 243
n Scoring function:
| | ) ( ) (
) ( ) (
ns interactio ionic
bonds H neutral
0
lipo lipo
A G f R f G
f R f G N G G G
io
hb rot rot
+
+ + + =
=
=
n
i
i i Y X
t y x
n
t
1
2
,
) (
1
) , ( RMSD
Matthias Rarey, GMD-SCAI
01/01 18
CLIX
n Lawrence et al, PROTEINS: Struct., Func., and Gen., Vol. 12
(1992), pp 31
n based on interaction maps calculated with GRID
n Algorithm:
n identification of interaction target points in the maps
n enumeration of all pairs of distance-compatible matches
n superposition of two matches groups, sampling of rotation
around common axis:
usearching for additional matches
uoverlap test, scoring
Matthias Rarey, GMD-SCAI
01/01 19
Geometric hashing
n Fischer et al, J.Mol.Biol., Vol. 248 (1995), pp. 459
n Key features
n method from pattern recognition applied to docking
n based on the dock sphere representation
n allows direct application to database search
n Constructing the hash table for ligand atom triplets (a,b,c):
n entries have address based on atom-atom distances
n information stored: ligand id, basis (a,b)
n Basic search algorithm:
n search for matching (two spheres, basis)
allowing large number of third atom matches
n extension and evaluation of matches
a
b
c
Matthias Rarey, GMD-SCAI
01/01 20
n Search for seed-matchings: (voting scheme)
n pairs of spheres (A,B) // search for matching bases
n spheres C: // sphere who gives the vote
u entries (ligand,basis) from hash table with matching distances:
uincrease vote for (ligand,basis)
uinsert (C,c) into matchlist of (ligand,basis)
n (ligand,basis) with > T votes:
ucheck all pairwise distances
uenter into seed matching list
n Method in pattern recognition:
n basis is d-dimensional and defines a coordinate reference frame
n here:
n due to complexity, basis is only 2-dimensional
n => matches spheres/atoms may not be superimposable
Geometric hashing
A
B
C
b
c
a
Matthias Rarey, GMD-SCAI
01/01 21
Pose clustering
n Rarey et al, J.Comput.-Aided Mol. Des., Vol 10 (1996), pp. 41
n Method from pattern recognition applied to ligand
orientation based on physico-chemical interactions
n Interaction model:
n compatible interaction types
n interaction center of first group lies
approximately on interaction surface of
second group ...
n ... and vice versa
Matthias Rarey, GMD-SCAI
01/01 22
Pose clustering
n Interaction surfaces are approximated by discrete
points:
Matthias Rarey, GMD-SCAI
01/01 23
Pose clustering
Pose clustering:
Searching for compatible triangles
Clustering of transformations
Matthias Rarey, GMD-SCAI
01/01 24
Pose clustering
n Preprocessing: construct hash table for all interaction
type pairs a,b:
n store all pairs of interaction points p,q with address d(p,q)
n chain lists twice, sorted by point id of p and q
n Search of initial ligand orientations:
n triplets (a,b,c) of ligand interaction centers:
ugenerate a list of all type- and distance-compatible pairs of
interaction points for (a,b) and (a,c)
uconstruct all distance-compatible triangles (p,q,r) by list merging
u triangles (p,q,r): generate ligand transformation, overlap test
n Cluster orientations by pairwise RMSD
n remaining ligand orientations:
uextend matching, overlap test, scoring
Matthias Rarey, GMD-SCAI
01/01 25
Flexible protein-ligand docking
n Main assumptions (not valid for simulations)
n ligand flexibility is limited to torsion angles (+ ring conformations)
n protein is considered as (nearly) rigid
n discrete models for conformations and interactions
n binding-pathway is not considered
n Application
n Analyzing complexes, searching for possible binding modes
n Virtual screening of small molecule databases
n History
98
DOCK 4.0
DOCK 4.0
FlexX
FlexX
96
GOLD
GOLD
Hhead
Hhead
LeachK
LeachK
Ludi
Ludi
DesJarlais
DesJarlais
86 92 94 90
AutoDock
AutoDock
Simul.
Simul.
95 72
Matthias Rarey, GMD-SCAI
01/01 26
Docking by simulation
n Method:
n generate (random) start orientations
n MD simulation / energy minimization for all start orientations
n Pros/Cons:
n can handle protein flexibility to an arbitrary extend
n very time consuming
n more a local minimization (large structural changes are difficult)
n Applications:
n Di Nola et al, PROTEINS: Struct., Func., and Gen., Vol. 19 (1994), pp 174
n Luty et al, J. Comp. Chem., Vol. 16 (1995), pp 454
Matthias Rarey, GMD-SCAI
01/01 27
Hybrid methods
n Method:
n use fast algorithms for placement, MD for refinement
n Applications:
n Wang et al, PROTEINS: Struct., Func., and Gen., Vol. 36 (1999), pp 1
n Hoffmann et al, J. Med. Chem., in press
n Wangs procedure:
n generate low energy conformations
n rigid-body docking (soft van der Waals potentials)
n minimization in the active site (amber force field, rigid protein)
n torsion angle refinement routine (scanning alternative torsions)
n simulated annealing (minimization, all degrees of freedom)
Matthias Rarey, GMD-SCAI
01/01 28
Simulated annealing: AutoDOCK
n Goodsell et al., PROTEINS: Struct., Func., and Gen. Vol. 8 (1990),
pp. 195
n Simulated annealing:
n random change in configuration is excepted with probability
n cooling schedule reduces T over time (for example T cT)
makes energetically unfavorable moves more unlikely
n Application specific:
n move: small random displacement of all degrees of freedom
n calculation of E: affinity potentials as in GRID
T
B
k
E
e E P
= ) (
E : energy difference of change
k
B
: Boltzmanns constant
T : user defined temperature
Matthias Rarey, GMD-SCAI
01/01 29
Place & join algorithms
n DesJarlais et al, J.Med.Chem. Vol. 29 (1986), pp 2149
n Algorithm:
n cut the ligand into few fragments (one overlapping atom (linker))
n place all fragments with the DOCK algorithm
n for a specific sequence of fragments:
ujoin two fragments in all placement combinations with close location
of the linker atom
n clustering and energy minimization (AMBER force field)
Matthias Rarey, GMD-SCAI
01/01 30
Place & join algorithms
n Sandak et al., CABIOS Vol. 11 (1995), pp. 87
n Hinge Bending: extending geometric hashing
n Hinge: Ligand with two adjacent, flexible bonds or protein
domain movement
n Hash table for ligand data set:
ustore ligand fragment, hinge location
n Matching phase: receptor sphere triplets:
usearch for ligand atom triplets in hash table
uperform a voting for a hinge location
n Join phase: hinges with high votes
ucombine collision free placements of fragments
uscoring and selection
Matthias Rarey, GMD-SCAI
01/01 31
Incremental construction algorithms
n Overall strategy:
n divide the molecule into fragments
n place one (several) fragment(s) into the active disregarding the
rest of the molecule
n add remaining fragments incrementally:
uexplore conformation space, clash test
usearch for new interactions, scoring
uselect new set of extended placements
n Application to the docking problem:
n Moon et al., PROTEINS: Struct., Func., and Gen., Vol. 11 (1991), pp 314
n Leach et al., J. Comp. Chem., Vol. 13 (1992), pp 730
n Rarey et al., J. Mol. Biol., Vol. 261 (1996), pp 470
n Welch et al., Chem. & Biol., Vol. 3 (1996), pp 449
n Makino et al., J. Comp. Chem., Vol. 18 (1997), pp 1812
Matthias Rarey, GMD-SCAI
01/01 32
Incremental construction algorithms
n Additional steps:
n Score estimation
n Placement optimization
n Solution clustering
n Search Strategies:
n GREEDY: after adding a fragment, select the high scoring
ones and reject the rest (GROW, FlexX, Hammerhead)
uscales linear with the number of fragments
uoptimal solution may be sub-optimal during buildup (the
larger the considered set and the lower the number of
fragments, the lower is the risk of missing the optimal
placement)
n BACKTRACKING: performs a recursive (depth first)
search through the whole configuration tree (Leach)
uscales exponentially with the number of fragments
uno risk of loosing the optimal solution due to tree pruning
Matthias Rarey, GMD-SCAI
01/01 33
Genetic algorithms: GOLD and others
n Genetic Algorithms:
n general purpose discrete optimization algorithm
n mimics the process of evolution
n The overall model:
n possible solution (configuration) individual
n its representation chromosome
n object function fitness of individual
n modifying solutions (moves) genetic operators
(crossover, mutation)
n Applications to the docking problem:
n Jones, et al., J.Mol.Biol., Vol. 245 (1995), pp 43
n Oshiro, et al., J.Comput.-Aided Mol.Design, Vol. 9 (1995), pp 113
n Gehlhaar, et al., Proc. 4th Annual Conference on Evolutionary
Programming (1995), pp 615
Matthias Rarey, GMD-SCAI
01/01 34
GOLD
n Molecule representation (N rotatable bonds)
n conformation string (N bytes), one byte each coding a torsion angle
n a matching string (integer), defines mapping between hydrogen bond
donors/acceptors: M(k)=l if k-th interaction group of ligand forms
interaction with l-th group of the protein
n Fitness evaluation of individual with chromosome c:
n build conformation according to c
n superimpose matched interacting groups
n calculate docking score: -E
hydrogen bond
- (E
internal
+ E
complex
)
OH
N H
2
NH
2
Matthias Rarey, GMD-SCAI
01/01 35
GOLD
n Population:
n 5 sub-populations of 100 individuals each
n about 20-50 runs, each up to 100000 genetic operations
n Genetic Operators:
n crossover: two-point crossover between two parent individuals
n mutation: one-point mutation
n migration: one individual moves between sub-populations
n operators are randomly selected
Matthias Rarey, GMD-SCAI
01/01 36
Concluding remarks
n docking performance
n correct structure can be predicted in about 70% of the test
cases
n prediction of binding affinity is very difficult:
1. ranking protein-ligand complex geometries good, not perfect
2. ranking different ligands with respect to binding weak
correlations
3. free energy estimation of protein-ligand complexes more or less
unsolved
n challenges
n handling protein flexibility
n improving reliability of structure and affinity prediction