Improved Protein Structure Prediction Using Potentials From Deep Learning
Improved Protein Structure Prediction Using Potentials From Deep Learning
https://doi.org/10.1038/s41586-019-1923-7 Andrew W. Senior1,4*, Richard Evans1,4, John Jumper1,4, James Kirkpatrick1,4, Laurent Sifre1,4,
Tim Green1, Chongli Qin1, Augustin Žídek1, Alexander W. R. Nelson1, Alex Bridgland1,
Received: 2 April 2019
Hugo Penedones1, Stig Petersen1, Karen Simonyan1, Steve Crossan1, Pushmeet Kohli1,
Accepted: 10 December 2019 David T. Jones2,3, David Silver1, Koray Kavukcuoglu1 & Demis Hassabis1
Proteins are at the core of most biological processes. As the function of an intermediate (FM/TBM) category. Figure 1a shows that AlphaFold
a protein is dependent on its structure, understanding protein struc- predicts more FM domains with high accuracy than any other system,
tures has been a grand challenge in biology for decades. Although particularly in the 0.6–0.7 TM-score range. The TM score—ranging
several experimental structure determination techniques have been between 0 and 1—measures the degree of match of the overall (back-
developed and improved in accuracy, they remain difficult and time- bone) shape of a proposed structure to a native structure. The assessors
consuming2. As a result, decades of theoretical work has attempted to ranked the 98 participating groups by the summed, capped z-scores of
predict protein structures from amino acid sequences. the structures, separated according to category. AlphaFold achieved
CASP5 is a biennial blind protein structure prediction assessment a summed z-score of 52.8 in the FM category (best-of-five) compared
run by the structure prediction community to benchmark progress in with 36.6 for the next closest group (322). Combining FM and TBM/FM
accuracy. In 2018, AlphaFold joined 97 groups from around the world in categories, AlphaFold scored 68.3 compared with 48.2. AlphaFold is
entering CASP138. Each group submitted up to 5 structure predictions able to predict previously unknown folds to high accuracy (Fig. 1b).
for each of 84 protein sequences for which experimentally determined Despite using only FM techniques and not using templates, AlphaFold
structures were sequestered. Assessors divided the proteins into 104 also scored well in the TBM category according to the assessors’ for-
domains for scoring and classified each as being amenable to template- mula 0-capped z-score, ranking fourth for the top-one model or first
based modelling (TBM, in which a protein with a similar sequence has for the best-of-five models. Much of the accuracy of AlphaFold is due
a known structure, and that homologous structure is modified in to the accuracy of the distance predictions, which is evident from the
accordance with the sequence differences) or requiring free model- high precision of the corresponding contact predictions (Fig. 1c and
ling (FM, in cases in which no homologous structure is available), with Extended Data Fig. 2a).
1
DeepMind, London, UK. 2The Francis Crick Institute, London, UK. 3University College London, London, UK. 4These authors contributed equally: Andrew W. Senior, Richard Evans, John Jumper,
James Kirkpatrick, Laurent Sifre. *e-mail: [email protected]
Precision (%)
25
TM score
0.6
20 50
15 0.4
10 25
0.2
5
0 0 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 L/1 L/2 L/5 L/1 L/2 L/5 L/1 L/2 L/5
T0953s2-D3
T0968s2-D1
T0990-D1
T0990-D2
T0990-D3
T1017s2-D1
TM-score cut-off Number of contacts
Target
Fig. 1 | The performance of AlphaFold in the CASP13 assessment. a, Number CASP13 for the most probable L, L/2 or L/5 contacts, where L is the length of the
of FM (FM + FM/TBM) domains predicted for a given TM-score threshold for domain. The distance distributions used by AlphaFold in CASP13, thresholded
AlphaFold and the other 97 groups. b, For the six new folds identified by the to contact predictions, are compared with the submissions by the two best-
CASP13 assessors, the TM score of AlphaFold was compared with the other ranked contact prediction methods in CASP13: 498 (RaptorX-Contact 26) and
groups, together with the native structures. The structure of T1017s2-D1 is not 032 (TripletRes32) on ‘all groups’ targets, with updated domain definitions for
available for publication. c, Precisions for long-range contact prediction in T0953s2.
The most-successful FM approaches thus far9–11 have relied on frag- neural network. By jointly predicting many distances, the network
ment assembly. In these approaches, a structure is created through can propagate distance information that respects covariation, local
a stochastic sampling process—such as simulated annealing12—that structure and residue identities of nearby residues. The predicted
minimizes a statistical potential that is derived from summary statistics probability distributions can be combined to form a simple, principled
extracted from structures in the Protein Data Bank (PDB)13. In fragment protein-specific potential. We show that with gradient descent, it is
assembly, a structure hypothesis is repeatedly modified, typically by simple to find a set of torsion angles that minimizes this protein-specific
changing the shape of a short section while retaining changes that lower potential using only limited sampling. We also show that whole chains
the potential, ultimately leading to low potential structures. Simu- can be optimized simultaneously, avoiding the need to segment long
lated annealing requires many thousands of such moves and must be proteins into hypothesized domains that are modelled independently
repeated many times to have good coverage of low-potential structures. as is common practice (see Methods).
In recent years, the accuracy of structure predictions has improved The central component of AlphaFold is a convolutional neural
through the use of evolutionary covariation data14 that are found in sets network that is trained on PDB structures to predict the distances
of related sequences. Sequences that are similar to the target sequence dij between the Cβ atoms of pairs, ij, of residues of a protein. On the
are found by searching large datasets of protein sequences derived basis of a representation of the amino acid sequence, S, of a protein
from DNA sequencing and aligned to the target sequence to generate and features derived from the MSA(S) of that sequence, the network,
a multiple sequence alignment (MSA). Correlated changes in the posi- which is similar in structure to those used for image-recognition tasks29,
tions of two amino acid residues across the sequences of the MSA can be predicts a discrete probability distribution P(dij|S, MSA(S)) for every
used to infer which residues might be in contact. Contacts are typically ij pair in any 64 × 64 region of the L × L distance matrix, as shown in
defined to occur when the β-carbon atoms of 2 residues are within 8 Å Fig. 2b. The full set of distance distribution predictions constructed
of one another. Several methods15–18, including neural networks19–22, by combining such predictions that covers the entire distance map is
have been used to predict the probability that a pair of residues is in termed a distogram (from distance histogram). Example distogram
contact based on features computed from MSAs. Contact predictions predictions for one CASP protein, T0955, are shown in Fig. 3c, d. The
are incorporated in structure predictions by modifying the statistical modes of the distribution (Fig. 3c) can be seen to closely match the
potential to guide the folding process to structures that satisfy more true distances (Fig. 3b). Example distributions for all distances to one
of the predicted contacts11,23. Other studies24,25 have used predictions residue (residue 29) are shown in Fig. 3d. We found that the predictions
of the distance between residues, particularly for distance geometry of the distance correlate well with the true distance between residues
approaches26–28. Neural network distance predictions without covari- (Fig. 3e). Furthermore, the network also models the uncertainty in its
ation features were used to make the evolutionary pairwise distance- predictions (Fig. 3f). When the s.d. of the predicted distribution is low,
dependent statistical potential25, which was used to rank structure the predictions are more accurate. This is also evident in Fig. 3d, in
hypotheses. In addition, the QUARK pipeline11 used a template-based which more confident predictions of the distance distribution (higher
distance-profile restraint for TBM. peak and lower s.d. of the distribution) tend to be more accurate, with
In this study, we present a deep-learning approach to protein struc- the true distance close to the peak. Broader, less-confidently predicted
ture prediction, the stages of which are illustrated in Fig. 2a. We show distributions still assign probability to the correct value even when it
that it is possible to construct a learned, protein-specific potential is not close to the peak. The high accuracy of the distance predictions
by training a neural network (Fig. 2b) to make accurate predictions and consequently the contact predictions (Fig. 1c) comes from a com-
about the structure of the protein given its sequence, and to predict bination of factors in the design of the neural network and its training,
the structure itself accurately by minimizing the potential by gradient data augmentation, feature representation, auxiliary losses, cropping
descent (Fig. 2c). The neural network predictions include backbone and data curation (see Methods).
torsion angles and pairwise distances between residues. Distance To generate structures that conform to the distance predictions,
predictions provide more specific information about the structure we constructed a smooth potential Vdistance by fitting a spline to the
than contact predictions and provide a richer training signal for the negative log probabilities, and summing across all of the residue pairs
TM score
0.4
and MSA Deep neural Distance and torsion Gradient descent on
0.3
features network distribution predictions protein-specific potential
0.2
0.1 Noisy restarts
64 bins deep 0
100 101 102 103
c Iteration
b
L × L 2D covariation features
d
0.7 70
0.6 60
TM score
r.m.s.d. (Å)
0.5 TM score 50
0.4 40
r.m.s.d. 30
0.3
j
0.2 20
0.1 10
0
64
0
0 200 400 600 800 1,000 1,200
140
120
64
Residue
100
80 I \
60
40
220 residual convolution blocks 20
i
0
0 200 400 600 800 1,000 1,200 Nat. 0 1 0.1 1 10
Gradient descent steps Prediction N–1
Fig. 2 | The folding process illustrated for CASP13 target T0986s2. CASP structure prediction probabilities of the network and the uncertainty in
target T0986s2, L = 155, PDB: 6N9V. a, Steps of structure prediction. b, The torsion angle predictions (as κ−1 of the von Mises distributions fitted to the
neural network predicts the entire L × L distogram based on MSA features, predictions for φ and ψ). While each step of gradient descent greedily lowers
accumulating separate predictions for 64 × 64-residue regions. c, One iteration the potential, large global conformation changes are effected, resulting in a
of gradient descent (1,200 steps) is shown, with the TM score and root mean well-packed chain. d, The final first submission overlaid on the native structure
square deviation (r.m.s.d.) plotted against step number with five snapshots of (in grey). e, The average (across the test set, n = 377) TM score of the lowest-
the structure. The secondary structure (from SST33) is also shown (helix in blue, potential structure against the number of repeats of gradient descent per
strand in red) along with the native secondary structure (Nat.), the secondary target (log scale).
(see Methods). We parameterized protein structures by the backbone We repeated the optimization from sampled initializations,
torsion angles (φ, ψ) of all residues and build a differentiable model of leading to a pool of low-potential structures from which further struc-
protein geometry x = G(φ, ψ) to compute the Cβ coordinates, xi for all ture initializations are sampled, with added backbone torsion noise
residues i and thus the inter-residue distances, dij = ||xi − xj||, for each (‘noisy restarts’), leading to more structures to be added to the pool.
structure, and express Vdistance as a function of φ and ψ. For a protein with After only a few hundred cycles, the optimization converges and the
L residues, this potential accumulates L2 terms from marginal distribu- lowest potential structure is chosen as the best candidate structure.
tion predictions. To correct for the overrepresentation of the prior, we Figure 2e shows the progress in the accuracy of the best-scoring struc-
subtract a reference distribution30 from the distance potential in the log tures over multiple restarts of the gradient descent process, show-
domain. The reference distribution models the distance distributions ing that after a few iterations the optimization has converged. Noisy
P(dij|length) independent of the protein sequence and is computed restarts enable structures with a slightly higher TM score to be found
by training a small version of the distance prediction neural network than when continuing to sample from the predicted torsion distribu-
on the same structures, without sequence or MSA input features. tions (average of 0.641 versus 0.636 on our test set, shown in Extended
A separate output head of the contact prediction network is trained to Data Fig. 4).
predict discrete probability distributions of backbone torsion angles Figure 4a shows that the distogram accuracy (measured using the
P(φi,ψi|S, MSA(S)). After fitting a von Mises distribution, this is used to local distance difference test (lDDT12) of the distogram; see Meth-
add a smooth torsion modelling term, Vtorsion, to the potential. Finally, ods) correlates well with the TM score of the final realized structures.
to prevent steric clashes, we add the Vscore2_smooth score of Rosetta9 to the Figure 4b shows the effect of changing the construction of the potential.
potential, as this incorporates a van der Waals term. We used multipli- Removing the distance potential entirely gives a TM score of 0.266.
cative weights for each of the three terms in the potential; however, no Reducing the resolution of the distogram representation below six bins
combination of weights noticeably outperformed equal weighting. by averaging adjacent bins causes the TM score to degrade. Removing
As all of the terms in the combined potential Vtotal(φ, ψ) are the torsion potential, reference correction or Vscore2_smooth degrades the
differentiable functions of (φ, ψ), it can be optimized with respect to accuracy only slightly. A final ‘relaxation’ (side-chain packing inter-
these variables by gradient descent. Here we use L-BFGS31. Structures leaved with gradient descent) with Rosetta9, using a combination of
are initialized by sampling torsion values from P(φi, ψi|S, MSA(S)). the Talaris2014 potential and a spline fit of our reference-corrected
Figure 2c illustrates a single gradient descent trajectory that minimizes distance potential adds side-chain atom coordinates, and yields a small
the potential, showing how this greedy optimization process leads to average improvement of 0.007 TM score.
increasing accuracy and large-scale conformation changes. The sec- We show that a carefully designed deep-learning system can pro-
ondary structure is partly set by the initialization from the predicted vide accurate predictions of inter-residue distances and can be used
torsion angle distributions. The overall accuracy (TM score) improves to construct a protein-specific potential that represents the protein
quickly and after a few hundred steps of gradient descent the accuracy structure. Furthermore, we show that this potential can be optimized
of the structure has converged to a local optimum of the potential. with gradient descent to achieve accurate structure predictions.
Distance (Å)
10–2
100
40 4 8 12 16 4 8 12 16 4 8 12 16 4 8 12 16 4 8 12 16
Distance (Å)
e 22 f
10
16
5
14
0
12
–5
10
8 –10
6 –15
4 –20
4 6 8 10 12 14 16 18 20 22 0 1 2 3 4 5 6
True distance (Å) V prediction (Å)
Fig. 3 | Predicted distance distributions compared with true distances. e, The mode of the predicted distance plotted against the true distance for all
a–d, CASP target T0955, L = 41, PDB 5W9F. a, Native structure showing residue pairs with distances ≤22 Å, excluding distributions with s.d. > 3.5 Å
distances under 8 Å from the Cβ of residue 29. b, c, Native inter-residue (n = 28,678). Data are mean ± s.d. calculated for 1 Å bins. f, The error of the mode
distances (b) and the mode of the distance predictions (c), highlighting residue distance prediction versus the s.d. of the distance distributions, excluding
29. d, The predicted probability distributions for distances of residue 29 to all pairs with native distances >22 Å (n = 61,872). Data are mean ± s.d. are shown for
other residues. The bin corresponding to the native distance is highlighted in 0.25 Å bins. The true distance matrix and distogram for T0990 are shown in
red, 8 Å is drawn in black. The distributions of the true contacts are plotted in Extended Data Fig. 2b, c.
green, non-contacts in blue. e, f, CASP target T0990, L = 552, PDB 6N9V.
Whereas FM predictions only rarely approach the accuracy of experi- can match the performance of template-modelling approaches without
mental structures, the CASP13 assessment shows that the AlphaFold using templates and is starting to reach the accuracy needed to provide
system achieves unprecedented FM accuracy and that this FM method biological insights (see Methods). We hope that the methods we have
0.6 0.650
0.8 0.645
0.5 0.640
0.635
0.630
0.6 0.4
48 51
TM score
TM score
0.3
0.4
0.2
0.2
0.1
0 0
0 10 20 30 40 50 60 70 2 3 6 12 24 51
Distogram IDDT12 Number of bins (log scale)
Fig. 4 | TM scores versus the accuracy of the distogram, and the dependency b, Average TM score over the test set (n = 377) versus the number of histogram
of the TM score on different components of the potential. a, TM score versus bins used when downsampling the distogram, compared with removing
distogram lDDT12 with Pearson’s correlation coefficients, for both CASP13 different components of the potential, or adding Rosetta relaxation.
(n = 500: 5 decoys for all domains, excluding T0999) and test (n = 377) datasets.
Extended Data Fig. 2 | CASP13 contact precisions. a, Precisions (as shown in groups’ targets, with updated domain definitions for T0953s2. b, c, True
Fig. 1c) for long-range contact prediction in CASP13 for the most probable L, L/2 distances (b) and modes of the predicted distogram (c) for CASP13 target
or L/5 contacts, where L is the length of the domain. The distance distributions T0990. CASP divides this chain into three domains as shown (D3 is inserted in
used by AlphaFold (AF) in CASP13, thresholded to contact predictions, are D2) for which there are 39, 36 and 42 HHblits alignments, respectively (from the
compared with submissions by the two best-ranked contact prediction CASP website).
methods in CASP13: 498 (RaptorX-Contact 26) and 032 (TripletRes32), on ‘all
Extended Data Fig. 3 | Analysis of structure accuracies. a, lDDT 12 versus potential giving different results from ‘Full’, for a two-tailed paired data t-test.
distogram lDDT 12 (see Methods, ‘Accuracy’). The distogram accuracy predicts ‘Bins’ shows the number of bins fitted by the spline before extrapolation and
the lDDT of the realized structure well (particularly for medium- and long-range the number in the full distribution. In CASP13, splines were fitted to the first 51
residue pairs, as well as the TM score as shown in Fig. 4a) for both CASP13 of 64 bins. Bottom, reducing the resolution of the distogram distributions. The
(n = 500: 5 decoys for domains excluding T0999) and test (n = 377) datasets. original 64-bin distogram predictions are repeatedly downsampled by a factor
Data are shown with Pearson’s correlation coefficients. b, DLDDT 12 against the of 2 by summing adjacent bins, in each case with constant extrapolation
effective number of sequences in the MSA (Neff ) normalized by sequence length beyond 18 Å (the last quarter of the bins). The two-level potential in the final
(n = 377). The number of effective sequences correlates with this measure of row, which was designed to compare with contact predictions, is constructed
distogram accuracy (r = 0.634). c, Structure accuracy measures, computed on by summing the probability mass below 8 Å and between 8 and 14 Å, with
the test set (n = 377), for gradient descent optimization of different forms of the constant extrapolation beyond 14 Å. The TM scores in this table are plotted in
potential. Top, removing terms in the potential, and showing the effect of Fig. 4b.
following optimization with Rosetta relax. ‘P’ shows the significance of the
Article
Extended Data Fig. 4 | TM score versus per-target computation time product of the number of (CPU-based) machines and time elapsed and can be
computed as an average over the test set. Structure realization requires a largely parallelized. Longer targets take longer to optimize. Figure 2e shows
modest computation budget, which can be parallelized over multiple how the TM score increases with the number of repeats of gradient descent.
machines. Full optimization with noisy restarts (orange) is compared with n = 377.
initialization from sampled torsions (blue). Computation is measured as the
Extended Data Fig. 5 | AlphaFold CASP13 results. a, The TM score for each of (submission with highest GDT_TS), a single run of full-chain gradient descent
the five AlphaFold CASP13 submissions are shown. Simulated annealing with (a CASP13 run for T0975 and later, back-fill for earlier targets) and a single
fragment assembly entries are shown in blue. Gradient-descent entries are CASP13 run of fragment assembly with domain segmentation (using a gradient
shown in yellow. Gradient descent was only used for targets T0975 and later, so descent submission for T0999). c, The formula-standardized (z) scores of the
to the left of the black line we also show the results for a single ‘back-fill’ run of assessors for GDT TS + QCS52, best-of-five for CASP FM (n = 31) and FM/TBM
gradient descent for each earlier target using the deployed system. T0999 (n = 12) domains comparing AlphaFold with the closest competitor (group 322),
(1,589 residues) was manually segmented based on HHpred51 homology coloured by domain category. AlphaFold performs better (P = 0.0032, one-
matching. b, Average TM scores of the AlphaFold CASP13 submissions tailed paired statistic t-test).
(n = 104 domains), comparing the first model submitted, the best-of-five model
Article
Extended Data Fig. 6 | Correct fold identification by structural search in good ground-truth match (score > 0.5), we show the percentage of decoys for
CATH. Often protein function can be inferred by finding homologous proteins which a domain with the same CATH code (CATH in red, CA in green; CAT results
of known function. Here we show that the FM predictions of AlphaFold give are close to CATH results) as the top ground-truth match is in the top-k matches
greater accuracy in a structure-based search for homologous domains in the with score > 0.5. Curves are shown for AlphaFold and the next-best group (322).
CATH database. For each of the FM or TBM/FM domains, the top-one AlphaFold predictions determine the matching fold more accurately.
submission and ground truth are compared to all 30,744 CATH S40 non- Determination of the matching CATH domain can provide insights into the
redundant domains with TM-align53. For the 36 domains for which there is a function of a new protein.
Extended Data Fig. 7 | Accuracy of predictions for interfaces. Protein– system and all submissions were for isolated chains rather than complexes. For
protein interaction is an important domain for understanding protein function the five all-groups heterodimer CASP13 targets, the full-atom r.m.s.d. values of
that has hitherto largely been limited to template-based models because of the the interface residues (residues with a ground-truth inter-chain heavy-atom
need for high-accuracy predictions, although there has been moderate distance <10 Å) are computed for the chain submissions of all groups (green),
success 54 in docking with predicted structures up to 6 Å r.m.s.d. This figure relative to the target complex. Results >8 Å are not shown. AlphaFold (blue)
shows that the predictions by AlphaFold improve accuracy in the interface achieves consistently high accuracy interface regions and, for 4 out of 5
regions of chains in hetero-dimer structures and are probably better targets, predicts interfaces below <5 Å for both chains.
candidates for docking, although docking did not form part of the AlphaFold
Article
Extended Data Fig. 8 | Ligand pocket visualizations for T1011. T1011 (PDB true pocket than that of the best other submission (322, model 3, 68.7 GDT TS)
6M9T) is the EP3 receptor bound to misoprostol-FA 55. a, The native structure (c). Both submissions are aligned to the native protein using the same subset of
showing the ligand in a pocket. b, c, Submission 5 (78.0 GDT TS) by AlphaFold residues from the helices close to the ligand pocket and visualized with the
(b), made without knowledge of the ligand, shows a pocket more similar to the interior pocket together with the native ligand position.
Extended Data Fig. 9 | Attribution map of distogram network. The contact contact, (2) a long-range strand–strand contact, (3) a medium-range strand–
probability map of T0986s2, and the summed absolute value of the Integrated strand contact, (4) a non-contact and (5) a very long-range strand–strand
Gradient, ∑c|S I,Ji,j,c|, of the input two-dimensional features with respect to the contact. Each pair is shown as two red dots on the diagrams. Darker colours
expected distance between five different pairs of residues (I,J): (1) a helix self- indicate a higher attribution weight.
Article
Extended Data Fig. 10 | Attribution shown on predicted structure. For lighter green colours indicate more sensitive, and the output pair is shown as a
T0986s2 (TM score 0.8), the top 10 input pairs, including self-pairs, with the blue line.
highest attribution weight for each of the five output pairs shown in Extended
Data Fig. 9, are shown as lines (or spheres for self-pairs) coloured by sensitivity,