Accurate Structure Prediction of Bimolecular Interactions With AlphaFold3
Accurate Structure Prediction of Bimolecular Interactions With AlphaFold3
https://doi.org/10.1038/s41586-024-07487-w Josh Abramson1,7, Jonas Adler1,7, Jack Dunger1,7, Richard Evans1,7, Tim Green1,7,
Alexander Pritzel1,7, Olaf Ronneberger1,7, Lindsay Willmore1,7, Andrew J. Ballard1,
Received: 19 December 2023
Joshua Bambrick2, Sebastian W. Bodenstein1, David A. Evans1, Chia-Chun Hung2,
Accepted: 29 April 2024 Michael O’Neill1, David Reiman1, Kathryn Tunyasuvunakool1, Zachary Wu1, Akvilė Žemgulytė1,
Eirini Arvaniti3, Charles Beattie3, Ottavia Bertolli3, Alex Bridgland3, Alexey Cherepanov4,
Published online: 8 May 2024
Miles Congreve4, Alexander I. Cowen-Rivers3, Andrew Cowie3, Michael Figurnov3,
Open access Fabian B. Fuchs3, Hannah Gladman3, Rishub Jain3, Yousuf A. Khan3,5, Caroline M. R. Low4,
Kuba Perlin3, Anna Potapenko3, Pascal Savy4, Sukhdeep Singh3, Adrian Stecula4,
Check for updates
Ashok Thillaisundaram3, Catherine Tong4, Sergei Yakneen4, Ellen D. Zhong3,6,
Michal Zielinski3, Augustin Žídek3, Victor Bapst1,8, Pushmeet Kohli1,8, Max Jaderberg2,8 ✉,
Demis Hassabis1,2,8 ✉ & John M. Jumper1,8 ✉
Accurate models of biological complexes are critical to our under- Here we present AlphaFold 3 (AF3)—a model that is capable of
standing of cellular functions and for the rational design of thera high-accuracy prediction of complexes containing nearly all molecular
peutics2–4,9. Enormous progress has been achieved in protein structure types present in the Protein Data Bank32 (PDB) (Fig. 1a,b). In all but one
prediction with the development of AlphaFold1, and the field has category, it achieves a substantially higher performance than strong
grown tremendously with a number of later methods that build on methods that specialize in just the given task (Fig. 1c and Extended
the ideas and techniques of AlphaFold 2 (AF2)10–12. Almost immediately Data Table 1), including higher accuracy at protein structure and the
after AlphaFold became available, it was shown that simple input structure of protein–protein interactions.
modifications would enable surprisingly accurate protein interaction This is achieved by a substantial evolution of the AF2 architec-
predictions13–15 and that training AF2 specifically for protein inter ture and training procedure (Fig. 1d) both to accommodate more
action prediction yielded a highly accurate system7. general chemical structures and to improve the data efficiency of
These successes lead to the question of whether it is possible to learning. The system reduces the amount of multiple-sequence
accurately predict the structure of complexes containing a much wider alignment (MSA) processing by replacing the AF2 evoformer with
range of biomolecules, including ligands, ions, nucleic acids and modi- the simpler pairformer module (Fig. 2a). Furthermore it directly
fied residues, within a deep-learning framework. A wide range of pre- predicts the raw atom coordinates with a diffusion module, replac-
dictors for various specific interaction types has been developed16–28, ing the AF2 structure module that operated on amino-acid-specific
as well as one generalist method developed concurrently with the frames and side-chain torsion angles (Fig. 2b). The multiscale
present work29, but the accuracy of such deep-learning attempts has nature of the diffusion process (low noise levels induce the net-
been mixed and often below that of physics-inspired methods30,31. work to improve local structure) also enable us to eliminate
Almost all of these methods are also highly specialized to particular stereochemical losses and most special handling of bonding pat-
interaction types and cannot predict the structure of general biomo- terns in the network, easily accommodating arbitrary chemical
lecular complexes containing many types of entities. components.
1
Core Contributor, Google DeepMind, London, UK. 2Core Contributor, Isomorphic Labs, London, UK. 3Google DeepMind, London, UK. 4Isomorphic Labs, London, UK. 5Department of Molecular
and Cellular Physiology, Stanford University, Stanford, CA, USA. 6Department of Computer Science, Princeton University, Princeton, NJ, USA. 7These authors contributed equally: Josh Abramson,
Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore. 8These authors jointly supervised this work: Victor Bapst, Pushmeet Kohli,
Max Jaderberg, Demis Hassabis, John M. Jumper. ✉e-mail: [email protected]; [email protected]; [email protected]
60
40
20
0
AF3 AutoDock RoseTTAFold PDB PDB CASP15 Bonded Glycosylation Modified residues All Protein– Protein
2019 cut-off Vina All-Atom protein–RNA protein–dsDNA RNA ligands n = 28 protein–protein antibody monomers
Protein DNA RNA
n = 428 n = 428 n = 427 n = 25 n = 38 n=8 n = 66 n = 1,064 n = 65 n = 338
n = 40 n = 91 n = 23
d
Template
search
Confidence
Inputs module
Genetic
search (4 blocks)
Fig. 1 | AF3 accurately predicts structures across biomolecular complexes. and ranking details are provided in the Methods. For ligands, n indicates the
a,b, Example structures predicted using AF3. a, Bacterial CRP/FNR family number of targets; for nucleic acids, n indicates the number of structures; for
transcriptional regulator protein bound to DNA and cGMP (PDB 7PZB; modifications, n indicates the number of clusters; and for proteins, n indicates
full-complex LDDT47, 82.8; global distance test (GDT)48, 90.1). b, Human the number of clusters. The bar height indicates the mean; error bars indicate
coronavirus OC43 spike protein, 4,665 residues, heavily glycosylated and exact binomial distribution 95% confidence intervals for PoseBusters and by
bound by neutralizing antibodies (PDB 7PNM; full-complex LDDT, 83.0; GDT, 10,000 bootstrap resamples for all others. Significance levels were calculated
83.1). c, AF3 performance on PoseBusters (v.1, August 2023 release), our recent using two-sided Fisher’s exact tests for PoseBusters and using two-sided
PDB evaluation set and CASP15 RNA. Metrics are as follows: percentage of Wilcoxon signed-rank tests for all others; ***P < 0.001, **P < 0.01. Exact P values
pocket-aligned ligand r.m.s.d. < 2 Å for ligands and covalent modifications; (from left to right) are as follows: 2.27 × 10 −13, 2.57 × 10 −3, 2.78 × 10 −3, 7.28 × 10 −12,
interface LDDT for protein–nucleic acid complexes; LDDT for nucleic acid and 1.81 × 10 −18, 6.54 × 10 −5 and 1.74 × 10 −34. AF-M 2.3, AlphaFold-Multimer v.2.3;
protein monomers; and percentage DockQ > 0.23 for protein–protein and dsDNA, double-stranded DNA. d, AF3 architecture for inference. The rectangles
protein–antibody interfaces. All scores are reported from the top confidence- represent processing modules and the arrows show the data flow. Yellow, input
ranked sample out of five model seeds (each with five diffusion samples), data; blue, abstract network activations; green, output data. The coloured balls
except for protein–antibody scores, which were ranked across 1,000 model represent physical atom coordinates.
seeds for both models (each AF3 seed with five diffusion samples). Sampling
Compared with the original evoformer from AF2, the number of blocks
Network architecture and training is reduced to four, the processing of the MSA representation uses an
The overall structure of AF3 (Fig. 1d and Supplementary Methods 3) inexpensive pair-weighted averaging and only the pair representa-
echoes that of AF2, with a large trunk evolving a pairwise representa- tion is used for later processing steps. The ‘pairformer’ (Fig. 2a and
tion of the chemical complex followed by a structure module that uses Supplementary Methods 3.6) replaces the evoformer of AF2 as the
the pairwise representation to generate explicit atomic positions, but dominant processing block. It operates only on the pair representation
there are large differences in each major component. These modifica- and the single representation; the MSA representation is not retained
tions were driven both by the need to accommodate a wide range of and all information passes through the pair representation. The pair
chemical entities without excessive special casing and by observations processing and the number of blocks (48) is largely unchanged from
of AF2 performance with different modifications. Within the trunk, AF2. The resulting pair and single representation together with the
MSA processing is substantially de-emphasized, with a much smaller input representation are passed to the new diffusion module (Fig. 2b)
and simpler MSA embedding block (Supplementary Methods 3.3). that replaces the structure module of AF2.
b d Fine Fine
Initial training tune 1 tune 2
100
Per-
Intraligand
token
cond. 90
Intraprotein
80 Intra-DNA
Per- Intra-RNA
atom 70
LDDT
cond.
Protein–ligand
60 Protein–protein
Rand.
rot. Protein–DNA
Seq. local Seq. local 50
trans. attention attention
(atoms) (atoms) Protein–RNA
3 blocks 3 blocks 40
Global attention
(tokens)
24 blocks 30
0 20 40 60 80 100 120 140
Steps (×103)
c
Network
trunk
Diffusion Diffusion
+ module + module
+
+
(inference) (training)
STOP
48 samples Loss
20 iterations
mini rollout Permute
ground
truth
Ground
truth STOP Confidence
module
STOP Loss
Metrics
Fig. 2 | Architectural and training details. a, The pairformer module. starting from the end of the network trunk. The coloured arrays show activations
Input and output: pair representation with dimension (n, n, c) and single from the network trunk (green, inputs; blue, pair; red, single). The blue arrows
representation with dimension (n, c). n is the number of tokens (polymer show abstract activation arrays. The yellow arrows show ground-truth data.
residues and atoms); c is the number of channels (128 for the pair representation, The green arrows show predicted data. The stop sign represents stopping of
384 for the single representation). Each of the 48 blocks has an independent set the gradient. Both depicted diffusion modules share weights. d, Training
of trainable parameters. b, The diffusion module. Input: coarse arrays depict curves for initial training and fine-tuning stages, showing the LDDT on our
per-token representations (green, inputs; blue, pair; red, single). Fine arrays evaluation set as a function of optimizer steps. The scatter plot shows the raw
depict per-atom representations. The coloured balls represent physical atom datapoints and the lines show the smoothed performance using a median filter
coordinates. Cond., conditioning; rand. rot. trans., random rotation and with a kernel width of nine datapoints. The crosses mark the point at which the
translation; seq., sequence. c, The training set-up (distogram head omitted) smoothed performance reaches 97% of its initial training maximum.
The diffusion module (Fig. 2b and Supplementary Methods 3.7) oper- network to learn protein structure at a variety of length scales, whereby
ates directly on raw atom coordinates, and on a coarse abstract token the denoising task at small noise emphasizes understanding very local
representation, without rotational frames or any equivariant process- stereochemistry and the denoising task at high noise emphasizes the
ing. We had observed in AF2 that removing most of the complexity large-scale structure of the system. At the inference time, random noise
of the structure module had only a modest effect on the prediction is sampled and then recurrently denoised to produce a final structure.
accuracy, and maintaining the backbone frame and side-chain torsion Importantly, this is a generative training procedure that produces a
representation add quite a bit of complexity for general molecular distribution of answers. This means that, for each answer, the local
graphs. Similarly AF2 required carefully tuned stereochemical viola- structure will be sharply defined (for example, side-chain bond geom-
tion penalties during training to enforce chemical plausibility of the etry) even when the network is uncertain about the positions. For this
resulting structures. We use a relatively standard diffusion approach33 reason, we are able to avoid both torsion-based parametrizations of
in which the diffusion model is trained to receive ‘noised’ atomic coor- the residues and violation losses on the structure, while handling the
dinates and then predict the true coordinates. This task requires the full complexity of general ligands. Similarly to some recent work34,
d e f
Fig. 3 | Examples of predicted complexes. Selected structure predictions of an EXTL3 homodimer (PDB 7AU2; mean pocket-aligned r.m.s.d., 1.10 Å).
from AF3. Predicted protein chains are shown in blue (predicted antibody in c, Mesothelin C-terminal peptide bound to the monoclonal antibody 15B6
green), predicted ligands and glycans in orange, predicted RNA in purple and (PDB 7U8C; DockQ, 0.85). d, LGK974, a clinical-stage inhibitor, bound to
the ground truth is shown in grey. a, Human 40S small ribosomal subunit (7,663 PORCN in a complex with the WNT3A peptide (PDB 7URD; ligand r.m.s.d.,
residues) including 18S ribosomal RNA and Met-tRNAiMet (opaque purple) in a 1.00 Å). e, (5S,6S)-O7-sulfo DADH bound to the AziU3/U2 complex with a novel
complex with translation initiation factors eIF1A and eIF5B (opaque blue; PDB fold (PDB 7WUX; ligand r.m.s.d., 1.92 Å). f, Analogue of NIH-12848 bound to an
7TQL; full-complex LDDT, 87.7; GDT, 86.9). b, The glycosylated globular portion allosteric site of PI5P4Kγ (PDB 7QIE; ligand r.m.s.d., 0.37 Å).
we find that no invariance or equivariance with respect to global relatively early and started to decline (most likely due to overfitting to
rotations and translation of the molecule are required in the archi- the limited number of training samples for this capability), while other
tecture and we therefore omit them to simplify the machine learning abilities were still undertrained. We addressed this by increasing or
architecture. decreasing the sampling probability for the corresponding training
The use of a generative diffusion approach comes with some technical sets (Supplementary Methods 2.5.1) and by performing early stopping
challenges that we needed to address. The biggest issue is that genera- using a weighted average of all of the above metrics and some additional
tive models are prone to hallucination35, whereby the model may invent metrics to select the best model checkpoint (Supplementary Table 7).
plausible-looking structure even in unstructured regions. To counteract The fine-tuning stages with the larger crop sizes improve the model on
this effect, we use a cross-distillation method in which we enrich the all metrics with an especially high uplift on protein–protein interfaces
training data with structures predicted by AlphaFold-Multimer (v.2.3)7,8. (Extended Data Fig. 2).
In these structures, unstructured regions are typically represented
by long extended loops instead of compact structures, and training
on them ‘teaches’ AF3 to mimic this behaviour. This cross-distillation Accuracy across complex types
greatly reduced the hallucination behaviour of AF3 (Extended Data AF3 can predict structures from input polymer sequences, residue
Fig. 1 for disorder prediction results on the CAID 236 benchmark set). modifications and ligand SMILES (simplified molecular-input line-entry
We also developed confidence measures that predict the atom-level system). In Fig. 3 we show a selection of examples highlighting the abil-
and pairwise errors in our final structures. In AF2, this was done directly ity of the model to generalize to a number of biologically important
by regressing the error in the output of the structure module during and therapeutically relevant modalities. In selecting these examples,
training. However, this procedure is not applicable to diffusion training, we considered novelty in terms of the similarity of individual chains
as only a single step of the diffusion is trained instead of a full-structure and interfaces to the training set (additional information is provided
generation (Fig. 2c). To remedy this, we developed a diffusion ‘rollout’ in Supplementary Methods 8.1).
procedure for the full-structure prediction generation during training We evaluated the performance of the system on recent interface-
(using a larger step size than normal; Fig. 2c (mini-rollout)). This pre- specific benchmarks for each complex type (Fig. 1c and Extended
dicted structure is then used to permute the symmetric ground-truth Data Table 1). Performance on protein–ligand interfaces was evalu-
chains and ligands, and to compute the performance metrics to train ated on the PoseBusters benchmark set, which is composed of 428
the confidence head. The confidence head uses the pairwise representa- protein–ligand structures released to the PDB in 2021 or later. As our
tion to predict a modified local distance difference test (pLDDT) and a standard training cut-off date is in 2021, we trained a separate AF3
predicted aligned error (PAE) matrix as in AF2, as well as a distance error model with an earlier training-set cutoff (Methods). Accuracy on the
matrix (PDE), which is the error in the distance matrix of the predicted PoseBusters set is reported as the percentage of protein–ligand pairs
structure as compared to the true structure (details are provided in with pocket-aligned ligand root mean squared deviation (r.m.s.d.) of
Supplementary Methods 4.3). less than 2 Å. The baseline models come in two categories: those that
Figure 2d shows that, during initial training, the model learns quickly use only protein sequence and ligand SMILES as an input and those that
to predict the local structures (all intrachain metrics go up quickly and additionally leak information from the solved protein–ligand test struc-
reach 97% of the maximum performance within the first 20,000 training ture. Traditional docking methods use the latter privileged information,
steps), while the model needs considerably longer to learn the global even though that information would not be available in real-world use
constellation (the interface metrics go up slowly and protein–protein cases. Even so, AF3 greatly outperforms classical docking tools such as
interface LDDT passes the 97% bar only after 60,000 steps). During Vina37,38 even while not using any structural inputs (Fisher’s exact test,
AF3 development, we observed that some model abilities topped out P = 2.27 × 10−13) and greatly outperforms all other true blind docking
iLDDT
DockQ
0.4 40 40
0.2 20 20
0 0 0
0–0.4 0.4–0.6 0.6–0.8 0.8–0.95 0.95+ 0–0.4 0.4–0.6 0.6–0.8 0.8–0.95 0.95+ 0–0.4 0.4–0.6 0.6–0.8 0.8–0.95 0.95+
n = 834 n = 692 n = 1,574 n = 1,804 n = 229 n = 277 n = 320 n = 449 n = 344 n = 108 n = 7 n = 29 n = 92 n = 299 n = 481
Chain pair ipTM Chain pair ipTM Chain pair ipTM
80 80 80
LDDT to polymer
LDDT to polymer
LDDT to polymer
60 60 60
40 40 40
20 20 20
0 0 0
0–50 50–70 70–90 90+ 0–50 50–70 70–90 90+ 0–50 50–70 70–90 90+
n = 107 n = 469 n = 2,283 n = 1,552 n = 227 n = 225 n = 189 n = 229 n = 57 n = 136 n = 396 n = 934
Chain pLDDT Chain pLDDT Chain pLDDT
b c d e
Interface DockQ score Predicted aligned error matrix
A C D F A C D F
0
30
A 0.003 0.003 0.721 A
20
Residue
C 0.003 0.740
400 C
15
D 10
0.003 0.740
600 D
5
F 0.721 800 F 0
Fig. 4 | AF3 confidences track accuracy. a, The accuracy of protein-containing 7T82 coloured by pLDDT (orange, 0–50; yellow, 50–70; cyan, 70–90; and
interfaces as a function of chain pair ipTM (top). Bottom, the LDDT-to-polymer blue, 90–100). c, The same prediction coloured by chain. d, DockQ scores for
accuracy was evaluated for various chain types as a function of chain-averaged protein–protein interfaces. e, PAE matrix of same prediction (darker is more
pLDDT. The box plots show the 25–75% confidence intervals (box limits), the confident), with chain colouring of c on the side bars. The dashed black lines
median (centre line) and the 5–95% confidence intervals (whiskers). n values indicate the chain boundaries.
report the number of clusters in each band. b, The predicted structure of PDB
that AF2 produces in disordered regions. To encourage ribbon-like example, E3 ubiquitin ligases natively adopt an open conformation
predictions in AF3, we use distillation training from AF2 predictions, in an apo state and have been observed only in a closed state when
and we add a ranking term to encourage results with more solvent bound to ligands, but AF3 exclusively predicts the closed state for both
accessible surface area36. holo and apo systems42 (Fig. 5c). Many methods have been developed,
A key limitation of protein structure prediction models is that they particularly around MSA resampling, that assist in generating diversity
typically predict static structures as seen in the PDB, not the dynamical from previous AlphaFold models43–45 and may also assist in multistate
behaviour of biomolecular systems in solution. This limitation persists prediction with AF3.
for AF3, in which multiple random seeds for either the diffusion head Despite the large advance in modelling accuracy in AF3, there are
or the overall network do not produce an approximation of the solu- still many targets for which accurate modelling can be challenging.
tion ensemble. To obtain the highest accuracy, it may be necessary to generate a
In some cases, the modelled conformational state may not be correct large number of predictions and rank them, which incurs an extra
or comprehensive given the specified ligands and other inputs. For computational cost. A class of targets in which we observe this effect
Percentage
30
Percentage
***
***
40 20
10
20
0
*
1 10 100 1,000 1 10 100 1,000
Seeds per target Seeds per target
AF3 AF-M 2.3
Fig. 5 | Model limitations. a, Antibody prediction quality increases with the chirality and clash penalty. c, Conformation coverage is limited. Ground-truth
number of model seeds. The quality of top-ranked, low-homology antibody– structures (grey) of cereblon in open (apo, PDB: 8CVP; left) and closed (holo
antigen interface predictions as a function of the number of seeds. Each mezigdomide-bound, PDB: 8D7U; right) conformations. Predictions (blue) of
datapoint shows the mean over 1,000 random samples (with replacement) of both apo (with 10 overlaid samples) and holo structures are in the closed
seeds to rank over, out of 1,200 seeds. Confidence intervals are 95% bootstraps conformation. The dashed lines indicate the distance between the N-terminal
over 10,000 resamples of cluster scores at each datapoint. Samples per Lon protease-like and C-terminal thalidomide-binding domain. d, A nuclear
interface are ranked by protein–protein ipTM. Significance tests were pore complex with 1,854 unresolved residues (PDB: 7F60). The ground truth
performed using by two-sided Wilcoxon signed-rank tests. n = 65 clusters. (left) and predictions from AlphaFold-Multimer v.2.3 (middle) and AF3 (right)
Exact P values were as follows: 2.0 × 10 −5 (percentage correct) and P = 0.009 are shown. e, Prediction of a trinucleosome with overlapping DNA (pink) and
(percentage very high accuracy). b, Prediction (coloured) and ground- protein (blue) chains (PDB: 7PEU); highlighted are overlapping protein chains
truth (grey) structures of Thermotoga maritima α-glucuronidase and B and J and self-overlapping DNA chain AA. Unless otherwise stated, predictions
beta-d-glucuronic acid—a target from the PoseBusters set (PDB: 7CTM). AF3 are top-ranked by our global complex ranking metric with chiral mismatch and
predicts alpha-d-glucuronic acid; the differing chiral centre is indicated by an steric clash penalties (Supplementary Methods 5.9.1).
asterisk. The prediction shown is top-ranked by ligand–protein ipTM and with a
Extended Data Fig. 1 | Disordered region prediction. a, Example prediction yellow: 50 < pLDDT <= 70, light blue: 70 < pLDDT <= 90, and dark blue: 90 <=
for a disordered protein from AlphaFoldMultimer v2.3, AlphaFold 3, and pLDDT < 100). b, Predictions of disorder across residues in proteins in the CAID
AlphaFold 3 trained without the disordered protein PDB cross distillation set. 2 set, which are also low homology to the AF3 training set. Prediction methods
Protein is DP02376 from the CAID 2 (Critical Assessment of protein Intrinsic include RASA (relative accessible surface area) and pLDDT (N = 151 proteins;
Disorder prediction) set. Predictions coloured by pLDDT (orange: pLDDT <= 50, 46,093 residues).
Extended Data Fig. 2 | Accuracy across training. Training curves for initial to 256 * 32 = 8,192. The scatter plot shows the raw data points and the lines show
training and fine tuning showing LDDT (local distance difference test) on our the smoothed performance using a median filter with a kernel width of 9 data
evaluation set as a function of optimizer steps. One optimizer step uses a points. The dashed lines mark the points where the smoothed performance
mini batch of 256 trunk samples and during initial training 256 * 48 = 12,288 passes 90% and 97% of the initial training maximum for the first time.
diffusion samples. For fine tuning the number of diffusion samples is reduced
Article
Extended Data Fig. 3 | AlphaFold 3 predictions of PoseBusters examples b, Pseudomonas sp. PDC86 Aapf bound to HEHEAA (PDB ID 7KZ9, ligand
for which Vina and Gold were inaccurate. Predicted protein chains are RMSD: 1.3 Å). c, Human Galectin-3 carbohydrate-recognition domain in
shown in blue, predicted ligands in orange, and ground truth in grey. a, Human complex with compound 22 (PDB ID 7XFA, ligand RMSD: 0.44 Å).
Notum bound to inhibitor ARUK3004556 (PDB ID 8BTI, ligand RMSD: 0.65 Å).
Extended Data Fig. 4 | PoseBusters analysis. a, Comparison of AlphaFold 3 those characterized as “common natural” ligands and others. “Common
and baseline method protein-ligand binding success on the PoseBusters natural” ligands are defined as those which occur greater than 100 times in the
Version 1 benchmark set (V1, August 2023 release). Methods classified by the PDB and which are not non-natural (by visual inspection). A full list may be
extent of ground truth information used to make predictions. Note all methods found in Supplementary Table 15. Dark bar indicates RMSD < 2 Å and passing
that use pocket residue information except for UMol and AF3 also use ground PoseBusters validity checks (PB-valid). e, PoseBusters V2 structural accuracy
truth holo protein structures. b, PoseBusters Version 2 (V2, November 2023 and validity. Dark bar indicates RMSD < 2 Å and passing PoseBusters validity
release) comparison between the leading docking method Vina and AF3 2019 checks (PB-valid). Light hashed bar indicates RMSD < 2 Å but not PB valid.
(two-sided Fisher exact test, N = 308 targets, p = 2.3 * 10 −8). c, PoseBusters V2 f, PoseBusters V2 detailed validity check comparison. Error bars indicate exact
results of AF3 2019 on targets with low, moderate, and high protein sequence binomial distribution 95% confidence intervals. N = 427 targets for RoseTTAFold
homology (integer ranges indicate maximum sequence identity with proteins All-Atom and 428 targets for all others in Version 1; 308 targets in Version 2.
in the training set). d, PoseBusters V2 results of AF3 2019 with ligands split by
Article
Extended Data Fig. 5 | Nucleic acid prediction accuracy and confidences. test, p = 5.2 * 10 −12). Note RF2NA was only trained and evaluated on duplexes
a, CASP15 RNA prediction accuracy from AIChemy_RNA (the top AI-based (chains forming at least 10 hydrogen bonds), but some DNA structures in this
submission), RoseTTAFold2NA (the AI-based method capable of predicting set may not be duplexes. Box, centerline, and whiskers boundaries are at
proteinRNA complexes), and AlphaFold 3. Ten of the 13 targets are available in (25%, 75%) intervals, median, and (5%, 95%) intervals. c Predicted structure of
the PDB or via the CASP15 website for evaluation. Predictions are downloaded a mycobacteriophage immunity repressor protein bound to double stranded
from the CASP website for external models. b, Accuracy on structures DNA (PDB ID 7R6R), coloured by pLDDT (left; orange: 0–50, yellow: 50–70, cyan
containing low homology RNA-only or DNA-only complexes from the recent 70–90, and blue 90–100) and chain id (right). Note the disordered N-terminus
PDB evaluation set. Comparison between AlphaFold 3 and RoseTTAFold2NA not entirely shown. d, Predicted aligned error (PAE) per token-pair for the
(RF2NA) (RNA: N = 29 structures, paired Wilcoxon signed-rank test, prediction in c with rows and columns labelled by chain id and green gradient
p = 1.6 * 10 −7; DNA: N = 63 structures, paired two-sided Wilcoxon signed-rank indicating PAE.
Extended Data Fig. 6 | Analysis and examples for modified proteins and pocketaligned RMSDCα 10.261 Å). When excluding phosphorylation, AlphaFold
nucleic acids. a, Accuracy on structures. containing common phosphorylation 3 provides lower pLDDT confidence on the phosphopeptide. c, Structure of
residues (SEP, TPO, PTR, NEP, HIP) from the recent PDB evaluation set. parkin bound to two phospho-ubiquitin molecules (PDB ID 7US1), predictions
Comparison between AlphaFold 3 with phosphorylation modelled, and similarly coloured by pLDDT. Left: Phosphorylation modelled (mean pocket-
AlphaFold 3 without modelling phosphorylation (N = 76 clusters, paired aligned RMSDCα 0.424 Å). Right: Without modelling phosphorylation (mean
two-sided Wilcoxon signed-rank test, p = 1.6 * 10 −4). Note, to predict a structure pocket-aligned RMSDCα 9.706 Å). When excluding phosphorylation, AlphaFold
without modelling phosphorylation, we predict the parent (standard) residue 3 provides lower pLDDT confidence on the interface residues of the incorrectly
in place of the modification. AlphaFold 3 generally achieves better backbone predicted ubiquitin. d, Example structures with modified nucleic acids. Left:
accuracy when modelling phosphorylation. Error bars indicate exact binomial Guanosine monophosphate in RNA (PDB ID 7TNZ, mean pocket-aligned modified
distribution 95% confidence intervals. b, SPOC domain of human SHARP in residue RMSD 0.840 Å). Right: Methylated DNA cytosines (PDB ID 7SDW, mean
complex with phosphorylated RNA polymerase II C-terminal domain (PDB ID pocket-aligned modified residue RMSD 0.502 Å). Welabel residues of the
7Z1K), predictions coloured by pLDDT (orange: 0–50, yellow: 50–70, cyan predicted structure for reference. Ground truth structure in grey; predicted
70–90, and blue 90–100). Left: Phosphorylation modelled (mean pocket- protein in blue, predicted RNA in purple, predicted DNA in magenta, predicted
aligned RMSDCα 2.104 Å). Right: Without modelling phosphorylation (mean ions in orange, with predicted modifications highlighted via spheres.
Article
Extended Data Fig. 7 | Model accuracy with MSA size and number of seeds. obtained through Gaussian kernel average smoothing (window size is 0.2 units
a, Effect of MSA depth on protein prediction accuracy. Accuracy is given as in log10(N eff)); the shaded area is the 95% confidence interval estimated using
single chain LDDT score and MSA depth is computed by counting the number bootstrap of 10,000 samples. b, Increase in ranked accuracy with number of
of non-gap residues for each position in the MSA using the N eff weighting seeds for different molecule types. Predictions are ranked by confidence,
scheme and taking the median across residues (see Methods for details on N eff). and only the most confident per interface is scored. Evaluated on the low
MSA used for AF-M 2.3 differs slightly from AF3; the data uses the AF3 MSA homology recent PDB set, filtered to less than 1,536 tokens. Number of clusters
depth for both to make the comparison clearer. The analysis uses every protein evaluated: dna-intra = 386, protein-intra = 875, rnaintra = 78, protein-dna = 307,
chain in the low homology Recent PDB set, restricted to chains in complexes protein-rna = 102, protein-protein (antibody = False) = 697, protein-protein
with fewer than 20 protein chains and fewer than 2,560 tokens (see Methods (antibody = True) = 58. Confidence intervals are 95% bootstraps over 1,000
for details on Recent PDB set and comparisons to AF-M 2.3). The curves are samples.
Extended Data Fig. 8 | Relationship between confidence and accuracy The ions group includes both metals and nonmetals. N values report the
for protein interactions with ions, bonded ligands and bonded glycans. number of clusters in each band. For a similar analysis on general ligand-protein
Accuracy is given as the percentage of interface clusters under various pocket- interfaces, see Fig. 4 of main text.
aligned RMSD thresholds, as a function of the chain pair ipTM of the interface.
Article
Extended Data Fig. 9 | Correlation of DockQ and iLDDT for protein-protein interfaces. One data point per cluster, 4,182 clusters shown. Line of best fit with a
Huber regressor with epsilon 1. DockQ categories correct (>0.23), and very high accuracy (>0.8) correspond to iLDDTs of 23.6 and 77.6 respectively.
Extended Data Table 1 | Prediction accuracy across biomolecular complexes
AlphaFold 3 Performance on PoseBusters V1 (August 2023 release), PoseBusters V2 (November 6th 2023 release), and our Recent PDB evaluation set. For ligands and nucleic acids N indicates
number of structures; for covalent modifications and proteins N indicates number of clusters.