0% found this document useful (0 votes)

3 views

CombFold predicting structures of large protein assemblies using a combinatorial assembly algorithm and AlphaFold2

The article presents CombFold, a novel algorithm for predicting the structures of large protein assemblies by combining a combinatorial assembly approach with AlphaFold2's predictions of pairwise interactions. CombFold demonstrated a top-10 success rate of 72% in accurately predicting complex structures, significantly improving structural coverage compared to existing Protein Data Bank entries. This method enhances the ability to model large protein complexes, which is crucial for understanding their functions and applications in drug discovery.

Uploaded by

hanzichang123

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

CombFold predicting structures of large protein assemblies using a combinatorial assembly algorithm and AlphaFold2

Uploaded by

hanzichang123

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

nature methods

Article https://doi.org/10.1038/s41592-024-02174-0

CombFold: predicting structures of large

protein assemblies using a combinatorial
assembly algorithm and AlphaFold2

Received: 17 May 2023 Ben Shor & Dina Schneidman-Duhovny

Accepted: 9 January 2024

Published online: 7 February 2024 Deep learning models, such as AlphaFold2 and RosettaFold, enable
high-accuracy protein structure prediction. However, large protein
Check for updates
complexes are still challenging to predict due to their size and the
complexity of interactions between multiple subunits. Here we present
CombFold, a combinatorial and hierarchical assembly algorithm for
predicting structures of large protein complexes utilizing pairwise
interactions between subunits predicted by AlphaFold2. CombFold
accurately predicted (TM-score >0.7) 72% of the complexes among the
top-10 predictions in two datasets of 60 large, asymmetric assemblies.
Moreover, the structural coverage of predicted complexes was 20% higher
compared to corresponding Protein Data Bank entries. We applied the
method on complexes from Complex Portal with known stoichiometry
but without known structure and obtained high-confidence predictions.
CombFold supports the integration of distance restraints based on
crosslinking mass spectrometry and fast enumeration of possible complex
stoichiometries. CombFold’s high accuracy makes it a promising tool for
expanding structural coverage beyond monomeric proteins.

Most proteins function as multimolecular assemblies in the cells. can also apply to predict protein complexes using the same archi-
There are on average a few dozen interactions per protein1–3. These tecture. Soon after its release, several techniques were developed
assemblies perform important functions, such as energy transduc- to use AlphaFold2 to predict multichain protein complexes—first
tion4, transport5 and signal transduction6. The determination of the 3D by using a linker9 and later by offsetting the residue index10. Similar
structures of these assemblies is critical for understanding their func- techniques were used for the training of AlphaFold-Multimer (AFM)11
tion and evolution, interpreting the effects of mutations, and potential which is able to predict multimeric complexes with high accuracy using
applications in drug discovery. The large size of some assemblies and paired and padded multiple sequence alignment. On several pairwise
conformational heterogeneity pose challenges for traditional struc- protein–protein docking benchmarks AFM achieves a success rate of
tural characterization techniques, such as X-ray crystallography and 40–70% for complexes consisting of two to nine chains up to 1,536 in
nuclear magnetic resonance spectroscopy. While progress has been total length11–13.
made using cryo-electron microscopy (cryo-EM), high-throughput However, AFM application for predicting structures of large
structure determination of large assemblies is still challenging. assemblies is still challenging12,13. The first difficulty is the require-
Recently deep learning techniques greatly advanced our ability ment for substantial resources, such as a graphical processing unit
to predict high-accuracy protein structures. One of the most notable (GPU) with a large memory size. Currently, common GPUs have ~20 GB
advancements was the release of AlphaFold2 (ref. 7) and RosettaFold8. of memory, enabling the prediction of complexes up to 1,800 and
While AlphaFold2 was designed to predict single-chain proteins, it 3,000 amino acids for AFM version 2.2 (AFMv2) and AFM version 2.3

The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
e-mail: [email protected]

Nature Methods | Volume 21 | March 2024 | 477–487 477

Article https://doi.org/10.1038/s41592-024-02174-0

(AFMv3), respectively. We estimate that in a few years GPU cards with is applicable mainly to homomeric complexes with a success rate of
sufficient memory will become widely available. However, as AFM ~30%. In this Article, inspired by this work, we combine AlphaFold2
memory usage increases roughly quadratically with the number of with a deterministic combinatorial assembly algorithm17,34. Our new
amino acids7, this currently limits the practical capability of many method, CombFold, uses a small number of pairwise subunit interac-
researchers to predict structures of large size, leaving many macro- tions generated by AlphaFold2 for assembly instead of thousands
molecular complexes without a structure prediction. The second generated by docking. The hierarchical and combinatorial assembly
difficulty is sampling with a large number of restraints: as the number stage exhaustively enumerates possible assembly trees, maximizing
of chains and amino acids increases, the number of residue–residue the probability of correctly assembling the complex based on pairwise
contacts and distance restraints to optimize increases as well, making AlphaFold2 interactions. We validate our approach on two benchmarks
it harder for the model to converge to accurate structures. Large, mul- of large heteromeric assemblies (up to 30 chains and 18,000 amino
timolecular complex prediction is an out-of-domain inference setup acids) and obtain a top-1 success rate of 62% and top-10 success rate
for AFM since it was trained only on cropped regions and thus is not of 72% (TM-score >0.7). Moreover, CombFold is able to increase the
expected to perform well. The third difficulty is that AFM converges to structural coverage by 20% relative to experimental structures in our
a single (sometimes incorrect) structure (for each of the five available benchmarks. Integration of distance restraints based on crosslinking
trained models) and it is highly challenging to obtain a diverse set of mass spectrometry further increases the success rate. We also test
predictions for the same target14. the method on the benchmark of homomeric complexes used for
Prior to the deep learning revolution, methods developed for the MoLPC validation and obtain a top-1 success rate of 57%. CombFold
assembly of multiprotein complexes could be divided into two main successfully assembles six out of seven CASP15 targets with over 3,000
categories. The first category is integrative modeling methods that amino acids (Supplementary Note 1 and Supplementary Fig. 1). We
mainly rely on experimental data15,16, and the second is docking-based apply the method on a set of complexes with known stoichiometry and
methods that rely on pairwise protein–protein docking17–19. Integrative without known structure from Complex Portal36 and obtain confident
modeling methods rely on information from multiple sources, such as predictions.
crosslinking mass spectrometry, Förster resonance energy transfer
(FRET), co-evolution, cryo-EM and small-angle X-ray scattering to com- Results
pute models. This information is converted into spatial restraints and Overview of CombFold
combined into an integrative modeling approach20,21, using specialized The input to CombFold is the subunit sequences and optionally distance
software packages22–24 to generate a set of structural models that are restraints, the output is a set of assembled structures. A subunit can be
consistent with it. The integrative modeling workflow iterates through a single chain or a domain. The approach is based on combinatorial and
four stages that convert input information into an output model: hierarchical assembly via pairwise interactions. In principle, there is no
(1) gathering data, (2) scoring (representing and translating the data limitation on complex size, as the complex can be divided into subunits
into spatial restraints), (3) sampling, and (4) validating the model15,22. suited for the GPU memory limit, and our current implementation
The sampling of candidate models is often performed by global supports up to 128 subunits. CombFold works in three major stages:
data-driven optimization algorithms, such as Monte Carlo or genetic (1) generation of pairwise subunit interactions by AFM, (2) creation of
algorithms. The input information contributes to a scoring function, a unified representation of subunits and interactions, and (3) combi-
either for ranking or filtering generated structural models or for natorial assembly of subunits (Fig. 1).
directly guiding the sampling process. Integrative structure mod- In the first stage, we apply AFM to all possible subunit pairings.
eling is applicable to large and heterogeneous systems25, such as the Following this, we create three additional AFM models for each subunit,
~52 MDa nuclear pore complex26. AlphaLink27 was developed recently ranging in size from three to five subunits, that include subunits with
to support such sampling with distance restraints using AlphaFold2. which the given subunit had the highest confidence-scored predicted
The second category of docking-based methods predominantly pairwise interactions (Methods). The underlying concept is that some
rely on pairwise protein–protein docking for the prediction of com- groups of more than two subunits form intertwined structures, and
plexes28–31 and do not require additional input information. In pair- therefore all of them should be predicted as a single model by AFM
wise docking, the two input proteins are docked to one another using (Methods).
geometric shape and physicochemical complementarity. The main In the second stage, to prepare input for the third assembly stage,
problem is that they sample thousands of docked configurations. a single representative structure for each subunit is selected and the
While the correct ones are usually sampled, it is difficult to rank them as transformations between representative subunits are calculated. This
top-scoring. Typically, pairwise docking methods succeed in ranking a is required since there are multiple AFM structures for each subunit
correct model among the top-10 best scoring in 25–40% of the cases32,33. from pairwise AFM runs and their enumeration during the assembly
This low accuracy further complicates the multiprotein assembly stage is intractable. The representative subunit structures are extracted
stage, where methods have to consider a large number of pairwise from the predicted modeled subcomplexes according to the maximal
protein–protein docking models. For example, Multi-LZerD18 builds average predicted local Distance Difference Test (plDDT) score for this
the multimolecular assembly by applying a stochastic search driven by subunit. Next, we use all interacting subunit pairs (Cα–Cα distance
a genetic algorithm. Kuzu et al.19 construct the multimolecular complex <8 Å) from AFM models to extract pairwise transformations (rotation
iteratively, where a single subunit is added to the subassembly in each and translation in 3D) between their representative structures in the
iteration. The CombDock method is hierarchical and combinatorial17,34. global reference frame. The representation of the input by representa-
The complexes are constructed hierarchically by generating subas- tive subunit structures and transformations between them enables us
semblies of two or more subunits. At each stage, subassemblies are to apply the combinatorial assembly algorithm with AFM interactions
connected using pairwise docking configurations between subunits. instead of docking-based ones. Each transformation is coupled with a
Due to multiple possible hierarchical assembly pathways, the algorithm score based on AFM’s predicted aligned error (PAE) score (Methods).
combinatorially enumerates assembly trees. Since the algorithms used In the third stage, we use N representative subunit structures,
for docking and scoring pairwise interactions have low accuracy, it is the pairwise transformations between them and, optionally, distance
difficult to reach high accuracy in multisubunit docking. restraints for the hierarchical and combinatorial assembly of the entire
The recently developed MoLPC method relies on AlphaFold2 to complex. Distance restraints can originate from crosslinking mass spec-
produce configurations for pairs and triplets of chains and assem- trometry, FRET or other sources of information37–40. If a protein chain is
ble them using Monte Carlo Tree Search35. However, the approach divided into subunits (for example, domains), distance constraints are

Nature Methods | Volume 21 | March 2024 | 477–487 478

Article https://doi.org/10.1038/s41592-024-02174-0

Subunit sequences:

(1) Generation of pairwise subunit interactions

All pairs Larger subsets (3–5 subunits, <1,800 amino acids)

… …

(2) Unified representation

Extract representative structures Compute pairwise transformations

T = T2 T1−1
T1 T2

(3) Combinatorial assembly of subunits

Fig. 1 | The three stages of the CombFold assembly algorithm. The input is the the representative structures. (3) Combinatorial and hierarchical assembly
sequences of the subunits in the complex. (1) Structure prediction of all pairwise of subunit structures using the computed pairwise transformations. In each
and some larger subunit subsets using AFM. (2) Selection of representative iteration, new subcomplexes are assembled using a pairwise transformation to
subunit structures out of all predicted structures, followed by computation join two previously created subcomplexes.
of all pairwise transformations present in predicted structures relative to

added to enforce sequence connectivity. This combinatorial assembly (Extended Data Fig. 2a). This dataset includes only complexes released
stage consists of N iterations, where in the ith iteration we construct K after April 2018, which AFMv2 was not trained on. Benchmark 2 dataset
subcomplexes of size i. The value of K has to be large enough to contain was generated similarly to Benchmark 1 to test the recently released
a variety of subcomplexes. Subcomplexes of size i are constructed from AFMv3. It contains 25 complexes with 5–30 chains and 2,000–18,000
pairs of previously computed subcomplexes of size 1 to i − 1. For exam- amino acids (Extended Data Fig. 2b) that were not in the training set of
ple, a subcomplex of size i can be computed by merging subcomplexes AFMv3 (released after September 2021). Benchmark 3 dataset was used
of size 3 and i − 3. We attempt to merge a pair of subcomplexes if they for benchmarking the MoLPC approach35. It contains 153 complexes
do not have any shared subunit and the joint number of subunits is i. ranging between 500 and 10,000 amino acids with 10–30 chains per
During the merge, new subcomplexes are generated by iterating all complex. This dataset contains mainly symmetric homomers (98 com-
subunit pairs (one from each subcomplex) and applying known trans- plexes consisting of one unique chain and 27 consisting of two unique
formations between those two subunits on the entire subcomplexes. chains). Finally, Benchmark 4 dataset contains seven CASP15 targets
Next, we discard generated subcomplexes with major steric clashes or with more than 3,000 amino acids.
chain connectivity violations. Distance restraints satisfaction is cal-
culated, and low-scoring subcomplexes are also discarded (Extended Accuracy assessment. To evaluate the accuracy of the modeled struc-
Data Fig. 1). The remaining subcomplexes are clustered and scored on tures we rely on the TM-score41 which assesses the global accuracy of the
the basis of the score of transformations that were used, and the top K complex, similar to CASP and MoLPC35. Similarly to CAPRI assessment42,
subcomplexes are saved for the next iterations. a model is considered acceptable quality if the TM-score is above 0.7
The model confidence score produced by our method is based and high quality if the TM-score is above 0.8. The success rate is meas-
on the AFM PAE score. Each pairwise interaction (represented by a ured as a fraction of the benchmark complexes with acceptable- or
transformation) has a PAE-based score (Methods). The confidence high-quality models among the top-N best-scoring predictions.
of an assembled structure is a weighted score of the transformations
that were used for assembly, where the weight is proportional to the Accuracy on Benchmark 1 (heteromers). We obtain a top-1 success
sizes of the subunit subsets that were merged by each transformation. rate of 60% for CombFold on this benchmark, accurately modeling 21
out of 35 complexes (Fig. 2a) with TM-score >0.7. High-quality top-1
Benchmark datasets. We tested the method on four benchmark data- models are produced for 14 complexes (40%). When considering the
sets (Table 1 and Supplementary Note 2). We generated a Benchmark top-10 models, the success rate is 74%. Importantly, the predicted
1 dataset aimed to test the method on large heteromeric complexes. confidence correlates with the TM-score (Pearson r = 0.57, Fig. 2b), indi-
Structures with many unique chains usually do not contain notable cating that it can be used to estimate model accuracy. To determine to
symmetry which makes them more challenging for assembly, since which extent the success rate depends on the ability of AFM to produce
many different pairwise interactions need to be found and combined. accurate models for pairwise interactions, we calculate the pairwise
Benchmark 1 contains 35 structures with 5 to 20 chains and at least 5 connectivity (Methods). As expected, the pairwise connectivity cor-
unique chains per complex, consisting of 1,300 to 8,000 amino acids relates with the TM-score (Pearson r = 0.48, Fig. 2c).

Nature Methods | Volume 21 | March 2024 | 477–487 479

Article https://doi.org/10.1038/s41592-024-02174-0

Table 1 | CombFold evaluation benchmarks

Benchmark Complex type Number of Number of Number of Top-1 success rate Top-10 success Top-1 success rate
complexes chains amino acids of CombFold rate of CombFold of AFM or MoLPC

1 Asymmetric complexes (released 35 5–20 1,300–8,000 60% 74% 26% (AFMv2)

after AFMv2 training)
2 Asymmetric complexes (released 25 5–30 2,000– 64% 68% 36% (AFMv3)
after AFMv3 training) 18,000
3 Mostly homomers and symmetric 153 10–30 600–10,000 57% 58% 28% (MoLPC)
complexes
4 CASP15 targets (>3,000 amino 7 1–27 3,000–8,000 57–86%a 57–86% 43% (AFM, MoLPCb)
acids)
The success rate is defined as the fraction of benchmark cases with a model with a TM-score above 0.7 among the top-N best-scoring predictions. aFor CASP15 targets the fully automated
CombFold had a success rate of 57%. Manual subdivision of proteins into domains led to an increased success rate of 86%. bWe compared CombFold to CASP15 submissions of the Elofsson
group that used AFM and MoLPC.

We compare CombFold to an end-to-end AFM on all the Bench- and contacts. We find that CombFold top-1 models have variable ICS
mark 1 complexes using the A100 GPU card with 40-GB memory. AFM scores (Extended Data Fig. 3e). Moreover, AFM models have higher
succeeded in producing at least one result for 17 out of 35 complexes scores compared to CombFold. The lower ICS scores of CombFold
with up to 3,700 amino acids. Of these, ten complexes were modeled can be attributed to the usage of representative subunit structures
with acceptable or high quality, resulting in success rates of 26% and instead of the ones produced by pairwise AFM. In addition, some of
29% for top-1 and top-5 results, respectively (Fig. 2a,d). the interfaces in the CombFold models are not a result of pairwise AFM
The largest complex assembled by CombFold was eIF2B:eIF2 prediction, but a by-product of the assembly process, and therefore
(Protein Data Bank (PDB) 6I3M, Fig. 2e), which could not be assembled have lower quality.
directly with AFM. The CombFold model contains a structural coverage We examine whether the interface quality of CombFold models is
for 6,114 amino acids with plDDT above 50 out of a total of 7,486. In com- sufficient for predicting dissociation constants (kD) between subunits.
parison, the experimental cryo-EM structure covers only 4,680 amino Because experimentally measured kD values are not available for the
acids. The addition of over 1,500 amino acids contains six well-folded whole Benchmark, we compare the kD values predicted by PRODIGY45
domains. This example demonstrates the ability of CombFold to com- from the interfaces in experimental structures to the kD values pre-
plete unresolved fragments in experimental structures. On average, dicted from the interfaces in the top-1 model of CombFold. We find a
each assembled complex in this Benchmark contained 20% more amino strong correlation (Spearman r = 0.55, Extended Data Fig. 3f), indicat-
acids compared to the corresponding PDB entry. GID E3 ubiquitin ligase ing that despite lower ICS scores, CombFold models are sufficiently
complex is another example where an additional domain is missing accurate for estimating kD.
in the experimental structure (PDB 6SWY, Fig. 2f) and is predicted by
CombFold with high plDDT. The complex is assembled with a TM-score Integration of experimental data. Integrative structure modeling is
of 0.83 compared to AFM, which produces a model with a TM-score often used to determine the structures of large macromolecular assem-
of 0.53. In contrast, the multiple resistance and pH adaptation (Mrp) blies using information from a variety of sources, such as crosslink-
complex (PDB 7D3U, Fig. 2g) is assembled with higher accuracy by AFM ing mass spectrometry, cryo-EM or bioinformatics analysis22,26,46–48.
(TM-score 0.97 versus 0.67 for CombFold). This is due to the fact that The information is used for scoring and sampling models to produce
the orientation between the two domains in the largest subunit was structures that are consistent with the available data. Here we add to
not accurately predicted in the representative structure chosen for CombFold support for integrating information about known physical
assembly (Fig. 2g, light blue). interactions between subunits and distance restraints that originate
from crosslinking mass spectrometry. This type of information can be
Accuracy on Benchmark 2 (heteromers). This benchmark was gen- obtained for individual complexes in vitro or for multiple assemblies
erated to test CombFold against the recently released AFMv3. We also identified from in situ experiments49–52. AFM does not currently support
used AFMv3 to predict the pairwise subunit interactions for CombFold the integration of this type of data. Recently, AlphaLink27 was developed
(instead of AFMv2 in Benchmark 1). The performance on this dataset to add distance restraints support to AlphaFold2/OpenFold as a bias to
is comparable to Benchmark 1 (Extended Data Fig. 3), with top-1 and residue–residue contacts, similar to template support in AlphaFold2.
top-5 success rates of 64% and 68%, respectively. In comparison, the This method requires subsampling of multiple sequence alignment
top-1 success rate of AFMv3 is 36%. The fraction of high-quality top-1 to give more weight to distance restraints and is currently applicable
models is higher on this Benchmark (52% versus 40% for Benchmark 1), for complexes with less than 3,000 amino acids53. The advantage of
indicating that AFMv3 produces pairwise interactions with higher CombFold is that it can integrate additional information during the
accuracy (Extended Data Fig. 4), perhaps due to the higher number assembly stage (Methods).
of recycles and larger training set. To further validate CombFold, we We apply CombFold with distance restraints for human mitochon-
used this Benchmark for comparison to RosettaFold2 (ref. 43). Roset- drial translocase TIM22 (PDB 7CGP), a Benchmark 1 case, for which
taFold2 was not able to assemble most complexes (21/25), and among both CombFold and AFM failed to produce an accurate prediction
the assembled four complexes, only one had an acceptable-quality (TM-score of 0.57 and 0.67, respectively). We used crosslinking mass
model among the ten predicted structures, which translates to a suc- spectrometry experiment for this complex54 to compile a set of 12
cess rate of 4% (Extended Data Fig. 3a). distance restraints. We also divided the chains into two groups for
While TM-score is a measure of global accuracy, to assess the accu- assembly (Methods), based on a known structure of a subcomplex of
racy of subunit interfaces, we calculate the interface contact similarity TIM9 and TIM10 (PDB 2BSK). The resulting model is of high quality with
(ICS) score44 that is also used in the CASP/CAPRI complex assessment. a TM-score of 0.85 (Fig. 2h).
Similarly to the TM-score, ICS values are in the range of 0–1; however, To further examine the contribution of crosslinking mass spec-
the ICS scores are usually lower compared to TM-scores, indicating that trometry data, we simulated crosslinks for Benchmark 2 (Methods)
a model with high global accuracy may still have low-quality interfaces and compared the performance of CombFold with and without input

Nature Methods | Volume 21 | March 2024 | 477–487 480

Article https://doi.org/10.1038/s41592-024-02174-0

a 0.8 b c 1.0 d
ρ = 0.57 ρ = 0.48 1.0
0.7

AFMv2 TM-score
0.90 0.8 0.8

Success rate
0.6

TM-score
TM-score
0.5 0.75 0.6 0.6
0.4
0.3 0.60 0.4 0.4
0.2 0.45 0.2 0.2
0.1
0 0
0.30
Top 1 Top 5 Top 10 15 30 45 60 75 90 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0

CombFold high AFMv2 high Predicted confidence Pairwise connectivity CombFold TM-score
CombFold acceptable AFMv2 acceptable

CombFold (TM-score 0.79; PDB 6I3M (4,680 amino acids)

6,114 amino acids)
f

CombFold (TM-score 0.83; PDB 6SWY (1,875 amino acids) AFM (TM-score 0.53;
2,134 amino acids) 1,861 amino acids)

CombFold (TM-score 0.67; PDB 7D3U (1,873 amino acids) AFM (TM-score 0.97;
1,817 amino acids) 1,789 amino acids)

CombFold + crosslinks PDB 7CGP (1,570 amino acids) AFM (TM-score 0.67;
(TM-score 0.85; 1,627 amino acids) 1,477 amino acids)

Fig. 2 | Accuracy of CombFold on Benchmark 1. a, The top-N (N = 1, 5, 10) success acids (marked with red circles). f, GID E3 ubiquitin ligase complex: high-quality
rate of CombFold (blue) and AFM (orange). AFM produces only five predictions. CombFold model (left), cryo-EM structure (middle) and inaccurate AFM model
b, Predicted confidence versus the TM-score for CombFold. c, Success rate of (right). g, Multiple resistance and pH adaptation (Mrp) complex: inaccurate
AFM in producing pairwise interactions as measured by the pairwise connectivity CombFold model (left), cryo-EM structure (middle) and high-quality AFM model
versus the TM-score of the models produced by CombFold. d, TM-score of AFM (right). h, Human mitochondrial translocase TIM22: high-quality model by
models versus CombFold models. e, eIF2B:eIF2 complex: CombFold model (left) CombFold, integrating experimental crosslinking data (left), cryo-EM structure
and cryo-EM structure (right). The model contains over 1,500 additional amino (middle) and inaccurate AFM model (right). Crosslinks are shown as blue lines.

Nature Methods | Volume 21 | March 2024 | 477–487 481

Article https://doi.org/10.1038/s41592-024-02174-0

a 0.6 b c
0.6
0.5 1.0
0.5

MoLPC TM-score
Success rate
0.4 0.8

Success rate
0.4
0.3 0.6
0.3
0.2 0.4
0.2
0.2
0.1 0.1
0

Top 1 Top 5 Top 10 Homomers Heteromers 0 0.2 0.4 0.6 0.8 1.0

CombFold high CombFold acceptable MoLPC high MoLPC acceptable CombFold TM-score

d 1.0
e 1.0
f 1.0
ρ = 0.44 ρ = –0.09 ρ = 0.79
0.8 0.8 0.8
TM-score

TM-score

TM-score
0.6 0.6 0.6

0.4 0.4
0.4
0.2 0.2
0.2
0 0
0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1.0

00
00

,0
Predicted confidence Pairwise connectivity
2,

10
Number of amino acids

g h i

CombFold
predictions

PDB structures

PDB 2V7Q (TM-score 0.93) PDB 6SSI (TM-score 0.74) PDB 3LAY (TM-score 0.47)
Fig. 3 | Accuracy of CombFold on Benchmark 3. a, The top-N (N = 1, 5, 10) (bottom). CombFold prediction contains 159 additional amino acids
success rate of CombFold (blue) and MoLPC (orange). b, Top-1 success rate for that are not modeled in the X-ray structure, providing full structural coverage.
homomers and heteromers. c, TM-score comparison for CombFold and MoLPC. h, Acceptable-quality model of Erwinia ligand-gated ion channel in complex with
d, Predicted confidence versus the TM-score for CombFold. e, The number of nanobodies (top) versus X-ray structure (bottom). The channel is accurately
complex amino acids versus the top-1 TM-score. f, The success rate of AFM in modeled; however, the location of nanobodies is incorrect. i, Incorrect model of
producing pairwise interactions as measured by the pairwise connectivity versus zinc resistance-associated protein from Salmonella enterica (top) versus X-ray
the TM-score. g, High-quality model of F1-ATPase (top) versus the X-ray structure structure (bottom).

crosslinks (Extended Data Fig. 3c,d). Integrating crosslinks increased high-quality models compared to heteromeric Benchmarks 1 and 2
the top-1 success rate to 76% (compared to 64% without crosslinks). We demonstrates the challenge of assembling heteromeric complexes
compared CombFold to AlphaLink53 and HADDOCK55 with the same set with high accuracy where multiple intersubunit orientations need to
of crosslinks and obtained a success rate of 8% and 4%, respectively be optimized simultaneously. The predicted confidence correlates
(Extended Data Fig. 3c,d). with the TM-score (Pearson r = 0.44, Fig. 3d). Moreover, the accuracy of
CombFold does not decrease with an increase in complex size (Pearson
Accuracy on Benchmark 3 (mostly homomers). We obtain a top-1 r = −0.09, Fig. 3e).
success rate of 57% on this benchmark, accurately modeling 87 out CombFold success rate correlates with the success of AFM in pro-
of 153 complexes (Fig. 3a and Table 1). Moreover, most of the suc- ducing structures of pairwise interactions as measured by the pairwise
cessful predictions (75 out of 87) are of high quality (TM-score >0.8). connectivity (Pearson r = 0.79, Fig. 3f). This correlation is higher than
When top-10 predictions are considered, the success rate is 58% and for Benchmark 1 complexes, as in the assembly of homomeric struc-
82 out of 89 are of high quality. The higher fraction of complexes with tures, CombFold relies mainly on one or two pairwise interactions.

Nature Methods | Volume 21 | March 2024 | 477–487 482

Article https://doi.org/10.1038/s41592-024-02174-0

a b c d
100 100

Confidence

Confidence
95
80
90

60 85
5 10 15 5 10 15

Number of copies Number of copies

Fig. 4 | Stoichiometry prediction. a, A structure of mitochondrial ATP synthase function of the number of copies of subunit c. c, A structure of PelC dodecamer
with bound native cardiolipin (PDB 6TDX). Circled is a symmetrical structure (PDB 5T11). d, CombFold predicted confidence for PelC dodecamer as a function
formed from ten copies of subunit c. b, CombFold predicted confidence as a of the number of copies in input stoichiometry.

As a result, CombFold accuracy is limited by the reported success rate of This enables us to perform the resource-intensive AFM calculation once
~60% for AFM in predicting pairwise protein–protein interactions11–13. and sample possible stoichiometries with the fast assembly algorithm.
In contrast, in the assembly of heteromeric structures, multiple pair- Here we present two examples of this application. The first is the
wise interactions are considered, and pairwise interaction can form complex of mitochondrial ATP synthase with bound native cardiolipin
indirectly even if it is not predicted correctly by AFM. Therefore, the that contains ten copies of ATP synthase subunit c forming a symmetri-
success rate of CombFold on heteromeric complexes is higher (Table 1 cal cylinder (Fig. 4a). We used CombFold to predict complexes with 14
and Fig. 2). While heteromeric complexes are asymmetric by defini- stoichiometries: 2–15 copies of subunit c and the correct number of
tion, they can include local symmetry resulting from multiple copies copies for all the other subunits. There is a significant increase in pre-
of one or more subunits56. Benchmark 3 contains four fully asymmetric dicted confidence for assemblies with 10, 11 and 14 copies of subunit c
complexes (without multiple subunit copies) and CombFold was able to (Fig. 4b), indicating that confidence can be used to narrow down the
assemble three with acceptable quality. The performance of CombFold set of possible stoichiometries.
on asymmetric structures is assessed on Benchmarks 1 and 2, which are Another example is the PelC dodecamer from Paraburkholderia
almost entirely asymmetric (Extended Data Fig. 2). phytofirmans. This is a symmetrical complex composed of 12 copies of
For comparison, the top-1 success rate of MoLPC on Benchmark 3 lipoprotein (Fig. 4c). We applied CombFold to predict complexes with
is 28% and top-10 is 31% (Fig. 3a,c)35. This difference is attributed to our 14 stoichiometries (2–15 copies of the PelC subunit). For 13 or more cop-
utilization of multiple AFM models and the assembly algorithm that ies no structure could be assembled without major steric clashes. There
performs a more exhaustive combinatorial and hierarchical search is a spike in the predicted confidence for assemblies with 11 or 12 copies
compared to the Monte Carlo Tree Search used by MoLPC. When Bench- (Fig. 4d). This demonstrates not only that confidence is an indicator
mark 3 complexes are divided into homomers and heteromers, there of stoichiometry, but that the ability to assemble is another indicator.
is no significant difference for our method, while there is a gap in favor
of homomers for MoLPC (Fig. 3b). Discussion
We present an approach to predict the structure of large multisubu-
Application for predicting complexes without known structure. nit protein complexes based on substructures predicted by AFM for
Complex Portal is a database that contains manually curated informa- pairs or larger subsets of input subunits. Our method is powered by
tion on stable macromolecular complexes36. We queried the database the combinatorial assembly algorithm that exhaustively enumerates
for all complexes with over 5,000 amino acids, known stoichiometry, best-scoring assembly trees resulting in accurately predicted assem-
and without homology to any experimentally determined structure blies. Moreover, information that can be converted into distance
(Methods) to obtain 28 complexes from three organisms (Homo sapi- restraints, such as crosslinking mass spectrometry datasets, can be
ens, Mus musculus and Saccharomyces cerevisiae). High-confidence integrated into the assembly algorithm for higher accuracy (Extended
structures were found for seven complexes (Extended Data Figs. 5 Data Fig. 3c,d). We validate the approach on four datasets with top-10
and 6). success rate of 57–74% for both homomeric and heteromeric assemblies
One of the high-confidence predictions is the human Elongator (Figs. 2 and 3, Table 1, Extended Data Fig. 3 and Supplementary Note 1).
holoenzyme complex, which consists of six proteins, Elp1–6, two copies Moreover, CombFold is able to extend by 20% the structural coverage
of each. A dimer of Elp123 subunits interacts with the Elp456 subcom- of experimentally solved large complexes where the modeled structure
plex. Partial homologous structures of S. cerevisiae are available, with often does not fully cover the sequences. This enables the application
larger subcomplexes published recently57. The structure predicted of CombFold to extend the coverage of solved structures.
by CombFold is consistent with the published homologous structure Most complexes could be assembled by CombFold using single
(Extended Data Fig. 5a,b). Moreover, the predicted structure can be chains as subunits. However, for some complexes, dividing chains into
used to explain the effect of mutations. We extracted all the pathogenic domain-level subunits is beneficial for correct assembly, such as CASP15
mutations from ClinVar58 (Supplementary Table 2) and classified them targets H1137 and T1169. While our method supports domain-level
on the basis of the predicted structure into those that could disrupt assembly, the decision of whether to split into domains is left to the
protein core or protein–protein interactions (Extended Data Fig. 5c,d). user. Subcomplexes are often known based on prior knowledge or can
be inferred from single-chain structures, such as intertwined domains
Stoichiometry prediction. The major obstacle to applying our method in CASP targets H1137 and H1114. In these cases, our method can enforce
to known interactions and complexes is the need for stoichiometry the specific assembly order to compute the known subcomplexes fol-
information. Our assembly algorithm can be applied to a set of subunits lowed by the generation of the whole assembly.
without stoichiometry using the AFM-predicted representative struc- Currently, our success rate is limited by the ability of AFM to pro-
tures and pairwise interactions as follows. Different stoichiometries can duce pairwise subunit interactions (Figs. 2c and 3f). In this regard,
be enumerated using the same AFM models as an input and the confi- approaches that enhance the AFM sampling by enabling dropout at
dence prediction can be used to estimate the correct stoichiometry. inference can be useful14,59. Additional pairwise orientations might

Nature Methods | Volume 21 | March 2024 | 477–487 483

Article https://doi.org/10.1038/s41592-024-02174-0

a Best pairwise models

i
0.53
15.5 0.93
7.54
0.59
17.1
ii 0.28 v DockQ
15.3 PAE
0.01
20.0
0.51
20.0
iii iv
PDB 7ZKQ 0.51
17.5

b AFMv3
c CombFold
d CombFold + crosslinks

TM-score 0.72 TM-score 0.85 TM-score 0.89

i i i
0.83 0.52 (2) 0.52 (2)
0.04 15.5 0.71 0.71 (1)
9.51 (1) 15.5
7.09 7.09
0.04 0.12 0.43
0.01
ii v ii 0.25 v ii 0.25 v
(3) (3)
15.3 15.3
0.03 0.17 0.24

0.37 0.20
0.09 (4)
19.8 16.9

iii iv iii iv iii iv

0.31 0.39 0.30
(4)
19.6 16.7

Fig. 5 | The advantage of hierarchical assembly over global AFM. models (top) with the quality of their pairwise interactions (DockQ, PAE) mapped
a, Experimental structure of the early Pp module assembly intermediate of on the interaction graph (bottom) for AFM (b), CombFold (c) and CombFold
complex I (left) and the interaction graph (right). The node colors correspond with crosslinks (d). Accurate pairwise interactions (DockQ >0.23) are in red.
to subunit colors. The edges are shown for all subunit pairs that have close Crosslinks are shown as blue lines. CombFold assembly order is indicated on the
contacting amino acids. Edges are labeled with the highest DockQ generated graph with numbers in parentheses (blue).
by AFM in the first stage of CombFold and their average PAE. b–d, Predicted

be obtained from pairwise docking methods28,31,60 as in the original therefore, CombFold is able to select the more confident and accurately
CombDock method17. This will enable us to further increase the suc- predicted interactions (Extended Data Fig. 7f). Third, the usage of a
cess rate of our method. unified representation results in each subunit model being the most
We compare CombFold to other complex structure prediction confident AFM-generated model of this subunit, which results in an
methods. Docking-based methods such as HADDOCK55 are unable to overall more accurate complex structure. Lastly, implementation
predict large complexes35 (Extended Data Fig. 3c). When compared to details such as a more relaxed steric clashes filtering stage, and AFM
the Monte-Carlo Tree Search assembly (MoLPC) that is mainly appli- prediction for groups of more than three subunits efficiently, can be
cable to homomeric complexes, our combinatorial algorithm doubles more effective when implementing assembly-based methods.
the success rate from ~30% to ~60% (Fig. 3a,b). This improvement is We also compare CombFold to end-to-end AFM, which is consid-
particularly significant for heteromeric complexes, where the larger ered state of the art for predicting entire complexes. We find that AFM is
number of subunit combinations leads to an increased number of pair- still limited compared to assembly methods by the maximal total length
wise interactions. The superior performance of CombFold compared of the complex and lack of diversity in the generated structures. Most
to MoLPC can be attributed to several factors. First, by employing a complexes that are accurately predicted by AFM are also accurately
more exhaustive combinatorial assembly algorithm, and implementing assembled by CombFold based on the pairwise interactions from AFM
clustering during assembly, we are able to better enumerate the many (Fig. 2d). Two primary reasons account for CombFold’s superior perfor-
possible interactions between subunits, resulting in a higher number mance compared to AFM. First, the stage that generates pairwise subunit
of accurate assemblies. Second, the enumeration process of CombFold interactions enables us to find a higher number of accurately predicted
is more strongly based on the confidence score of each transforma- pairs. For example, for the early Pp module assembly intermediate of
tion, which correlates with accuracy (Extended Data Fig. 7a–e), and complex I, we find six pairwise interactions of acceptable quality (DockQ

Nature Methods | Volume 21 | March 2024 | 477–487 484

Article https://doi.org/10.1038/s41592-024-02174-0

>0.23, Fig. 5a). As a result in the assembly stage, several assembly path- 8. Baek, M. et al. Accurate prediction of protein structures and
ways are possible because only four pairwise interactions that produce interactions using a three-track neural network. Science 373,
a spanning tree of all subunits are needed to assemble the complex. In 871–876 (2021).
contrast, AFM applied on the whole complex correctly predicts only 9. Moriwaki, Y. AlphaFold2 can also predict heterocomplexes. All
three pairwise interactions (Fig. 5b). Second, even if the pairwise inter- you have to do is input the two sequences you want to predict
action was not predicted correctly by AFM, it can still form during the and connect them with a long linker. Twitter https://twitter.com/
assembly process (Fig. 5d, subunits iii–v). This also applies to other Ag_smith/status/1417063635000598528 (2021).
end-to-end (single step) methods, such as RosettaFold2 and AlphaLink. 10. Baek, M. Twitter post: adding a big enough number for
While some complexes assemble into stable structures, others are residue_index feature is enough to model hetero-complex using
dynamic and exist in multiple states. The heterogeneity can be both AlphaFold (green&cyan: crystal structure/magenta: predicted
compositional with subunits that interact transiently or conforma- model w/residue_index modification). Twitter https://twitter.com/
tional with flexible proteins or a combination of both61. Addressing this minkbaek/status/1417538291709071362 (2021).
heterogeneity is still challenging. For example, compositional hetero- 11. Evans, R. et al. Protein complex prediction with AlphaFold-Multimer.
geneity can be addressed similarly to stoichiometry by enumerating Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2022).
compositions during assembly. The conformational heterogeneity is 12. Yin, R., Feng, B. Y., Varshney, A. & Pierce, B. G. Benchmarking
currently addressed based on additional structural information, such AlphaFold for protein complex modeling reveals accuracy
as cryo-EM62–64, cryo-electron tomography65, crosslinking mass spec- determinants. Protein Sci. 31, e4379 (2022).
trometry66 and single-molecule FRET67. The Bayesian approach that 13. Zhu, W., Shenoy, A., Kundrotas, P. & Elofsson, A. Evaluation of
can account for most sources of uncertainty in data without overfitting AlphaFold-Multimer prediction on multi-chain protein complexes.
is often used for determining structural ensembles68. This approach Bioinformatics 39, btad424 (2023).
estimates the probability of a model, given information available about 14. Wallner, B. AFsample: improving multimer prediction with Alpha
the system, including both prior knowledge and newly acquired experi- Fold using aggressive sampling. Bioinformatics 39, btad573 (2023).
mental data. It was successfully integrated into data-driven MD simula- 15. Alber, F. et al. Determining the architectures of macromolecular
tions and adopted for multiple types of data, including cryo-EM density assemblies. Nature 450, 683–694 (2007).
maps62 and contact or distance information from multiple sources69. 16. Dominguez, C., Boelens, R. & Bonvin, A. M. J. J. HADDOCK:
Our current implementation can integrate distance-based informa- a protein–protein docking approach based on biochemical or
tion into the assembly stage and generate multiple models that are biophysical information. J. Am. Chem. Soc. 125, 1731–1737 (2003).
consistent with the data. Moreover, models generated by CombFold 17. Inbar, Y., Benyamini, H., Nussinov, R. & Wolfson, H. J. Protein
can be used as starting points for generating dynamic ensembles using structure prediction via combinatorial assembly of sub-structural
data-driven simulation approaches, such as CryoFold70,71. units. Bioinformatics 19, i158–i168 (2003).
Large datasets of experimentally observed protein–protein inter- 18. Esquivel-Rodríguez, J., Yang, Y. D. & Kihara, D. Multi-LZerD:
actions and assemblies are available from Complex Portal, Corum and multiple protein docking for asymmetric complexes. Proteins 80,
STRING36,72,73. In addition, crosslinking mass spectrometry is providing 1818–1833 (2012).
large datasets of interactions74. These datasets can be used by Comb- 19. Kuzu, G., Keskin, O., Nussinov, R. & Gursoy, A. Modeling protein
Fold, including crosslinks that can be converted into distance restraints assemblies in the proteome. Mol. Cell. Proteom. 13, 887–896
and integrated into the assembly stage. While the major bottleneck in (2014).
applying assembly methods on these datasets is unknown stoichiom- 20. Batista, P. R., Neto, M. O. & Perahia, D. Integrative Structural
etry, we demonstrate that our approach can be extended to enumerate Biology of Proteins and Macromolecular Assemblies: Bridging
stoichiometries (Fig. 4) and we plan to further develop this capability Experiments and Simulations (Frontiers Media SA, 2022).
to enable the assembly of complexes without known stoichiometry. 21. Ward, A. B., Sali, A. & Wilson, I. A. Biochemistry. Integrative
structural biology. Science 339, 913–915 (2013).
Online content 22. Russel, D. et al. Putting the pieces together: integrative modeling
Any methods, additional references, Nature Portfolio reporting sum- platform software for structure determination of macromolecular
maries, source data, extended data, supplementary information, assemblies. PLoS Biol. 10, e1001244 (2012).
acknowledgements, peer review information; details of author contri- 23. van Zundert, G. C. P. et al. The HADDOCK2.2 Web server:
butions and competing interests; and statements of data and code avail- user-friendly integrative modeling of biomolecular complexes.
ability are available at https://doi.org/10.1038/s41592-024-02174-0. J. Mol. Biol. 428, 720–725 (2016).
24. Rantos, V., Karius, K. & Kosinski, J. Integrative structural modeling
References of macromolecular complexes using Assembline. Nat. Protoc. 17,
1. Grigoriev, A. On the number of protein–protein interactions in the 152–176 (2022).
yeast proteome. Nucleic Acids Res. 31, 4157–4161 (2003). 25. Rout, M. P. & Sali, A. Principles for integrative structural biology
2. Dunham, B. & Ganapathiraju, M. K. Benchmark evaluation of studies. Cell 177, 1384–1403 (2019).
protein–protein interaction prediction algorithms. Molecules 27, 26. Kim, S. J. et al. Integrative structure and functional anatomy of a
41 (2021). nuclear pore complex. Nature 555, 475–482 (2018).
3. Stumpf, M. P. H. et al. Estimating the size of the human 27. Stahl, K., Graziadei, A., Dau, T., Brock, O. & Rappsilber, J. Protein
interactome. Proc. Natl Acad. Sci. USA 105, 6959–6964 (2008). structure prediction with in-cell photo-crosslinking mass
4. Sousa, J. S. et al. Structural basis for energy transduction by spectrometry and deep learning. Nat. Biotechnol. https://doi.org/
respiratory alternative complex III. Nat. Commun. 9, 1728 (2018). 10.1038/s41587-023-01704-z (2023).
5. Wang, W. et al. Atomic structure of human TOM core complex. 28. Schneidman-Duhovny, D., Inbar, Y., Nussinov, R. & Wolfson, H. J.
Cell Discov. 6, 67 (2020). PatchDock and SymmDock: servers for rigid and symmetric
6. Groves, J. T. & Kuriyan, J. Molecular mechanisms in signal docking. Nucleic Acids Res. 33, W363–W367 (2005).
transduction at the membrane. Nat. Struct. Mol. Biol. 17, 659–665 29. Katchalski-Katzir, E. et al. Molecular surface recognition:
(2010). determination of geometric fit between proteins and their ligands
7. Jumper, J. et al. Highly accurate protein structure prediction with by correlation techniques. Proc. Natl Acad. Sci. USA 89,
AlphaFold. Nature 596, 583–589 (2021). 2195–2199 (1992).

Nature Methods | Volume 21 | March 2024 | 477–487 485

Article https://doi.org/10.1038/s41592-024-02174-0

30. Kozakov, D. et al. The ClusPro web server for protein-protein 52. Wippel, H. H., Chavez, J. D., Tang, X. & Bruce, J. E. Quantitative
docking. Nat. Protoc. 12, 255–278 (2017). interactome analysis with chemical cross-linking and mass
31. Pierce, B. G. et al. ZDOCK server: interactive docking prediction spectrometry. Curr. Opin. Chem. Biol. 66, 102076 (2022).
of protein–protein complexes and symmetric multimers. 53. Stahl, K., Brock, O. & Rappsilber, J. Modelling protein complexes
Bioinformatics 30, 1771–1773 (2014). with crosslinking mass spectrometry and deep learning. Preprint
32. Moal, I. H., Torchala, M., Bates, P. A. & Fernández-Recio, J. The at bioRxiv https://doi.org/10.1101/2023.06.07.544059 (2023).
scoring of poses in protein–protein docking: current capabilities 54. Valpadashi, A. et al. Defining the architecture of the human TIM22
and future directions. BMC Bioinform. 14, 286 (2013). complex by chemical crosslinking. FEBS Lett. 595, 157–168 (2021).
33. Dong, G. Q., Fan, H., Schneidman-Duhovny, D., Webb, B. & Sali, A. 55. Dominguez, C., Boelens, R. & Bonvin, A. M. HADDOCK: a protein–
Optimized atomic statistical potentials: assessment of protein protein docking approach based on biochemical or biophysical
interfaces and loops. Bioinformatics 29, 3158–3166 (2013). information. J. Am. Chem. Soc. 125, 1731–1737 (2003).
34. Inbar, Y., Benyamini, H., Nussinov, R. & Wolfson, H. J. Prediction of 56. Duarte, J. M., Dutta, S., Goodsell, D. S. & Burley, S. K. Exploring
multimolecular assemblies by multiple docking. J. Mol. Biol. 349, protein symmetry at the RCSB Protein Data Bank. Emerg. Top. Life
435–447 (2005). Sci. 6, 231–243 (2022).
35. Bryant, P. et al. Predicting the structure of large protein 57. Jaciuk, M. et al. Cryo-EM structure of the fully assembled
complexes using AlphaFold and Monte Carlo tree search. Nat. Elongator complex. Nucleic Acids Res. https://doi.org/10.1093/
Commun. 13, 6028 (2022). nar/gkac1232 (2023).
36. Meldal, B. H. M. et al. Complex Portal 2018: extended content and 58. Landrum, M. J. et al. ClinVar: improving access to variant
enhanced visualization tools for macromolecular complexes. interpretations and supporting evidence. Nucleic Acids Res. 46,
Nucleic Acids Res. 47, D550–D558 (2019). D1062–D1067 (2018).
37. Rappsilber, J. The beginning of a beautiful friendship: 59. Johansson-Åkhe, I. & Wallner, B. Improving peptide-protein
cross-linking/mass spectrometry and modelling of proteins and docking with AlphaFold-Multimer using forced sampling. Front.
multi-protein complexes. J. Struct. Biol. 173, 530–540 (2011). Bioinform. 2, 959160 (2022).
38. Braitbard, M., Schneidman-Duhovny, D. & Kalisman, N. 60. Comeau, S. R., Gatchell, D. W., Vajda, S. & Camacho, C. J.
Integrative structure modeling: overview and assessment. ClusPro: an automated docking and discrimination method for
Annu. Rev. Biochem. 88, 113–135, https://doi.org/10.1146/ the prediction of protein complexes. Bioinformatics 20, 45–50
annurev-biochem-013118-111429 (2019). (2004).
39. Lenz, S. et al. Reliable identification of protein–protein 61. Schneidman-Duhovny, D., Pellarin, R. & Sali, A. Uncertainty in
interactions by crosslinking mass spectrometry. Nat. Commun. 12, integrative structural modeling. Curr. Opin. Struct. Biol. 28,
3564 (2021). 96–104 (2014).
40. Bonomi, M. et al. Determining protein complex structures based 62. Bonomi, M., Pellarin, R. & Vendruscolo, M. Simultaneous
on a Bayesian model of in vivo Förster resonance energy transfer determination of protein structure and dynamics using
(FRET) data. Mol. Cell. Proteom. 13, 2812–2823 (2014). cryo-electron microscopy. Biophys. J. 114, 1604–1613 (2018).
41. Zhang, Y. & Skolnick, J. Scoring function for automated 63. Scheres, S. H. W. Processing of structurally heterogeneous
assessment of protein structure template quality. Proteins 57, Cryo-EM Data in RELION. Methods Enzymol. 579, 125–157 (2016).
702–710 (2004). 64. Singharoy, A. et al. Molecular dynamics-based refinement and
42. Ozden, B., Kryshtafovych, A. & Karaca, E. Assessment of the validation for sub-5 Å cryo-electron microscopy maps. eLife 5,
CASP14 assembly predictions. Proteins 89, 1787–1799 (2021). e16105 (2016).
43. Baek, M. et al. Efficient and accurate prediction of protein 65. Zimmerli, C. E. et al. Nuclear pores dilate and constrict in cellulo.
structure using RoseTTAFold2. Preprint at bioRxiv https://doi.org/ Science 374, eabd9776 (2021).
10.1101/2023.05.24.542179 (2023). 66. Ziemianowicz, D. S. et al. IMProv: a resource for cross-link-driven
44. Lafita, A. et al. Assessment of protein assembly prediction in structure modeling that accommodates protein dynamics. Mol.
CASP12. Proteins 86, 247–256 (2018). Cell. Proteom. 20, 100139 (2021).
45. Xue, L. C., Rodrigues, J. P., Kastritis, P. L., Bonvin, A. M. & 67. Lerner, E. et al. Toward dynamic structural biology: two decades
Vangone, A. PRODIGY: a web server for predicting the binding of single-molecule Förster resonance energy transfer. Science
affinity of protein–protein complexes. Bioinformatics 32, 359, eaan1133 (2018).
3676–3678 (2016). 68. Rieping, W., Habeck, M. & Nilges, M. Inferential structure
46. Shi, Y. et al. A strategy for dissecting the architectures of determination. Science 309, 303–306 (2005).
native macromolecular assemblies. Nat. Methods 12, 1135–1138 69. MacCallum, J. L., Perez, A. & Dill, K. A. Determining protein
(2015). structures by combining semireliable data with atomistic physical
47. Sali, A. From integrative structural biology to cell biology. J. Biol. models by Bayesian inference. Proc. Natl Acad. Sci. USA 112,
Chem. 296, 100743 (2021). 6985–6990 (2015).
48. Rodrigues, J. P. G. L. M. & Bonvin, A. M. J. J. Integrative 70. Shekhar, M. et al. CryoFold: determining protein structures and
computational modeling of protein interactions. FEBS J. 281, data-guided ensembles from cryo-EM density maps. Matter 4,
1988–2003 (2014). 3195–3216 (2021).
49. Leitner, A., Faini, M., Stengel, F. & Aebersold, R. Crosslinking and 71. Chang, L., Mondal, A., MacCallum, J. L. & Perez, A. CryoFold 2.0:
mass spectrometry: an integrated technology to understand the cryo-EM structure determination with MELD. J. Phys. Chem. A 127,
structure and function of molecular machines. Trends Biochem. 3906–3913 (2023).
Sci. 41, 20–32 (2016). 72. Szklarczyk, D. et al. The STRING database in 2023: protein–protein
50. Iacobucci, C., Götze, M. & Sinz, A. Cross-linking/mass association networks and functional enrichment analyses for any
spectrometry to get a closer view on protein interaction networks. sequenced genome of interest. Nucleic Acids Res. 51, D638–D646
Curr. Opin. Biotechnol. 63, 48–53 (2020). (2022).
51. Wheat, A. et al. Protein interaction landscapes revealed by 73. Giurgiu, M. et al. CORUM: the comprehensive resource of
advanced in vivo cross-linking-mass spectrometry. Proc. Natl mammalian protein complexes—2019. Nucleic Acids Res. 47,
Acad. Sci. USA 118, e2023360118 (2021). D559–D563 (2019).

Nature Methods | Volume 21 | March 2024 | 477–487 486

Article https://doi.org/10.1038/s41592-024-02174-0

74. Zheng, C. et al. XLink-DB: database and software tools for storing source, provide a link to the Creative Commons license, and indicate
and visualizing protein interaction topology data. J. Proteome Res. if changes were made. The images or other third party material in this
12, 1989–1995 (2013). article are included in the article’s Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not
Publisher’s note Springer Nature remains neutral with regard to included in the article’s Creative Commons license and your intended
jurisdictional claims in published maps and institutional affiliations. use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
Open Access This article is licensed under a Creative Commons holder. To view a copy of this license, visit http://creativecommons.
Attribution 4.0 International License, which permits use, sharing, org/licenses/by/4.0/.
adaptation, distribution and reproduction in any medium or format,
as long as you give appropriate credit to the original author(s) and the © The Author(s) 2024

Nature Methods | Volume 21 | March 2024 | 477–487 487

Article https://doi.org/10.1038/s41592-024-02174-0

Methods (Cα–Cα distance <8 Å), the transformation between the subunits is
CombFold method calculated. We can mark the predicted interacting structure for two
Definition of subunits. A subunit is a sequence that can be either an subunits A and B, and two representative structures for those subunits
independent chain of the complex or a part of a chain (for example, a A′ and B′. Notice that even though A and A′ are the same molecules, the
certain domain). Sometimes it is necessary to divide a chain into a num- different interactions in each AFM model will result in different struc-
ber of subunits—either because the chain is too long to be predicted tures and different reference frames for A and A′. We would like to cal-
by AFM or because domains are connected by a long linker and are not culate a transformation between the representatives B′ to A′ that will
in spatial proximity. In case a chain is too long for modeling with other result in the interaction interface as close as possible to that of the
chains, or if it is known to contain a long inter-domain linker, it is best examined model pair A and B. To achieve this, the transformation T1 that
to divide it into structural domains based on predicted disordered aligns A′ on A is calculated by computing the transformation that mini-
regions using tools, such as IUPred3 (ref. 75). mizes root mean square deviation (RMSD)78,79. Similarly, the transforma-
In Benchmark 4, each subunit was defined as a single chain tion T2 that aligns B′ on B is calculated. Finally, the desired transformation
−1
according to definitions supplied by CASP. The two targets that is composed as T2 ∘ T1 . A problem arises when a subunit has a disor-
are long single chains (T1165 and T1169) were divided into subunits dered region—this region will be folded differently in each predicted
according to IUPred3 (ref. 75). The predicted disordered regions model, which can substantially affect the alignment and the resulting
connecting the domains were not included in the prediction. In all transformation. Therefore, during the alignment, we consider only
other benchmarks a full chain was used as a subunit as defined in the amino acids that have a high plDDT score (>80) or at least half of the
SEQRES segment of the PDB entry for almost all cases. Due to a high amino acids with the highest plDDT.
number of long chains in Benchmarks 2 and 3, we opted for a simple Each transformation is scored using the PAE score of the two sub
split procedure without relying on predicted disorder regions. In units. PAE score is computed by AFM for any two amino acids in the
Benchmark 2, long chains in five complexes (PDBs 8HIL, 8F50, 8ADL, structure, predicting their alignment error relating to each other. The
8A3T and 7OZN) were divided into subunits evenly until every subunit PAE score values are between 0 and 30, with lower values corresponding
pair could be predicted by AFMv3. In Benchmark 3, long chains in two to a lower predicted error. The transformation score is calculated and
complexes (PDBs 1I50 and 6KWY) were divided into two subunits, one normalized to be between 1 and 100 by the equation max{1, 100 − P2 /4}
containing the first 1,000 amino acids and the other with the rest. where P is the average value of PAE of the two interacting subunits. This
For Complex Portal predictions, the UniProt sequences were divided expression gives the score quadratic properties so that small differ-
similarly to Benchmark 2. ences in low P scores (which are usually at least 1) will be meaningful,
while for high P scores, there is not much difference between the score
AlphaFold2 structure prediction. In the first stage, we run AFM on of transformations as it is predicted to be inaccurate.
each possible pairing of the subunits. Proteins, both homomers and Multiple possibilities for scoring were considered, including PAE,
heteromers, have the ability to create intertwined structures where the minimal PAE, the interface PAE of the interacting amino acids only,
the interacting chains exchange small segments or compact protein interface predicted TM-score (ipTM) and interface pLDDT (ipLDDT),
substructures. These interactions can result in a wide range of quater- which is widely used12,35,80. All scores had a comparable correlation
nary arrangements, including dimers, or higher-order oligomers76. To with Cα RMSD (Pearson r of ~0.5–0.6, Extended Data Fig. 7a–e). The
account for this, AFM prediction is applied for larger subsets of three advantage of our PAE-based score is that incorrect interfaces consist-
to five subunits as follows. For each subunit, we select the most likely ently have low scores (Extended Data Fig. 7e). Our analysis of average
interacting subunits based on the pairwise PAE interaction score and PAE distributions of all AFM pairwise interaction modes versus the ones
use them to build larger subsets (Methods). Here we limit our calcula- that were selected for top-1 assembly models revealed that CombFold
tions to the total length of input sequences of 1,800, which can be run indeed selects the interactions with lower PAE scores (Extended Data
on standard GPUs. Fig. 7f).
AlphaFold2 runs were performed using ColabFold77 with default
parameters (without templates), producing five structures per run. Combinatorial assembly of subunits
Subunits were inputted as separate chains. For Benchmarks 1 and 3, The input to the assembly stage is a list of representative structures of
we used AFMv2 and AlphaFold-ptm to obtain ten structural mod- subunits and a list of pairwise transformations between subunits. The
els. For comparison to CombFold on Benchmark 1, only end-to-end output is a list of assembled complexes containing all the subunits. If all
AFMv2 was used. For Benchmark 2, CASP15 and Complex Portal the subunits can not be assembled, the algorithm outputs partial com-
predictions, we used AFMv3 only, as it was not trained on these tar- plexes containing the largest number of input subunits. The assembly
gets. CombFold predictions on Benchmark 2 were compared to algorithm proceeds with N iterations, where N is the number of input
end-to-end AFMv3. subunits. In each iteration, the size of the subcomplexes created is
increased, until the Nth iteration, where the subcomplexes computed
Extracting representative subunit structures. Each subunit struc- contain all input subunits.
ture from AFM predictions is ranked on the basis of the mean plDDT Each iteration contains three stages: subcomplexes expansion,
score using all predicted structures from AFM runs for pairs and larger filtering and clustering. The first stage creates new subcomplexes
subsets. The structure with the maximal score is selected as the ‘repre- based on smaller subcomplexes from previous iterations and pairwise
sentative subunit structure’ for the assembly stage. Additional criteria transformations that were provided to the algorithm. Each new sub-
were examined as possible ranking scores including the average PAE complex is scored on the basis of the scores of the pairwise transforma-
score for the structure, the maximal plDDT or the interaction score with tions that were used to generate it. The second stage filters assembled
other subunits in AFM prediction. There were no significant differences subcomplexes with steric clashes between subunits. The third stage
between the described possibilities; the mean plDDT, which is easy to clusters subcomplexes with the same subunit composition and saves
calculate and more widely used, was chosen. K best-scoring subcomplexes. Optionally, the final structures can be
relaxed to resolve steric clashes.
Computing pairwise transformations. The method computes for each
pair of subunits a list of possible transformations between them based Expansion stage. In this stage, we attempt to connect pairs of subcom-
on their interaction models from AlphaFold2. All pairs of subunits are plexes that have no overlapping subunits and with the total number of
extracted from multisubunit predictions. For each pair, if it is interacting i subunits, where i is the iteration number. For each pair of subunits in

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

the two subcomplexes (of sizes k and i − k), a new larger subcomplex Data integration. To consider known interactions between subunits,
is computed for each input pairwise transformation between those we group the input subunits into subcomplexes based on the data. Each
subunits. The transformation is applied to all the subunits of the second such group will be assembled separately, followed by the assembly of
subcomplex, thus bringing it to the first subcomplex. the groups and remaining subunits into a larger complex. Therefore,
There is a special reward for scoring symmetrical subcomplexes the information is used to enforce a specific assembly order that is
with over five identical subunits transformed with the same pairwise consistent with the known interactions.
subunit transformation. This reward compensates for the assembly The crosslinking mass spectrometry information is converted
being based on pairwise subunit interactions, compared to the full into distance restraints. A restraint is considered satisfied if the Cα–Cα
assembly by AFM, which is likely to result in lower PAE scores if a sym- distance is below a distance threshold. The threshold is defined by the
metrical structure was formed. Therefore, if a symmetric structure user on the basis of the length of the crosslinker. In the case of ambiguity
was generated on the basis of pairwise subunit transformations, the of crosslinked residues due to multiple copies of the same subunit, we
new score is calculated as (S + S × (100 − S)/100), where S is the original require that one of the possible distances restrained by the crosslink is
score of the transformation. below the distance threshold. CombFold accounts for the uncertainty
in the crosslinking data and in the subunit structures as follows. The
Filtering stage. As the pairwise transformations can be at least partially uncertainty in the data is accounted for by weighting each crosslink
inaccurate, applying some of them can result in subcomplexes with according to its confidence based on the experimental evidence (w1),
steric clashes or violated distance constraints and restraints. Steric such as the false discovery rate83. To account for uncertainty in the sub
clashes are checked for all backbone atoms with plDDT higher than 80 unit structures, each crosslink is weighted by the average AFM pLDDT
because the representative structures can contain disordered regions, score of the two crosslinked amino acids (w2). The satisfaction ratio of
which are likely to clash with other subunits as they are left static during a subcomplex is calculated as the sum of weights of satisfied distance
the assembly (Extended Data Fig. 1). A backbone atom of one subunit restraints divided by the sum of weights of all restraints within the given
is considered as clashing if its center penetrates by more than 1 Å into subcomplex (equation (1)). The score of each subcomplex is multiplied
the surface of another subunit. The steric clash test is performed for by the satisfaction ratio. Consequently, as more restraints are fulfilled,
all pairs of subunits, one from each subcomplex. A subcomplex is the score increases, making it more probable for the subcomplex to
filtered if there are over 5% of a subunit’s backbone atoms clashing avoid being filtered. A subcomplex is also filtered in the filtering stage
with another subunit. if it violates some minimal percentage of its restraints (default 10%).
Distance constraints are imposed on different subunits from the
∑satisfied w1 × w2
same chain to enforce sequence connectivity. A subcomplex is dis- satisfaction ratio = (1)
carded if the distance between consecutive amino acids from two ∑all w1 × w2
subunits is greater than the number of linker amino acids multiplied
by 3 Å. Predicted confidence. CombFold predicts the confidence of the
assembled structure as a weighted score of the pairwise transformation
Clustering stage. RMSD clustering is performed to cluster subcom- scores (ST) used in the assembly stage. To calculate the weight of a given
plexes containing the same subunits. We have used iterative clustering, transformation (WT), we split the complex into two subcomplexes using
starting from the best-scoring subcomplex with the RMSD threshold the transformation and the complex assembly tree. The weight of the
of 1 Å. However, a default RMSD calculation does not account for mul- transformation is the number of amino acids in the smaller subcom-
tiple copies of the same subunit. This means that for a subcomplex plex. The idea is that some transformations have a larger effect on the
with p copies of identical subunits, there will be p! equivalent sub- final global structure of the complex, as they affect a larger number of
complexes. In this case, to compare the two subcomplexes we need amino acids. The final score is normalized by the total weight of all the
to find the correspondence between copies of subunits from differ- transformations used in the assembly stage (equation (2)).
ent subcomplexes that minimizes the RMSD. Incorrect correspond-
∑T WT × ST
ence will lead to high RMSD for similar subcomplexes. To avoid the predicted confidence = (2)
enumeration of p! configurations, we implemented a heuristic that ∑T WT
superimposes only the centroids of the subunits using starting order
subunit correspondence. After the initial superimposition, the cor- Performance analysis
respondence for each pair of identical subunits is swapped and the Runtimes. CombFold runtime is dominated by the AFM prediction
RMSD is recalculated using centroids. If the RMSD has decreased, we runs for subunit pairs and larger subsets. On Benchmark 1, the aver-
proceed with the new correspondence. The swap process is repeated age GPU time for AFM predictions was 709 and 1,429 s for subunit
until there is no further RMSD decrease. The final correspondence pairs and larger subsets, respectively, running on NVIDIA A30 with
between subunits is used to calculate the Cα RMSD between the two 24 GB of memory. However, since our method requires O(N2) AFM
subcomplexes. predictions for pairs and O(N) AFM predictions for larger subsets
After clustering, only the K best-scored subcomplexes of size i will the average total GPU time per complex was 7,093 and 15,404 s for
be saved for the next iteration (on the presented benchmarks K = 100). subunit pairs and larger subsets, respectively. It is also important to
Clustering aids in diversifying the stored subcomplexes and avoiding note that the first stage of CombFold that performs AFM calculations
the dominance of suboptimal ones in the set of subcomplexes for the can be trivially distributed into the shorter AFM jobs that can run in
next iteration. parallel. In comparison, the average GPU runtime required for AFM
for end-to-end modeling of an entire complex was 5,154 s running on
Relaxation. As a result of using representative subunit structures, the NVIDIA RTX A6000 with 48 GB of memory (n = 17, only cases where
CombFold may produce structures with steric clashes in interfaces, AFM was able to produce models were considered, Extended Data
mainly in side-chains. Therefore, it is recommended to perform an Fig. 8). It is important to note that the CombFold runtime is higher for
extra step of relaxation of the structure by gradient descent using the heteromeric complexes containing more unique chains compared to
Amber81 force field similar to AlphaFold. This step substantially reduces homomeric complexes of similar size, as multiple identical copies of
the clashscore calculated by Molprobity82 (Extended Data Fig. 3g) while a subunit will use the same AFM interaction models. Benchmark 1 is
not affecting the structure considerably (change in Cα RMSD <1 Å in designed to contain heteromeric complexes with many unique chains;
all targets of Benchmark 2). homomeric complexes, such as in Benchmark 3, have lower runtimes.

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

For example, a symmetrical structure with ten identical chains requires Code availability
much less GPU time in CombFold compared to naive end-to-end AFM CombFold assembly is implemented using C++. The code, Colab
(as we only need to run a job for two copies of the chains which is much notebook, and tutorial for CombFold are available at https://github.
faster compared to ten copies). The runtime of the unified representa- com/dina-lab3D/CombFold. There is also a Code Ocean capsule avail-
tion and combinatorial assembly stages is negligible compared to the able for running the assembly algorithm at https://codeocean.com/
AFM and is on average 80–600 s on the different benchmarks on a capsule/8791899.
single central processing unit. In contrast to the generation of pairwise
subunit interactions stage, the assembly stage is faster for heteromeric References
complexes with a higher number of unique chains. The assembly time is 75. Erdős, G., Pajkos, M. & Dosztányi, Z. IUPred3: prediction of protein
much faster compared to MoLPC, where the reported average assembly disorder enhanced with unambiguous experimental annotation
stage takes 13,000 s. and visualization of evolutionary conservation. Nucleic Acids Res.
49, W297–W303 (2021).
Pairwise connectivity. Given a set of pairwise transformations and a 76. Wodak, S. J., Malevanets, A. & MacKinnon, S. S. The landscape of
target complex structure, this metric measures how many of the pair- intertwined associations in homooligomeric proteins. Biophys. J.
wise transformations between subunits from the target complex are 109, 1087–1100 (2015).
present in the set. A graph is built, where each node is a subunit in the 77. Mirdita, M. et al. ColabFold: making protein folding accessible to
target complex and an edge is present if there exists a transformation all. Nat. Methods 19, 679–682 (2022).
in the set between those subunits for which the DockQ (ref. 84) score 78. Kabsch, W. A solution for the best rotation to relate two sets of
relative to the transformation in the target complex is at an accept- vectors. Acta Crystallogr. A 32, 922–923 (1976).
able level (DockQ >0.23). We calculate the connected components 79. Kabsch, W. A discussion of the solution for the best rotation to
of this graph. The pairwise connectivity ratio is defined as the ratio relate two sets of vectors. Acta Crystallogr. A 34, 827–828
between the number of amino acids in the largest connected compo- (1978).
nent and the total number of amino acids in the complex. A single con- 80. He, G., Liu, J., Liu, D. & Guijun, Z. GraphGPSM: a global scoring
nected component in the graph (pairwise connectivity 1.0) indicates model for protein structure using graph neural networks. Brief.
that there are pairwise transformations that can lead to the assembly Bioinform. 24, bbad219 (2023).
of the complex. In contrast, multiple connected components indicate 81. Hornak, V. et al. Comparison of multiple Amber force fields and
that accurate assembly is not possible with available transformations. development of improved protein backbone parameters. Proteins
65, 712–725 (2006).
Comparison to HADDOCK, AlphaLink and RosettaFold2. HAD- 82. Williams, C. J. et al. MolProbity: more and better reference data for
DOCK and AlphaLink were tested using the simulated crossliinks for improved all-atom structure validation. Protein Sci. 27, 293–315
Benchmark 2. For HADDOCK (v2.4 with CNSv1.3) the input subunits (2018).
were the same representative subunits that were used for CombFold 83. Leitner, A. et al. Toward increased reliability, transparency, and
assembly. For AlphaLink (v2.2), a model that was trained on restraints accessibility in cross-linking mass spectrometry. Structure 28,
with an upper bound of 25 Å on the Cα-Cα distances was used. Roset- 1259–1268 (2020).
taFold2 was tested using RF_apr23 model weights on Benchmark 2 84. Basu, S. & Wallner, B. DockQ: a quality measure for
without crosslinks. protein–protein docking models. PLoS ONE 11, e0161879
(2016).
Comparison to MoLPC. MoLPC evaluation used a TM-score above 0.8 85. Pettersen, E. F. et al. UCSF ChimeraX: structure visualization for
to define a high-quality prediction. Here we use the same definition researchers, educators, and developers. Protein Sci. 30, 70–82
of high-quality prediction. We find that a prediction with a TM-score (2021).
of 0.7 can have a correct global shape (Figs. 3h and 5b). Therefore, we 86. Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci.
define an additional acceptable-quality category for predictions with Eng. 9, 90–95 (2007).
a TM-score above 0.7. In the original MoLPC publication, the success
rate was calculated as a fraction of benchmark cases with a high-quality Acknowledgements
prediction out of cases where at least one assembly was obtained. Note D.S.-D. and B.S. are supported by the Israeli Science Foundation
that MoLPC was able to obtain some predictions for 91 out of 175 Bench- (ISF 1466/18), NIH NIAID (R01AI163011-01A1) and Minerva Stiftung.
mark 3 cases. Here we define a success rate as a fraction of benchmark The funders had no role in study design, data collection and
cases with an acceptable-quality prediction out of all benchmark cases. analysis, decision to publish, or preparation of the manuscript.
In addition, while MoLPC has presented separate success rates for Molecular graphics and analyses were performed with UCSF
AFM-based or FoldDock-based pipelines, we have considered results ChimeraX, developed by the Resource for Biocomputing,
from both pipelines in our calculated success rate. We recalculated Visualization, and Informatics at the University of California,
the success rate of MoLPC according to our definitions, resulting in San Francisco, with support from National Institutes of Health
slightly different values. R01-GM129325 and the Office of Cyber Infrastructure and
Computational Biology, National Institute of Allergy and Infectious
Visualizations. Protein complexes were visualized using ChimeraX Diseases.
(ref. 85). Graphs were created using Matplotlib86.
Author contributions
Reporting summary Conceptualization was carried out by B.S. and D.S.-D. The
Further information on research design is available in the Nature methodology, software development, investigation, data curation,
Portfolio Reporting Summary linked to this article. benchmarking, validation and visualization were developed by B.S.
Supervision and project administration were carried out by D.S.-D.
Data availability Writing of the paper was done by B.S. and D.S.-D.
The PDB codes for Benchmarks 1–3, scripts and data for manuscript
figures are part of the repository https://github.com/dina-lab3D/ Competing interests
CombFold. The authors declare no competing interests.

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

Additional information Peer review information Nature Methods thanks Matteo Degiacomi
Extended data is available for this paper at https://doi.org/10.1038/ and the other, anonymous, reviewer(s) for their contribution to the
s41592-024-02174-0. peer review of this work. Primary Handling Editor: Arunima Singh, in
collaboration with the Nature Methods team. Peer reviewer reports
Supplementary information The online version contains supplementary are available.
material available at https://doi.org/10.1038/s41592-024-02174-0.

Correspondence and requests for materials should be addressed to Reprints and permissions information is available at
Dina Schneidman-Duhovny. www.nature.com/reprints.

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

Extended Data Fig. 1 | CombFold filtering visualization. For each assembly between amino acids of different subunits, in this example, the threshold is 5%.
tree, in each step, CombFold joins two previously assembled subcomplexes, The second filter is by not satisfying enough of the distance restraints present in
into many new subcomplexes by applying input transformations between the subcomplex, here the threshold is 70%. The last filter scores each subcomplex
pairs of subunits. These new subcomplexes are filtered to discard suboptimal based on the used transformation scores and the distance restraints satisfaction
subcomplexes. The first filter is by crossing a threshold of allowed steric clashes rate.

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

Extended Data Fig. 2 | Heteromeric benchmark datasets. Heteromeric complexes (colored by chain) from (a) Benchmark 1 and (b) Benchmark 2.

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

Extended Data Fig. 3 | Accuracy of CombFold on Benchmark 2. (a) The Top-N (f) Comparison of PRODIGY predicted dissociation constants for interfaces
(N = 1, 5, 10) success rate of CombFold (blue), AFMv3 (orange), and RosettaFold2 of experimental structures vs. interfaces of structure models generated by
(green). (b) TM-score of AFMv3 models vs. CombFold models for Top-5 results CombFold. Spearman correlation of 0.55. (g) Distributions of clashscores are
(c) The Top-N (N = 1, 5, 10) success rate of CombFold (blue),CombFold with calculated using MolProbity for interfaces in the models of CombFold output
crosslinks (turquoise), AlphaLink (purple) and HADDOCK(brown). (d) TM-score models (left, N = 17) and the same models after relaxation (right, N = 17). Error
of CombFold models with crosslinks vs. without crosslinks for Top-1 results. bars indicate maxima, mean, and minima from top to bottom respectively.
(e) Interface contact similarity (ICS) of CombFold vs. AFMv3 for Top-1 model.

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

Extended Data Fig. 4 | Accuracy of pairwise predictions for AFMv2 and score is over 50. The median score is 0.70 and 0.78 for AFMv2 and AFMv3,
AFMv3. DockQ scores of pairwise interactions predicted by AFM on Benchmark respectively. Error bars indicate maxima, mean, and minima from top to bottom
1 (AFMv2, N = 469) and Benchmark 2 (AFMv3, N = 445), for which the PAE-based respectively.

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

Extended Data Fig. 5 | Modeling the human Elongator holoenzyme complex. mutation P914L in Elp1 is depicted as sticks (red). (d) The interface between Elp4
(a) CombFold prediction for the human Elongator holoenzyme complex. (light blue) and Elp6 (sky blue) with a pathogenic mutation R289W in Elp4 is
(b) Part of the complex structure in yeast, as determined by Cryo-EM (PDB 8ASV). depicted as sticks (red).
(c) The interface between Elp1 (green) and Elp2 (orange) with a likely pathogenic

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

Extended Data Fig. 6 | Complex Portal Predicted Complexes. Predicted complexes from Complex Portal with High or Medium confidence.

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

Extended Data Fig. 7 | Analysis of pairwise scoring functions. Each graph in the pair (e) CombFold score, based on PAE as described in Methods. (f) The
(a)-(e) presents a scoring function vs. pairwise RMSD, calculated on Benchmark distribution of average PAE scores for all generated pairwise interactions (left,
2. (a) ipTM. (b) iplDDT - for each interface the average plDDT of its amino acids N = 34,365) vs. the distribution of PAE scores in Top-1 models (right, N = 310)
is calculated and those averages are averaged. (c) iplDDT, where the plDDT of created by CombFold. Error bars indicate maxima, mean, and minima from top to
each interface is weighted by its size (d) the average PAE scores of all amino acids bottom respectively.

Nature Methods
Article https://doi.org/10.1038/s41592-024-02174-0

Extended Data Fig. 8 | Runtime Analysis. CombFold runtime vs. the number of unique subunits. Calculated on all cases in Benchmark 1. Pearson correlation of 0.74.

Nature Methods

Welding Procedure Specification (WPS) : As Per Asme Sec - Ix 2004 EDITION
100% (2)
Welding Procedure Specification (WPS) : As Per Asme Sec - Ix 2004 EDITION
3 pages
Python for Chemistry: An introduction to Python algorithms, Simulations, and Programing for Chemistry (English Edition)
From Everand
Python for Chemistry: An introduction to Python algorithms, Simulations, and Programing for Chemistry (English Edition)
Dr. M. Kanagasabapathy
5/5 (1)
s41586 024 07487 W - Reference
No ratings yet
s41586 024 07487 W - Reference
45 pages
Accurate Structure Prediction of Bimolecular Interactions With AlphaFold3
No ratings yet
Accurate Structure Prediction of Bimolecular Interactions With AlphaFold3
24 pages
ALPHAFOLD
No ratings yet
ALPHAFOLD
16 pages
Materi 1 - AI driven Innovation
No ratings yet
Materi 1 - AI driven Innovation
30 pages
Main
No ratings yet
Main
15 pages
Improved Protein Structure Prediction Using Potentials From Deep Learning
No ratings yet
Improved Protein Structure Prediction Using Potentials From Deep Learning
22 pages
Alphafill: Enriching Alphafold Models With Ligands and Cofactors
No ratings yet
Alphafill: Enriching Alphafold Models With Ligands and Cofactors
14 pages
Gkab 1061
No ratings yet
Gkab 1061
6 pages
Alpha Fold 2
No ratings yet
Alpha Fold 2
3 pages
2022 Nat Met Colabfold
No ratings yet
2022 Nat Met Colabfold
10 pages
AlphaFold - Laterst - s41586 021 03819 2
No ratings yet
AlphaFold - Laterst - s41586 021 03819 2
12 pages
Highly Accurate Protein Structure Prediction With Alphafold: Article
No ratings yet
Highly Accurate Protein Structure Prediction With Alphafold: Article
12 pages
DeepMind AlphaFold A Revolutionary Advance in Protein Structure Prediction
No ratings yet
DeepMind AlphaFold A Revolutionary Advance in Protein Structure Prediction
8 pages
2403.04395v1
No ratings yet
2403.04395v1
19 pages
Structure Prediction of Protein-Ligand Complexes From Sequence Information With Umol
No ratings yet
Structure Prediction of Protein-Ligand Complexes From Sequence Information With Umol
12 pages
Cyclic Peptide Structure Prediction and Design Using AlphaFold
No ratings yet
Cyclic Peptide Structure Prediction and Design Using AlphaFold
25 pages
olanders-et-al-2024-challenge-for-deep-learning-protein-structure-prediction-of-ligand-induced-conformational-changes
No ratings yet
olanders-et-al-2024-challenge-for-deep-learning-protein-structure-prediction-of-ligand-induced-conformational-changes
14 pages
s41586 021 03819 2 - Reference
No ratings yet
s41586 021 03819 2 - Reference
16 pages
2502.09372v1
No ratings yet
2502.09372v1
22 pages
Alpha Fold
No ratings yet
Alpha Fold
9 pages
Deepmind's Ai Predicts Structures For A Vast Trove of Proteins
No ratings yet
Deepmind's Ai Predicts Structures For A Vast Trove of Proteins
1 page
Protn STR
No ratings yet
Protn STR
1 page
TR_20211112_许锦波_基于深度学习的蛋白质结构预测
No ratings yet
TR_20211112_许锦波_基于深度学习的蛋白质结构预测
47 pages
Reading Assignment Protein Structures For All
No ratings yet
Reading Assignment Protein Structures For All
3 pages
AlphaFold Protein Structure Database in 2024 Providing Structure Coverage for Over 214 Million Protein Sequences
No ratings yet
AlphaFold Protein Structure Database in 2024 Providing Structure Coverage for Over 214 Million Protein Sequences
8 pages
알파폴드1논문
No ratings yet
알파폴드1논문
27 pages
Base Paper
No ratings yet
Base Paper
14 pages
nihms-1751143
No ratings yet
nihms-1751143
8 pages
A Method For Multiple-Sequence-Alignment - Free Protein Structure Prediction Using A - Protein Language Model
No ratings yet
A Method For Multiple-Sequence-Alignment - Free Protein Structure Prediction Using A - Protein Language Model
12 pages
kmiecik2016
No ratings yet
kmiecik2016
39 pages
AlphaFold 2 - Why It Works and Its Implications For Understanding The Relationships of Protein Sequence, Structure, and Function - PMC
No ratings yet
AlphaFold 2 - Why It Works and Its Implications For Understanding The Relationships of Protein Sequence, Structure, and Function - PMC
10 pages
Innovative Computing Review (ICR) : Issn: 2791-0024 ISSN: 2791-0032 Homepage
No ratings yet
Innovative Computing Review (ICR) : Issn: 2791-0024 ISSN: 2791-0032 Homepage
17 pages
Protein Folding and Assembly Literature Review
No ratings yet
Protein Folding and Assembly Literature Review
6 pages
skolnick-et-al-2021-alphafold-2-why-it-works-and-its-implications-for-understanding-the-relationships-of-protein
No ratings yet
skolnick-et-al-2021-alphafold-2-why-it-works-and-its-implications-for-understanding-the-relationships-of-protein
5 pages
s41586 021 03828 1 - Reference
No ratings yet
s41586 021 03828 1 - Reference
23 pages
Alpha Fold
No ratings yet
Alpha Fold
16 pages
Hydrophobic Residue Patterning in - Strands and Implications For - Sheet Nucleation
No ratings yet
Hydrophobic Residue Patterning in - Strands and Implications For - Sheet Nucleation
124 pages
AlphaFold-Revolutionizing-Structural-Biology
No ratings yet
AlphaFold-Revolutionizing-Structural-Biology
10 pages
Science Abj8754
No ratings yet
Science Abj8754
6 pages
Protein Structure Determination: Bookmark This Page
No ratings yet
Protein Structure Determination: Bookmark This Page
25 pages
A Structural Biology Community Assessment of AlphaFold2 Applications
No ratings yet
A Structural Biology Community Assessment of AlphaFold2 Applications
19 pages
Btac 625
No ratings yet
Btac 625
5 pages
Report-2
No ratings yet
Report-2
1 page
Report-2
No ratings yet
Report-2
1 page
ijms-25-08426
No ratings yet
ijms-25-08426
21 pages
Accurate prediction of protein–nucleic acid
No ratings yet
Accurate prediction of protein–nucleic acid
14 pages
struct2go
No ratings yet
struct2go
7 pages
d41586-022-02083-2
No ratings yet
d41586-022-02083-2
2 pages
Reviews: Advances in Protein Structure Prediction and Design
No ratings yet
Reviews: Advances in Protein Structure Prediction and Design
17 pages
Acs Jcim 2c01400
No ratings yet
Acs Jcim 2c01400
8 pages
Advances in Protein Structure Prediction and Design
No ratings yet
Advances in Protein Structure Prediction and Design
17 pages
Module 5 notes
No ratings yet
Module 5 notes
151 pages
Structural bioinformatics
No ratings yet
Structural bioinformatics
23 pages
Machine Learning For Protein Folding and Dynamics: Sciencedirect
No ratings yet
Machine Learning For Protein Folding and Dynamics: Sciencedirect
8 pages
Ijms 24 13543 v2
No ratings yet
Ijms 24 13543 v2
20 pages
Bio Articles3
No ratings yet
Bio Articles3
14 pages
NOBEL-24 By Arpan Ghosh, 1st Sem
No ratings yet
NOBEL-24 By Arpan Ghosh, 1st Sem
1 page
Intelligent Technologies for Research and Engineering
From Everand
Intelligent Technologies for Research and Engineering
S. Kannadhasan
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Advanced Kinamatic
No ratings yet
Advanced Kinamatic
3 pages
En Technical Information PDF
No ratings yet
En Technical Information PDF
160 pages
MCB For Protection - Acti9 Ic60 - A9F84404
No ratings yet
MCB For Protection - Acti9 Ic60 - A9F84404
3 pages
PHY1103 Tutorial 01
No ratings yet
PHY1103 Tutorial 01
2 pages
IMO-Class-12-QP
No ratings yet
IMO-Class-12-QP
13 pages
Digital Control Systems: Lecture Notes
No ratings yet
Digital Control Systems: Lecture Notes
61 pages
Zimbabwe School Examinations Council Additional Mathematics 6002/1
No ratings yet
Zimbabwe School Examinations Council Additional Mathematics 6002/1
8 pages
Cswip3.1 Question 1
No ratings yet
Cswip3.1 Question 1
9 pages
Physics of Droplet Impact On Various Substrates and Its Current Advancements in Interfacial Science A Review
No ratings yet
Physics of Droplet Impact On Various Substrates and Its Current Advancements in Interfacial Science A Review
36 pages
Test Equi
No ratings yet
Test Equi
18 pages
Transmission Line Manual
No ratings yet
Transmission Line Manual
10 pages
Free Convection Manual: Heat & Mass Transfer Lab
No ratings yet
Free Convection Manual: Heat & Mass Transfer Lab
8 pages
Flight Stability of Canard-Guided Dual-Spin Projectiles With1
No ratings yet
Flight Stability of Canard-Guided Dual-Spin Projectiles With1
9 pages
385C and 385C FS Excavator Electrical System
No ratings yet
385C and 385C FS Excavator Electrical System
4 pages
Grounding System
No ratings yet
Grounding System
3 pages
Lecture 1 Nanomaterials
No ratings yet
Lecture 1 Nanomaterials
38 pages
Oscillation - Question Paper 02 PDF
No ratings yet
Oscillation - Question Paper 02 PDF
4 pages
Installation and Operating Instructions KEY-START Series: Energy Division
No ratings yet
Installation and Operating Instructions KEY-START Series: Energy Division
20 pages
6ff623ff-b3a0-419b-b3c3-3bf261208b32
No ratings yet
6ff623ff-b3a0-419b-b3c3-3bf261208b32
51 pages
123 Final Set A
No ratings yet
123 Final Set A
1 page
CSWIP 3.0 Visual Inspection Revised
No ratings yet
CSWIP 3.0 Visual Inspection Revised
136 pages
Jones Chopper: Pratik Gupta (Vgec Electrical)
100% (1)
Jones Chopper: Pratik Gupta (Vgec Electrical)
13 pages
WEEKLY LEARNING PLAN SY 2024-2025 (GRADE 4 WEEK 3)
No ratings yet
WEEKLY LEARNING PLAN SY 2024-2025 (GRADE 4 WEEK 3)
4 pages
ME-Elective-1-_-Finals-Reviewer
No ratings yet
ME-Elective-1-_-Finals-Reviewer
13 pages
PV3300 TLV Series (1KW-6KW) : Specification
No ratings yet
PV3300 TLV Series (1KW-6KW) : Specification
1 page
Omnia 8100 Man
No ratings yet
Omnia 8100 Man
174 pages
SRM Ata 53-04 Pressurized Skin Minor Damage Repairs
No ratings yet
SRM Ata 53-04 Pressurized Skin Minor Damage Repairs
2 pages
Complex KCSE Maths Questions
100% (1)
Complex KCSE Maths Questions
2 pages
Failure Mechanism Tracing of the Crankshaft for Reciprocating
No ratings yet
Failure Mechanism Tracing of the Crankshaft for Reciprocating
15 pages

CombFold predicting structures of large protein assemblies using a combinatorial assembly algorithm and AlphaFold2

Uploaded by

CombFold predicting structures of large protein assemblies using a combinatorial assembly algorithm and AlphaFold2

Uploaded by

nature methods

CombFold: predicting structures of large

Received: 17 May 2023 Ben Shor & Dina Schneidman-Duhovny

Accepted: 9 January 2024

Nature Methods | Volume 21 | March 2024 | 477–487 477

Nature Methods | Volume 21 | March 2024 | 477–487 478

(1) Generation of pairwise subunit interactions

(2) Unified representation

(3) Combinatorial assembly of subunits

Nature Methods | Volume 21 | March 2024 | 477–487 479

Table 1 | CombFold evaluation benchmarks

1 Asymmetric complexes (released 35 5–20 1,300–8,000 60% 74% 26% (AFMv2)

Nature Methods | Volume 21 | March 2024 | 477–487 480

CombFold (TM-score 0.79; PDB 6I3M (4,680 amino acids)

Nature Methods | Volume 21 | March 2024 | 477–487 481

Nature Methods | Volume 21 | March 2024 | 477–487 482

Number of copies Number of copies

Nature Methods | Volume 21 | March 2024 | 477–487 483

a Best pairwise models

TM-score 0.72 TM-score 0.85 TM-score 0.89

iii iv iii iv iii iv

Nature Methods | Volume 21 | March 2024 | 477–487 484

Nature Methods | Volume 21 | March 2024 | 477–487 485

Nature Methods | Volume 21 | March 2024 | 477–487 486

Nature Methods | Volume 21 | March 2024 | 477–487 487

You might also like