Progress and Challenges in Protein Structure Prediction - Zhang 2008
Progress and Challenges in Protein Structure Prediction - Zhang 2008
com
Depending on whether similar structures are found in the PDB The crucial problems/efforts in the field of protein struc-
library, the protein structure prediction can be categorized into ture prediction include: first, for the sequences of similar
template-based modeling and free modeling. Although structures in PDB (especially those of weakly/distant
threading is an efficient tool to detect the structural analogs, the homologous relation to the target), how to identify the
advancements in methodology development have come to a correct templates and how to refine the template structure
steady state. Encouraging progress is observed in structure closer to the native; second, for the sequences without
refinement which aims at drawing template structures closer to appropriate templates, how to build models of correct
the native; this has been mainly driven by the use of multiple topology from scratch. The progress made along these
structure templates and the development of hybrid knowledge- directions was assessed in the recent CASP7 experiment
based and physics-based force fields. For free modeling, [5] under the categories of template-based modeling
exciting examples have been witnessed in folding small (TBM) and free modeling (FM). Here, I will review
proteins to atomic resolutions. However, predicting structures the new progress and challenges in these directions.
for proteins larger than 150 residues still remains a challenge,
with bottlenecks from both force field and conformational Template-based modeling
search. The canonical procedure of the TBM consists of four
steps: first, finding known structures (templates) related
Address to the sequence to be modeled (target); second, aligning
Center for Bioinformatics and Department of Molecular Biosciences, the target sequence to the template structure; third,
University of Kansas, 2030 Becker Drive, Lawrence, KS 66047, United
States
building structural frameworks by copying the aligned
regions or by satisfying the spatial restraints from tem-
Corresponding author: Zhang, Yang ([email protected]) plates; fourth, constructing the unaligned loop regions
and adding side-chain atoms. The first two steps are
actually done in a single procedure called threading (or
Current Opinion in Structural Biology 2008, 18:342–348 fold recognition) [6,7] because the correct selection of
This review comes from a themed issue on
templates relies on the accurate alignment. Similarly, the
Sequence and Topology last two steps are performed simultaneously since the
Edited by Nick Grishin and Sarah Teichmann atoms of the core and loop regions are in close interaction.
Available online 22nd April 2008 The existence of similar structures in the PDB is a
0959-440X/$ – see front matter necessary precondition for the successful TBM. An
# 2008 Elsevier Ltd. All rights reserved. important question is how complete the current PDB
structure library is. Figure 1 shows a distribution of the
DOI 10.1016/j.sbi.2008.02.004
best templates found by the structural alignment [8] for
1413 representative single-domain proteins between 80
and 200 residues. Remarkably, even excluding the hom-
Introduction ologous templates of sequence identity >20%, all the
In recent years, despite many debates, structure genomics target proteins have at least one structural analog in the
is probably one of the most noteworthy efforts in protein PDB with a Ca root-mean-squared deviation (rmsd) to the
structure determination, which aims to obtain 3D models target <6 Å covering >70% regions. The average rmsd
of all proteins by an optimized combination of exper- and coverage are 2.96 Å and 86%, respectively. Zhang and
imental structure solution and computer-based structure Skolnick [9] recently showed that high-quality full-
prediction [1,2]. Two factors will dictate the success of length models could be built for all the protein targets
the structure genomics: experimental structure determi- with an average rmsd 2.25 Å when using the best tem-
nation of optimally selected proteins and efficient com- plates in the PDB. These data demonstrate that the
puter modeling algorithms. Based on about 40 000 structural universe of the current PDB library is complete
structures in the PDB library (many are redundant) [3], essentially for solving the protein structure problem for at
4 million models/fold-assignments can be obtained by a least the single-domain proteins. However, most of the
simple combination of the PSI-BLAST search and the target–template pairs at this level of sequence identity
comparative modeling technique [4]. Development of (15%) are difficult to identify by threading. In fact, after
more sophisticated and automated computer modeling excluding the templates of sequence identity >30%, only
approaches will dramatically enlarge the scope of model- two-third of the proteins could be assigned by the current
able proteins in the structure genomics project. threading techniques to the templates of a correct top-
Figure 1 position of the target MSA and the log-odds of the amino
acid in the template MSA, the profile [19]. There are
alternatives in calculating the PPA scores [20]. The
profile-alignment-based methods demonstrated advan-
tages in several recent blind tests [21,22,23]. In Live-
Bench-8 [21], for example, all top four servers (BASD/
MASP/MBAS, SFST/STMP, FFAS03, and ORF2/
ORFS) were based on the sequence PPA. In CAFASP
[22] and the recent CASP Server Section [23], several
sequence-profile-based methods were ranked at the top
of single-threading servers. Wu and Zhang [24] recently
showed that the accuracy of the sequence PPAs can be
further improved by about 5–6% by incorporating a
variety of additional structural information.
Table 1
Multiple servers from the same lab are represented by the highest rank one.
The meta-server predictors have dominated the server observation was recently made by Summa and Levitt
predictions in previous experiments (e.g. CAFASP4 [28], [37] who exploited different molecular mechanics
LiveBench-8 [21], and CASP6 [30]). In the recent CASP7 (MM) potentials (AMBER99, OPLS-AA, GROMOS96,
experiment [23], however, Zhang-Server (an automated and ENCAD) on the refinement of 75 proteins by in vacuo
server based on profile–profile threading and I-TASSER energy minimization. The authors found that a knowl-
structure refinement [31]) clearly outperforms others edge-based atomic contact potential based on the PDB
(including the meta-servers which include it as an input statistics outperforms all the traditional MM potentials by
[29]). A list of the top 10 automated servers in the CASP7 moving almost all the test proteins closer to the native
experiment is shown in Table 1. This data on the one state, while the MM potentials, except for AMBER99,
hand highlight the challenge to the MQAP methods in essentially drive the decoys away from the native. The
correctly ranking and selecting the best models; on the vacuum simulation without solvation may be a part of the
other hand, the success of the composite threading plus reason for the failure of the MM potentials. But this
refinement servers (as Zhang-Server, ROBETTA, and observation demonstrates the potential of the hybrid
MetaTasser) demonstrates the advantage of structure knowledge-based and physics-based potentials in the
refinement in the TBM prediction. protein structure refinement.
and side-chain center of mass) with a purely knowledge- CASP experiments and made the fragment assembly
based force field. One of the major contributions to the approach popular in the field. In the new developments
refinements is the use of multiple threading templates of ROSETTA [44,45], the authors first assemble struc-
where the consensus spatial restraint is more accurate tures in a reduced knowledge-based model with confor-
than that from the individual template. Second, the mations specified by the heavy backbone atoms and Cbs. In
composite knowledge-based energy terms have been the second stage, Monte Carlo simulations with an all-atom
extensively optimized using large-scale structure decoys physics-based potential are performed to refine the details
[41] which help coordinate the complicated correlations of the low-resolution models. An exciting achievement was
between different interaction terms. demonstrated in CASP6 by generating a model for T0281
(70 residues) of 1.6 Å away from the crystal structure. In
The progress of threading template refinements has been CASP7, ROSETTA built a model for T0283 (112 residues)
assessed in the recent CASP7 experiment, where the with rmsd = 1.8 Å over 92 residues (Figure 2, left panel).
assessors compared the predicted models with the best Despite significant success, the computer cost of the
structural template (or ‘virtual predictor group’) and procedure (150 CPU days for a small protein <100 resi-
commented that ‘The best group in this respect (24, dues) is still too expensive for the routine use.
Zhang) managed to achieve a higher GDT-TS score than
the virtual group in more than half the assessment units Another successful free modeling approach, called TAS-
and a higher GDT-HA score in approximately one-third SER [36] by Zhang and Skolnick, constructs 3D models
of cases’ [42]. This comparison may not entirely reflect based on a purely knowledge-based approach. Continu-
the template refinement ability of the algorithms because ous fragments of various sizes are excised from threading
the predictors actually start from threading templates alignments and used to reassemble protein structures in
rather than the best structural alignments and the latter an on-and-off lattice system. A newer version of I-TAS-
requests the information of the native, which was not SER was recently developed by Wu et al. [46], which
available when the predictions were made. On the con- refines the TASSER cluster centroids by iterative Monte
trary, a global GDT score comparison may favor the full- Carlo simulations. Although the procedure uses structural
length models because the template alignment has a fragments and spatial restraints from threading templates,
shorter length than the models. In a direct comparison it often constructs models of correct topology even when
of the rmsd over the same aligned regions, we find that the the topologies of individual templates are incorrect. In
first I-TASSER model is closer to the native than the best CASP7, among 19 FM and FM/TBM targets, I-TASSER
initial template in 86 of 105 TBM cases while the other 13 builds correct topology (3–5 Å) for 7 cases with
(6) cases are worse than (equal to) the template. The sequences up to 155 residues long. Figure 2 (right panel)
average rmsd is 4.9 and 3.8 Å for the templates and shows one example of T0382 (123 residues) where all
models, respectively, over the same aligned regions initial templates have a wrong topology (>9 Å) but the
[31]. final model is 3.6 Å away from the X-ray structure.
Free modeling Significant efforts have been made on the purely physics-
When structural analogs do not exist in the PDB library or based protein folding and structure prediction. The very
could not be successfully identified by threading (which is first milestone of successful ab initio protein folding is
more often the case as shown by Figure 1), the structure probably the 1997 work of Duan and Kollman, who folded
prediction has to be generated from scratch. This type of the villin headpiece (a 36-mer) by MD simulations in
predictions has been termed as ‘ab initio’ or ‘de novo’ explicit solvent for two months on parallel supercompu-
modeling, a term that may be easily understood as a ters with models up to 4.5 Å [47]. With the help of the
modeling ‘from first principle’. In CASP7, it is named worldwide-distributed computers, this small protein was
as ‘free modeling’ which I think reflects more appropri- recently folded by Pande and coworkers [48] to 1.7 Å with
ately the status of the field, since the most efficient a total simulation time of 300 ms or approximately
methods in this category still consider hybrid approaches 1000 CPU years. To reduce the computing cost, Scheraga
including both knowledge-based and physics-based and coworkers [49] developed a reduced physics-based
potentials. Evolutionary information is often used in model, called UNRES, which represents protein confor-
generating sparse spatial restraints or identifying local mations by Ca, side-chain center, and a virtual peptide
structural building blocks. group. The low-energy UNRES models are then con-
verted to all-atom representations based on ECEPP/3. In
The best-known idea for free modeling is probably the one CASP6, a structure genomic target of TM0487 (T0230,
pioneered by Bowie and Eisenberg who assembled new 102 residues) was folded to a structure within 7.3 Å by the
tertiary structures using small fragments (mainly 9-mer) approach. Using ASTRO-FOLD on the ECEPP/3 optim-
cut from other PDB proteins [43]. On the basis of similar ization, Floudas and coworkers [50] recently constructed
idea, Baker and coworkers developed ROSETTA [40], a model of 5.2 Å for a four-helical bundle protein of 102
which has worked extremely well for free modeling in the residues in a double-blind prediction.
Figure 2
Representative examples of free modeling in CASP7 generated by two different approaches. T0283 (left panel) is a TBM target (from Bacillus
halodurans) of 112 residues; but the model is generated by all-atom ROSETTA (a hybrid knowledge-based and physics-based approach) [45] based
on free modeling, which gives a TM-score 0.74 and a rmsd 1.8 Å over the first 92 residues (the overall rmsd is 13.8 Å mainly because of the
misorientation of C-terminal). T0382 (right panel) is a FM/TBM target (from Rhodopseudomonas palustris CGA009) of 123 residues; the model is
generated by I-TASSER (a purely knowledge-based approach) [31] with a TM-score 0.66 and a rmsd 3.6 Å. Blue and red represent the model and the
crystal structure, respectively.
a golf-hole-like energy landscape without middle-range 13. Ginalski K, Pas J, Wyrwicz LS, von Grotthuss M, Bujnicki JM,
Rychlewski L: ORFeus: detection of distant homology using
funnel should not be the one taken in nature, which can be sequence profiles and predicted secondary structure. Nucleic
a deeper reason for the failure of conformational search. Acids Res 2003, 31:3804-3807.
Thus, the bottleneck for free modeling comes from the 14. Shi J, Blundell TL, Mizuguchi K: FUGUE: sequence–structure
homology recognition using environment-specific
lack of both funnel-like force fields and efficient space substitution tables and structure-dependent gap penalties. J
searching, especially for proteins of larger sizes. Mol Biol 2001, 310:243-257.
15. Karplus K, Barrett C, Hughey R: Hidden Markov models for
Acknowledgements detecting remote protein homologies. Bioinformatics 1998,
The project is supported in part by KU Start-up Fund 06194, the Alfred P. 14:846-856.
Sloan Foundation, and Grant Number R01GM083107 of the National 16. Soding J: Protein homology detection by HMM–HMM
Institute of General Medical Sciences. comparison. Bioinformatics 2005, 21:951-960.
The sequence–HMM alignment is extended to the pair-wise profile HMM–
References and recommended reading HMM alignment for the remote homology detection. The HHsearch is one
of the best single-threading servers in CASP7.
Papers of particular interest, published within the annual period of
review, have been highlighted as: 17. Jones DT: GenTHREADER: an efficient and reliable protein fold
recognition method for genomic sequences. J Mol Biol 1999,
of special interest 287:797-815.
of outstanding interest 18. Cheng J, Baldi P: A machine learning information retrieval
approach to protein fold recognition. Bioinformatics 2006,
22:1456-1463.
1. Burley SK, Almo SC, Bonanno JB, Capel M, Chance MR,
Gaasterland T, Lin D, Sali A, Studier FW, Swaminathan S: 19. Gribskov M, McLachlan AD, Eisenberg D: Profile analysis:
Structural genomics: beyond the human genome project. Nat detection of distantly related proteins. Proc Natl Acad Sci U S A
Genet 1999, 23:151-157. 1987, 84:4355-4358.
2. Chandonia JM, Brenner SE: The impact of structural genomics: 20. Sadreyev R, Grishin N: COMPASS: a tool for comparison of
expectations and outcomes. Science 2006, 311:347-351. multiple protein alignments with assessment of statistical
The authors review and assess the gain and loss of the structural significance. J Mol Biol 2003, 326:317-336.
genomics project in the past five years in contrast with traditional
structural biology. 21. Rychlewski L, Fischer D: LiveBench-8: the large-scale,
continuous assessment of automated protein structure
3. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, prediction. Protein Sci 2005, 14:240-245.
Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids
Res 2000, 28:235-242. 22. Fischer D, Rychlewski L, Dunbrack RL Jr, Ortiz AR, Elofsson A:
CAFASP3: the third critical assessment of fully automated
4. Pieper U, Eswar N, Davis FP, Braberg H, Madhusudhan MS, structure prediction methods. Proteins 2003, 53(Suppl 6):503-
Rossi A, Marti-Renom M, Karchin R, Webb BM, Eramian D et al.: 516.
MODBASE: a database of annotated comparative protein
structure models and associated resources. Nucleic Acids Res 23. Battey JN, Kopp J, Bordoli L, Read RJ, Clarke ND, Schwede T:
2006, 34:D291-D295. Automated server predictions in CASP7. Proteins 2007,
MODBASE is a database of 3D models built by the MODELLER pipeline 69(Suppl 8):68-82.
for all protein sequences in SwissProt based on available structural It is an official assessment paper for the structure prediction servers in
templates in the PDB library. CASP7, which is especially helpful for the users who want to find
appropriate servers for generating their own structure prediction.
5. Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T,
Tramontano A: Critical assessment of methods of protein 24. Wu ST, Zhang Y: MUSTER: improving protein sequence profile–
structure prediction (CASP) — round VII. Proteins 2007, profile alignments by using multiple sources of structure
69(Suppl 8):3-9. information. Proteins 2008 doi: 10.1002/prot.21945.
6. Bowie JU, Luthy R, Eisenberg D: A method to identify protein 25. Ginalski K, Elofsson A, Fischer D, Rychlewski L: 3D-Jury: a simple
sequences that fold into a known three-dimensional structure. approach to improve protein structure predictions.
Science 1991, 253:164-170. Bioinformatics 2003, 19:1015-1018.
7. Jones DT, Taylor WR, Thornton JM: A new approach to protein 26. Wu ST, Zhang Y: LOMETS: a local meta-threading-server for
fold recognition. Nature 1992, 358:86-89. protein structure prediction. Nucleic Acids Res 2007, 35:3375-
3382.
8. Zhang Y, Skolnick J: TM-align: a protein structure alignment LOMETS is a new meta-server with all individual threading programs
algorithm based on the TM-score. Nucleic Acids Res 2005, installed locally, which ensures a quick collection and selection of multiple
33:2302-2309. threading results.
9. Zhang Y, Skolnick J: The protein structure prediction problem 27. Fischer D: 3D-SHOTGUN: a novel, cooperative, fold-
could be solved using the current PDB library. Proc Natl Acad recognition meta-predictor. Proteins 2003, 51:434-441.
Sci U S A 2005, 102:1029-1034.
Using the best available templates, TASSER could build high-quality 28. Fischer D: Servers for protein structure prediction. Curr Opin
models for all single-domain proteins. This shows that the current struc- Struct Biol 2006, 16:178-182.
ture set in PDB is essentially complete for the protein structure prediction
problem, though most of the templates are not detectable by current 29. Wallner B, Elofsson A: Prediction of global and local model
threading approaches. quality in CASP7 using Pcons and ProQ. Proteins 2007,
69(Suppl 8):184-193.
10. Skolnick J, Kihara D, Zhang Y: Development and large scale The Pcons-server generates structure predictions by ranking and select-
benchmark testing of the PROSPECTOR 3.0 threading ing models generated by other programs. It shows that the structural
algorithm. Protein 2004, 56:502-518. consensus is the most robust score for protein model selection.
11. Jaroszewski L, Rychlewski L, Li Z, Li W, Godzik A: FFAS03: a 30. Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A: Critical
server for profile–profile sequence alignments. Nucleic Acids assessment of methods of protein structure prediction (CASP)
Res 2005, 33:W284-W288. — round 6. Proteins 2005, 61(Suppl 7):3-7.
12. Zhou H, Zhou Y: Fold recognition by combining sequence 31. Zhang Y: Template-based modeling and free modeling
profiles derived from evolution and from depth-dependent by I-TASSER in CASP7. Proteins 2007, 69(Suppl 8):
structural alignment of fragments. Proteins 2005, 58:321-328. 108-117.
Template structures can be refined significantly closer to the native by a 42. Kopp J, Bordoli L, Battey JN, Kiefer F, Schwede T: Assessment of
purely knowledge-based I-TASSER modeling. I-TASSER also generated CASP7 predictions for template-based modeling targets.
the correct topology for 7 of 19 free modeling targets in CASP7. Proteins 2007, 69(Suppl 8):38-56.
The paper assesses the template-based modeling category, which
32. Tress M, Ezkurdia I, Grana O, Lopez G, Valencia A: Assessment of includes 108 out of a total of 123 targets/domains in CASP7. Progress
predictions submitted for the CASP6 comparative modeling in the template refinement is highlighted.
category. Proteins 2005, 61(Suppl 7):27-45.
43. Bowie JU, Eisenberg D: An evolutionary approach to folding
33. Tramontano A, Morea V: Assessment of homology-based small alpha-helical proteins that uses sequence information
predictions in CASP5. Proteins 2003, 53(Suppl 6):352-368. and an empirical guiding fitness function. Proc Natl Acad Sci U
34. Lee MR, Tsai J, Baker D, Kollman PA: Molecular dynamics in the S A 1994, 91:4436-4440.
endgame of protein structure prediction. J Mol Biol 2001, 44. Bradley P, Misura KM, Baker D: Toward high-resolution de novo
313:417-430. structure prediction for small proteins. Science 2005,
35. Wroblewska L, Skolnick J: Can a physics-based, all-atom 309:1868-1871.
potential find a protein’s native structure among misfolded This is the first work to report successful high-resolution modeling cases
structures? I. Large scale AMBER benchmarking. J Comput by free modeling. It states that atomic potentials have the lowest energy
Chem 2007, 28:2059-2066. near the native state and the bottleneck for high-resolution free modeling
AMBER plus GB solvation potential can discriminate the native from the is the insufficient conformation search.
roughly minimized structural decoys. After a longer MD simulation,
45. Das R, Qian B, Raman S, Vernon R, Thompson J, Bradley P,
however, the energy–rmsd correlation vanishes. This finding partially
Khare S, Tyka MD, Bhat D, Chivian D et al.: Structure prediction
explains the discrepancy between the discrimination ability and some
for CASP7 targets using extensive all-atom refinement with
unsuccessful folding/refinement results of the physics-based potentials.
Rosetta@home. Proteins 2007, 69(Suppl 8):118-128.
36. Zhang Y, Skolnick J: Automated structure prediction of weakly The paper summarizes the recent progress of ROSSETA using distributed
homologous proteins on a genomic scale. Proc Natl Acad Sci U computing resource and the performance of ROSETTA@home on the
S A 2004, 101:7594-7599. CASP7 targets.
37. Summa CM, Levitt M: Near-native structure refinement using in 46. Wu ST, Skolnick J, Zhang Y: Ab initio modeling of small proteins
vacuo energy minimization. Proc Natl Acad Sci U S A 2007, by iterative TASSER simulations. BMC Biol 2007, 5:17.
104:3177-3182. By iterative TASSER assembly, I-TASSER is able to generate medium-
The in vacuo energy minimization experiments show that a knowledge- resolution to high-resolution models for small proteins without using
based atomic contact potential from the PDB statistics outperforms all homologous templates. The computing cost is significantly lower than
traditional molecular mechanics potentials in driving the protein structure the corresponding atomic-based structure predictions.
decoys toward the native state.
47. Duan Y, Kollman PA: Pathways to a protein folding intermediate
38. Misura KM, Chivian D, Rohl CA, Kim DE, Baker D: Physically observed in a 1-microsecond simulation in aqueous solution.
realistic homology models built with ROSETTA can be more Science 1998, 282:740-744.
accurate than their templates. Proc Natl Acad Sci U S A 2006,
103:5361-5366. 48. Zagrovic B, Snow CD, Shirts MR, Pande VS: Simulation of folding
The hybrid approaches of the ROSETTA structure assembly combined of a small alpha-helical protein in atomistic detail using
with atomic refinements guided by spatial restraints are shown to be able worldwide-distributed computing. J Mol Biol 2002, 323:927-
to draw 22 of 39 template models closer to the native. 937.
39. Chen J, Brooks CL III: Can molecular dynamics simulations 49. Oldziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila JA,
provide high-resolution refinement of protein structure? Khalili M, Arnautova YA, Jagielska A, Makowski M et al.: Physics-
Proteins 2007, 67:922-930. based protein-structure prediction using a hierarchical
CHARMM22/GBSW with spatial restraints are able to refine four of five protocol based on the UNRES force field: assessment in two
CASP6 CM targets with up to 1 Å rmsd reduction, a new progress of the blind tests. Proc Natl Acad Sci U S A 2005, 102:7547-7552.
MD-based structure refinements. By using a reduced physics-based approach, UNRES is able to generate
correct topologies for proteins up to 102 residues.
40. Simons KT, Kooperberg C, Huang E, Baker D: Assembly of
protein tertiary structures from fragments with similar local 50. Klepeis JL, Wei Y, Hecht MH, Floudas CA: Ab initio prediction of
sequences using simulated annealing and Bayesian scoring the three-dimensional structure of a de novo designed
functions. J Mol Biol 1997, 268:209-225. protein: a double-blind case study. Proteins 2005, 58:560-570.
41. Zhang Y, Kolinski A, Skolnick J: TOUCHSTONE II: a new 51. Zhang Y, Skolnick J: Scoring function for automated
approach to ab initio protein structure prediction. Biophys J assessment of protein structure template quality. Proteins
2003, 85:1145-1164. 2004, 57:702-710.