We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13
SRI RAMACHANDRA
INSTITUTE OF HIGHER EDUCATION AND RESEARCH
(Deemed to be University)
Protein structure prediction
Dr. UDHAYA LAVINYA B.
ASST. PROFESSOR, DEPT. OF BMS, SRIHER (DU) Introduction • Proteins differ from one another primarily in their sequence of amino acids • This results in different spatial shape and structure and therefore different biological functionalities in cells • It is much easier to obtain protein sequences than to obtain their structures • The UniProt/TrEMBL database contains currently more than 85 million of protein sequences • On the structure side, X-ray crystallography and NMR spectroscopy are currently the two major experimental techniques for protein structure determination • Both are, time- and manpower-consuming, and have their own technical limitations for different protein targets • As of April 2017, the number of protein structures in PDB increases to ~ 120,000, which counts however only < 0.2% of the protein sequences in the UniProt. Secondary structure prediction • Protein secondary structure refers to the local conformation proteins’ polypeptide backbone • There are two regular secondary structure states, α-helix (H) and β-strand (E), and one irregular secondary structure type, the coil region (C) • Sander developed a secondary structure assignment method Dictionary of Secondary Structure of Proteins (DSSP)3 • It automatically assigns secondary structure into eight states (H, E, B, T, S, L, G, and I) according to hydrogen- bonding patterns • These eight states are often further simplified into three states of helix, sheet and coil • The most widely used convention is that helix is designated as G, H and I; sheet as B and E; and all other states are designated as a coils • Most commonly, the secondary structure prediction problem is formulated as follows: given a protein sequence with amino acids, predict whether each amino acid is in the α-helix (H), β-strand (E), or coil region (C) • Protein secondary structure prediction is usually evaluated by Q3 accuracy, which measures the percentage of residues for three-state secondary structures to determine whether they have been predicted correctly Secondary structure prediction • Many statistical approaches and machine learning approaches have been developed to predict secondary structure. • One of the first approaches for predicting protein secondary structure, uses a combination of statistical and heuristic rules. • The GOR6 method formalizes the secondary structure prediction problem within an information- theoretic framework. • Position specific scoring matrix (PSSM) based on PSIBLAST reflects evolutionary information and has made the most significant improvements in protein secondary structure prediction • Many machine learning methods have been developed to predict protein secondary structure • They exhibit good performance by exploiting evolutionary information, as well as statistic information about amino acid subsequences • For example, many neural network (NN) methods, hidden Markov model (HMM), support vector machines (SVM) and K-nearest neighbors22 have had substantial success, and Q3 accuracy has reached to 80%. Secondary structure prediction • The prediction accuracy has been continuously improved over the years, especially by • using hybrid or ensemble methods and • incorporating evolutionary information in the form of profiles extracted from alignments of multiple homologous sequences • The highest Q3 accuracy without relying on structure templates is now at 82–84% • DeepCNF is a deep learning extension of conditional neural fields (CNF), which integrates conditional random fields and shallow neural networks. • The overall performance of DeepCNF is significantly better than other state-of- the-art methods, breaking the long-lasting ~80% accuracy. • Recently SPIDER3 improved the prediction of protein secondary structure by capturing non-local interactions using long short-term memory bidirectional recurrent neural networks. Frequently used tools • PSRSM • Protein Secondary Structure Prediction based on Data Partition and Semi-Random Subspace Method • This method partitions the training dataset based on protein sequence length and employs a semi-random subspace technique to train multiple classifiers. It combines predictions using a majority vote rule, achieving high accuracy across various datasets. • Reported Q3 accuracy ranges from 85% to 86.38% on different datasets, outperforming many existing methods • PSSpred • Neural network-based tool that utilizes multiple sequence alignments gathered through PSI-BLAST. • It trains separate neural networks for secondary structure prediction using amino acid frequency data. • The final prediction is a combination of results from seven different neural network predictors • JPred • This server uses multiple neural networks trained on PSI-BLAST and HMMER profiles to predict both secondary structure and solvent accessibility. • Input Formats: Accepts sequences in various formats, including FASTA, and allows batch submissions for multiple sequences • PSIPRED • It employs two feed-forward neural networks to analyze outputs from PSI-BLAST profiles for secondary structure prediction • It remains one of the most reliable tools in the field • RaptorX-SS8 • Utilizes conditional neural fields to predict both three-state and eight-state secondary structures from protein sequences • It is recognized for its effectiveness in structure prediction tasks Tertiary structure prediction • Three-dimensional arrangement of all the atoms in a single polypeptide chain • Crucial for the protein's functionality • Formed through • various interactions among the side chains (R groups) of the amino acids that make up the protein and • interactions between these side chains and the backbone of the polypeptide Anfinsen’s dogma Methods • Similar sequences from the same evolutionary family often adopt similar protein structures • This forms the foundation of homology modeling • Most accurate way to predict protein structure by taking its homologous structure in PDB as template • With the rapid growth of PDB database, an increasing proportion of target proteins can be predicted via homology modeling • When no structure with obvious sequence similarity to the target protein can be found in PDB, it is still possible to find out proteins with structural similarity to the target protein • The method to identify template structures from the PDB is called threading or fold recognition, • It matches the target sequence to homologous and distant-homologous structures based on some algorithm and take the best matches as structural template • The basic premise for threading to work is that protein structure is highly conservative in evolution and the number of unique structural folds are limited in nature • Both homology modeling (based on sequence comparison) and threading methods (based on fold- recognition) can be called template-based structure prediction methods Frequently used tools FALCON2 • Integrates template-based modeling (ProALIGN) and ab initio prediction (ProFOLD). • FALCON2 simultaneously utilizes both approaches to enhance prediction accuracy. ProALIGN aligns the target protein with known templates, while ProFOLD uses a neural network to estimate inter-residue distances. • The server includes quality assessment tools to select the best candidate structures from predictions, demonstrating improved accuracy through the integration of methods 1. AlphaFold • Deep learning-based approach. • Developed by DeepMind, AlphaFold has achieved remarkable success in predicting protein structures by utilizing attention mechanisms to model the relationships between amino acids. • It has set new benchmarks in structure prediction, particularly in the CASP competitions, showcasing its ability to predict complex structures with high accuracy. I-TASSER • Threading and fragment assembly. • I-TASSER predicts protein structures by threading target sequences through known structures and assembling fragments based on these templates. • It is widely used for generating structural models when experimental data is lacking. Frequently used tools Phyre2 • Template-based modeling. • Phyre2 predicts protein structures by aligning sequences with known structures and generating models based on these alignments. • It provides a user-friendly interface for researchers to input sequences and receive structural predictions. MODELLER • Homology modeling. • MODELLER builds models based on homologous proteins with known structures, allowing users to create accurate models for target proteins. • Offers extensive options for model refinement and evaluation. RaptorX • Remote homology detection and threading. • RaptorX combines template-based methods with ab initio approaches to predict protein structures effectively. • It provides detailed structural predictions along with confidence scores.
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados