SlideShare a Scribd company logo
Introduction to Bioinformatics 9. Machine Learning for  Protein Structure Prediction #1 Course 341 Department of Computing Imperial College, London © Simon Colton
Remember the Scenario We have found a gene in mice Which when active makes them immune to a disease The gene codes for a protein, the protein has a shape and the shape dictates what it does Humans share 96% of their genes with mice So, what does the  human  protein look like?
The Database Approach If two sequences are sequentially similar Then they are very likely to code for similar proteins Find the best match for the mouse gene In terms of sequences From a large database of individual human genes Or from a database of families of genes Infer protein structure from knowledge of matched genes  If lucky, a structure of one of them may already be known
There is another way… Machine learning: general set of techniques For teaching a computer to make predictions By observing given correct predictions (being trained) Special type of prediction Classification of objects into classes E.g., images into faces/cars/landscapes E.g., drugs into toxic/non-toxic ( binary  classification) We want to predict a protein’s structure Given its sequence
A Good Approach Look at regions of a protein i.e., lengths of residues Define ways to describe the regions So that we can infer the structure of a protein From a description of all its regions Learn methods for predicting: What type of region a particular residue will be in Apply this to a protein sequence To find contiguous regions with same description Put regions together to predict entire structure
For example   G  A  G  D  G  A  N  A  A  A   Alpha  Alpha  Alpha Alpha  Inter Inter  Beta  Beta  Beta  Beta Trained Predictor Alpha Helix Beta Sheet Further Processing
Two Main Questions How do we describe protein structures? What are alpha helices and beta-sheets? Covered in the next lecture How do we train our predictors? Covered in this lecture (and the start of the next…)
Machine Learning in a Nutshell Examples in Predictor out Learning is by example More examples, better predictors For some methods, the examples are used once For other methods, they are used repeatedly
Machine Learning Considerations What is the problem for the predictor to address? What is the nature of our data? How will we represent the predictor? How will we train the predictor? How will we test how good the predictor is?
Types of Learning Problems in Bioinformatics Class membership e.g., predictive toxicology Prediction of sequences e.g., sequences of protein sub-structures Classification hierarchies e.g., folds, families, super-families Shape descriptions e.g., binding site descriptions Temporal models e.g., activity of cells, metabolic pathways
Learning Data Data comes in many forms, in particular: Objects (to be classified/predicted for) Classifications/predictions of objects Features of objects (to use in the prediction) Problems with data Imprecise information (e.g., badly recorded data) Irrelevant information (e.g., additional features) Incorrect information (e.g., wrong classifications) Missing classifications Missing features for sets of objects
Types of Representations Logical Decision trees , grammars,  logic programs Symbolic, understandable representations Probabilistic Neural networks ,  Hidden Markov Models , SVMs Mathematical functions, not easy to understand Mixed Bayesian Networks, Stochastic Logic Programs Have advantages of both, more difficult to work with
Advantages of Representations Probabilistic models: Can handle noisy and imprecise data Useful when there is a notion of uncertainty (in data/hypothesis) Well-founded (300 years of development) Good statistical algorithms for estimation Logical models Richness of description Extensibility - probabilities, sequences, space, time, actions Clarity of results Well-founded (2400 years of development)
Decision Tree Representations Input is a set of features Describing an example/situation Many “if-then” choices Leaves are decision Logical representation: “ If then” is implication Branches are conjunctions Different branches comprise A disjunction
Artificial Neural Networks Layers of nodes Input is transformed into numbers Weighted averages are fed into nodes High or low numbers come out of nodes A Threshold function determines whether high or low Output nodes will “fire” or not Determines classification For an example
Logic Program Representations Logic programs are a subset of first order logic They consist of sets of Horn clauses Horn clause:  A conjunction of literals implying a single literal Can easily be interpreted At the heart of the Prolog programming language
Learning Decision Trees Problem: what feature do nodes in the tree test? And what happens for each case ID3 algorithm: Uses a notion of “Information gain” Based on entropy: how (dis)organised data is Chooses the node with the highest information gain As the node to add to the tree next Then restricts examples for next node
Learning Artificial Neural Networks First problem: layer structure Usually done through trial and error Main problem: choosing the weights Uses a back-propagation algorithm to train them Each example is given If currently correctly classified, that’s fine If not, the errors from the output are passed back Propagated in order to change the weights throughout Only very small changes are made (avoid un-doing good work) Once all examples have been given We start again, until some termination conditions (accuracy) met Often requires thousands of such training ‘epochs’
Learning Logic Programs A notion of generality is used One logic program is more general than another If one follows from another (sort of) [subsumption]  A search space of logic programs is defined Constrained using a language bias Some programs search from general to specific Using rules of deduction to go between sentences Other programs search from specific to general Using inverted rules of deduction Search is guided by: Performance of the LP with respect to classifying training examples Information theoretic calculations to avoid over-specialisations
Testing Learned Predictors #1 Imperative to test on  unseen  examples Cannot report accuracy on examples which have been used to train the predictor, because the results will be heavily biased Simple method: Hold back When number of examples > 200 (roughly) Split into a training set and a test set (e.g., 80%/20%) Never let the training algorithm see the test set Report the accuracy of the predictor on the test set only Have to worry about statistics with smaller numbers This kind of testing came a little late to bioinformatics Beware conclusions drawn about badly tested predictors
N-Fold Cross Validation Leave one out For m < 20 examples Train on m-1 examples, test predictor on left out example Do this for every example and report the average accuracy N-fold cross validation Randomly split into n mutually exclusive sets (partitions) For every set S Train using all examples from the other n-1 sets Test predictor on S, record the accuracy Report the average accuracy over all the sets 10-fold cross validation is common
Testing Learned Predictors #2 Often different consideration for different contexts E.g., false positives/negatives in medical diagnosis Confusion matrix For binary prediction tasks Predicted F Predicted T number = a number = b (false pos) number = c (false neg) number = d Actually F Actually T Let t = a+b+c+d Predictive accuracy = (a+d)/t Majority class  = max ((a+b)/t, (c+d)/t) Precision = Selectivity  = d/(b+d) Recall = Sensitivity = d/(c+d)
Comparing Learning Methods A very simple method: Majority class predictor  Predict everything to be in the majority class Trained predictors must beat this to be credible N-fold cross validation results are compared To show an advance in predictor technology However,  accuracy is not the only consideration Speed, memory and  comprehensibility
Overfitting It’s easy to over-train predictors If a predictor is substantially better for the training set than the test set, it is overfitting Essentially, it has memorised aspects of the examples, rather than generalising properties of them This is bad: think of a completely new example Individual learning schemes have coping methods Easy general approach to avoiding overfitting: Maintain a validation set to perform tests on during training When performance on the validation set degrades, stop learning Be careful of blips in predictive accuracy (leave a while, then come back) Note: this shouldn’t be used as the testing set

More Related Content

What's hot (20)

PERL- Bioperl modules
PERL- Bioperl modulesPERL- Bioperl modules
PERL- Bioperl modules
Nixon Mendez
 
protein sturcture prediction and molecular modelling
protein sturcture prediction and molecular modellingprotein sturcture prediction and molecular modelling
protein sturcture prediction and molecular modelling
Dileep Paruchuru
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
Nikesh Narayanan
 
Merck molecular force field ppt
Merck molecular force field pptMerck molecular force field ppt
Merck molecular force field ppt
seema sangwan
 
Tree building
Tree buildingTree building
Tree building
deepalakshmi59
 
Blast fasta
Blast fastaBlast fasta
Blast fasta
yaghava
 
Kegg databse
Kegg databseKegg databse
Kegg databse
Rashi Srivastava
 
Protein secondary structure prediction by a neural network architecture with...
Protein secondary structure prediction by a neural network  architecture with...Protein secondary structure prediction by a neural network  architecture with...
Protein secondary structure prediction by a neural network architecture with...
IJECEIAES
 
Introduction to Perl and BioPerl
Introduction to Perl and BioPerlIntroduction to Perl and BioPerl
Introduction to Perl and BioPerl
Bioinformatics and Computational Biosciences Branch
 
Docking Score Functions
Docking Score FunctionsDocking Score Functions
Docking Score Functions
SAKEEL AHMED
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
Jajati Keshari Nayak
 
Structural bioinformatics and pdb
Structural bioinformatics and pdbStructural bioinformatics and pdb
Structural bioinformatics and pdb
JanmoniBorah1
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
Malla Reddy College of Pharmacy
 
Protein sequencing
Protein   sequencingProtein   sequencing
Protein sequencing
Student
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and Simulations
Abhilash Kannan
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Greg Landrum
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
Elda Nurafnie
 
Structural bioinformatics.
Structural bioinformatics.Structural bioinformatics.
Structural bioinformatics.
SALIHAMUGHAL
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
sonam786
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
Shelomi Karoon
 
PERL- Bioperl modules
PERL- Bioperl modulesPERL- Bioperl modules
PERL- Bioperl modules
Nixon Mendez
 
protein sturcture prediction and molecular modelling
protein sturcture prediction and molecular modellingprotein sturcture prediction and molecular modelling
protein sturcture prediction and molecular modelling
Dileep Paruchuru
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
Nikesh Narayanan
 
Merck molecular force field ppt
Merck molecular force field pptMerck molecular force field ppt
Merck molecular force field ppt
seema sangwan
 
Blast fasta
Blast fastaBlast fasta
Blast fasta
yaghava
 
Protein secondary structure prediction by a neural network architecture with...
Protein secondary structure prediction by a neural network  architecture with...Protein secondary structure prediction by a neural network  architecture with...
Protein secondary structure prediction by a neural network architecture with...
IJECEIAES
 
Docking Score Functions
Docking Score FunctionsDocking Score Functions
Docking Score Functions
SAKEEL AHMED
 
Structural bioinformatics and pdb
Structural bioinformatics and pdbStructural bioinformatics and pdb
Structural bioinformatics and pdb
JanmoniBorah1
 
Protein sequencing
Protein   sequencingProtein   sequencing
Protein sequencing
Student
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and Simulations
Abhilash Kannan
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Greg Landrum
 
Structural bioinformatics.
Structural bioinformatics.Structural bioinformatics.
Structural bioinformatics.
SALIHAMUGHAL
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
sonam786
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
Shelomi Karoon
 

Viewers also liked (14)

Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
butest
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
karamveer prajapat
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
Balachandramohan Bcm
 
Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11
Purdue University
 
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionA Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
Gerald Lushington
 
[Talk]
[Talk][Talk]
[Talk]
butest
 
Human brain bioinformatics tifc fairfax va
Human brain bioinformatics tifc fairfax vaHuman brain bioinformatics tifc fairfax va
Human brain bioinformatics tifc fairfax va
Avi Dey
 
Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016
Purdue University
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure prediction
Samvartika Majumdar
 
Structural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeStructural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its Scope
Nixon Mendez
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
Arindam Ghosh
 
Protein structure classification
Protein structure classificationProtein structure classification
Protein structure classification
Malla Reddy College of Pharmacy
 
Classification and properties of protein
Classification and properties of proteinClassification and properties of protein
Classification and properties of protein
Mark Philip Besana
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
Dmytro Fishman
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
butest
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
karamveer prajapat
 
Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11
Purdue University
 
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionA Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
Gerald Lushington
 
[Talk]
[Talk][Talk]
[Talk]
butest
 
Human brain bioinformatics tifc fairfax va
Human brain bioinformatics tifc fairfax vaHuman brain bioinformatics tifc fairfax va
Human brain bioinformatics tifc fairfax va
Avi Dey
 
Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016
Purdue University
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure prediction
Samvartika Majumdar
 
Structural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its ScopeStructural Bioinformatics - Homology modeling & its Scope
Structural Bioinformatics - Homology modeling & its Scope
Nixon Mendez
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
Arindam Ghosh
 
Classification and properties of protein
Classification and properties of proteinClassification and properties of protein
Classification and properties of protein
Mark Philip Besana
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
Dmytro Fishman
 

Similar to Lecture 9 slides: Machine learning for Protein Structure ... (20)

Machine Learning
Machine LearningMachine Learning
Machine Learning
Paolo Marcatili
 
Brief Tour of Machine Learning
Brief Tour of Machine LearningBrief Tour of Machine Learning
Brief Tour of Machine Learning
butest
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
butest
 
Lecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnLecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can Learn
Kodok Ngorex
 
Introductionedited
IntroductioneditedIntroductionedited
Introductionedited
Mefratechnologies
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
butest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
jim
 
Download It
Download ItDownload It
Download It
butest
 
Machine Can Think
Machine Can ThinkMachine Can Think
Machine Can Think
Rahul Jaiman
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
butest
 
machinecanthink-160226155704.pdf
machinecanthink-160226155704.pdfmachinecanthink-160226155704.pdf
machinecanthink-160226155704.pdf
PranavPatil822557
 
S10
S10S10
S10
butest
 
S10
S10S10
S10
butest
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
Datacademy.ai
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
6238578.ppt
6238578.ppt6238578.ppt
6238578.ppt
ChijiokeNsofor
 
Brief Tour of Machine Learning
Brief Tour of Machine LearningBrief Tour of Machine Learning
Brief Tour of Machine Learning
butest
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
butest
 
Lecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnLecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can Learn
Kodok Ngorex
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
butest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
jim
 
Download It
Download ItDownload It
Download It
butest
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
butest
 
machinecanthink-160226155704.pdf
machinecanthink-160226155704.pdfmachinecanthink-160226155704.pdf
machinecanthink-160226155704.pdf
PranavPatil822557
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
Datacademy.ai
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
PPT
PPTPPT
PPT
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
hier
hierhier
hier
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Lecture 9 slides: Machine learning for Protein Structure ...

  • 1. Introduction to Bioinformatics 9. Machine Learning for Protein Structure Prediction #1 Course 341 Department of Computing Imperial College, London © Simon Colton
  • 2. Remember the Scenario We have found a gene in mice Which when active makes them immune to a disease The gene codes for a protein, the protein has a shape and the shape dictates what it does Humans share 96% of their genes with mice So, what does the human protein look like?
  • 3. The Database Approach If two sequences are sequentially similar Then they are very likely to code for similar proteins Find the best match for the mouse gene In terms of sequences From a large database of individual human genes Or from a database of families of genes Infer protein structure from knowledge of matched genes If lucky, a structure of one of them may already be known
  • 4. There is another way… Machine learning: general set of techniques For teaching a computer to make predictions By observing given correct predictions (being trained) Special type of prediction Classification of objects into classes E.g., images into faces/cars/landscapes E.g., drugs into toxic/non-toxic ( binary classification) We want to predict a protein’s structure Given its sequence
  • 5. A Good Approach Look at regions of a protein i.e., lengths of residues Define ways to describe the regions So that we can infer the structure of a protein From a description of all its regions Learn methods for predicting: What type of region a particular residue will be in Apply this to a protein sequence To find contiguous regions with same description Put regions together to predict entire structure
  • 6. For example G A G D G A N A A A Alpha Alpha Alpha Alpha Inter Inter Beta Beta Beta Beta Trained Predictor Alpha Helix Beta Sheet Further Processing
  • 7. Two Main Questions How do we describe protein structures? What are alpha helices and beta-sheets? Covered in the next lecture How do we train our predictors? Covered in this lecture (and the start of the next…)
  • 8. Machine Learning in a Nutshell Examples in Predictor out Learning is by example More examples, better predictors For some methods, the examples are used once For other methods, they are used repeatedly
  • 9. Machine Learning Considerations What is the problem for the predictor to address? What is the nature of our data? How will we represent the predictor? How will we train the predictor? How will we test how good the predictor is?
  • 10. Types of Learning Problems in Bioinformatics Class membership e.g., predictive toxicology Prediction of sequences e.g., sequences of protein sub-structures Classification hierarchies e.g., folds, families, super-families Shape descriptions e.g., binding site descriptions Temporal models e.g., activity of cells, metabolic pathways
  • 11. Learning Data Data comes in many forms, in particular: Objects (to be classified/predicted for) Classifications/predictions of objects Features of objects (to use in the prediction) Problems with data Imprecise information (e.g., badly recorded data) Irrelevant information (e.g., additional features) Incorrect information (e.g., wrong classifications) Missing classifications Missing features for sets of objects
  • 12. Types of Representations Logical Decision trees , grammars, logic programs Symbolic, understandable representations Probabilistic Neural networks , Hidden Markov Models , SVMs Mathematical functions, not easy to understand Mixed Bayesian Networks, Stochastic Logic Programs Have advantages of both, more difficult to work with
  • 13. Advantages of Representations Probabilistic models: Can handle noisy and imprecise data Useful when there is a notion of uncertainty (in data/hypothesis) Well-founded (300 years of development) Good statistical algorithms for estimation Logical models Richness of description Extensibility - probabilities, sequences, space, time, actions Clarity of results Well-founded (2400 years of development)
  • 14. Decision Tree Representations Input is a set of features Describing an example/situation Many “if-then” choices Leaves are decision Logical representation: “ If then” is implication Branches are conjunctions Different branches comprise A disjunction
  • 15. Artificial Neural Networks Layers of nodes Input is transformed into numbers Weighted averages are fed into nodes High or low numbers come out of nodes A Threshold function determines whether high or low Output nodes will “fire” or not Determines classification For an example
  • 16. Logic Program Representations Logic programs are a subset of first order logic They consist of sets of Horn clauses Horn clause: A conjunction of literals implying a single literal Can easily be interpreted At the heart of the Prolog programming language
  • 17. Learning Decision Trees Problem: what feature do nodes in the tree test? And what happens for each case ID3 algorithm: Uses a notion of “Information gain” Based on entropy: how (dis)organised data is Chooses the node with the highest information gain As the node to add to the tree next Then restricts examples for next node
  • 18. Learning Artificial Neural Networks First problem: layer structure Usually done through trial and error Main problem: choosing the weights Uses a back-propagation algorithm to train them Each example is given If currently correctly classified, that’s fine If not, the errors from the output are passed back Propagated in order to change the weights throughout Only very small changes are made (avoid un-doing good work) Once all examples have been given We start again, until some termination conditions (accuracy) met Often requires thousands of such training ‘epochs’
  • 19. Learning Logic Programs A notion of generality is used One logic program is more general than another If one follows from another (sort of) [subsumption] A search space of logic programs is defined Constrained using a language bias Some programs search from general to specific Using rules of deduction to go between sentences Other programs search from specific to general Using inverted rules of deduction Search is guided by: Performance of the LP with respect to classifying training examples Information theoretic calculations to avoid over-specialisations
  • 20. Testing Learned Predictors #1 Imperative to test on unseen examples Cannot report accuracy on examples which have been used to train the predictor, because the results will be heavily biased Simple method: Hold back When number of examples > 200 (roughly) Split into a training set and a test set (e.g., 80%/20%) Never let the training algorithm see the test set Report the accuracy of the predictor on the test set only Have to worry about statistics with smaller numbers This kind of testing came a little late to bioinformatics Beware conclusions drawn about badly tested predictors
  • 21. N-Fold Cross Validation Leave one out For m < 20 examples Train on m-1 examples, test predictor on left out example Do this for every example and report the average accuracy N-fold cross validation Randomly split into n mutually exclusive sets (partitions) For every set S Train using all examples from the other n-1 sets Test predictor on S, record the accuracy Report the average accuracy over all the sets 10-fold cross validation is common
  • 22. Testing Learned Predictors #2 Often different consideration for different contexts E.g., false positives/negatives in medical diagnosis Confusion matrix For binary prediction tasks Predicted F Predicted T number = a number = b (false pos) number = c (false neg) number = d Actually F Actually T Let t = a+b+c+d Predictive accuracy = (a+d)/t Majority class = max ((a+b)/t, (c+d)/t) Precision = Selectivity = d/(b+d) Recall = Sensitivity = d/(c+d)
  • 23. Comparing Learning Methods A very simple method: Majority class predictor Predict everything to be in the majority class Trained predictors must beat this to be credible N-fold cross validation results are compared To show an advance in predictor technology However, accuracy is not the only consideration Speed, memory and comprehensibility
  • 24. Overfitting It’s easy to over-train predictors If a predictor is substantially better for the training set than the test set, it is overfitting Essentially, it has memorised aspects of the examples, rather than generalising properties of them This is bad: think of a completely new example Individual learning schemes have coping methods Easy general approach to avoiding overfitting: Maintain a validation set to perform tests on during training When performance on the validation set degrades, stop learning Be careful of blips in predictive accuracy (leave a while, then come back) Note: this shouldn’t be used as the testing set