0% found this document useful (0 votes)
131 views

Merge

This document provides information about a course on machine learning for biomedical applications taught by Professor Gajendra P.S. Raghava. The course aims to teach students machine learning techniques for developing predictive models to solve problems in biological and health sciences. It will cover topics like protein structure prediction, disease classification, and computer-aided drug design. Students will learn to develop models and evaluate their performance. The course includes assignments, quizzes, mid-term and end-term exams.

Uploaded by

Vaibhav Jindal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views

Merge

This document provides information about a course on machine learning for biomedical applications taught by Professor Gajendra P.S. Raghava. The course aims to teach students machine learning techniques for developing predictive models to solve problems in biological and health sciences. It will cover topics like protein structure prediction, disease classification, and computer-aided drug design. Students will learn to develop models and evaluate their performance. The course includes assignments, quizzes, mid-term and end-term exams.

Uploaded by

Vaibhav Jindal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 370

Machine learning for biomedical

Applications
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology

Web Site:
http://webs.iiitd.edu.in/raghava/
Welcome to BIO542(MLBA)
• Course : Machine learning for biomedical Applications
• Instructor: Prof. G. P. S. Raghava ([email protected], [email protected])
• TAs
• Dilraj Kaur ([email protected]), Sanjay K. Mohanty ([email protected]), Pradeep Singh
([email protected]), Shreya Mishra ([email protected]), Shalini Sharma
([email protected])
• Important URLs & email
• Mailing list : [email protected]
• Google Classroom: joining code goqqfsn
• https://classroom.google.com/c/Mzc5MzczNDc3NTE1?cjc=goqqfsn
• Website: http://webs.iiitd.edu.in/raghava/
• Please go through academic dishonesty policy very carefully
• https://www.iiitd.ac.in/education/resources/academicdishonesty
• Visiting hours: Student may visit between 4.30 to 5.30 PM (A-302, New Academic Building)
for any question/doubts/discussion. Check availability in advance.
Course Description
This course is designed for students having wide range of
background like biology, medical science, pharmacology,
bioinformatics, computer science. This course is divided in
following three sections; i) Major challenges in the field of
biomedical science, ii) Introduction/implementation of
machine learning techniques for developing prediction
models, and iii) solving biomedical problems using machine
learning techniques. This course will be help students in
developing novel methods for solving real-life problems in the
field of biological and health sciences. Attempt will made to
bridge gap between students and world class researchers,
students will be exposed to highly accurate methods based on
machine learning techniques (research papers).
Post conditions
(Expectations from students after course)
• Knowledge of biomedical applications
• Classification of biomolecules
• Prediction of inhibitors/drugs
• Models for predicting disease associated genes
• Image-based disease classification
• Ability to develop models using machine learning techniques
• Major ML techniques: SVM, ANN, Random Forest, KNN & HMM
• Feature engineering
• Generating features for biomolecules and biomedical images.
• Feature selection techniques
• Dimension reduction techniques (e.g., PCA)
• Evaluation of prediction/classification models
• Parameters for measuring performance of models
• Cross-validation techniques for training/testing
• Internal/ external validation of models
Performance Evaluation
Group Activities
Assignments: 20% Guoup activity
Individual Activities
Mid-sem Exam: 30% Individual
End-sem Exam: 30% Individual
Quiz: 20% Individual
Performance Evaluation
(Group activity)
Assignments: There will be two assignments, 10 marks
each. Assignments will be submitted by a group of
maximum three students. It will be based on Kaggle in
class competition.
Performance Evaluation
(Individual Activity)
• Quiz: Total three quiz will be conducted in class. Best
of two will be used for evaluation, weightage will be
20%.
• Mid-sem Exam: Online exam will be conducted
weightage will be 30%
• End-sem Exam: Online exam will be conducted
weightage will be 30%
Week-wise plan

• Week1: Introduction to course and case study


• Week2: Python and machine learning libraries
• Week3: Concept of classification and regression
• Week4: Artificial neural network
• Week5: Protein therapeutics and feature generation
• Week6: Evaluation of models
• Week7: Computer-aided drug design
Week-wise plan

• Week8: Feature selection or reduction


• Week9: KNN and SVM
• Week10: Ensemble classifiers and random forest
• Week11: Genetic biomarkers for disease diagnostics
• Week12: Deep Learning using PyTorch
• Week13: Image classification
Accidents Environment

Food Age of Organism

Causes of Possible
Diseases Solutions
• Disease-associated Pathogens (Virus, • Understanding biology at genome level
Bacteria, Fungus etc.) • Drugs particularly against drug resistant
• Disordered or Malfunction (e.g., Cancer) diseases
• Malnutrition (Healthy food) • Subunit or Epitope-based Vaccines
• Side-effects of drugs • Disease Biomarkers for early detection
• Mental Health & Stress • Drug biomarkers
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers

Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure prediction • Disease • Drug design • Image Classification
• Subcellular biomarkers • Chemical • Medical images
localization • Drug biomarkers descriptor • Disease
• Therapeutic • mRNA expression • QSAR models classification
Application • Copy number • Personalized • Disease diagnostics
• Ligand binding variation inhibitors
Five kingdoms of living organism
Cell: minimum unit of life

Saccharomyces Helicobactor
Neuron Paramecium Chlamydomonas cerevisiae Pylori

Single celled Yeast cell Bacteria causes


Nerve cells from Protozoan green algae stomach ulcer
mammalian brain
Human Cell
Major molecules: Proteins, DNA, RNA
Proteins
• Most of activities are performed by proteins
• A protein is a polymer of 20 natural amino acids
• FASTA is a commonly used format to present protein
sequence
• FASTA file contain amino acid in single letter code
• Sequence of a protein is also called protein primary
structure
Protein Synthesis (Gene Expression) Notes
Proteins (Review)
• Proteins make up all living materials
Proteins are made of 20 type of natural
amino acids
Protein Sequences in FASTA
format
Major molecules: Proteins, DNA, RNA
DNA (Deoxyribonucleic acid)
• It is a polymer of 4 nucleotides (A, T, G, C)
• It is a double chain (two complementary strands)
• One strand is 5’ -> 3 ‘ and another 3’ -> 5’
Major molecules: Proteins, DNA, RNA
RNA
• RNA molecules are similar to DNA
• Uracil (U) instead of thymine (T)
• Normally RNA is in single strand
• DNA have single function but RNA have different functions
• mRNA, tRNA, rRNA, miRNA, siRNA etc
The genome is our Genetic Blueprint

• Nearly every human cell


contains 23 pairs of
chromosomes
• 1 - 22 and XY or XX
• XY = Male
• XX = Female

• Length of chr 1-22, X, Y


together is ~3.2 billion
bases (about 2 meters
diploid)
The Genome is
Who We Are on the inside!
• Chromosomes consist of Information
DNA coded in DNA
• molecular strings of A, C, G,
&T
• base pairs, A-T, C-G

• Genes
• DNA sequences that encode
proteins
• less than 3% of human
genome
•Transcription
•DNA -> RNA

•Translation
•RNA -> Protein
Central dogma of molecular biology
Central dogma of molecular
biology
• mRNA then goes through the pores of the nucleus with
the DNA code and attaches to the ribosome.
Transcription, Translation and Protein synthesis

Transcription

• Process of copying DNA to RNA


• Does not need a primer to start
• Can involve multiple RNA polymerases
• Divided into 3 stages
• Initiation
• Elongation
• Termination
Genes in Genomes
Genes to Protein
Translation (six open reading frame)
Genetic Codes
•A series of three adjacent bases
in an mRNA molecule codes for
a specific amino acid—called a
codon.

•Each tRNA has 3 nucleotides Amino acid


that are complementary to the
codon in mRNA.

•Each tRNA codes for a different


amino acid.
Anticodon
• mRNA carrying the DNA instructions and tRNA carrying
amino acids meet in the ribosomes.
• Amino acids are joined together to make a protein.

Polypeptide = Protein
Summary
mRNA Levels Indirectly Measure Gene Activity
The activity of a gene (expression) can be determined by the
Gene Expression presence of its complementary mRNA

Every cell contains the same DNA

Genes code for proteins through


the intermediary of mRNA

Cells differ in the DNA (gene) which is active at any one time
Gene Expression

pseudo-colour
sample
image
(labelled)

probe
(on chip)
DNA sequencing
• Sanger sequencing techniques
• Maxam–Gilbert sequencing (1977-80)
• Pyrosequencing (1993)
• Next generation sequencing techniques
Genome Gallery
Galerie
genomů
Genome size of Important species
Bacteriophage λ (virus) 1 chr. 5*104 bp
Escherichia Coli 1 5*106
S. cerevisaie (yeast) 32 1*107
Caenorhabditis elegans (worm) 12 5*108
D. melanogaster (fruit fly) 8 2*108
Homo sapiens (human) 46 3*109
Glycomics Lipidomics
(Sugars) (Lipids)

Metabolomics
Chromosome
(23 pair) Epigenomics
M
M

Ac
Ac

Cell Nucleus Chromatin


Organ, Tissue
Genomics (3×109)
miRNA
DNA (4 chemicals: A, T, G, C)
World of OMICs
Non-coding RNA Transcriptomics
mRNA (copies)

M C
A

A
I
V

Y
M
E Proteomics
D

Glycomics (Sugars attached proteins) Protein (20 chemicals: A, C, D ..)


Briefings in Bioinformatics, 22(2), 2021, 936–945
doi: 10.1093/bib/bbaa259
Computer-aided prediction and design of IL-6 inducing peptides: IL-6 plays a crucial
role in COVID-19
Anjali Dhall†, Sumeet Patiyal†, Neelam Sharma, Salman Sadullah Usmani and Gajendra P. S. Raghava

Abstract
Interleukin 6 (IL-6) is a pro-inf lammatory cytokine that stimulates acute phase responses, hematopoiesis
and specific immune reactions. Recently, it was found that the IL-6 plays a vital role in the progression of
COVID-19, which is responsible for the high mortality rate. In order to facilitate the scientific community
to fight against COVID-19, we have developed a method for predicting IL-6 inducing peptides/epitopes.
The models were trained and tested on experimentally validated 365 IL-6 inducing and 2991 non-inducing
peptides extracted from the immune epitope database. Initially, 9149 features of each peptide were
computed using Pfeature, which were reduced to 186 features using the SVC-L1 technique. These features
were ranked based on their classification ability, and the top 10 features were used for developing
prediction models. A wide range of machine learning techniques has been deployed to develop models.
Random Forest-based model achieves a maximum AUROC of 0.84 and 0.83 on training and independent
validation dataset, respectively. We have also identified IL-6 inducing peptides in different proteins of
SARS-CoV-2, using our best models to design vaccine against COVID-19. A web server named as IL-6Pred
and a standalone package has been developed for predicting, designing and screening of IL-6 inducing
peptides (https://webs.iiitd.edu.in/raghava/il6pred/).
Material and methods
•Dataset preparation and pre-processing
•We extracted experimentally validated, 583 IL-6 inducing peptides from IEDB database.
•We removed all identical peptides and peptides having a length greater than 25 amino acids.
•Finally, we got 365 IL-6 inducing peptides and 2991 non-IL-6 inducing peptide
Name of descriptor Description of descriptor Number of features or
length of vector
AAC Amino acid composition 20
DPC Dipeptide composition 400
TPC Tripeptide composition 8000
ABC Atomic and bond composition 9
RRI Residue repeat Information 20
DDOR Distance distribution of residue 20
SE Shannon-entropy of protein 1
SER Shannon entropy of all amino acids 20
SEP Shannon entropy of physicochemical 25
property
CTD Conjoint triad calculation of the 343
descriptors
CeTD Composition-enhanced transition 187
distribution
PAAC Pseudo amino acid composition 23
APAAC Amphiphilic pseudo amino acid composition 29

QSO Quasi-sequence order 46


SOCN Sequence order coupling number 6
Computer-aided prediction and design of IL-6 inducing peptides 939

SVC-L1-based feature selection technique, which implements the parameters were calculated using the following equations:
support vector classifier (SVC) with linear kernel, penal- ized with L1
regularization. We used SVC-L1 because it per- forms several methods to
select the best features from a large number feature vector, and it is TP
Sensitivity = × 100 (1)
extremely fast as compared with other techniques [44]. Its primary TP + FN
purpose is to minimize the objective function, which considers the loss
function and regularization. SVC-L1 method selects the non-zero TN
coefficients and then applies the L1 penalty to select relevant features Specificity = (2)
T N + FP
to reduce dimensions. The L1 regularization creates the sparse models
× 100
during the optimization process, and by selecting some of the features TP +T N
out of the model by making the coefficients equal to zero. Using the ‘C’ Accuracy = × 100 (3)
TP + FP + TN + FN
parameter, it regulates the sparsity, which is directly proportional to the
number of selected features; lower the value of the ‘C’, lesser number
of features will be selected. We have used the default value of 0.01 for TP = True Positive, FP = False Positive,
param- eter ‘C’ [45]. Based on this technique, 186 important features
TN = True Negative, FN = False Negative.
(Supplementary Table S1) have been identified from the 9149- feature
set.
After that, these 186 features were ranked based on their importance in
classifying peptides using program feature- selector. The program Architecture of web server
feature-selector rank features using a DT-based algorithm Light Gradient A web server named as ‘IL6Pred’ (https://webs.iiitd.edu.in/ra
Boosting Machine, which calculates the rank of feature based on the ghava/il6pred) is developed to predict IL-6 inducing and non- inducing
number of times a feature is used to split the data across all trees [46]. peptides. The front end of the web server was devel- oped by using
These top- ranked features were examined to understand the nature of HTML5, JAVA, CSS3 and PHP scripts. It is based on responsive templates
IL-6 inducing peptides. Furthermore, we applied machine learning on which adjust the screen based on the size of the device. It is compatible
selected features and computed the performance on top 10, 20, 30 ... ., with almost all modern devices such as mobile, tablet, iMac and
and 186 features, respectively. desktop. The web server incorporates five major modules, such as
Predict, Design, Protein Scan, Motif Scan and Blast Scan.

Results
In this study, we have used 365 peptides as a positive dataset, which can
induce IL-6 cytokine. The negative dataset includes 2991 peptides, which
Cross-validation do not induce IL-6 cytokine. All the anal- yses and predictions performed
on the IL-6 inducing and non- inducing epitopes or peptides.
We used the 5-fold cross-validation and external validation tech- nique
to train, test and evaluate our prediction models. In the past, several
studies used an 80:20 proportion for the splitting of the complete
dataset into training and validation datasets [47– 50]. We also used this
Positional analysis
standard protocol in this study, where 80% (i.e. 292 IL-6 inducing and
2393 non-IL-6 inducing peptides) of the data was used for training and In this analysis, we study the preference of particular amino acid at a
the remaining 20% (i.e. 73 IL-6 inducing and 598 non-IL-6 inducing specific position in the peptide string; we create a TSL for the IL-6
peptides) was used for external validation. Then, we implement inducing (positive) and non-inducing (negative) peptides as represented
standard 5-fold cross- validation evaluation techniques, which is in Figure 2. The most significant amino acid residue represents the
frequently used in the previous studies [51, 52]. Firstly, the entire relative abundance in the sequence. It is important to note that the first
training dataset is divided into five equivalent sets or folds, with all the eight positions represent the N-terminal residues of peptides, and the
5-folds have the same number of positive and negative examples. Then, last eight positions represent C-terminus of peptides. We observed that
4-folds were used for training, while the 5-fold was utilized for testing. ‘L’ amino acid residue is mostly preferred at 2nd, 4th, 5th, 6th, 7th,
This procedure was iterated five times so that each set was used for 10th, 11th, 12th, 13th, 14th, 15th and 16th positions in the IL-6 inducing
testing. peptides. It means that ‘L’ is preferred in N-terminus as well as C-
terminus residues. Besides, residue ‘I’ is found to be most abundant at
positions 1st, 4th and 7th in IL-6 inducing peptides; it means that ‘I’ is
preferred in N-terminus residues. On the other hand, amino acid residue
‘A’ dominates at 4th, 8th, and 16th positions in non-IL-6 inducing
peptides.

Evaluation parameters
In order to evaluate the efficiency of different prediction models, we
used well-established evaluation parameters. In this study, we used both
threshold-dependent and independent param- eters, and we measure Compositional analysis
threshold-dependent parameters such as sensitivity (Sens), specificity In this analysis, we computed amino acid composition (AAC) for both
(Spec) and accuracy (Acc) with the help of the following equations. We positive and negative datasets. The average composition of IL-6 inducing
also used the stan- dard threshold-independent parameter Area Under and non-inducing peptides is shown in Figure 3. The average composition
the Receiver Operating Characteristic (AUROC) curve to measure the of residues (such as I, L and S) is higher in IL-6 inducing peptides than in
perfor- mance of the models. AUROC curve is generated by plotting non-IL-6 peptides. Besides, the residues (such as A, D and G) are more
sensitivity against (1-specificity) on various thresholds. These abundant in non-IL-6 peptides as compared with IL-6 inducing peptides.
940 Dhall et al.

Prediction models with a minimum number of features, which will discriminate between IL-
6 inducers and non-inducers with high AUROC and accuracy. Therefore,
Machine learning-based prediction models
we build different models on top (10, 20,
We develop prediction models using various classifiers such as RF, DT, 30 ... ... and 186) features, respectively, and evaluate per- formance on
GNB, XGB and LR. Firstly, we computed the features of the IL-6 inducers the training and validation dataset. In order to understand the
and non-inducers from the Pfeature compositional-based module. A total difference between the positive and negative datasets, we computed
of 9149 features were generated by Pfeature, and then we have the average values of the top-10 features of IL-6 inducing and non-
implemented the SVC-L1 feature selection technique to select the most inducing peptides, as represented in Table 3.
relevant features, i.e. 186 features, as shown in Supplementary Table The top-10 selected features have reasonable discriminatory power in
S1. With this feature set, we applied various machine learning models. case of AUROC and accuracy. RF achieves maximum performance with
RF attains maximum performance with AUROC 0.893 and 0.863; accuracy accuracy (77.39 and 73.47), AUROC (0.84 and 0.83) on training and
75.79 and 73.32 on training and validation datasets, respectively, with validation dataset with balanced sensitiv- ity and specificity,
balanced sensitivity and specificity. XGB also performed well on training respectively, as represented in Table 4 and Figure 4. The performance of
and validation datasets with AUROC 0.87 and 0.82, accuracy 86.29 and 10, 20, 30 ... . .. . and 186 selected feature sets is provided in
84.65, respectively, but there is a considerable difference in sensitivity Supplementary Table S2.
and specificity. Other classifiers, such as DT, LR, KNN and GNB, perform
poorly on the training and validation dataset, as represented in Table 2.

Services to the scientific community


In order to serve the scientific community, we develop a user- friendly
prediction web server that integrates different modules to predict IL-6
Performance of top-ranked features inducing peptides. The prediction models used in the study are
All 186 features were ranked based on their importance accord- ing to implemented in the web server. Users can predict that the given query
their normalized and cumulative score, with the help of the feature peptide is IL-6 inducing or non-inducing based on the prediction models
selector tool. Furthermore, we evaluate the perfor- mance of the score at a different threshold. The web server has five important
different feature sets. We identified the feature set modules: (i) Predict; (ii)
942 Dhall et al.

Case study: IL-6 inducing peptides in spike proteins of SARS-CoV-2


Spike protein of novel coronavirus massively induces the release of proinflammatory cytokine IL-6. We downloaded the SARS-
CoV-2 proteins of five different countries, such as India (MT539168), China (NC_004718), USA (MT536976), Germany
(MT539726) and Italy
(MT528239) from NCBI (https://www.ncbi.nlm.nih.gov/sars-co v-2/). We identify 222 IL-6 inducing peptides out of 1259 peptides
from the spike proteins of all the countries
Computer-aided prediction and design of IL-6 inducing peptides 943
Python Quick Review

Prof. Gajendra P.S. Raghava


Head, Department of Computational Biology

Reference book:Python (Notes for Professionals). https://GoalKicker.com/PythonBook

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so


no claim of authorship on any slide
Development Environments
Major IDE

1. PyDev with Eclipse


2. Spyder
3. Komodo
4. Emacs
5. Vim
6. TextMate
7. Gedit
8. Idle
9. Jupyter notebook
10. NotePad++ (Windows)
11. BlueFish (Linux)
Getting Started with Python
• Check python version: python –version #Make executable program: (insert
• https://www.python.org/downloads/ following at first line)
• Start python interpreter: python #! /use/bin/python
>>> print("Hello, World")
Hello, World #Make the file executable:
• Running from file > chmod +x ./abc.py
• Create file “abc.py” using any editor
• Enter line ‘print("Hello, World”) ‘ and save # type the script name to execute
file
% ./abc.py
• Run program from terminal: python
abc.py
Hello, World
• Run command from terminal:
python –c print("Hello, World")
Built in Modules and Functions
Data Types A module is a file containing Python definitions and
statements. Function is a piece of code which execute some
logic. A module is an importable file containing definitions
Example
and statements.
a=2 # Integer
>dir() to check functions
pi = 3.14 # Float >dir(__builtins__)
c = ‘A’ # String >pip install sckit-learn
print(type(x)) >from sklearn import datasets
a, b, c = 1, 2, 3 >dir(datasets)
>import anticp2mod
x = y = [7, 8, 9] # array
>import math
x = [123,'abcd',10.2,'d’] # list
>dir(math)
dic={'name':'red','age':10} # dictionary
>math.log(16)
Block Indentation (4 space or tab)
Conditions Input & Output
if number > 2: print (a, b, c)
print("Number is bigger than 2.")
#Input from a file
elif number < 2:
print("Number is smaller than 2.") fileobj = open('shoppinglist.txt', 'r')
else: content = fileobj.read()
print("Number is 2.") print(‘file content’,content)
# Read from stdin
Loops foo = input(”Enter value of foo")
while i < 7: print(foo)
print(i) =============
if i == 4:
print("Breaking from loop") with open('myfile.txt', 'w') as f:
break f.write("Line 1")
i += 1 f.write("Line 2")
=====================
f.write("Line 3")
for i in (0, 1, 2, 3, 4, 5): f.write("Line 4")
if i == 2 or i == 4:
continue
print(i)
Important points
• Indentation matters to code meaning
• Block structure indicated by indentation
• First assignment to a variable creates it
• Variable types don’t need to be declared.
• Python figures out the variable types on its own.
• Assignment is = and comparison is ==
• For numbers + - * / % are as expected
• Special use of + for string concatenation and % for string
formatting (as in C’s printf)
• Logical operators are words (and, or, not)
not symbols
• The basic printing command is print
Important points
Whitespace is meaningful in Python: especially
indentation and placement of newlines
•Use a newline to end a line of code
Use \ when must go to next line prematurely
•No braces {} to mark blocks of code, use consistent
indentation instead
• First line with less indentation is outside of the block
• First line with more indentation starts a nested block
•Colons start of a new block in many constructs, e.g.
function definitions, then clauses
Naming Rules
• Names are case sensitive and cannot start with a number. They can
contain letters, numbers, and underscores.
bob Bob _bob _2_bob_ bob_2 BoB
• There are some reserved words:
and, assert, break, class, continue, def, del, elif,
else, except, exec, finally, for, from, global, if,
import, in, is, lambda, not, or, pass, print, raise,
return, try, while
Comments
• Start comments with #, rest of line is ignored
• Can include a “documentation string” as the first line of a
new function or class you define
• Development environments, debugger, and other tools use
it: it’s good style to include one
def fact(n):
“““fact(n) assumes n is a positive integer
and returns facorial of n.”””
assert(n>0)
return 1 if n==1 else n*fact(n-1)
A Code Sample
x = 34 - 23 # A comment.
y = “Hello” # Another one.
z = 3.45
if z == 3.45 or y == “Hello”:
x = x + 1
y = y + “ World” # String concat.
print x
print y
while loops Loops
a = 10
for a in range(10):
while a > 0: print a
print a
a -= 1 Common for loop idiom:
a = [3, 1, 4, 1, 5, 9]
 Common while loop idiom:
for i in range(len(a)):
f = open(filename, "r")
while True: print a[i]
line = f.readline()
if not line:
break
Input from Terminal and File

f = file("foo", "r")
line = f.readline()
print line,
f.close()
Files: Input
input = open(‘data’, ‘r’) Open the file for input

S = input.read() Read whole file into


one String
S = input.read(N) Reads N bytes
(N >= 1)
L = input.readlines() Returns a list of line
strings
Files: Output
output = open(‘data’, ‘w’) Open the file for writing

output.write(S) Writes the string S to file

output.writelines(L) Writes each of the strings in list


L to file
output.close() Manual close
Sequence Types
1. Tuple: (‘john’, 32, [CMSC])
• A simple immutable ordered sequence of
items
• Items can be of mixed types, including
collection types
2. Strings: “John Smith”
• Immutable
• Conceptually very much like a tuple
3. List: [1, 2, ‘john’, (‘up’, ‘down’)]
• Mutable ordered sequence of items of mixed
types
Sequence Types 1
• Define tuples using parentheses and commas
>>> tu = (23, ‘abc’, 4.56, (2,3),
‘def’)
• Define lists are using square brackets and commas
>>> li = [“abc”, 34, 4.34, 23]
• Define strings using quotes (“, ‘, or “““).
>>> st = “Hello World”
>>> st = ‘Hello World’
>>> st = “““This is a multi-line
string that uses triple quotes.”””
Sequence Types 2
• Access individual members of a tuple, list, or string
using square bracket “array” notation
• Note that all are 0 based…
>>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> tu[1] # Second item in the tuple.
‘abc’

>>> li = [“abc”, 34, 4.34, 23]


>>> li[1] # Second item in the list.
34

>>> st = “Hello World”


>>> st[1] # Second character in string.
‘e’
Tuples
• What is a tuple?
• A tuple is an ordered collection which
cannot
be modified once it has been created.
• In other words, it's a special array, a
read-only array.
• How to make a tuple? In round brackets
• E.g.,
>>> t = ()
>>> t = (1, 2, 3)
>>> t = (1, )
>>> t = 1,
>>> a = (1, 2, 3, 4, 5)
>>> print a[1] # 2
Compound Data Type: List
• List:
• A container that holds a number
of other objects, in a given
order
• Defined in square brackets
a = [1, 2, 3, 4, 5]
print a[1] # number 2
some_list = []
some_list.append("foo")
some_list.append(12)
print len(some_list) # 2
Positive and negative indices
>>> t = (23, ‘abc’, 4.56, (2,3),
‘def’)
Positive index: count from the left, starting with 0
>>> t[1]
‘abc’
Negative index: count from right, starting with –1
>>> t[-3]
4.56
Slicing: return copy of a subset
>>> t = (23, ‘abc’, 4.56, (2,3),
‘def’)

Return a copy of the container with a subset of the


original members. Start copying at the first index,
and stop copying before second.
>>> t[1:4]
(‘abc’, 4.56, (2,3))
Negative indices count from end
>>> t[1:-1]
(‘abc’, 4.56, (2,3))
Slicing: return copy of a=subset
>>> t = (23, ‘abc’, 4.56, (2,3),
‘def’)
Omit first index to make copy starting from
beginning of the container
>>> t[:2]
(23, ‘abc’)
Omit second index to make copy starting at first
index and going to end
>>> t[2:]
(4.56, (2,3), ‘def’)
Copying the Whole Sequence
• [ : ] makes a copy of an entire sequence
>>> t[:]
(23, ‘abc’, 4.56, (2,3), ‘def’)
• Note the difference between these two lines for
mutable sequences
>>> l2 = l1 # Both refer to 1 ref,
# changing one affects
both
>>> l2 = l1[:] # Independent copies,
two refs
Operations in Tuple
• Indexing e.g., T[i]
• Slicing e.g., T[1:5]
• Concatenation e.g., T + T
• Repetition e.g., T * 5
• Membership test e.g., ‘a’ in T
• Length e.g., len(T)
The ‘in’ Operator
• Boolean test whether a value is inside a container:
>>> t = [1, 2, 4, 5]
>>> 3 in t
False
>>> 4 in t
True
>>> 4 not in t
False
• For strings, tests for substrings
>>> a = 'abcde'
>>> 'c' in a
True
>>> 'cd' in a
True
>>> 'ac' in a
False
• Be careful: the in keyword is also used in the syntax of for loops and list
comprehensions
The + Operator

The + operator produces a new tuple, list, or string whose value is the
concatenation of its arguments.

>>> (1, 2, 3) + (4, 5, 6)


(1, 2, 3, 4, 5, 6)

>>> [1, 2, 3] + [4, 5, 6]


[1, 2, 3, 4, 5, 6]

>>> “Hello” + “ ” + “World”


‘Hello World’
The * Operator

• The * operator produces a new tuple, list, or string that “repeats” the
original content.

>>> (1, 2, 3) * 3
(1, 2, 3, 1, 2, 3, 1, 2, 3)

>>> [1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]

>>> “Hello” * 3
‘HelloHelloHello’
Methods in string
• upper() ▪ strip(), lstrip(), rstrip()
• lower() ▪ replace(a, b)
• capitalize() ▪ expandtabs()
• count(s) ▪ split()
• find(s) ▪ join()
• rfind(s) ▪ center(), ljust(), rjust()
• index(s)
Lists are mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]

• We can change lists in place.


• Name li still points to the same memory
reference when we’re done.
Tuples are immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14

Traceback (most recent call last):


File "<pyshell#75>", line 1, in -toplevel-
tu[2] = 3.14
TypeError: object doesn't support item assignment

• You can’t change a tuple.


• You can make a fresh tuple and assign its reference to a
previously used name.
>>> t = (23, ‘abc’, 3.14, (2,3), ‘def’)
• The immutability of tuples means they’re faster than
lists.
Operations on Lists Only
>>> li = [1, 11, 3, 4, 5]

>>> li.append(‘a’) # Note the


method syntax
>>> li
[1, 11, 3, 4, 5, ‘a’]

>>> li.insert(2, ‘i’)


>>>li
[1, 11, ‘i’, 3, 4, 5, ‘a’]
The extend method vs +
• + creates a fresh list with a new memory ref
• extend operates on list li in place.
>>> li.extend([9, 8, 7])
>>> li
[1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7]

• Potentially confusing:
• extend takes a list as an argument.
• append takes a singleton as an argument.
>>> li.append([10, 11, 12])
>>> li
[1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7, [10, 11, 12]]
Operations on Lists Only
Lists have many methods, including index, count, remove,
reverse, sort
>>> li = [‘a’, ‘b’, ‘c’, ‘b’]
>>> li.index(‘b’) # index of 1st
occurrence
1
>>> li.count(‘b’) # number of
occurrences
2
>>> li.remove(‘b’) # remove 1st
occurrence
>>> li
[‘a’, ‘c’, ‘b’]
Operations on Lists Only
>>> li = [5, 2, 6, 8]

>>> li.reverse() # reverse the list *in place*


>>> li
[8, 6, 2, 5]

>>> li.sort() # sort the list *in place*


>>> li
[2, 5, 6, 8]

>>> li.sort(some_function)
# sort in place using user-defined comparison
Operations in List
▪ append • Indexing e.g., L[i]
▪ insert • Slicing e.g., L[1:5]
▪ index • Concatenation e.g., L + L
▪ count • Repetition e.g., L * 5
▪ sort • Membership test e.g., ‘a’ in L
▪ reverse • Length e.g., len(L)
▪ remove
▪ pop
▪ extend
List vs. Tuple
• What are common characteristics?
• Both store arbitrary data objects
• Both are of sequence data type
• What are differences?
• Tuple doesn’t allow modification
• Tuple doesn’t have methods
• Tuple supports format strings
• Tuple supports variable length parameter in function call.
• Tuples slightly faster
Summary: Tuples vs. Lists
• Lists slower but more powerful than tuples
• Lists can be modified, and they have lots of handy operations and
mehtods
• Tuples are immutable and have fewer features
• To convert between tuples and lists use the list() and tuple()
functions:
li = list(tu)
tu = tuple(li)
Python Libraries

Prof. Gajendra P.S. Raghava


Head, Department of Computational Biology

Reference book:Python (Notes for Professionals). https://GoalKicker.com/PythonBook

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so


no claim of authorship on any slide
Python Libraries for Data Science
Many popular Python toolboxes/libraries:
● NumPy
● SciPy
● Pandas
#Import Python Libraries
● SciKit-Learn
import numpy as np
import scipy as sp
Visualization libraries import pandas as pd
● matplotlib import matplotlib as mpl
● Seaborn import seaborn as sns

90
NumPy
● NumPy is the fundamental package needed for scientific
computing with Python. It contains:
● a powerful N-dimensional array object
● basic linear algebra functions
● basic Fourier transforms
● sophisticated random number capabilities
● tools for integrating Fortran code
● tools for integrating C/C++ code
● Official documentation
● http://docs.scipy.org/doc/
● The NumPy book
● http://web.mit.edu/dvp/Public/numpybook.pdf
● Example list
● https://docs.scipy.org/doc/numpy/reference/routines.html
More about Numpy
●Python does numerical
●Application of NumPy
computations slowly.
● Mathematics (alternate to MATLAB)
● Plotting (Matplotlib) ●1000 x 1000 matrix multiply
● Backend (Pandas) ● Python triple loop takes > 10 min.
● Machine learning (Tensorflow) ● Numpy takes ~0.03 seconds
●Comparison with List Structured lists of numbers.
● Faster than List
● Number of operations (a*b)
●Vectors
● Less memory ●Matrices
● Convent to use ●Images
●Tensors
●ConvNets
Arrays – Numerical Python (Numpy)
● Lists ok for storing small amounts of one-dimensional data
>>> a = [1,3,5,7,9] >>> a = [1,3,5,7,9]
>>> print(a[2:4]) >>> b = [3,5,6,7,9]
[5, 7] >>> c = a + b
>>> b = [[1, 3, 5, 7, 9], [2, 4, 6, 8, 10]] >>> print c
>>> print(b[0]) [1, 3, 5, 7, 9, 3, 5, 6, 7, 9]
[1, 3, 5, 7, 9]
>>> print(b[1][2:4])
[6, 8]

• But, can’t use directly with arithmetical operators (+, -, *, /, …)


• Need efficient arrays with arithmetic and better multidimensional
tools
• Numpy >>> import numpy

• Similar to lists, but much more capable, except fixed size


Numpy – Creating matrices
>>> l = [[1, 2, 3], [3, 6, 9], [2, 4, 6]] # create a list
>>> a = numpy.array(l) # convert a list to an array
>>>print(a)
[[1 2 3]
[3 6 9]
[2 4 6]]
>>> a.shape
(3, 3)
>>> print(a.dtype) # get type of an array
int64

# or directly as matrix
>>> M = array([[1, 2], [3, 4]])
>>> M.shape
(2,2)
>>> M.dtype
dtype('int64')
NumPy functions
min() # as vectors from lists
abs() >>> a = numpy.array([1,3,5,7,9])
add() max()
>>> b = numpy.array([3,5,6,7,9])
binomial() multiply() >>> c = a + b
cumprod() polyfit() >>> print(c)
cumsum() randint() [4, 8, 11, 14, 18]
floor() >>> c.shape
shuffle()
(5,)
histogram() transpose()

95
Numpy – ndarray attributes
⮚ndarray.ndim: the number of axes (dimensions) of the array i.e. the rank.
⮚ndarray.shape: the dimensions of the array. This is a tuple of integers indicating the size of the
array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length
of the shape tuple is therefore the rank, or number of dimensions, ndim.
⮚ndarray.size: the total number of elements of the array, equal to the product of the elements of
shape.
⮚ndarray.dtype: an object describing the type of the elements in the array. One can create or
specify dtype's using standard Python types. NumPy provides many, for example bool_, character,
int_, int8, int16, int32, int64, float_, float8, float16, float32, float64, complex_, complex64,
object_.
⮚ndarray.itemsize: the size in bytes of each element of the array. E.g. for elements of type float64,
itemsize is 8 (=64/8), while complex32 has itemsize 4 (=32/8) (equivalent to
ndarray.dtype.itemsize).
⮚ndarray.data: the buffer containing the actual elements of the array. Normally, we won't need to
use this attribute because we will access the elements in an array using indexing facilities.
Numpy – array methods - sorting
>>> arr = numpy.array([4.5, 2.3, 6.7, 1.2, 1.8, 5.5])
>>> arr.sort() # acts on array itself
>>> print(arr)
[ 1.2 1.8 2.3 4.5 5.5 6.7]
>>> x = numpy.array([4.5, 2.3, 6.7, 1.2, 1.8, 5.5])
>>> numpy.sort(x)
array([ 1.2, 1.8, 2.3, 4.5, 5.5, 6.7])
>>> print(x)
[ 4.5 2.3 6.7 1.2 1.8 5.5]
>>> s = x.argsort()
>>> s
array([3, 4, 1, 0, 5, 2])
>>> x[s]
array([ 1.2, 1.8, 2.3, 4.5, 5.5, 6.7])
>>> y[s]
array([ 6.2, 7.8, 2.3, 1.5, 8.5, 4.7])
SciPy : Python Library for science/Engineering

● It depends on the NumPy library


● file input/output
● statistics
● optimization
● numerical integration
● linear algebra
● Fourier transforms
● signal processing
● image processing
● ODE solvers
SciPy – Linear Algebra
from numpy import *
from scipy import linalg
import matplotlib.pyplot as plt

c1,c2= 5.0,2.0
i = r_[1:11]
xi = 0.1*i
yi = c1*exp(-xi)+c2*xi
zi = yi + 0.05*max(yi)*random.randn(len(yi))

A = c_[exp(-xi)[:,newaxis],xi[:,newaxis]]
c,resid,rank,sigma = linalg.lstsq(A,zi)

xi2 = r_[0.1:1.0:100j]
yi2 = c[0]*exp(-xi2) + c[1]*xi2

plt.plot(xi,zi,'x',xi2,yi2)
plt.axis([0,1.1,3.0,5.5])
plt.xlabel('$x_i$')
plt.title('Data fitting with linalg.lstsq')
plt.show()
Example: Linear regression
import scipy as sp
from scipy import stats
import pylab as plt
n=50 # number of points
x=sp.linspace(-5,5,n) # create x axis data
a, b=0.8, -4
y=sp.polyval([a,b],x)
yn=y+sp.randn(n) #add some noise
(ar,br)=sp.polyfit(x,yn,1)
yr=sp.polyval([ar,br],x)
err=sp.sqrt(sum((yr-yn)**2)/n) #compute the mean square error
print('Linear regression using polyfit')
print(‘Input parameters: a=%.2f b=%.2f’ % (a,b))
print(‘Regression: a=%.2f b=%.2f, ms error= %.3f' % (ar,br,err))
plt.title('Linear Regression Example')
plt.plot(x,y,'g--')
plt.plot(x,yn,'k.')
plt.plot(x,yr,'r-')
plt.legend(['original','plus noise', 'regression'])
plt.show()
(a_s,b_s,r,xx,stderr)=stats.linregress(x,yn)
print('Linear regression using stats.linregress')
print('parameters: a=%.2f b=%.2f’ % (a,b))
print(‘regression: a=%.2f b=%.2f, std error= %.3f' % (a_s,b))
Example: Least squares fit
from pylab import *
from numpy import *
from matplotlib import *
from scipy.optimize import leastsq

fp = lambda v, x: v[0]/(x**v[1])*sin(v[2]*x) # parametric function


v_real = [1.7, 0.0, 2.0]
fn = lambda x: fp(v_real, x) # fn to generate noisy data
e = lambda v, x, y: (fp(v,x)-y) # error function
n, xmin, xmax = 30, 0.1, 5 # Generate noisy data to fit
x = linspace(xmin,xmax,n)
y = fn(x) + rand(len(x))*0.2*(fn(x).max()-fn(x).min())
v0 = [3., 1, 4.] # Initial parameter values
v, success = leastsq(e, v0, args=(x,y), maxfev=10000) # perform fit
print ‘Fit parameters: ', v
print ‘Original parameters: ', v_real
X = linspace(xmin,xmax,n*5) # plot results
plot(x,y,'ro', X, fp(v,X))
show()
Reading data using pandas
#Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

#Read csv file


df = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/Salaries.csv")

There is a number of pandas commands to read other data formats:

pd.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None, na_values=['NA'])


pd.read_stata('myfile.dta')
pd.read_sas('myfile.sas7bdat')
pd.read_hdf('myfile.h5','df')
102
Data Frame data types
Pandas Type Native Python Type Description
object string The most general dtype. Will be
assigned to your column if column
has mixed types (numbers and
strings).
int64 int Numeric characters. 64 refers to
the memory allocated to hold this
character.
float64 float Numeric characters with decimals.
If a column contains numbers and
NaNs(see below), pandas will
default to float64, in case your
missing value has a decimal.
datetime64, timedelta[ns] N/A (but see the datetime module Values meant to hold time data.
in Python’s standard library) Look into these for time series
experiments.

103
Data Frames attributes
Python objects have attributes and methods.

df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions

size number of elements


shape return a tuple representing the dimensionality
values numpy representation of the data

104
Data Frames methods
Unlike attributes, python methods have parenthesis.
All attributes and methods can be listed with a dir() function: dir(df)

df.method() description
head( [n] ), tail( [n] ) first/last n rows

describe() generate descriptive statistics (for numeric columns only)

max(), min() return max/min values for all numeric columns

mean(), median() return mean/median values for all numeric columns

std() standard deviation

sample([n]) returns a random sample of the data frame

dropna() drop all the records with missing values


105
Data Frames: Slicing

When selecting one column, it is possible to use single set of brackets, but the
resulting object will be a Series (not a DataFrame):
In [ ]: #Select column salary:
df['salary']

When we need to select more than one column and/or make the output to be a
DataFrame, we should use double brackets:
In [ ]: #Select column salary:
df[['rank','salary']]

106
Python Libraries for Data Science
matplotlib:
▪ python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced visualization

Link: https://matplotlib.org/

107
import matplotlib.pyplot as plt

xs = range(-100,100,10)
x2 = [x**2 for x in xs]
negx2 = [-x**2 for x in xs]

plt.plot(xs, x2)
plt.plot(xs, negx2)
plt.xlabel("x”)
Incrementally
plt.ylabel("y”) modify the figure.
plt.ylim(-2000, 2000)
plt.axhline(0) # horiz line
plt.axvline(0) # vert line
plt.savefig(“quad.png”) Save your figure to a file
plt.show() Show it on the screen
Python Libraries for Data Science
Seaborn:
▪ based on matplotlib

▪ provides high level interface for drawing attractive statistical graphics

▪ Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/

109
Why use modules?
● Code reuse
● Routines can be called multiple times within a program
● Routines can be used from multiple programs
● Namespace partitioning
● Group data together with functions used for that data
● Implementing shared services or data
● Can provide global data structure that is accessed by
multiple subprograms
Simple functions: ex.py
"""factorial
done
recursively
and
iteratively"""
def fact1(n):
ans =
for ans1
i in=
range(2,n):
ans return
* n ans
def fact2(n):
if n < 1:
1 else:return
n
1) * return
fact2(n -
Simple functions: ex.py
>>>
>>> import ex
ex.fact1(6)
1296
>>> ex.fact2(200)
78865786736479050
35523632139321850
7…000000L
>>>
at ex.fact1
<function fact1
0x902470>
>>> fact1
Traceback
recent (most
call
last):
File
line 1,"<stdin>",
in
<module>
NameError:
'fact1' is name
not
defined
Defining a class
# Define class
class thingy:
def __init__(self, value): # defining instance
self.value = value
def showme(self): # defining method
print("value = %s" % self.value)

# Call class and usage


t = thingy(20)
t.showme()
scikit-learn: Machine Learning in Python

● Python library for data mining and data analysis


● Accessible to everybody, and reusable in various contexts
● Built on NumPy, SciPy, and matplotlib
● Open source, commercially usable
Major Features
● Classification Regression
● Clustering Dimensionality reduction
● Model selection Feature Selection
scikit-learn: Machine Learning in Python
Support Vector Machine
■ Code sample
>>> from sklearn import svm
>>> classifier = svm.SVC()
>>> classifier.fit(X_train, y_train)
>>> y_pred = clf.predict(X_test)
IRIS data set
X1
X2
X3
X4
Y
Datasets in SKlearn
Load and return the boston house-prices dataset
load_boston
(regression).

load_iris Load and return the iris dataset (classification).

load_diabetes Load and return the diabetes dataset (regression).

load_digits Load and return the digits dataset (classification).

Load and return the physical excercise linnerud


load_linnerud
dataset.

load_wine Load and return the wine dataset (classification).

Load and return the breast cancer wisconsin dataset


load_breast_cancer
(classification).
References
● Python Homepage
• http://www.python.org
● Python Tutorial
• http://docs.python.org/tutorial/
● Python Documentation
• http://www.python.org/doc
● Python Library References
● http://docs.python.org/release/2.5.2/lib/lib.html
● Python Add-on Packages:
● http://pypi.python.org/pypi
Regression and Classification Models
Prof. Gajendra P.S. Raghava
Head, Department of Computational
Biology

Slides collected from different resources for


teaching

Web Site: http://webs.iiitd.edu.in/raghava/


Regression and Classification
• Linear regression
• Simple linear regression
• Multiple linear regression
• Classification
• Logistic regression
• Logistic model for multiple variable
• Machine learning
• KNN, SVM, ANN, HMM, RF
Types of Regression Models

1 Explanatory Regression 2+ Explanatory


Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear
Linear Equations
Y
Y = mX + b
Change
m = Slope in Y
Change in X
b = Y-intercept
X
Linear Regression Model

• 1. Relationship Between Variables Is a Linear Function

Population Population Random


Y-Intercept Slope Error

Y i =  0 + 1 X i +  i
Dependent Independent (Explanatory)
(Response) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.)
Least Squares
• 1. ‘Best Fit’ Means Difference Between Actual Y Values & Predicted
Y Values Are a Minimum.

( ) =  ˆ
n n
Yi − Yˆi
2
2
i
i =1 i =1

• 2. LS Minimizes the Sum of the Squared Differences (errors) (SSE)


Least Squares Graphically
n
LS minimizes   i =  1 +  2 +  3 +  4
 2  2  2  2  2

i =1
Y
Y2 =  0 +  1X 2 +  2
^4
^2
^1 ^3

Yi =  0 +  1X i
X
Coefficient estimation
• Prediction equation
yˆi = ˆ0 + ˆ1 xi
• Sample slope
 ( xi − x )( yi − y )
SS xy
ˆ1 = =
SS xx  i( x − x )2
• Sample Y - intercept

ˆ0 = y − ˆ1x
130
Multiple Linear Regression Models

Introduction
• For example, suppose that the effective life of a cutting
tool depends on the cutting speed and the tool angle. A
possible multiple regression model could be

where
Y – tool life
x1 – cutting speed
x2 – tool angle

131
Multiple Linear Regression Models

Least Squares Estimation of the Parameters

132
133
Categorical Response Variables
Examples:
 Non − smoker
Whether or not a person Y =
smokes Binary Response Smoker
Survives
Success of a medical Y =
treatment Dies

Opinion poll responses Agree



Y =  Neutral
Ordinal Response Disagree

e  +   x
P( y x) =
1 + e  +   x

135
The Logistic Curve
 p  exp ( z )
LOGIT ( p ) = ln  = z  p=
 (1 − p )  1 + exp ( z )

exp( z )
p=
p (probability)
1 + exp( z )

p
LOGIT ( p ) = z = log
(1 − p )

z (log odds)
The Logistic Regression Model

Logistic Regression:

 P (Y) 
ln   =  0 + 1 X 1 +  2 X 2 + + K X K
 1-P ( Y ) 
 

Linear Regression:

Y =  0 + 1 X 1 +  2 X 2 + + K X K + 
Ridge Regression
Ridge Regression (cont.)
Ridge Regression (cont.)
• The effect of this equation is to add a shrinkage penalty of the form

where the tuning parameter λ is a positive value.


• This has the effect of shrinking the estimated beta coefficients towards
zero. It turns out that such a constraint should improve the fit, because
shrinking the coefficients can significantly reduce their variance.

• Note that when λ = 0, the penalty term as no effect, and ridge regression
will procedure the OLS estimates. Thus, selecting a good value for λ is
critical (can use cross-validation for this).
The Lasso

• One significant problem of ridge regression is that the penalty term will never
force any of the coefficients to be exactly zero.

• Thus, the final model will include all p predictors, which creates a challenge in
model interpretation

• A more modern machine learning alternative is the lasso.

• The lasso works in a similar way to ridge regression, except it uses a different
penalty term that shrinks some of the coefficients exactly to zero.
The Lasso (cont.)
Elastic Net Regression
Classification & Prediction
• Handling linear data • Machine Learning for non-linear data
• Artificial neural networks (ANN)
• Handling non-linear data • Support vector machine (SVM)
• Hidden markov model (HMM)
• K-nearest neighbor (K-NN)
• Random forest classifier
Comparison of Supervised, Unsupervised
and Reinforcement
Supervised Unsupervised Reinforcement
Trained using labeled data Trained on Unlabeled data Learning based on feedback
Regression or classification Clustering and association Real time learning (game, robot)

Labeled data for training No labeled data (patterns) No predefined data (scratch)
Supervision Unsupervised Unsupervised
Forecast outcomes Discovery patterns Learn like a child from actions
Map input data and output label Understand pattern Learn using trial and error

Direct feedback No feedback Reward/penalty system


KNN, SVM, LR, RF K-mean, C-mean, Apriori Q-learning, SARSA
Classification & Prediction
• Handling linear data • Machine Learning for non-linear data
• Artificial neural networks (ANN)
• Handling non-linear data • Support vector machine (SVM)
• Hidden markov model (HMM)
• K-nearest neighbor (K-NN)
• Random forest classifier
Introduction to Neural Network (Perceptron)
The neuron calculates a weighted sum of inputs and compares it to a threshold. If the
sum is higher than the threshold, the output is set to 1, otherwise to -1.

Non-linearity

163
Example: A simple single unit adaptive network
• The network has 2 inputs,
and one output. All are
binary. The output is
• 1 if W0I0 + W1I1 + Wb > 0
• 0 if W0I0 + W1I1 + Wb ≤ 0

• We want it to learn simple


OR: output a 1 if either I0
or I1 is 1.

164
Hidden layers Neural Networks
• Layers of nodes
• Input is transformed into
numbers
• Weighted averages are fed into
nodes
• High or low numbers come out
of nodes
• A Threshold function determines
whether high or low
• Output nodes will “fire” or not
• Determines classification
• For an example

165
Backpropagation neural network
Deep neural network is simply a feedforward network
Introduction to Support Vector Machine (SVM)

• Proposed by Boser, Guyon and Vapnik in 1992


• Gained increasing popularity in late 1990s.
• Highly effective on small datasets
• Minimum over optimization
• Classification (binary, multi), Regression, clustering
• Most popular SMO, LIBSVM and SVMlight
• Tuning SVMs parameters is a challenge, hit and try
Non-linear SVMs
• Datasets that are linearly separable with some noise work out
great:
0 x

• But what are we going to


0
do if the dataset
x is just too hard?

2
• How about… mapping x
data to a higher-dimensional space:

0 x
Non-linear SVMs: Feature spaces
• General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
Perceptron Revisited: Linear Separators
• Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0

f(x) = sign(wTx + b)
Maximum Margin Classification
• Maximizing the margin is good according to intuition and PAC
theory.
• Implies that only support vectors are important; other training
examples are ignorable.
K-NEAREST NEIGHBOR METHOD (KNN)
• Classification of unknown object based on similarity/distance of
annotated object
• Search similar objects in a database of known objects
• Different names
• Memory based reasoning
• Example based learning
• Instance based reasoning
• Case based reasoning
KNN – Number of Neighbors
• If K=1, select the nearest neighbor
• If K>1,
• For classification select the most frequent neighbor.
• Voting or average concept
• Preference/weight to similarity/distance
• For regression calculate the average of K neighbors.

Weight to Instance
• All instance or examples are not reliable
• Weight a instance based on its success in prediction
174
Random Forest Algorithm
• Random forest (or random forests) is an ensemble classifier that consists of
many decision trees
• Tin Kam Ho of Bell Labs in 1995, proposed random decision forests
 Decision trees are one of the most popular learning methods.
 One type of decision tree is called CART… classification and regression tree.
 CART … greedy, top-down binary, recursive partitioning, that divides feature space
into sets of disjoint rectangular regions.

175/14
To ‘play tennis’ or not.
A new test example:
Outlook (Outlook==rain) and (not
Windy==false)
sunny rain
overcast
Pass it on the tree
-> Decision is yes.
Humidity Yes
Windy

high normal true false

No Yes No Yes
Decision tress involve greedy, recursive partitioning.

• Simple dataset with two predictors

 Greedy, recursive partitioning along TI and PE

177/14
Random Forest Classifier
Training Data

M features
N examples
Random Forest Classifier

M features
N examples

....…
Random Forest Classifier
Construct a decision tree

M features
N examples

....…
Random Forest Classifier
At each node in choosing the split feature
choose only among m<M features

M features
N examples

....…
Random Forest Classifier
Create decision tree
from each bootstrap sample

M features
N examples

....…
....…
Random Forest Classifier

M features
N examples

Take he
majority
vote

....…
....…
Artificial Neural Network
&
Hidden Markov Model
Prof. Gajendra P.S. Raghava
Head, Department of Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so


no claim of authorship on any slide
History of Neural Network
• 1943: McCulloch and Pitts model neural networks based on their
understanding of neurology.
• 1950: Farley and Clark (model biological behavior)
• 1958: Perceptron by Rosenblatt 1958
• 1969: Minsky & Papert showed the limitations of perceptron, killing
research for a decade
• 1974: Back-propagation by Werbos, revitalizes the field
• 1990: Applications in medicine, marketing, risk management
Classification & Prediction
• Handling linear data • Machine Learning for non-linear data
• Artificial neural networks (ANN)
• Handling non-linear data • Support vector machine (SVM)
• Hidden markov model (HMM)
• K-nearest neighbor (K-NN)
• Random forest classifier
Introduction to Neural Network (Perceptron)
The neuron calculates a weighted sum of inputs and compares it to a threshold. If the
sum is higher than the threshold, the output is set to 1, otherwise to -1.

Non-linearity

187
Example: A simple single unit adaptive network
• The network has 2 inputs,
and one output. All are
binary. The output is
• 1 if W0I0 + W1I1 + Wb > 0
• 0 if W0I0 + W1I1 + Wb ≤ 0

• We want it to learn simple


OR: output a 1 if either I0
or I1 is 1.

188
Perceptrons
• Initial proposal of connectionist networks
• Rosenblatt, 50’s and 60’s
• Essentially a linear discriminant composed of
nodes, weights
I1 W1 I1 W1
or

I2
W2
 O I2
W2 O

W3 W3

I3 Activation Function I3
 
O=  i
 
1 :   wi I i  +   0



 0 : otherwise  1
191
Perceptron Example
2 .5

1
.3
 =-1

2(0.5) + 1(0.3) + -1 = 0.3 , O=1

Learning Procedure:

Randomly assign weights (between 0-1)


Present inputs from training data
Get output O, nudge weights to gives results toward our desired output T
Repeat; stop when no errors, or enough epochs completed
How might you use a perceptron network?
• This (and other networks) are generally used to learn how to make classifications
• Say you have collected some data regarding the diagnosis of patients with heart
disease
• Age, Sex, Chest Pain Type, Resting BPS, Cholesterol, …, Diagnosis (<50% diameter narrowing,
>50% diameter narrowing)

• 67,1,4,120,229,…, 1
• 37,1,3,130,250,… ,0
• 41,0,2,130,204,… ,0

• Train network to predict heart disease of new patient


LMS Learning
LMS = Least Mean Square learning Systems, more general than the
previous perceptron learning rule. The concept is to minimize the w Ii i +
total error, as measured over all training examples, P. O is the raw i

output, as calculated by

Dis tan ce( LMS ) =  (TP − OP )


1 2

2 P
E.g. if we have two patterns and
T1=1, O1=0.8, T2=0, O2=0.5 then D=(0.5)[(1-0.8)2+(0-0.5)2]=.145

We want to minimize the LMS:


C-learning rate
E W(old)
W(new)

W
LMS Gradient Descent
• Using LMS, we want to minimize the error. We can do this by finding the direction
on the error surface that most rapidly reduces the error rate; this is finding the slope
of the error function by taking the derivative. The approach is called gradient
descent (similar to hill climbing).
To compute how much to change weight for link k:

Error Oj = f (I W )
wk = −c
wk
O j
= I k f ' ( ActivationFunction( I kWk ) )
Chain rule: wk
Error Error O j
wk
=
O j

wk wk = −c (− (T j − O j ) )I k f ' ( ActivationFunction )

1
Error
 
2 P
(TP − O P ) 2 1
 (T j − O j )2
= = 2 = −(T j − O j )
O j O j O j
We can remove the sum since we are taking the partial derivative wrt Oj
Optimizing concave/convex function
• Maximum of a concave function = minimum of a convex
function
Gradient ascent (concave) / Gradient descent (convex)

Gradient ascent rule


Activation Function
• To apply the LMS learning rule, also known as the
delta rule, we need a differentiable activation
function.

wk = cI k (T j − O j ) f ' ( ActivationFunction )


Old: New:
1 :  wi I i +   0 O=
1
O= i  −  wi I i + 
 0 : otherwise  1+ e i

−
LMS vs. Limiting Threshold
• With the new sigmoidal function that is differentiable, we
can apply the delta rule toward learning.
• Perceptron Method
• Forced output to 0 or 1, while LMS uses the net output
• Guaranteed to separate, if no error and is linearly separable
• Otherwise it may not converge
• Gradient Descent Method:
• May oscillate and not converge
• May converge to wrong answer
• Will converge to some minimum even if the classes are not linearly
separable, unlike the earlier perceptron training method
Activation functions
• The activation function is generally non-linear. Linear functions are limited
because the output is simply proportional to the input.

Slide 206
Hidden layers Neural Networks
• Layers of nodes
• Input is transformed into
numbers
• Weighted averages are fed into
nodes
• High or low numbers come out
of nodes
• A Threshold function determines
whether high or low
• Output nodes will “fire” or not
• Determines classification
• For an example

208
Backpropagation neural network
Example of cascade neural network
Deep neural network is simply a feedforward network
212
Example of Markov Model

0.3 0.7

Rain Dry

0.2 0.8
• Two states : ‘Rain’ and ‘Dry’.
• Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,
P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

213
214
Markov Chain Models

• a Markov chain model is defined by:


• a set of states
• some states emit symbols
• other states (e.g. the begin state) are silent
• a set of transitions with associated probabilities
• the transitions emanating from a given state define a distribution over the possible next
states

215
Markov Chain Models
• given some sequence x of length L, we can ask how probable the
sequence is given our model
• for any probabilistic model of sequences, we can write this
probability as

• key property of a (1st order) Markov chain: the probability of each Xi depends
only on Xi-1
217
Higher Order Markov Chains

• An nth order Markov chain over some alphabet is equivalent to a


first order Markov chain over the alphabet of n-tuples

• Example: a 2nd order Markov model for DNA can be treated as a


1st order Markov model over alphabet:
AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG,
and TT (i.e. all possible dipeptides)

218
Protein Annotation
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/


Why Functional annotation of a protein?
• All living things are made up of four classes macromolecules
• Carbohydrates
• Lipids
• Protein
• Nucleic Acids
• Few facts about proteins
• Called servant of living organisms
• Most of the functions of living organism is performed by proteins
• More than 50% of the dry mass of most cells
• Protein functions include structural support, storage, transport, cellular
communications, movement, and defense against foreign substances
• Made of 20 natural amino acids (polymers of amino acids)
More about proteins
Prediction of Major Function of Proteins
Annotation at Protein Level
• Protein Tertiary Structure
• Classification of proteins
• Subcellular localization of proteins
• Protein-protein interaction
• Evolution of proteins
Residue Level Annotation
• Surface accessibility of residues
• Residues interact with other molecules
• Post translation modification
Major Methods for Annotating Proteins
• Similarity search techniques (BLAST)
• Database scanning using BLAST, FASTA
• It requires a large, well annotated proteins
• Sequence composition
• Simple statistical/mathematical methods
• Sequence features, profiles or motifs
• Sophisticated sequence analysis tools
• Prediction or Classification models
• Application of Artificial Intelligence
Function by Homology or Similarity
Assumption: Similar sequences have similar function
• BLAST searches against databases (GenBank, PIR, ProDom, BIND)

sequence
BLAST
seq DB
homologues
retrieve

annotations parse features

Databases Are Key


Important Databases for Annotation
Sequence Databases Expression Databases
⚫ GEO (https://www.ncbi.nlm.nih.gov/geo/ )
⚫ GDC (https://portal.gdc.cancer.gov/ )
⚫GenBank (minimal annotation)
Cancer Databases
⚫PIR (slightly better annotation)
⚫ CancerDR (https://webs.iiitd.edu.in/raghava/cancerdr/ )
⚫UniProt (even better annotation)
⚫ TumorHope (https://webs.iiitd.edu.in/raghava/tumorhope/ )
⚫Organsim-specific databases
Database for Therapeutics
Structure Databases ⚫ THPdb (https://webs.iiitd.edu.in/raghava/thpdb/ )
⚫RCSB-PDB http://www.rcsb.org/pdb/ ⚫ CancerPPD (https://webs.iiitd.edu.in/raghava/cancerppd/ )
⚫SCOP http://scop.mrc-lmb.cam.ac.uk/scop/ ⚫ ParaPEP (https://webs.iiitd.edu.in/raghava/parapep/ )
⚫SATPdb (https://webs.iiitd.edu.in/raghava/satpdb/) ⚫ CPPsite 2.0 (https://webs.iiitd.edu.in/raghava/cppsite/ )
⚫ccPDB (https://webs.iiitd.edu.in/raghava/ccpdb/) ⚫ AntiTbPDB (https://webs.iiitd.edu.in/raghava/antitbpdb/ )
⚫ TopicalPdb (https://webs.iiitd.edu.in/raghava/topicalpdb/ )
Interaction Databases
Database for Vaccines
⚫STRING (https://string-db.org/ )
⚫ PRRDB 2.0 (https://webs.iiitd.edu.in/raghava/prrdb2/ )
⚫INTACT (https://www.ebi.ac.uk/intact/ )
BLAST compares sequences
• BLAST takes a query sequence
• Compares with sequences in biological databases
• Lists those that appear to be similar to the query sequence
• The “hit list”
• Tells you why it thinks they are homologs
• BLAST makes suggestions
• YOU make the conclusions
• E-value
• The chance that the match could be random
• The lower the E-value, the more significant the match (E =
0, sequences are identical)

226
Major Methods for Annotating Proteins
• Similarity search techniques (BLAST)
• Database scanning using BLAST, FASTA
• It requires a large, well annotated proteins
• Sequence composition
• Simple statistical/mathematical methods
• Sequence features, profiles or motifs
• Sophisticated sequence analysis tools
• Prediction or Classification models
• Application of Artificial Intelligence
Adopted from Internet
https://webs.iiitd.edu.in/raghava/pfeature/
Compositional Similarity (Composition)
 Correlation between composition

 Compositional distance

Euclidean Distance
Minkowski Distance

n
dist =  ( pk − qk )
2
Manhattan distance
k =1
Major Methods for Annotating Proteins
• Similarity search techniques (BLAST)
• Database scanning using BLAST, FASTA
• It requires a large, well annotated proteins
• Sequence composition
• Simple statistical/mathematical methods
• Sequence features, profiles or motifs
• Sophisticated sequence analysis tools
• Prediction or Classification models
• Application of Artificial Intelligence
Feature based annotation

sequence
find pattern
DB
patterns
parse

features
Pfam; PF00234; tryp_alpha_amyl; 1.
• PROSITE - http://www.expasy.ch/ PROSITE; PS00940; GAMMA_THIONIN; 1.
PROSITE; PS00305; 11S_SEED_STORAGE; 1.
• BLOCKS - http://blocks.fhcrc.org/
• DOMO - http://www.infobiogen.fr/services/domo/
• PFAM - http://pfam.wustl.edu/
• PRINTS - http://www.biochem.ucl.ac.uk/bsm/dbrowser/PRINTS/
Major Methods for Annotating Proteins
• Similarity search techniques (BLAST)
• Database scanning using BLAST, FASTA
• It requires a large, well annotated proteins
• Sequence composition
• Simple statistical/mathematical methods
• Sequence features, profiles or motifs
• Sophisticated sequence analysis tools
• Prediction or Classification models
• Application of Artificial Intelligence
What is Subcellular Localization?

⚫ Organelles
⚫ Membranes
⚫ Compartments
⚫ Micro-
environments
Prediction of molecular
interactions in Proteins
Computational methods for
predicting protein interactions
Protein level annotation Structure based techniques
➢ DNA binding proteins ➢ Docking techniques
➢RNA binding proteins ➢ Need structure
➢Ligand binding proteins ➢ Any type of interaction
➢ Time consuming
➢Interacting pair of proteins
Sequence based techniques
Residue level annotation ➢ Generation of features
➢Prediction/identification of ➢ Classification techniques
➢DNA interacting residues
➢ Knowledge based techniques
➢ATP interacting residues
➢ Need data for training
➢RNA interacting residues
➢Glycosylation sites
➢ Suitable for high throughput
Prediction of Interaction: Case
studies

Protein level annotation


➢ DNA binding proteins
➢DNAbinder: A web server for predicting DNA binding proteins
➢Interacting pair of proteins
➢ProPrInt: Predicting Protein-Protein Interactions
Residue level annotation
➢ATP interacting residues
➢ATPint: Prediction of ATP interacting residues
Kumar M, Gromiha MM, Raghava GP. Identification of
DNA-binding proteins using support vector machines and
evolutionary profiles. BMC Bioinformatics. 8:463.
Datasets for Training and Testing
• DNAset (Main dataset)
• All datasets are non-
• 146 DNA binding proteins
redundant (25%)
• 250 Non-binding proteins
• DNAaset (Alternate dataset) • No two proteins have
• 1153 DNA binding protein similarity more than 25%
• 1153 Non-binding proteins
• DNAiset (Independent dataset)
• 92 DNA binding proteins
• 100 Non-binding
Example of Protein-Protein Interaction
Protein name Amino acid sequence Label

P1 MLGHLPGHGTRALGRHLGHLPG ----- DNA-binding


P2 GGGRDESEWERWAAGHCDSA ----- DNA-binding
P3 TGHHHSSAAQWERAASWERY ----- Non-binding
P4 LLMNVCDFGHTYRRRWWWA ----- Non-binding
P5 RDSRGTESEWERWAAGHCDSA ----- ???

Amino Acid Composition

Dipeptide Composition
Example of Protein-Protein Interaction
Protein name Amino acid sequence Label

P1 MLGHLPGHGTRALGRHLGHLPG ----- DNA-binding


P2 GGGRDESEWERWAAGHCDSA ----- DNA-binding
P3 TGHHHSSAAQWERAASWERY ----- Non-binding
P4 LLMNVCDFGHTYRRRWWWA ----- Non-binding
P5 RDSRGTESEWERWAAGHCDSA ----- ???

PN A C D E F G H I K L M N P Q R S V U V W Label
Amino Acid
P1 1
Composition
P2 1

P3 0

P4 0

P5 ?
Protein features and
vector encoding
• Amino Acid Composition: 20
• Split (4 part): 20*4 = 80
• Dipeptide Composition : 400
• Evolutionary information
• PSI-BLAST
• Search against NR
• E-value 0.01
• PSSM profile
• PSSM-400
• PSSM-420
• PSSM-21
Performance of Different Models
Performance of
Different Models
Prediction of Interaction: Case
studies

Protein level annotation


➢ DNA binding proteins
➢DNAbinder: A web server for predicting DNA binding proteins
➢Interacting pair of proteins
➢ProPrInt: Predicting Protein-Protein Interactions
Residue level annotation
➢ATP interacting residues
➢ATPint: Prediction of ATP interacting residues
ProPrInt: Predicting Protein-Protein Interactions
Rashid M, Ramasamy S, Raghava GP. A simple approach for predicting protein-
protein interactions. Curr Protein Pept Sci. 2010 Nov;11(7):589-600.

• Alignment free method


• Prediction of protein interaction pair in Escherichia coli,
Saccharomyces cerevisiae, and Helicobacter pylori
• E. coli (1082 positive and 13840 negative interactions)
• S. cerevisiae (10517 positive and equal number of negative
examples generated randomly)
• H. pylori (1458 interactions and non-interactions
Example of Protein-Protein Interaction
First Protein Second Protein Interaction

P1: ALGRHLGHLPGHTHGKLPMN ----- P2: MLGHLPGHGTRALGRHLGHLPG ----- Interacting


P3: GGGGGRGRDSAWWWEEEE ----- P4: GGGRDESEWERWAAGHCDSA ----- Interacting
P1: ALGRHLGHLPGHTHGKLPMN ----- P5: TGHHHSSAAQWERAASWERY ----- Non-Interacting
P3: GGGGGRGRDSAWWWEEEE ----- P6: LLMNVCDFGHTYRRRWWWA ----- Non-interacting
P2: MLGHLPGHGTRALGRHLPG ----- P4: GGGRDESEWERWAAGHCDSA ----- ???

Amino Acid Composition

Dipeptide Composition
Example of Composition based Features
First Protein Second Protein Interaction

P1: ALGRHLGHLPGHTHGKLPMN ----- P2: MLGHLPGHGTRALGRHLGHLPG ----- Interacting


P3: GGGGGRGRDSAWWWEEEE ----- P4: GGGRDESEWERWAAGHCDSA ----- Interacting
P1: ALGRHLGHLPGHTHGKLPMN ----- P5: TGHHHSSAAQWERAASWERY ----- Non-Interacting
P3: GGGGGRGRDSAWWWEEEE ----- P6: LLMNVCDFGHTYRRRWWWA ----- Non-interacting
P2: MLGHLPGHGTRALGRHLPG ----- P4: GGGRDESEWERWAAGHCDSA ----- ???

1st Protein (20) 2nd Protein (20)


Label
A C D E F G H I K L M N P Q R S T V W Y A C D E F G H I K L M N P Q R S T V W Y
1
1
1
0
0
0
?
?
?
Rashid M, Ramasamy S, Raghava GP. A simple
approach for predicting protein-protein interactions.
Curr Protein Pept Sci. 2010 Nov;11(7):589-600.
Creation of Datasets
• E. coli
• 1082 interacting-pairs (Positive examples)
• 13840 non-interacting pairs (non-colocalized proteins)
• S. cerevisiae
• 10517 Interacting-pairs (Positive examples)
• 10517 Non-interacting pairs (generated randomly)
• H. pylori
• 1458 Interacting-pairs (Positive examples)
• 1458 Non-interacting pairs (Random pairs)
Rashid M, Ramasamy S, Raghava GP. A simple approach
for predicting protein-protein interactions. Curr Protein
Pept Sci. 2010 Nov;11(7):589-600.
Feature Extraction
• Amino acid composition: 20 * 2 = 40
• Split (4 parts) amino acid composition = (20 * 4) * 2 = 160
• Dipeptide composition: 400 * 2 = 800
• Biochemical descriptors: 6 groups based on similar properties
• Mono-Composition : 6 * 2 = 12
• Di-Composition: (6*6)*2= 36*2 = 72
• Pseudo amino acid composition: 21 * 2 = 42
• Position-specific scoring matrix (PSSM): 400 * 2 = 800

Wrapper based Attribute Selection


Prediction of Interaction: Case
studies

Protein level annotation


➢ Interacting pair of proteins
➢ProPrInt: Predicting Protein-Protein Interactions
➢DNA binding proteins
➢DNAbinder: A web server for predicting DNA binding proteins
Residue level annotation
➢ATP interacting residues
➢ATPint: Prediction of ATP interacting residues
➢RNA interacting residues
➢Pprint: A web server for predicting RNA interacting residues
Chauhan, J.S., Mishra, N.K. & Raghava, G.P. Identification of ATP
binding residues of a protein from its primary sequence. BMC
Bioinformatics 10, 434 (2009).

Creation of dataset

• 360 ATP binding protein chains from SuperSite

• 267 non- redundant PDB chains (using CD-HIT software)

• 168 protein chain having ATP binding sites

• ATP interacting residues assigned using software Ligand Protein


Contact (LPC)
Example of ATP interacting Residues
(Small Case)
Generate Patterns of length 7
1A0I_A::VNikTNPfkaVSFVESAIKKALDNAGYLIAeikyDGVrGNI
XXXVNikTNPfkaVSFVESAIKKALDNAGYLIAeikyDGVrGNIXXX

Pattern Label
XXXVNik Non-Interacting
XXVNikT Non-Interacting
XVNikTN Interacting
VNikTNP Interacting
NikTNPf Non-Interacting
ikTNPfk Non-Interacting
…….. …….
rGNIXXX
RGHRIGH ?
Generate Patterns of length 9
1A0I_A::VNikTNPfkaVSFVESAIKKALDNAGYLIAeikyDGVrGNI
XXXXVNikTNPfkaVSFVESAIKKALDNAGYLIAeikyDGVrGNIXXXX

Pattern Label
XXXXVNikT Non-Interacting
XXXVNikTN Non-Interacting
XXVNikTNP Interacting
XVNikTNPf Interacting
VNikTNPfk Non-Interacting
NikTNPfka Non-Interacting
…….. …….
VrGNIXXXX
ARGHRIGHV ?
X V N i k T N P f

A 0 0 0 0 0 0 0 0 0
XVNikTNPf C 0 0 0 0 0 0 0 0 0

D 0 0 0 0 0 0 0 0 0

E 0 0 0 0 0 0 0 0 0
PATTERN = 9 * 21 = 189
Convert a pattern in binary
F 0 0 0 0 0 0 0 0 1

G 0 0 0 0 0 0 0 0 0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
H 0 0 0 0 0 0 0 0 0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
I 0 0 0 1 0 0 0 0 0
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
K 0 0 0 0 1 0 0 0 0 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,
L 0 0 0 0 0 0 0 0 0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
pattern

M 0 0 0 0 0 0 0 0 0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,
N 0 0 1 0 0 0 1 0 0 0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,
P 0 0 0 0 0 0 0 1 0 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
Q 0 0 0 0 0 0 0 0 0 0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
R 0 0 0 0 0 0 0 0 0

S 0 0 0 0 0 0 0 0 0

T 0 0 0 0 0 1 0 0 0

V 0 1 0 0 0 0 0 0 0

W 0 0 0 0 0 0 0 0 0

Y 0 0 0 0 0 0 0 0 0

X 1 0 0 0 0 0 0 0 0
Therapeutic Application of
Proteins or Peptides
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so


no claim of authorship on any slide
Adopted from Internet
Protein Sequence in FASTA Format
Drug/inhibitor/vaccine/ Disease Peptide-Protein Peptide Structure, docked
Diagnostics Interaction structure

Adaptive Immunity: B-cell, T-cell Epitope Innate Immunity: Toll-like receptors

Anti-bacterial/cancer/viral peptides Mimotopes for diseases diagnostics

Structure determination: Natural, Structure prediction: Natural,


Protein non-natural, modified bonds
non-natural , modified bonds

Synthesis: Phase display, SPSS, Natural bioactive peptides from


Codon Shuffling metagenomics

Mimotopes for B/T epitopes Mimicking of Drug Molecules

ADMET: Proteolytic enzymes, Size Optimization for Oral Delivery : Trans. &
Half-life function/Str. Adjuvant
https://webs.iiitd.edu.in/raghava/thpdb/
Important Facts
 Success rate
 Phase I -> II ~84% for biologics; 63% for small molecules
 Phase II -> III ~53% for biologics; 38% for small molecules.

 Approval
 Phase III -> FDA approval ~74% biologics; ~61% small molecules

 Peptide therapeutics market is expected to reach USD 48.04 billion by


the year 2025 (report by Grand View Research, Inc.)
Concept of Drug and Vaccine

• Concept of Drug
• Kill invaders of foreign pathogens
• Inhibit the growth of pathogens

• Concept of Vaccine
• Generate memory cells
• Trained immune system to face various existing disease agents
History of Immunization
• Children protected who recovered from smallpox
• Immunity induce, a process known as variolation
• Variolation spread to England and America
• Stopped due to the risk of death
• Edward Jenner found that protection against smallpox
• Inoculation with material from an individual infected with
cowpox
• This process was called vaccination (cowpox is vaccina)
• Inoculum was termed a vaccine
• Protective antibodies was developed
Biomolecules Based Vaccines

WHOLE Purified Antigen Epitopes (Subunit


ORGANISM Vaccine)

T cell epitope
Attenuated
Different arms of Immune System
Disease Causing Agents

Pathogens/Invaders
Exogenous processing of Pathogenic antigens
(MHC Class II binders or T-helper Epitopes)
Prediction of CTL Epitopes (Cell-mediated immunity)
Web servers for designing epitope-based vaccine
Propred: Promiscuous MHC-II binders
MHCBN: Database of MHC
T-Cell Epitopes IL4Pred: Prediction of interleukin-4
--------------------------------------------
Propred1: for promiscuous MHC I binders
Pcleavage: Proteome cleavage sites
TAPpred: for predicting TAP binders
CTLpred: Prediction of CTL epitopes

BCIpep: Database of B-cell eptioes;


Lbtope: Prediction of B-cell epitopes
B-Cell Epitopes ALGpred: Allergens and IgE eptopes
IgPred: Antibody-specific epitopes

PRRDB: A database of PRRs & ligands


VaccineDA: DNA-based adjuvants
Vaccine Adjuvants imRNA: Immunomulatory RNAs
VaccinePAD: Peptide-based adjuvants
PolysacDB: Polysaccharide antigens
Drug Delivery
Drug Delivery
Drug Delivery
Drug Delivery
Peptide based Inhibitors
Peptide based Inhibitors
Peptide based Inhibitors
Peptide based Inhibitors
Peptide based Inhibitors
Peptide Stability
Toxicity
Toxicity
Protein Sequence in FASTA Format
Nucleotide Sequence in FASTA Format
Tweets
Generation of Features from
String Specifically for Protein
Prof. Gajendra P.S. Raghava
Head, Department of Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so


no claim of authorship on any slide
Tweets
Protein Sequence in FASTA Format
Nucleotide Sequence in FASTA Format
Text Classification-Applications
• Arranging news stories  Applications:
 Web pages
• Classify business names by industry.  Recommending
 Yahoo-like classification
• Identification of spam email
 Newsgroup Messages
 Recommending
• Categorization of pdf files as ResearchPaper
 spam filtering

• Movie reviews as favorable,  News articles


 Personalized newspaper
• Identification of Interesting research papers  Email messages
 Routing
• Classify jokes as Funny, NotFunny.  Prioritizing
 Folderizing
• Prediction of relevant web sites  spam filtering
Text Classification-Definition
• Text classification is the assignment of text documents
to one or more predefined categories based on their
content.
Text document

Classifier

Class A Class B Class C


Text Text
document document

• The classifier:
• Input: a set of m hand-labeled documents (x1,y1),....,(xm,ym)
• Output: a learned classifier f:x → y
Text Classification-Representing Texts
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their products to

f( )=y
February 11, in thousands of tonnes, showing those for future shipments month, 1986/87
total and 1985/86 total to February 12, 1986, in brackets:
• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
• Maize Mar 48.0, total 48.0 (nil).
• Sorghum nil (nil)
• Oilseed export registrations were:
• Sunflowerseed total 15.0 (7.9)
• Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for subproducts, as follows....

simplest useful

? What is the best representation


for the document x being
classified?
Representing text: a list of words
• Removing stop words ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26

• Punctuations Argentine grain board figures show crop registrations of grains, oilseeds and their products to

f( )=y
February 11, in thousands of tonnes, showing those for future shipments month, 1986/87
total and 1985/86 total to February 12, 1986, in brackets:
• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).

• Prepositions •


Maize Mar 48.0, total 48.0 (nil).
Sorghum nil (nil)
Oilseed export registrations were:

• Pronouns, etc. •

Sunflowerseed total 15.0 (7.9)
Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for subproducts, as follows....

• Stemming
• Walk
• walker
(argentine, 1986, 1987, grain, oilseed,

f( )=y
walked registrations, buenos, aires, feb, 26,
• walking. argentine, grain, board, figures, show, crop,
registrations, of, grains, oilseeds, and, their,
products, to, february, 11, in, …
Word Frequency
word freq
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains,
grain(s) 3
oilseeds and their products to February 11, in thousands of
tonnes, showing those for future shipments month, oilseed(s) 2
1986/87 total and 1985/86 total to February 12, 1986, in
brackets: total 3
• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total
2,692.4 (4,161.0). wheat 1
• Maize Mar 48.0, total 48.0 (nil).
• Sorghum nil (nil) maize 1
• Oilseed export registrations were:
• Sunflowerseed total 15.0 (7.9) soybean 1
• Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for subproducts, as tonnes 1
follows....
... ...

If the order of words doesn’t matter, x


can be a vector of word frequencies. “Bag of words”: a long
sparse vector x=(,…,fi,….)
where fi is the frequency of
Categories: g rain , wh eat the i-th word in the
vocabulary
Feature generation

• Fingerprints or presence of a word in document

• Term or word frequency (TF)

• Inverse document frequency (IDF) = log [Nd/(1 + Nt)] , where Nd is


total documents, Nt is number of documents contain term t .

• TF-IDF = TF ✗ IDF

• Composition: term frequency is divided by length of document


Document Features

Documents
D1 (length = 20)
Words
Frequency Fingerprint Composition tf-idf tfc-weight ltc-weight Entropy-W

grain(s) 3 1 0.15 3
oilseed(s) 2 1 0.1 3
total 3 1 0.15 6
wheat 1 1 0.05 4
maize 1 1 0.05 2
soybean 1 1 0.05 1
tonnes 1 1 0.05 2
IIIT 0 0 0 0
Delhi 0 0 0 0
Feature generation (Cont.)
• tfc-weighting
• It considers the normalized length of documents (M).

• ltc-weighting
• It considers the logarithm of the word frequency to reduce the effect of large differences in
frequencies.

• Entropy weighting
the dog smelled like a skunk
What is an N-gram?  Bigram:
• An n-gram in case of text  "# the”, “the dog”, “dog smelled”, “
smelled like”, “like a”, “a skunk”,
• Unigram: n-gram of size 1 “skunk#”
• Bigram: n-gram of size 2  Trigram:
• Trigram: n-gram of size 3  "# the dog", "the dog smelled", "dog
• Item: Phonemes, Syllables, Letters, smelled like", "smelled like a", "like a
Words, Others skunk" and "a skunk #".
 In case of Nucleotide sequence
• In case of protein n-gram is called by
 Mono-nucleotide composition
• Amino acid composition (n = 1)
 Di-nucleotide composition
• Dipeptide composition (n = 2)
 Tri-nucleotide composition
• Tripeptide Composition
 Item: 4 nucleotides, properties
• Item: 20 amino acids, properties
Feature generation
(Word Embedding)
• Transforming words into feature vectors

• One-hot encoding Documents


Word IIIT Delhi tones soybeans grain
• Binary profile grain(s) 0 0 0 0 1
oilseed(s) 0 0 0 0 0
total 0 0 0 0 0
wheat 0 0 0 0 0
maize 0 0 0 0 0
soybean 0 0 0 1 0
tonnes 0 0 1 0 0
IIIT 1 0 0 0 0
DELHI 0 1 0 0 0
Example code using SKlearn
Tokenization of Text
Protein Sequence in FASTA Format
Pfeature: A web server for computing
features of proteins and peptides
http://webs.iiitd.edu.in/raghava/pfeature/
Manual of Pfeature :
https://webs.iiitd.edu.in/raghava/pfeature/Pfeature_Manual.pdf
Example
AAC

AAC_NT(5)

AAC_CT(5)
Example

AAC_Rest(3,3)

AAC_Split (3)
Manual of Pfeature :
https://webs.iiitd.edu.in/raghava/pfeature/Pfeature_Manual.pdf
Manual of Pfeature :
https://webs.iiitd.edu.in/raghava/pfeature/Pfeature_Manual.pdf
where PCPi is physico-chemical properties composition of residue type i;
Pi and L are sum of property of type i and length of sequence.
Evaluation or Benchmarking of
Methods
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/


Topics to be covered

• Importance of unbiased evaluation


• Creation of datasets
• Different types of datasets
• Internal and external validation
• Evaluation of classification methods
• Measuring performance of regression methods
• K-cross validation techniques
• Evaluation on independent or validation data
• Case studies or examples from literature
Importance of unbiased evaluation

• Reinventing the wheel

• Established datasets for evaluating methods

• Standard parameters for evaluation

• Significance test to understand difference

• Standard protocols for creating datasets


Source of datasets
• Primary Databases (Structure of Proteins)
• PDB: Database of protein structures
• PubChem: Databases of BioAssays
• CancerPPD: A database of anticancer peptides
• ArrayExpress: Maintain expression data of genes
• Secondary databases
• ccPDB: a collection of datasets compiled from PDB
• Swiss-Prot: Annotated database of proteins
• Extracting information from literature
• Repositories of datasets
• ccPDB: maintain large number of datasets
Important points on datasets

• Removing redundancy in dataset


• Protein structure prediction (30%)
• Subcellular localization (40 to 90%)
• Peptide (100%, ?)
• Balanced dataset (equal number in each class)
• Realistic dataset
• Creating mixture of two
• Training , testing and validation
• Internal validation
• External validation
UniProt a database of protein sequence
http://www.uniprot.org/
• History (Swiss-Prot to UniProt)
• Swiss-Prot was created in 1986 by Amos Bairoch during Ph.D. at SIB
• Later developed by Rolf Apweiler at the EBI
• TrEMBL (automatic translation of EMBL) was introduced to maintain pace
• UniProt launched in 2003,(PROSITE, PRINTS, Prodom, SMART, PFAM etc)
• UniProt/SwissProt
• Manually curated highly-annotated sequences from SwissProt & PIRSF
• Contain taxonomy, citations, GO terms, motifs etc.
• Rule-based annotations including InterPro domains and motifs,
• UniProt/TREMBL
• Automatically translated from genomes including predicted as well as RefSeq genes.
• Automated rule-based annotations.
• UniProt Reference (UniRef)
• UniRef100, UniRef90, UniRef50: Less than 100%, 90% and 50% sequence
identity
Prediction of Subcellular Localization of Eukaryotic Proteins
(ESLPred, NNPSL, SubLoc)

• SWISSPROT database release 33.0


• Subcellular location was found for 15775 out of 52205 proteins
• Filters (following type of sequences were removed)
• Fragments of larger proteins
• Contained ambiguities (X within the sequence )
• Annotated in multiple locations
• Annotated based on similarity (probable)
• Transmembrane proteins
• Plant sequences
• 5134 out of 15775 sequences
• Finally 3420 sequences (Non-redundant, 90% identity)
• 2427 eukaryotic proteins in major locations
• 1097 nuclear, 684 cytoplasmic, 321 mitochondrial and 325
extracellular proteins
Prediction of Subcellular Localization of Prokaryotic Proteins
(PSLPred, CELLO, PSORT-B)

• SWISSPROT database release 40.29


• Gram-negative bacterial sequences, subcellular annotated
• Filters (following type of sequences were removed)
• Fragments of larger proteins
• Contained ambiguities (X within the sequence )
• Annotated in multiple locations
• Annotated based on similarity (probable)
• Only experimentally validated (literature)
• 1302 Prokaryotic gram-negative proteins (single location)
• 248 cytoplasmic, 268 inner membrane, 244 periplasmic, 352 outer membrane
and 190 extracellular
Subcellular localization of mycobacterial proteins
(TBpred)
• Swiss-Prot release 48
• Initially, we got 1365 mycobacterial proteins
• Filters (following type of sequences were removed)
• Fragments of larger proteins
• Contained ambiguities (X within the sequence )
• Annotated in multiple locations
• Annotated based on similarity (probable)
• 882 proteins belong to 13 subcellular locations
• Four major locations 852 mycobacterial proteins
• 340 cytoplasmic, 402 integral membranes, 50 secretory and 60 lipid anchor
• Non-redundant criteria reduce data drastically
• BLAST-cluster at e-value 10 e-4 (around 26% percent identity)
• 5 sets are created in such a way that no similar sequence in any two sets
• Single set may have redundant sequences
Error Rate
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 201 85
0 25 2689

Overall error rate = (25+85)/3000 = 3.67%


Accuracy = 1 – err = (201+2689)/3000 = 96.33%
If multiple classes, error rate is:
(sum of misclassified records)/(total records)

347
Cutoff for classification
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly

• Default cutoff value is 0.50


If >= 0.50, classify as “1”
If < 0.50, classify as “0”
• Can use different cutoff values
• Typically, error rate is lowest for cutoff = 0.50

348
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004

• If cutoff is 0.50: 13 records are classified as “1”


• If cutoff is 0.80: seven records are classified as “1”
349
Confusion Matrix for Different
Cutoffs
Cut off Prob.Val. for Success (Updatable) 0.25

Classification Confusion Matrix


Predicted Class

Actual Class owner non-owner

owner 11 1
non-owner 4 8

Cut off Prob.Val. for Success (Updatable) 0.75

Classification Confusion Matrix


Predicted Class

Actual Class owner non-owner

owner 7 5
non-owner 1 11
350
Cross-validation Techniques for Evaluation
Cross Validation
• Jack Knife Test
• LOOCV: One for testing and rest for training
• K-fold cross validation
• LOOCV If K = N (number of samples)
K-fold Cross Validation

STT592-002: Intro. to Statistical Learning 353


Which kind of Cross Validation?
Cross evaluation
• Bootstrapping Technique
• Statistical technique for estimating distribution
Cross validation Techniques
• Monte Carlo Method
• Extracting samples randomly for testing and training
Cross validation (ANN)
• Three Way Split Technique
• Training data: A set of examples used for learning: to fit the parameters of the classifier.
• Validation set: A set of examples used to tune the parameters of a classifier.
• Test set: Assess the performance of a fully-trained classifier.
Cross validation
• Dis-Joint test

Cross validation (Large Dataset)
• 2-fold cross validation
• One fold for training another fold for testing
• Reverse traing and testing fold

• Bagging (bootstrap aggregation): randomly pickup


samples (~ 10%) for training and testing
• Repeat process number of times
• Check variation in performance
A comparison of the Bootstrap & Jackknife

• Bootstrap • Jackknife
• Yields slightly different results • Less general technique
when repeated on the same • Explores sample variation
data (when estimating the differently
standard error)
• Yields the same result each time
• Not bound to theoretical
distributions • Similar data requirements
Actual Threshold Dependent
Parameters for
Positive Negative Evaluation

Predicted
Positive TP FP PPV

Negative FN TN NPV

Sensitivity Specificity

Actual

Positive(sick) Negative (Healthy)

Positive PPV
(Sick) TP=2 FP=18 =2 / (2 + 18)
Predicted

=10%

Negative NPV
(Healthy) FN=1 TN=182 =182 / (1 + 182)
=99.5%

Sensitivity Specificity
=2/(2+1) =182/18+182
=66.67% =91%
Measures for evaluating
classification models
Sensitivity : Percentage coverage of positive is the percentage
of positive samples predicted as positive.
Specificity or percentage coverage of negative is the
percentage of negative samples predicted as negative.
Positive predictive (PPV): Probability of correct prediction of
positively samples.
Negative predictive (PPV): Probability of correct prediction of
negative predicted samples.
Accuracy: Percentage of correctly predicted examples (both
correct positive and correct negative prediction).
Matthews Correlation Coefficient (MCC): It penalized both
under and over prediction.
F1: Measured the harmonic mean of precision and recall:
Confusion Matrix for
Multiclass Classifier
Following table shows the confusion matrix of a classification problem
with six classes labeled as C1, C2, C3, C4, C5 and C6.

Class C1 C2 C3 C4 C5 C6
C1 52 10 7 0 0 1
C2 15 50 6 2 1 2
C3 5 6 6 0 0 0
C4 0 2 0 10 0 1
C5 0 1 0 0 7 1
C6 1 3 0 1 0 24

Predictive accuracy?
Area Under Curve (AUC or AUCROC)
Threshold Independent Parameter

True positive rate (TPR) = Sensitivity ; False positive rate (FPR) = (1 – specificity)
Regression Method
Actual MP (MP ) Predicted MP
 (MP )
act n
2
− MP pred
(MP pre d )
act

R2 = 1− i=1

 (MP )
n
12.5 14.0 2
act
− MP
67.0 71.3 i=1
71.2 68.7
 (MP )
n
2
115.9 121.0 act
− MP pred
32.7 29.8 Q2 = 1− i=1

 (MP )
n
2
45.7 49.3 act
− MPtrain
79.8 76.8 i=1

127.3 125.1 1 m
57.6 50.2 MPtrain =  MP act
m i=1
37.2 33.8
= 646.90 = 640.0 1 n

MAE =  MP act − MP pred
n i=1

= 53580.21 = 53169.64
1 M
RMSECV =  (RMSE )2
M i =1
MEASURING THE MODEL ACCURACY

367
Evaluation of regression-
based methods

• Pearson's correlation (R): measure relation between actual/observed and


predicted values
• Coefficient of determination (R2): Coefficient of determination is the statistical
parameter for proportion of variability in model.
• Q2 is another very important statistical parameter for the determination of
variability in model.
• MAE/AAE is mean of absolute errors within actual and predicted value.
• RMSECV is the aggregate root mean squared error of the cross-validation.
Statistical Measures
• z-Test- The Z-test compares sample and population means
to determine if there is a significant difference.
• t-Test: The t-test assesses whether the means of two groups
are statistically different from each other. This analysis is
appropriate whenever you want to compare the means of
two groups.
• P-test: It tests the validity of the null hypothesis which
states a commonly accepted claim about a population.
Sklearn Examples

You might also like