Merge
Merge
Applications
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology
Web Site:
http://webs.iiitd.edu.in/raghava/
Welcome to BIO542(MLBA)
• Course : Machine learning for biomedical Applications
• Instructor: Prof. G. P. S. Raghava ([email protected], [email protected])
• TAs
• Dilraj Kaur ([email protected]), Sanjay K. Mohanty ([email protected]), Pradeep Singh
([email protected]), Shreya Mishra ([email protected]), Shalini Sharma
([email protected])
• Important URLs & email
• Mailing list : [email protected]
• Google Classroom: joining code goqqfsn
• https://classroom.google.com/c/Mzc5MzczNDc3NTE1?cjc=goqqfsn
• Website: http://webs.iiitd.edu.in/raghava/
• Please go through academic dishonesty policy very carefully
• https://www.iiitd.ac.in/education/resources/academicdishonesty
• Visiting hours: Student may visit between 4.30 to 5.30 PM (A-302, New Academic Building)
for any question/doubts/discussion. Check availability in advance.
Course Description
This course is designed for students having wide range of
background like biology, medical science, pharmacology,
bioinformatics, computer science. This course is divided in
following three sections; i) Major challenges in the field of
biomedical science, ii) Introduction/implementation of
machine learning techniques for developing prediction
models, and iii) solving biomedical problems using machine
learning techniques. This course will be help students in
developing novel methods for solving real-life problems in the
field of biological and health sciences. Attempt will made to
bridge gap between students and world class researchers,
students will be exposed to highly accurate methods based on
machine learning techniques (research papers).
Post conditions
(Expectations from students after course)
• Knowledge of biomedical applications
• Classification of biomolecules
• Prediction of inhibitors/drugs
• Models for predicting disease associated genes
• Image-based disease classification
• Ability to develop models using machine learning techniques
• Major ML techniques: SVM, ANN, Random Forest, KNN & HMM
• Feature engineering
• Generating features for biomolecules and biomedical images.
• Feature selection techniques
• Dimension reduction techniques (e.g., PCA)
• Evaluation of prediction/classification models
• Parameters for measuring performance of models
• Cross-validation techniques for training/testing
• Internal/ external validation of models
Performance Evaluation
Group Activities
Assignments: 20% Guoup activity
Individual Activities
Mid-sem Exam: 30% Individual
End-sem Exam: 30% Individual
Quiz: 20% Individual
Performance Evaluation
(Group activity)
Assignments: There will be two assignments, 10 marks
each. Assignments will be submitted by a group of
maximum three students. It will be based on Kaggle in
class competition.
Performance Evaluation
(Individual Activity)
• Quiz: Total three quiz will be conducted in class. Best
of two will be used for evaluation, weightage will be
20%.
• Mid-sem Exam: Online exam will be conducted
weightage will be 30%
• End-sem Exam: Online exam will be conducted
weightage will be 30%
Week-wise plan
Causes of Possible
Diseases Solutions
• Disease-associated Pathogens (Virus, • Understanding biology at genome level
Bacteria, Fungus etc.) • Drugs particularly against drug resistant
• Disordered or Malfunction (e.g., Cancer) diseases
• Malnutrition (Healthy food) • Subunit or Epitope-based Vaccines
• Side-effects of drugs • Disease Biomarkers for early detection
• Mental Health & Stress • Drug biomarkers
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers
Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure prediction • Disease • Drug design • Image Classification
• Subcellular biomarkers • Chemical • Medical images
localization • Drug biomarkers descriptor • Disease
• Therapeutic • mRNA expression • QSAR models classification
Application • Copy number • Personalized • Disease diagnostics
• Ligand binding variation inhibitors
Five kingdoms of living organism
Cell: minimum unit of life
Saccharomyces Helicobactor
Neuron Paramecium Chlamydomonas cerevisiae Pylori
• Genes
• DNA sequences that encode
proteins
• less than 3% of human
genome
•Transcription
•DNA -> RNA
•Translation
•RNA -> Protein
Central dogma of molecular biology
Central dogma of molecular
biology
• mRNA then goes through the pores of the nucleus with
the DNA code and attaches to the ribosome.
Transcription, Translation and Protein synthesis
Transcription
Polypeptide = Protein
Summary
mRNA Levels Indirectly Measure Gene Activity
The activity of a gene (expression) can be determined by the
Gene Expression presence of its complementary mRNA
Cells differ in the DNA (gene) which is active at any one time
Gene Expression
pseudo-colour
sample
image
(labelled)
probe
(on chip)
DNA sequencing
• Sanger sequencing techniques
• Maxam–Gilbert sequencing (1977-80)
• Pyrosequencing (1993)
• Next generation sequencing techniques
Genome Gallery
Galerie
genomů
Genome size of Important species
Bacteriophage λ (virus) 1 chr. 5*104 bp
Escherichia Coli 1 5*106
S. cerevisaie (yeast) 32 1*107
Caenorhabditis elegans (worm) 12 5*108
D. melanogaster (fruit fly) 8 2*108
Homo sapiens (human) 46 3*109
Glycomics Lipidomics
(Sugars) (Lipids)
Metabolomics
Chromosome
(23 pair) Epigenomics
M
M
Ac
Ac
M C
A
A
I
V
Y
M
E Proteomics
D
Abstract
Interleukin 6 (IL-6) is a pro-inf lammatory cytokine that stimulates acute phase responses, hematopoiesis
and specific immune reactions. Recently, it was found that the IL-6 plays a vital role in the progression of
COVID-19, which is responsible for the high mortality rate. In order to facilitate the scientific community
to fight against COVID-19, we have developed a method for predicting IL-6 inducing peptides/epitopes.
The models were trained and tested on experimentally validated 365 IL-6 inducing and 2991 non-inducing
peptides extracted from the immune epitope database. Initially, 9149 features of each peptide were
computed using Pfeature, which were reduced to 186 features using the SVC-L1 technique. These features
were ranked based on their classification ability, and the top 10 features were used for developing
prediction models. A wide range of machine learning techniques has been deployed to develop models.
Random Forest-based model achieves a maximum AUROC of 0.84 and 0.83 on training and independent
validation dataset, respectively. We have also identified IL-6 inducing peptides in different proteins of
SARS-CoV-2, using our best models to design vaccine against COVID-19. A web server named as IL-6Pred
and a standalone package has been developed for predicting, designing and screening of IL-6 inducing
peptides (https://webs.iiitd.edu.in/raghava/il6pred/).
Material and methods
•Dataset preparation and pre-processing
•We extracted experimentally validated, 583 IL-6 inducing peptides from IEDB database.
•We removed all identical peptides and peptides having a length greater than 25 amino acids.
•Finally, we got 365 IL-6 inducing peptides and 2991 non-IL-6 inducing peptide
Name of descriptor Description of descriptor Number of features or
length of vector
AAC Amino acid composition 20
DPC Dipeptide composition 400
TPC Tripeptide composition 8000
ABC Atomic and bond composition 9
RRI Residue repeat Information 20
DDOR Distance distribution of residue 20
SE Shannon-entropy of protein 1
SER Shannon entropy of all amino acids 20
SEP Shannon entropy of physicochemical 25
property
CTD Conjoint triad calculation of the 343
descriptors
CeTD Composition-enhanced transition 187
distribution
PAAC Pseudo amino acid composition 23
APAAC Amphiphilic pseudo amino acid composition 29
SVC-L1-based feature selection technique, which implements the parameters were calculated using the following equations:
support vector classifier (SVC) with linear kernel, penal- ized with L1
regularization. We used SVC-L1 because it per- forms several methods to
select the best features from a large number feature vector, and it is TP
Sensitivity = × 100 (1)
extremely fast as compared with other techniques [44]. Its primary TP + FN
purpose is to minimize the objective function, which considers the loss
function and regularization. SVC-L1 method selects the non-zero TN
coefficients and then applies the L1 penalty to select relevant features Specificity = (2)
T N + FP
to reduce dimensions. The L1 regularization creates the sparse models
× 100
during the optimization process, and by selecting some of the features TP +T N
out of the model by making the coefficients equal to zero. Using the ‘C’ Accuracy = × 100 (3)
TP + FP + TN + FN
parameter, it regulates the sparsity, which is directly proportional to the
number of selected features; lower the value of the ‘C’, lesser number
of features will be selected. We have used the default value of 0.01 for TP = True Positive, FP = False Positive,
param- eter ‘C’ [45]. Based on this technique, 186 important features
TN = True Negative, FN = False Negative.
(Supplementary Table S1) have been identified from the 9149- feature
set.
After that, these 186 features were ranked based on their importance in
classifying peptides using program feature- selector. The program Architecture of web server
feature-selector rank features using a DT-based algorithm Light Gradient A web server named as ‘IL6Pred’ (https://webs.iiitd.edu.in/ra
Boosting Machine, which calculates the rank of feature based on the ghava/il6pred) is developed to predict IL-6 inducing and non- inducing
number of times a feature is used to split the data across all trees [46]. peptides. The front end of the web server was devel- oped by using
These top- ranked features were examined to understand the nature of HTML5, JAVA, CSS3 and PHP scripts. It is based on responsive templates
IL-6 inducing peptides. Furthermore, we applied machine learning on which adjust the screen based on the size of the device. It is compatible
selected features and computed the performance on top 10, 20, 30 ... ., with almost all modern devices such as mobile, tablet, iMac and
and 186 features, respectively. desktop. The web server incorporates five major modules, such as
Predict, Design, Protein Scan, Motif Scan and Blast Scan.
Results
In this study, we have used 365 peptides as a positive dataset, which can
induce IL-6 cytokine. The negative dataset includes 2991 peptides, which
Cross-validation do not induce IL-6 cytokine. All the anal- yses and predictions performed
on the IL-6 inducing and non- inducing epitopes or peptides.
We used the 5-fold cross-validation and external validation tech- nique
to train, test and evaluate our prediction models. In the past, several
studies used an 80:20 proportion for the splitting of the complete
dataset into training and validation datasets [47– 50]. We also used this
Positional analysis
standard protocol in this study, where 80% (i.e. 292 IL-6 inducing and
2393 non-IL-6 inducing peptides) of the data was used for training and In this analysis, we study the preference of particular amino acid at a
the remaining 20% (i.e. 73 IL-6 inducing and 598 non-IL-6 inducing specific position in the peptide string; we create a TSL for the IL-6
peptides) was used for external validation. Then, we implement inducing (positive) and non-inducing (negative) peptides as represented
standard 5-fold cross- validation evaluation techniques, which is in Figure 2. The most significant amino acid residue represents the
frequently used in the previous studies [51, 52]. Firstly, the entire relative abundance in the sequence. It is important to note that the first
training dataset is divided into five equivalent sets or folds, with all the eight positions represent the N-terminal residues of peptides, and the
5-folds have the same number of positive and negative examples. Then, last eight positions represent C-terminus of peptides. We observed that
4-folds were used for training, while the 5-fold was utilized for testing. ‘L’ amino acid residue is mostly preferred at 2nd, 4th, 5th, 6th, 7th,
This procedure was iterated five times so that each set was used for 10th, 11th, 12th, 13th, 14th, 15th and 16th positions in the IL-6 inducing
testing. peptides. It means that ‘L’ is preferred in N-terminus as well as C-
terminus residues. Besides, residue ‘I’ is found to be most abundant at
positions 1st, 4th and 7th in IL-6 inducing peptides; it means that ‘I’ is
preferred in N-terminus residues. On the other hand, amino acid residue
‘A’ dominates at 4th, 8th, and 16th positions in non-IL-6 inducing
peptides.
Evaluation parameters
In order to evaluate the efficiency of different prediction models, we
used well-established evaluation parameters. In this study, we used both
threshold-dependent and independent param- eters, and we measure Compositional analysis
threshold-dependent parameters such as sensitivity (Sens), specificity In this analysis, we computed amino acid composition (AAC) for both
(Spec) and accuracy (Acc) with the help of the following equations. We positive and negative datasets. The average composition of IL-6 inducing
also used the stan- dard threshold-independent parameter Area Under and non-inducing peptides is shown in Figure 3. The average composition
the Receiver Operating Characteristic (AUROC) curve to measure the of residues (such as I, L and S) is higher in IL-6 inducing peptides than in
perfor- mance of the models. AUROC curve is generated by plotting non-IL-6 peptides. Besides, the residues (such as A, D and G) are more
sensitivity against (1-specificity) on various thresholds. These abundant in non-IL-6 peptides as compared with IL-6 inducing peptides.
940 Dhall et al.
Prediction models with a minimum number of features, which will discriminate between IL-
6 inducers and non-inducers with high AUROC and accuracy. Therefore,
Machine learning-based prediction models
we build different models on top (10, 20,
We develop prediction models using various classifiers such as RF, DT, 30 ... ... and 186) features, respectively, and evaluate per- formance on
GNB, XGB and LR. Firstly, we computed the features of the IL-6 inducers the training and validation dataset. In order to understand the
and non-inducers from the Pfeature compositional-based module. A total difference between the positive and negative datasets, we computed
of 9149 features were generated by Pfeature, and then we have the average values of the top-10 features of IL-6 inducing and non-
implemented the SVC-L1 feature selection technique to select the most inducing peptides, as represented in Table 3.
relevant features, i.e. 186 features, as shown in Supplementary Table The top-10 selected features have reasonable discriminatory power in
S1. With this feature set, we applied various machine learning models. case of AUROC and accuracy. RF achieves maximum performance with
RF attains maximum performance with AUROC 0.893 and 0.863; accuracy accuracy (77.39 and 73.47), AUROC (0.84 and 0.83) on training and
75.79 and 73.32 on training and validation datasets, respectively, with validation dataset with balanced sensitiv- ity and specificity,
balanced sensitivity and specificity. XGB also performed well on training respectively, as represented in Table 4 and Figure 4. The performance of
and validation datasets with AUROC 0.87 and 0.82, accuracy 86.29 and 10, 20, 30 ... . .. . and 186 selected feature sets is provided in
84.65, respectively, but there is a considerable difference in sensitivity Supplementary Table S2.
and specificity. Other classifiers, such as DT, LR, KNN and GNB, perform
poorly on the training and validation dataset, as represented in Table 2.
f = file("foo", "r")
line = f.readline()
print line,
f.close()
Files: Input
input = open(‘data’, ‘r’) Open the file for input
The + operator produces a new tuple, list, or string whose value is the
concatenation of its arguments.
• The * operator produces a new tuple, list, or string that “repeats” the
original content.
>>> (1, 2, 3) * 3
(1, 2, 3, 1, 2, 3, 1, 2, 3)
>>> [1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> “Hello” * 3
‘HelloHelloHello’
Methods in string
• upper() ▪ strip(), lstrip(), rstrip()
• lower() ▪ replace(a, b)
• capitalize() ▪ expandtabs()
• count(s) ▪ split()
• find(s) ▪ join()
• rfind(s) ▪ center(), ljust(), rjust()
• index(s)
Lists are mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]
• Potentially confusing:
• extend takes a list as an argument.
• append takes a singleton as an argument.
>>> li.append([10, 11, 12])
>>> li
[1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7, [10, 11, 12]]
Operations on Lists Only
Lists have many methods, including index, count, remove,
reverse, sort
>>> li = [‘a’, ‘b’, ‘c’, ‘b’]
>>> li.index(‘b’) # index of 1st
occurrence
1
>>> li.count(‘b’) # number of
occurrences
2
>>> li.remove(‘b’) # remove 1st
occurrence
>>> li
[‘a’, ‘c’, ‘b’]
Operations on Lists Only
>>> li = [5, 2, 6, 8]
>>> li.sort(some_function)
# sort in place using user-defined comparison
Operations in List
▪ append • Indexing e.g., L[i]
▪ insert • Slicing e.g., L[1:5]
▪ index • Concatenation e.g., L + L
▪ count • Repetition e.g., L * 5
▪ sort • Membership test e.g., ‘a’ in L
▪ reverse • Length e.g., len(L)
▪ remove
▪ pop
▪ extend
List vs. Tuple
• What are common characteristics?
• Both store arbitrary data objects
• Both are of sequence data type
• What are differences?
• Tuple doesn’t allow modification
• Tuple doesn’t have methods
• Tuple supports format strings
• Tuple supports variable length parameter in function call.
• Tuples slightly faster
Summary: Tuples vs. Lists
• Lists slower but more powerful than tuples
• Lists can be modified, and they have lots of handy operations and
mehtods
• Tuples are immutable and have fewer features
• To convert between tuples and lists use the list() and tuple()
functions:
li = list(tu)
tu = tuple(li)
Python Libraries
90
NumPy
● NumPy is the fundamental package needed for scientific
computing with Python. It contains:
● a powerful N-dimensional array object
● basic linear algebra functions
● basic Fourier transforms
● sophisticated random number capabilities
● tools for integrating Fortran code
● tools for integrating C/C++ code
● Official documentation
● http://docs.scipy.org/doc/
● The NumPy book
● http://web.mit.edu/dvp/Public/numpybook.pdf
● Example list
● https://docs.scipy.org/doc/numpy/reference/routines.html
More about Numpy
●Python does numerical
●Application of NumPy
computations slowly.
● Mathematics (alternate to MATLAB)
● Plotting (Matplotlib) ●1000 x 1000 matrix multiply
● Backend (Pandas) ● Python triple loop takes > 10 min.
● Machine learning (Tensorflow) ● Numpy takes ~0.03 seconds
●Comparison with List Structured lists of numbers.
● Faster than List
● Number of operations (a*b)
●Vectors
● Less memory ●Matrices
● Convent to use ●Images
●Tensors
●ConvNets
Arrays – Numerical Python (Numpy)
● Lists ok for storing small amounts of one-dimensional data
>>> a = [1,3,5,7,9] >>> a = [1,3,5,7,9]
>>> print(a[2:4]) >>> b = [3,5,6,7,9]
[5, 7] >>> c = a + b
>>> b = [[1, 3, 5, 7, 9], [2, 4, 6, 8, 10]] >>> print c
>>> print(b[0]) [1, 3, 5, 7, 9, 3, 5, 6, 7, 9]
[1, 3, 5, 7, 9]
>>> print(b[1][2:4])
[6, 8]
# or directly as matrix
>>> M = array([[1, 2], [3, 4]])
>>> M.shape
(2,2)
>>> M.dtype
dtype('int64')
NumPy functions
min() # as vectors from lists
abs() >>> a = numpy.array([1,3,5,7,9])
add() max()
>>> b = numpy.array([3,5,6,7,9])
binomial() multiply() >>> c = a + b
cumprod() polyfit() >>> print(c)
cumsum() randint() [4, 8, 11, 14, 18]
floor() >>> c.shape
shuffle()
(5,)
histogram() transpose()
95
Numpy – ndarray attributes
⮚ndarray.ndim: the number of axes (dimensions) of the array i.e. the rank.
⮚ndarray.shape: the dimensions of the array. This is a tuple of integers indicating the size of the
array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length
of the shape tuple is therefore the rank, or number of dimensions, ndim.
⮚ndarray.size: the total number of elements of the array, equal to the product of the elements of
shape.
⮚ndarray.dtype: an object describing the type of the elements in the array. One can create or
specify dtype's using standard Python types. NumPy provides many, for example bool_, character,
int_, int8, int16, int32, int64, float_, float8, float16, float32, float64, complex_, complex64,
object_.
⮚ndarray.itemsize: the size in bytes of each element of the array. E.g. for elements of type float64,
itemsize is 8 (=64/8), while complex32 has itemsize 4 (=32/8) (equivalent to
ndarray.dtype.itemsize).
⮚ndarray.data: the buffer containing the actual elements of the array. Normally, we won't need to
use this attribute because we will access the elements in an array using indexing facilities.
Numpy – array methods - sorting
>>> arr = numpy.array([4.5, 2.3, 6.7, 1.2, 1.8, 5.5])
>>> arr.sort() # acts on array itself
>>> print(arr)
[ 1.2 1.8 2.3 4.5 5.5 6.7]
>>> x = numpy.array([4.5, 2.3, 6.7, 1.2, 1.8, 5.5])
>>> numpy.sort(x)
array([ 1.2, 1.8, 2.3, 4.5, 5.5, 6.7])
>>> print(x)
[ 4.5 2.3 6.7 1.2 1.8 5.5]
>>> s = x.argsort()
>>> s
array([3, 4, 1, 0, 5, 2])
>>> x[s]
array([ 1.2, 1.8, 2.3, 4.5, 5.5, 6.7])
>>> y[s]
array([ 6.2, 7.8, 2.3, 1.5, 8.5, 4.7])
SciPy : Python Library for science/Engineering
c1,c2= 5.0,2.0
i = r_[1:11]
xi = 0.1*i
yi = c1*exp(-xi)+c2*xi
zi = yi + 0.05*max(yi)*random.randn(len(yi))
A = c_[exp(-xi)[:,newaxis],xi[:,newaxis]]
c,resid,rank,sigma = linalg.lstsq(A,zi)
xi2 = r_[0.1:1.0:100j]
yi2 = c[0]*exp(-xi2) + c[1]*xi2
plt.plot(xi,zi,'x',xi2,yi2)
plt.axis([0,1.1,3.0,5.5])
plt.xlabel('$x_i$')
plt.title('Data fitting with linalg.lstsq')
plt.show()
Example: Linear regression
import scipy as sp
from scipy import stats
import pylab as plt
n=50 # number of points
x=sp.linspace(-5,5,n) # create x axis data
a, b=0.8, -4
y=sp.polyval([a,b],x)
yn=y+sp.randn(n) #add some noise
(ar,br)=sp.polyfit(x,yn,1)
yr=sp.polyval([ar,br],x)
err=sp.sqrt(sum((yr-yn)**2)/n) #compute the mean square error
print('Linear regression using polyfit')
print(‘Input parameters: a=%.2f b=%.2f’ % (a,b))
print(‘Regression: a=%.2f b=%.2f, ms error= %.3f' % (ar,br,err))
plt.title('Linear Regression Example')
plt.plot(x,y,'g--')
plt.plot(x,yn,'k.')
plt.plot(x,yr,'r-')
plt.legend(['original','plus noise', 'regression'])
plt.show()
(a_s,b_s,r,xx,stderr)=stats.linregress(x,yn)
print('Linear regression using stats.linregress')
print('parameters: a=%.2f b=%.2f’ % (a,b))
print(‘regression: a=%.2f b=%.2f, std error= %.3f' % (a_s,b))
Example: Least squares fit
from pylab import *
from numpy import *
from matplotlib import *
from scipy.optimize import leastsq
103
Data Frames attributes
Python objects have attributes and methods.
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
104
Data Frames methods
Unlike attributes, python methods have parenthesis.
All attributes and methods can be listed with a dir() function: dir(df)
df.method() description
head( [n] ), tail( [n] ) first/last n rows
When selecting one column, it is possible to use single set of brackets, but the
resulting object will be a Series (not a DataFrame):
In [ ]: #Select column salary:
df['salary']
When we need to select more than one column and/or make the output to be a
DataFrame, we should use double brackets:
In [ ]: #Select column salary:
df[['rank','salary']]
106
Python Libraries for Data Science
matplotlib:
▪ python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats
Link: https://matplotlib.org/
107
import matplotlib.pyplot as plt
xs = range(-100,100,10)
x2 = [x**2 for x in xs]
negx2 = [-x**2 for x in xs]
plt.plot(xs, x2)
plt.plot(xs, negx2)
plt.xlabel("x”)
Incrementally
plt.ylabel("y”) modify the figure.
plt.ylim(-2000, 2000)
plt.axhline(0) # horiz line
plt.axvline(0) # vert line
plt.savefig(“quad.png”) Save your figure to a file
plt.show() Show it on the screen
Python Libraries for Data Science
Seaborn:
▪ based on matplotlib
Link: https://seaborn.pydata.org/
109
Why use modules?
● Code reuse
● Routines can be called multiple times within a program
● Routines can be used from multiple programs
● Namespace partitioning
● Group data together with functions used for that data
● Implementing shared services or data
● Can provide global data structure that is accessed by
multiple subprograms
Simple functions: ex.py
"""factorial
done
recursively
and
iteratively"""
def fact1(n):
ans =
for ans1
i in=
range(2,n):
ans return
* n ans
def fact2(n):
if n < 1:
1 else:return
n
1) * return
fact2(n -
Simple functions: ex.py
>>>
>>> import ex
ex.fact1(6)
1296
>>> ex.fact2(200)
78865786736479050
35523632139321850
7…000000L
>>>
at ex.fact1
<function fact1
0x902470>
>>> fact1
Traceback
recent (most
call
last):
File
line 1,"<stdin>",
in
<module>
NameError:
'fact1' is name
not
defined
Defining a class
# Define class
class thingy:
def __init__(self, value): # defining instance
self.value = value
def showme(self): # defining method
print("value = %s" % self.value)
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
Linear Equations
Y
Y = mX + b
Change
m = Slope in Y
Change in X
b = Y-intercept
X
Linear Regression Model
Y i = 0 + 1 X i + i
Dependent Independent (Explanatory)
(Response) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.)
Least Squares
• 1. ‘Best Fit’ Means Difference Between Actual Y Values & Predicted
Y Values Are a Minimum.
( ) = ˆ
n n
Yi − Yˆi
2
2
i
i =1 i =1
i =1
Y
Y2 = 0 + 1X 2 + 2
^4
^2
^1 ^3
Yi = 0 + 1X i
X
Coefficient estimation
• Prediction equation
yˆi = ˆ0 + ˆ1 xi
• Sample slope
( xi − x )( yi − y )
SS xy
ˆ1 = =
SS xx i( x − x )2
• Sample Y - intercept
ˆ0 = y − ˆ1x
130
Multiple Linear Regression Models
Introduction
• For example, suppose that the effective life of a cutting
tool depends on the cutting speed and the tool angle. A
possible multiple regression model could be
where
Y – tool life
x1 – cutting speed
x2 – tool angle
131
Multiple Linear Regression Models
132
133
Categorical Response Variables
Examples:
Non − smoker
Whether or not a person Y =
smokes Binary Response Smoker
Survives
Success of a medical Y =
treatment Dies
135
The Logistic Curve
p exp ( z )
LOGIT ( p ) = ln = z p=
(1 − p ) 1 + exp ( z )
exp( z )
p=
p (probability)
1 + exp( z )
p
LOGIT ( p ) = z = log
(1 − p )
z (log odds)
The Logistic Regression Model
Logistic Regression:
P (Y)
ln = 0 + 1 X 1 + 2 X 2 + + K X K
1-P ( Y )
Linear Regression:
Y = 0 + 1 X 1 + 2 X 2 + + K X K +
Ridge Regression
Ridge Regression (cont.)
Ridge Regression (cont.)
• The effect of this equation is to add a shrinkage penalty of the form
• Note that when λ = 0, the penalty term as no effect, and ridge regression
will procedure the OLS estimates. Thus, selecting a good value for λ is
critical (can use cross-validation for this).
The Lasso
• One significant problem of ridge regression is that the penalty term will never
force any of the coefficients to be exactly zero.
• Thus, the final model will include all p predictors, which creates a challenge in
model interpretation
• The lasso works in a similar way to ridge regression, except it uses a different
penalty term that shrinks some of the coefficients exactly to zero.
The Lasso (cont.)
Elastic Net Regression
Classification & Prediction
• Handling linear data • Machine Learning for non-linear data
• Artificial neural networks (ANN)
• Handling non-linear data • Support vector machine (SVM)
• Hidden markov model (HMM)
• K-nearest neighbor (K-NN)
• Random forest classifier
Comparison of Supervised, Unsupervised
and Reinforcement
Supervised Unsupervised Reinforcement
Trained using labeled data Trained on Unlabeled data Learning based on feedback
Regression or classification Clustering and association Real time learning (game, robot)
Labeled data for training No labeled data (patterns) No predefined data (scratch)
Supervision Unsupervised Unsupervised
Forecast outcomes Discovery patterns Learn like a child from actions
Map input data and output label Understand pattern Learn using trial and error
Non-linearity
163
Example: A simple single unit adaptive network
• The network has 2 inputs,
and one output. All are
binary. The output is
• 1 if W0I0 + W1I1 + Wb > 0
• 0 if W0I0 + W1I1 + Wb ≤ 0
164
Hidden layers Neural Networks
• Layers of nodes
• Input is transformed into
numbers
• Weighted averages are fed into
nodes
• High or low numbers come out
of nodes
• A Threshold function determines
whether high or low
• Output nodes will “fire” or not
• Determines classification
• For an example
165
Backpropagation neural network
Deep neural network is simply a feedforward network
Introduction to Support Vector Machine (SVM)
2
• How about… mapping x
data to a higher-dimensional space:
0 x
Non-linear SVMs: Feature spaces
• General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
Perceptron Revisited: Linear Separators
• Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
Maximum Margin Classification
• Maximizing the margin is good according to intuition and PAC
theory.
• Implies that only support vectors are important; other training
examples are ignorable.
K-NEAREST NEIGHBOR METHOD (KNN)
• Classification of unknown object based on similarity/distance of
annotated object
• Search similar objects in a database of known objects
• Different names
• Memory based reasoning
• Example based learning
• Instance based reasoning
• Case based reasoning
KNN – Number of Neighbors
• If K=1, select the nearest neighbor
• If K>1,
• For classification select the most frequent neighbor.
• Voting or average concept
• Preference/weight to similarity/distance
• For regression calculate the average of K neighbors.
Weight to Instance
• All instance or examples are not reliable
• Weight a instance based on its success in prediction
174
Random Forest Algorithm
• Random forest (or random forests) is an ensemble classifier that consists of
many decision trees
• Tin Kam Ho of Bell Labs in 1995, proposed random decision forests
Decision trees are one of the most popular learning methods.
One type of decision tree is called CART… classification and regression tree.
CART … greedy, top-down binary, recursive partitioning, that divides feature space
into sets of disjoint rectangular regions.
175/14
To ‘play tennis’ or not.
A new test example:
Outlook (Outlook==rain) and (not
Windy==false)
sunny rain
overcast
Pass it on the tree
-> Decision is yes.
Humidity Yes
Windy
No Yes No Yes
Decision tress involve greedy, recursive partitioning.
177/14
Random Forest Classifier
Training Data
M features
N examples
Random Forest Classifier
M features
N examples
....…
Random Forest Classifier
Construct a decision tree
M features
N examples
....…
Random Forest Classifier
At each node in choosing the split feature
choose only among m<M features
M features
N examples
....…
Random Forest Classifier
Create decision tree
from each bootstrap sample
M features
N examples
....…
....…
Random Forest Classifier
M features
N examples
Take he
majority
vote
....…
....…
Artificial Neural Network
&
Hidden Markov Model
Prof. Gajendra P.S. Raghava
Head, Department of Computational Biology
Non-linearity
187
Example: A simple single unit adaptive network
• The network has 2 inputs,
and one output. All are
binary. The output is
• 1 if W0I0 + W1I1 + Wb > 0
• 0 if W0I0 + W1I1 + Wb ≤ 0
188
Perceptrons
• Initial proposal of connectionist networks
• Rosenblatt, 50’s and 60’s
• Essentially a linear discriminant composed of
nodes, weights
I1 W1 I1 W1
or
I2
W2
O I2
W2 O
W3 W3
I3 Activation Function I3
O= i
1 : wi I i + 0
0 : otherwise 1
191
Perceptron Example
2 .5
1
.3
=-1
Learning Procedure:
• 67,1,4,120,229,…, 1
• 37,1,3,130,250,… ,0
• 41,0,2,130,204,… ,0
output, as calculated by
2 P
E.g. if we have two patterns and
T1=1, O1=0.8, T2=0, O2=0.5 then D=(0.5)[(1-0.8)2+(0-0.5)2]=.145
W
LMS Gradient Descent
• Using LMS, we want to minimize the error. We can do this by finding the direction
on the error surface that most rapidly reduces the error rate; this is finding the slope
of the error function by taking the derivative. The approach is called gradient
descent (similar to hill climbing).
To compute how much to change weight for link k:
Error Oj = f (I W )
wk = −c
wk
O j
= I k f ' ( ActivationFunction( I kWk ) )
Chain rule: wk
Error Error O j
wk
=
O j
wk wk = −c (− (T j − O j ) )I k f ' ( ActivationFunction )
1
Error
2 P
(TP − O P ) 2 1
(T j − O j )2
= = 2 = −(T j − O j )
O j O j O j
We can remove the sum since we are taking the partial derivative wrt Oj
Optimizing concave/convex function
• Maximum of a concave function = minimum of a convex
function
Gradient ascent (concave) / Gradient descent (convex)
−
LMS vs. Limiting Threshold
• With the new sigmoidal function that is differentiable, we
can apply the delta rule toward learning.
• Perceptron Method
• Forced output to 0 or 1, while LMS uses the net output
• Guaranteed to separate, if no error and is linearly separable
• Otherwise it may not converge
• Gradient Descent Method:
• May oscillate and not converge
• May converge to wrong answer
• Will converge to some minimum even if the classes are not linearly
separable, unlike the earlier perceptron training method
Activation functions
• The activation function is generally non-linear. Linear functions are limited
because the output is simply proportional to the input.
Slide 206
Hidden layers Neural Networks
• Layers of nodes
• Input is transformed into
numbers
• Weighted averages are fed into
nodes
• High or low numbers come out
of nodes
• A Threshold function determines
whether high or low
• Output nodes will “fire” or not
• Determines classification
• For an example
208
Backpropagation neural network
Example of cascade neural network
Deep neural network is simply a feedforward network
212
Example of Markov Model
0.3 0.7
Rain Dry
0.2 0.8
• Two states : ‘Rain’ and ‘Dry’.
• Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,
P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
213
214
Markov Chain Models
215
Markov Chain Models
• given some sequence x of length L, we can ask how probable the
sequence is given our model
• for any probabilistic model of sequences, we can write this
probability as
• key property of a (1st order) Markov chain: the probability of each Xi depends
only on Xi-1
217
Higher Order Markov Chains
218
Protein Annotation
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology
sequence
BLAST
seq DB
homologues
retrieve
226
Major Methods for Annotating Proteins
• Similarity search techniques (BLAST)
• Database scanning using BLAST, FASTA
• It requires a large, well annotated proteins
• Sequence composition
• Simple statistical/mathematical methods
• Sequence features, profiles or motifs
• Sophisticated sequence analysis tools
• Prediction or Classification models
• Application of Artificial Intelligence
Adopted from Internet
https://webs.iiitd.edu.in/raghava/pfeature/
Compositional Similarity (Composition)
Correlation between composition
Compositional distance
Euclidean Distance
Minkowski Distance
n
dist = ( pk − qk )
2
Manhattan distance
k =1
Major Methods for Annotating Proteins
• Similarity search techniques (BLAST)
• Database scanning using BLAST, FASTA
• It requires a large, well annotated proteins
• Sequence composition
• Simple statistical/mathematical methods
• Sequence features, profiles or motifs
• Sophisticated sequence analysis tools
• Prediction or Classification models
• Application of Artificial Intelligence
Feature based annotation
sequence
find pattern
DB
patterns
parse
features
Pfam; PF00234; tryp_alpha_amyl; 1.
• PROSITE - http://www.expasy.ch/ PROSITE; PS00940; GAMMA_THIONIN; 1.
PROSITE; PS00305; 11S_SEED_STORAGE; 1.
• BLOCKS - http://blocks.fhcrc.org/
• DOMO - http://www.infobiogen.fr/services/domo/
• PFAM - http://pfam.wustl.edu/
• PRINTS - http://www.biochem.ucl.ac.uk/bsm/dbrowser/PRINTS/
Major Methods for Annotating Proteins
• Similarity search techniques (BLAST)
• Database scanning using BLAST, FASTA
• It requires a large, well annotated proteins
• Sequence composition
• Simple statistical/mathematical methods
• Sequence features, profiles or motifs
• Sophisticated sequence analysis tools
• Prediction or Classification models
• Application of Artificial Intelligence
What is Subcellular Localization?
⚫ Organelles
⚫ Membranes
⚫ Compartments
⚫ Micro-
environments
Prediction of molecular
interactions in Proteins
Computational methods for
predicting protein interactions
Protein level annotation Structure based techniques
➢ DNA binding proteins ➢ Docking techniques
➢RNA binding proteins ➢ Need structure
➢Ligand binding proteins ➢ Any type of interaction
➢ Time consuming
➢Interacting pair of proteins
Sequence based techniques
Residue level annotation ➢ Generation of features
➢Prediction/identification of ➢ Classification techniques
➢DNA interacting residues
➢ Knowledge based techniques
➢ATP interacting residues
➢ Need data for training
➢RNA interacting residues
➢Glycosylation sites
➢ Suitable for high throughput
Prediction of Interaction: Case
studies
Dipeptide Composition
Example of Protein-Protein Interaction
Protein name Amino acid sequence Label
PN A C D E F G H I K L M N P Q R S V U V W Label
Amino Acid
P1 1
Composition
P2 1
P3 0
P4 0
P5 ?
Protein features and
vector encoding
• Amino Acid Composition: 20
• Split (4 part): 20*4 = 80
• Dipeptide Composition : 400
• Evolutionary information
• PSI-BLAST
• Search against NR
• E-value 0.01
• PSSM profile
• PSSM-400
• PSSM-420
• PSSM-21
Performance of Different Models
Performance of
Different Models
Prediction of Interaction: Case
studies
Dipeptide Composition
Example of Composition based Features
First Protein Second Protein Interaction
Creation of dataset
Pattern Label
XXXVNik Non-Interacting
XXVNikT Non-Interacting
XVNikTN Interacting
VNikTNP Interacting
NikTNPf Non-Interacting
ikTNPfk Non-Interacting
…….. …….
rGNIXXX
RGHRIGH ?
Generate Patterns of length 9
1A0I_A::VNikTNPfkaVSFVESAIKKALDNAGYLIAeikyDGVrGNI
XXXXVNikTNPfkaVSFVESAIKKALDNAGYLIAeikyDGVrGNIXXXX
Pattern Label
XXXXVNikT Non-Interacting
XXXVNikTN Non-Interacting
XXVNikTNP Interacting
XVNikTNPf Interacting
VNikTNPfk Non-Interacting
NikTNPfka Non-Interacting
…….. …….
VrGNIXXXX
ARGHRIGHV ?
X V N i k T N P f
A 0 0 0 0 0 0 0 0 0
XVNikTNPf C 0 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0
PATTERN = 9 * 21 = 189
Convert a pattern in binary
F 0 0 0 0 0 0 0 0 1
G 0 0 0 0 0 0 0 0 0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
H 0 0 0 0 0 0 0 0 0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
I 0 0 0 1 0 0 0 0 0
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
K 0 0 0 0 1 0 0 0 0 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,
L 0 0 0 0 0 0 0 0 0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
pattern
M 0 0 0 0 0 0 0 0 0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,
N 0 0 1 0 0 0 1 0 0 0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,
P 0 0 0 0 0 0 0 1 0 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
Q 0 0 0 0 0 0 0 0 0 0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
R 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 1 0 0 0
V 0 1 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0
Y 0 0 0 0 0 0 0 0 0
X 1 0 0 0 0 0 0 0 0
Therapeutic Application of
Proteins or Peptides
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology
ADMET: Proteolytic enzymes, Size Optimization for Oral Delivery : Trans. &
Half-life function/Str. Adjuvant
https://webs.iiitd.edu.in/raghava/thpdb/
Important Facts
Success rate
Phase I -> II ~84% for biologics; 63% for small molecules
Phase II -> III ~53% for biologics; 38% for small molecules.
Approval
Phase III -> FDA approval ~74% biologics; ~61% small molecules
• Concept of Drug
• Kill invaders of foreign pathogens
• Inhibit the growth of pathogens
• Concept of Vaccine
• Generate memory cells
• Trained immune system to face various existing disease agents
History of Immunization
• Children protected who recovered from smallpox
• Immunity induce, a process known as variolation
• Variolation spread to England and America
• Stopped due to the risk of death
• Edward Jenner found that protection against smallpox
• Inoculation with material from an individual infected with
cowpox
• This process was called vaccination (cowpox is vaccina)
• Inoculum was termed a vaccine
• Protective antibodies was developed
Biomolecules Based Vaccines
T cell epitope
Attenuated
Different arms of Immune System
Disease Causing Agents
Pathogens/Invaders
Exogenous processing of Pathogenic antigens
(MHC Class II binders or T-helper Epitopes)
Prediction of CTL Epitopes (Cell-mediated immunity)
Web servers for designing epitope-based vaccine
Propred: Promiscuous MHC-II binders
MHCBN: Database of MHC
T-Cell Epitopes IL4Pred: Prediction of interleukin-4
--------------------------------------------
Propred1: for promiscuous MHC I binders
Pcleavage: Proteome cleavage sites
TAPpred: for predicting TAP binders
CTLpred: Prediction of CTL epitopes
Classifier
• The classifier:
• Input: a set of m hand-labeled documents (x1,y1),....,(xm,ym)
• Output: a learned classifier f:x → y
Text Classification-Representing Texts
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their products to
f( )=y
February 11, in thousands of tonnes, showing those for future shipments month, 1986/87
total and 1985/86 total to February 12, 1986, in brackets:
• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
• Maize Mar 48.0, total 48.0 (nil).
• Sorghum nil (nil)
• Oilseed export registrations were:
• Sunflowerseed total 15.0 (7.9)
• Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for subproducts, as follows....
simplest useful
• Punctuations Argentine grain board figures show crop registrations of grains, oilseeds and their products to
f( )=y
February 11, in thousands of tonnes, showing those for future shipments month, 1986/87
total and 1985/86 total to February 12, 1986, in brackets:
• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
• Prepositions •
•
•
Maize Mar 48.0, total 48.0 (nil).
Sorghum nil (nil)
Oilseed export registrations were:
• Pronouns, etc. •
•
Sunflowerseed total 15.0 (7.9)
Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for subproducts, as follows....
• Stemming
• Walk
• walker
(argentine, 1986, 1987, grain, oilseed,
•
f( )=y
walked registrations, buenos, aires, feb, 26,
• walking. argentine, grain, board, figures, show, crop,
registrations, of, grains, oilseeds, and, their,
products, to, february, 11, in, …
Word Frequency
word freq
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains,
grain(s) 3
oilseeds and their products to February 11, in thousands of
tonnes, showing those for future shipments month, oilseed(s) 2
1986/87 total and 1985/86 total to February 12, 1986, in
brackets: total 3
• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total
2,692.4 (4,161.0). wheat 1
• Maize Mar 48.0, total 48.0 (nil).
• Sorghum nil (nil) maize 1
• Oilseed export registrations were:
• Sunflowerseed total 15.0 (7.9) soybean 1
• Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for subproducts, as tonnes 1
follows....
... ...
• TF-IDF = TF ✗ IDF
Documents
D1 (length = 20)
Words
Frequency Fingerprint Composition tf-idf tfc-weight ltc-weight Entropy-W
grain(s) 3 1 0.15 3
oilseed(s) 2 1 0.1 3
total 3 1 0.15 6
wheat 1 1 0.05 4
maize 1 1 0.05 2
soybean 1 1 0.05 1
tonnes 1 1 0.05 2
IIIT 0 0 0 0
Delhi 0 0 0 0
Feature generation (Cont.)
• tfc-weighting
• It considers the normalized length of documents (M).
• ltc-weighting
• It considers the logarithm of the word frequency to reduce the effect of large differences in
frequencies.
• Entropy weighting
the dog smelled like a skunk
What is an N-gram? Bigram:
• An n-gram in case of text "# the”, “the dog”, “dog smelled”, “
smelled like”, “like a”, “a skunk”,
• Unigram: n-gram of size 1 “skunk#”
• Bigram: n-gram of size 2 Trigram:
• Trigram: n-gram of size 3 "# the dog", "the dog smelled", "dog
• Item: Phonemes, Syllables, Letters, smelled like", "smelled like a", "like a
Words, Others skunk" and "a skunk #".
In case of Nucleotide sequence
• In case of protein n-gram is called by
Mono-nucleotide composition
• Amino acid composition (n = 1)
Di-nucleotide composition
• Dipeptide composition (n = 2)
Tri-nucleotide composition
• Tripeptide Composition
Item: 4 nucleotides, properties
• Item: 20 amino acids, properties
Feature generation
(Word Embedding)
• Transforming words into feature vectors
AAC_NT(5)
AAC_CT(5)
Example
AAC_Rest(3,3)
AAC_Split (3)
Manual of Pfeature :
https://webs.iiitd.edu.in/raghava/pfeature/Pfeature_Manual.pdf
Manual of Pfeature :
https://webs.iiitd.edu.in/raghava/pfeature/Pfeature_Manual.pdf
where PCPi is physico-chemical properties composition of residue type i;
Pi and L are sum of property of type i and length of sequence.
Evaluation or Benchmarking of
Methods
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology
347
Cutoff for classification
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly
348
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
owner 11 1
non-owner 4 8
owner 7 5
non-owner 1 11
350
Cross-validation Techniques for Evaluation
Cross Validation
• Jack Knife Test
• LOOCV: One for testing and rest for training
• K-fold cross validation
• LOOCV If K = N (number of samples)
K-fold Cross Validation
• Bootstrap • Jackknife
• Yields slightly different results • Less general technique
when repeated on the same • Explores sample variation
data (when estimating the differently
standard error)
• Yields the same result each time
• Not bound to theoretical
distributions • Similar data requirements
Actual Threshold Dependent
Parameters for
Positive Negative Evaluation
Predicted
Positive TP FP PPV
Negative FN TN NPV
Sensitivity Specificity
Actual
Positive PPV
(Sick) TP=2 FP=18 =2 / (2 + 18)
Predicted
=10%
Negative NPV
(Healthy) FN=1 TN=182 =182 / (1 + 182)
=99.5%
Sensitivity Specificity
=2/(2+1) =182/18+182
=66.67% =91%
Measures for evaluating
classification models
Sensitivity : Percentage coverage of positive is the percentage
of positive samples predicted as positive.
Specificity or percentage coverage of negative is the
percentage of negative samples predicted as negative.
Positive predictive (PPV): Probability of correct prediction of
positively samples.
Negative predictive (PPV): Probability of correct prediction of
negative predicted samples.
Accuracy: Percentage of correctly predicted examples (both
correct positive and correct negative prediction).
Matthews Correlation Coefficient (MCC): It penalized both
under and over prediction.
F1: Measured the harmonic mean of precision and recall:
Confusion Matrix for
Multiclass Classifier
Following table shows the confusion matrix of a classification problem
with six classes labeled as C1, C2, C3, C4, C5 and C6.
Class C1 C2 C3 C4 C5 C6
C1 52 10 7 0 0 1
C2 15 50 6 2 1 2
C3 5 6 6 0 0 0
C4 0 2 0 10 0 1
C5 0 1 0 0 7 1
C6 1 3 0 1 0 24
Predictive accuracy?
Area Under Curve (AUC or AUCROC)
Threshold Independent Parameter
True positive rate (TPR) = Sensitivity ; False positive rate (FPR) = (1 – specificity)
Regression Method
Actual MP (MP ) Predicted MP
(MP )
act n
2
− MP pred
(MP pre d )
act
R2 = 1− i=1
(MP )
n
12.5 14.0 2
act
− MP
67.0 71.3 i=1
71.2 68.7
(MP )
n
2
115.9 121.0 act
− MP pred
32.7 29.8 Q2 = 1− i=1
(MP )
n
2
45.7 49.3 act
− MPtrain
79.8 76.8 i=1
127.3 125.1 1 m
57.6 50.2 MPtrain = MP act
m i=1
37.2 33.8
= 646.90 = 640.0 1 n
MAE = MP act − MP pred
n i=1
= 53580.21 = 53169.64
1 M
RMSECV = (RMSE )2
M i =1
MEASURING THE MODEL ACCURACY
367
Evaluation of regression-
based methods