Unit6 -7 Issues_23bc7150-918a-4ebe-9af6-01db96af986a
Unit6 -7 Issues_23bc7150-918a-4ebe-9af6-01db96af986a
2
Underfitting :
The underfitting
means model has
a low accuracy
score on both
training data and
test data.
An underfit model fails to significantly grasp the
relationship between the input values and target
variables.
Underfitting happens when the algorithm used to build
a prediction model is very simple and not able to learn
the complex patterns from the training data. In that
case, accuracy will be low on seen training data as well
as unseen test data. Generally, It happens with Linear
Algorithms. 4
Underfitting
For example, as shown in the figure above, the
model is trained to classify between the circles and
crosses. However, it is unable to do so properly due
to the straight line, which fails to properly classify
either of the two classes.
5
Overfitting :
It means model
has a High
accuracy score on
training data but
low score on test
data.
An overfit model has overly memorized the data set it
has seen and is unable to generalize the learning to
an unseen data set. That is why an overfit model
results in very poor test accuracy. Poor test accuracy
may occur when the model is highly complex, i.e., the
input feature combinations are in a large number and
affect the model's flexibility.
7
Overfitting
For example, As shown in the figure below, the
model is trained to classify between the circles and
crosses, and unlike last time, this time the model
learns too well. It even tends to classify the noise in
the data by creating an excessively complex model
(right).
8
Overfitting
◼ Overfitting occurs when a statistical model describes
random error or noise instead of the underlying
relationship.
◼ Overfitting generally occurs when a model is excessively
complex, such as having too many parameters relative to
the number of observations.
◼ A model which has been overfit will generally have poor
predictive performance.
◼ Overfitting depends not only on the number of parameters
and data but also the conformability of the model
structure.
◼ In order to avoid overfitting, it is necessary to use
additional techniques (e.g. crossvalidation, pruning (Pre or
9
10
◼ Reason
◼ Noise in training data.
11
Model Comparison
◼ iii. Others such as: Gain and Lift Charts, K-S Charts
12
i. Confusion Matrix (Contigency Table):
◼ A confusion matrix contains information about actual
and predicted classifications done by classifier.
◼ Performance of such system is commonly evaluated
using data in the matrix.
◼ It is also known as a contingency table or an error
matrix, is a specific table layout that allows
visualization of the performance of an algorithm.
◼ Each column of the matrix represents the instances in
a predicted class, while each row represents the
instances in an actual class.
13
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class Predicted C1 Predicted ¬ C1
Actual C1 True Positives (TP) False Negatives (FN)
Actual ¬ C1 False Positives (FP) True Negatives (TN)
◼ FPR = 1- TNR(specificity) 15
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
◼ Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
16
Classifier Evaluation Metrics: Example
18
ii. ROC Analysis
◼ Receiver Operating Characteristic (ROC), or ROC curve, is a
graphical plot that illustrates the performance of a binary
classifier system as its discrimination threshold is varied.
◼ The curve is created by plotting the true positive rate
against the false positive rate at various threshold
settings.
◼ The ROC curve plots sensitivity (TPR) versus FPR
◼ ROC analysis provides tools to select possibly optimal
models and to discard suboptimal ones independently
from (and prior to specifying) the cost context or the class
distribution.
19
◼ ROC analysis is related in a direct and natural way to
cost/benefit analysis of diagnostic decision making.
20
Model Selection: ROC Curves
◼ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
◼ Originated from signal detection theory
◼ Shows the trade-off between the true
positive rate and the false positive
rate
◼ The area under the ROC curve is a ◼ Vertical axis represents
measure of the accuracy of the model the true positive rate
◼ Rank the test tuples in decreasing ◼ Horizontal axis rep. the
order: the one that is most likely to false positive rate
belong to the positive class appears at ◼ A model with perfect
the top of the list accuracy will have an
area of 1.0
◼ The closer to the diagonal line (i.e., the
closer the area is to 0.5), the less
21
22
Figure shows the ROC curves of two classification models. The
diagonal line representing random guessing is also shown. Thus,
the closer the ROC curve of a model is to the diagonal line, the
less accurate the model.
If the model is really good, initially we are more likely to
encounter true positives as we move down the ranked
list.
Thus, the curve moves steeply up from zero. Later, as we start to
encounter fewer and fewer true positives, and more and more
false positives, the curve eases off and becomes more horizontal.
To assess the accuracy of a model, we can measure the area
under the curve. Several software packages are able to perform
such calculation.
The closer the area is to 0.5, the less accurate the
corresponding model is. A model with perfect accuracy
will have an area of 1.0.
23
Validation
◼ Validation techniques are motivated by two fundamental
problems in pattern recognition:
◼ model selection and
◼ performance estimation
◼ Validation Approaches:
◼ One approach is to use the entire training data to
25
Training Dataset
age income student credit_rating buys_comput
<=30 high no fair no
Class: <=30 high no excellent no
31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
Data sample <=30 medium no fair no
<=30 low yes fair yes
X = (age <=30, Income = medium, >40 medium yes fair yes
Student = yes Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
class (buys_computer) = ? 31…40 high yes fair yes
>40 medium no excellent no
dataset
◼ Each split randomly selects (fixed) no. examples without
replacement
◼ For each data split we retrain the classifier from scratch
27
◼ Approach2: K-Fold Cross-Validation
◼ K-Fold Cross validation is similar to Random Sub
sampling.
◼ Create a K-fold partition of the dataset, For each of
rate
28
◼ Approach3: Leave-one-out Cross-Validation
◼ Leave-one-out is the degenerate case of K-Fold
29
31
Example – 5 Fold Cross Validation
32
33
34
35
Can we Reject H0 ?
=> See next slide
36
Can we Reject H0
37
◼ We want to study effect of drugs A & drugs B for pain.
◼ We recruit 400 individuals with pain & make 200 pairs
of these individuals where we pair them based on
similar pain score, gender and age.
38
◼ We then
randomly
assign drug
◼ A to one of
the
individuals
in the pairs,
and drug B
to the other
persons in
the pairs 39
n00 n01
n10 n11
◼ Note: Toal count 200 here is number of pairs not number of individuas
40
A single test set: McNemar’s test
◼ McNemar’s test
• It is corresponding chi-square test for paired data.
• Compares classifiers A and B on a single test set.
• Considers the number of test items where either A or
B make errors:
◼ n11: number of items classified correctly by both A
and B
◼ n00: number of items misclassified by both A and B
Null hypothesis:
◼ A and B have the same error rate. Then, n01 = n10
41
NOTE: Some books do not use -1 in formula above 42
43
McNemar’s test is used to compare the
performance of two classifiers on the same
test set.
44
45
46
Q. We are interested in if proportion
Example using R:
of stroke patients unable to walk
without an assistive device changes
after completing physical therapy
(pt) program.
After PT
Before PT
Solution:
Here p-value (0.00796) < 0.005 (alpha value);
So, null hypothesis (H0 : n01 = n10 ) is rejected;
This means project is successful effect of after PT is more than before PT. 47
48
Issues Affecting Model Selection
◼ Accuracy
◼ classifier accuracy: predicting class label
◼ Speed
◼ time to construct the model (training time)
◼ time to use the model (classification/prediction time)
◼ Robustness: handling noise and missing values
◼ Scalability: efficiency in disk-resident databases
◼ Interpretability
◼ understanding and insight provided by the model
◼ Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
49
51
Summary (I)
◼ Classification is a form of data analysis that extracts models
describing important data classes.
◼ Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
◼ Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
◼ Stratified k-fold cross-validation is recommended for accuracy
estimation
52
Summary (II)
◼ Significance tests and ROC curves are useful for model selection.
◼ There have been numerous comparisons of the different
classification methods; the matter remains a research topic
◼ No single method has been found to be superior over all others
for all data sets
◼ Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
53