0% found this document useful (0 votes)
19 views

Model Evaluation and Selection

This document discusses model evaluation and selection techniques. It introduces terminology like true positives, true negatives, false positives and false negatives used in model evaluation. It describes accuracy, error rate, sensitivity, specificity, precision, recall and F-measure - common metrics used to evaluate classifiers. It also discusses techniques like hold-out method, k-fold cross validation and bootstrap to obtain reliable estimates of a classifier's accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Model Evaluation and Selection

This document discusses model evaluation and selection techniques. It introduces terminology like true positives, true negatives, false positives and false negatives used in model evaluation. It describes accuracy, error rate, sensitivity, specificity, precision, recall and F-measure - common metrics used to evaluate classifiers. It also discusses techniques like hold-out method, k-fold cross validation and bootstrap to obtain reliable estimates of a classifier's accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Model Evaluation and

Selection
Model Evaluation and Selection
• One would like an estimate of how accurately the classifier is able to
classify data on which the classifier has not been trained.
• What if we have more than one classifier and want to choose the best
one?
• The answer to these questions is obtained through model evaluation
and selection.
• What is accuracy?
• How can we estimate it?
• Are some measures of a classifier’s accuracy more appropriate than others?
Terminology-Introduction
• Positive tuples: (tuples of the main class of interest)
• Negative tuples: (all other tuples).
• Example,
• the positive tuples may be buys_computer=yes while the negative tuples are
buys_computer=no.
• Lets consider P is the number of positive tuples and N is negative
tuples
Terminology-introduction
• True positives (TP): These refer to the positive tuples that were correctly
labeled by the classifier. Let TP be the number of true positives.
• True negatives (TN): These are the negative tuples that were correctly labeled
by the classifier. Let TN be the number of true negatives.
• False positives (FP): These are the negative tuples that were incorrectly labeled
as positive (e.g., tuples of class buys_computer =no for which the classifier
predicted buys_computer=yes). Let FP be the number of false positives.
• False negatives (FN): These are the positive tuples that were incorrectly labeled
as negative (e.g., tuples of class buys_computer=yes for which the classifier
predicted buys_computer=no). Let FN be the number of false negatives.
Model Evaluation Metrics
Confusion Matrix
• The confusion matrix is a useful tool for analyzing how well your
classifier can recognize tuples of different classes.
• TP and TN tell us when the classifier is getting things right, while FP
and FN tell us when the classifier is getting things wrong.
Confusion Matrix where m≥2
• Given m classes (where m ≥ 2), a confusion matrix is a table of at
least size m by m.
• An entry, CMi,j in the first m rows and m columns indicates the
number of tuples of class i that were labeled by the classifier as
class j.
• For a classifier to have good accuracy, ideally most of the tuples
would be represented along the diagonal of the confusion
matrix, from entry CM1,1 to entry CMm,m, with the rest of the
entries being zero or close to zero. That is, ideally, FP and FN are
around zero.
Accuracy
• The accuracy of a classifier on a given test set is the percentage of
test set tuples that are correctly classified by the classifier. That is,

This is also known as recognition rate of classifier


Error rate or misclassification rate
• We can also speak of the error rate or misclassification rate of a
classifier, M, which is simply 1-accuracy (M) where accuracy(M) is the
accuracy of M.
• This also can be computed as

• If we were to use the training set (instead of a test set) to estimate


the error rate of a model, this quantity is known as the resubstitution
error.
Class imbalance problem
• Recognition/accuracy is most effective when the class distribution is relatively balanced,
i.e. main class of interest (positive class) and other classes(negative class) are fairly
distributed.
• But it becomes an ineffective measure for imbalanced classes, i.e. situations where the
main class of interest(positive class) is rare and has significant number of other classes
(negative class)
• Example:
• 1. In fraud detection applications, the class of interest is “fraudulent” which occurs much
less frequently than negative class “non-fraudulent”.
• 2. In medical data, the class of interest “cancerous” occurs rarely than class “non-
cancerous”. With an accuracy of suppose 97%, the classifier might be correctly labelling
only negative class
• Therefore other measures are required which can depict separately how well the
classifier classifies positive tuples and how well it can recognize negative tuples
Sensitivity and Specificity
• Sensitivity is also referred to as the true positive (recognition) rate
(i.e., the proportion of positive tuples that are correctly identified),
while specificity is the true negative rate (i.e., the proportion of
negative tuples that are correctly identified). These measures are
defined as

• It can be shown that accuracy is a function of sensitivity and


specificity
Example
• Find the sensitivity, specificity and accuracy for the following data

• The sensitivity of the classifier is 90/300=30.00%.


• The specificity is 9560/9700=98.56%.
• The classifier’s overall accuracy is 9650/10,000=96.40%.
• Thus, we note that although the classifier has a high accuracy, it’s ability
to correctly label the positive (rare) class is poor given its low sensitivity.
Precision and Recall
• Precision is defined as what percentage of tuples labeled as positive
are actually such. It is a measure of exactness.
• Recall is defined as what percentage of positive tuples are labeled as
such. It is a measure of completeness
• These measures can be computed as

=sensitivity
• A perfect precision score of 1.0 for a class C means that every tuple
that the classifier labeled as belonging to class C does indeed belong to
class C.
• However, it does not tell us anything about the number of class C
tuples that the classifier mislabeled.
• A perfect recall score of 1.0 for C means that every item from class C
was labeled as such, but it does not tell us how many other tuples
were incorrectly labeled as belonging to class C.
• There tends to be an inverse relationship between precision and recall,
where it is possible to increase one at the cost of reducing the other.
Problem
• Find the precision and recall for the following confusion matrix

• The precision for the yes class is=90/230=39.13%.


• The recall is 90/300=30.00%,
F measure and Fβ measure
• An alternative way to use precision and recall is to combine them into a single measure.
• This is the approach of the F measure (also known as the F1 score or F-score) and the F
measure. They are defined as

where β is a non-negative real number.


• The F measure is the harmonic mean of precision and recall. It gives equal weight to
precision and recall.
• The F β measure is a weighted measure of precision and recall. It assigns times as much
weight to recall as to precision. Commonly used F measures are F2 (which weights recall
twice as much as precision) and F0.5 (which weights precision twice as much as recall).
Additional measures
In addition to accuracy-based measures, classifiers can also be compared with respect to the
following additional aspects:
• Speed: This refers to the computational costs involved in generating and using the given
classifier.
• Robustness: This is the ability of the classifier to make correct predictions given noisy data
or data with missing values. Robustness is typically assessed with a series of synthetic data
sets representing increasing degrees of noise and missing values.
• Scalability: This refers to the ability to construct the classifier efficiently given large
amounts of data. Scalability is typically assessed with a series of data sets of increasing size.
• Interpretability: This refers to the level of understanding and insight that is provided by the
classifier or predictor. Interpretability is subjective and therefore more difficult to assess.
Decision trees and classification rules can be easy to interpret, yet their interpretability may
diminish the more they become complex
Obtaining reliable classifier accuracy
estimates(I)
• The holdout method :the given data are randomly partitioned into two
independent sets, a training set and a test set.
• Typically, two-thirds of the data are allocated to the training set, and the remaining
one-third is allocated to the test set.
• The training set is used to derive the model. The model’s accuracy is then estimated
with the test set .
• The estimate is pessimistic because only a portion of the initial data is used to
derive the model.
• Random subsampling is a variation of the holdout method in which the holdout
method is repeated k times.
• The overall accuracy estimate is taken as the average of the accuracies obtained
from each iteration.
Obtaining reliable classifier accuracy
estimates (II)
• In k-fold cross-validation, the initial data are randomly partitioned into k
mutually exclusive subsets or “folds,” D1, D2, : : : , Dk, each of approximately
equal size.
• Training and testing is performed k times.
• In iteration i, partition Di is reserved as the test set, and the remaining
partitions are collectively used to train the model.
• Unlike the holdout and random subsampling methods, here each sample is
used the same number of times for training and once for testing.
• For classification, the accuracy estimate is the overall number of correct
classifications from the k iterations, divided by the total number of tuples in
the initial data.
Obtaining reliable classifier accuracy
estimates (III)
• Bootstrap method samples the given training tuples uniformly with
replacement.
• That is, each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set.
• For instance, imagine a machine that randomly selects tuples for our
training set. In sampling with replacement, the machine is allowed to
select the same tuple more than once.
.632 bootstrap
• Suppose we are given a data set of d tuples.
• The data set is sampled d times, with replacement, resulting in a bootstrap
sample or training set of d samples.
• It is very likely that some of the original data tuples will occur more than once
in this sample.
• The data tuples that did not make it into the training set end up forming the
test set.
• Suppose we were to try this out several times. As it turns out, on average,
• 63.2% of the original data tuples will end up in the bootstrap sample, and the
remaining 36.8% will form the test set (hence, the name, .632 bootstrap).
Comparing Classifiers Based on Cost–Benefit
and ROC Curves
• The true positives, true negatives, false positives, and false negatives are also useful in
assessing the costs and benefits (or risks and gains) associated with a classification model.
• The cost associated with a false negative (such as incorrectly predicting that a cancerous
patient is not cancerous) is far greater than those of a false positive (incorrectly yet
conservatively labeling a noncancerous patient as cancerous).
• In such cases, we can outweigh one type of error over another by assigning a different
cost to each.
• These costs may consider the danger to the patient, financial costs of resulting therapies,
and other hospital costs.
• Similarly, the benefits associated with a true positive decision may be different than those
of a true negative. Up to now, to compute classifier accuracy, we have assumed equal
costs and essentially divided the sum of true positives and true negatives by the total
number of test tuples.
Comparing Classifiers Based on Cost–Benefit
and ROC Curves
• Receiver operating characteristic curves are a useful visual tool for
comparing two classification models.
• An ROC curve for a given model shows the trade-off between the true
positive rate (TPR) and the false positive rate (FPR).
Increasing classifier Accuracy
• An ensemble for classification is a composite model, made up of a combination of classifiers.
• The individual classifiers vote, and a class label prediction is returned by the ensemble based on
the collection of votes.
• Ensembles tend to be more accurate than their component classifiers.
• Bagging, boosting, and random forests are examples of ensemble methods
Increasing classifier Accuracy
• An ensemble tends to be more accurate than its base classifiers.
• For example, consider an ensemble that performs majority voting.
• That is, given a tuple X to classify, it collects the class label predictions returned
from the base classifiers and outputs the class in majority.
• The base classifiers may make mistakes, but the ensemble will misclassify X only if
over half of the base classifiers are in error.
• Ensembles yield better results when there is significant diversity among the models.
• That is, ideally, there is little correlation among classifiers.
• The classifiers should also perform better than random guessing.
• Each base classifier can be allocated to a different CPU and so ensemble methods
are parallelizable.
Bagging-bootstrap aggregation
• Given a set, D, of d tuples, bagging works as follows. For iteration i (i=1, 2,… , k), a training
set, Di , of d tuples is sampled with replacement from the original set of tuples, D.
• Each training set is a bootstrap sample.
• Because sampling with replacement is used, some of the original tuples of D may not be
included in Di , whereas others may occur more than once.
• A classifier model, Mi , is learned for each training set, Di .
• To classify an unknown tuple, X, each classifier, Mi , returns its class prediction, which
counts as one vote.
• The bagged classifier, M*, counts the votes and assigns the class with the most votes to
X.
• Bagging can be applied to the prediction of continuous values by taking the average
value of each prediction for a given test tuple.
Boosting
• In boosting, weights are also assigned to each training tuple.
• A series of k classifiers is iteratively learned.
• After a classifier, Mi , is learned, the weights are updated to allow the
subsequent classifier,Mi+1, to “pay more attention” to the training
tuples that were misclassified by Mi .
• The final boosted classifier, M*, combines the votes of each individual
classifier, where the weight of each classifier’s vote is a function of its
accuracy.
AdaBoost (Adaptive Boosting)
• AdaBoost is a popular boosting algorithm.
• Suppose we want to boost the accuracy of a learning method.
• We are given D, a data set of d class-labeled tuples, (X1, y1), (X2, y2),… , (Xd, yd), where yi
is the class label of tuple Xi.
• Initially, AdaBoost assigns each training tuple an equal weight of 1/d.
• Generating k classifiers for the ensemble requires k rounds through the rest of the
algorithm.
• In round i, the tuples from D are sampled to form a training set, Di , of size d.
• A classifier model, Mi , is derived from the training tuples of Di .
• Its error is then calculated using Di as a test set.
• The weights of the training tuples are then adjusted according to how they were
classified.
AdaBoost (Adaptive Boosting)
• If a tuple was incorrectly classified, its weight is increased.
• If a tuple was correctly classified, its weight is decreased.
• A tuple’s weight reflects how difficult it is to classify— the higher the weight,
the more often it has been misclassified.
• These weights will be used to generate the training samples for the classifier
of the next round.
• The basic idea is that when we build a classifier, we want it to focus more on
the misclassified tuples of the previous round.
• Some classifiers may be better at classifying some “difficult” tuples than
others.
• In this way, we build a series of classifiers that complement each other.
AdaBoost
• To compute the error rate of model Mi , we sum the weights of each
of the tuples in Di that Mi misclassified.

where err(Xj) is the misclassification error of tuple Xj: If the tuple was
misclassified, then err(Xj) is 1; otherwise, it is 0.
If the performance of classifier Mi is so poor that its error exceeds 0.5,
then we abandon it.
Instead, we try again by generating a new Di training set, from which
we derive a new Mi .
AdaBoost
• The error rate of Mi affects how the weights of the training tuples are
updated.
• If a tuple in round i was correctly classified, its weight is multiplied by
• Once the weights of all the correctly classified tuples are updated, the
weights for all tuples (including the misclassified ones) are normalized
so that their sum remains the same as it was before.
• To normalize a weight, we multiply it by the sum of the old weights,
divided by the sum of the new weights.
• As a result, the weights of misclassified tuples are increased and the
weights of correctly classified tuples are decreased, as described before.
AdaBoost
• Unlike bagging, where each classifier was assigned an equal vote, boosting
assigns a weight to each classifier’s vote, based on how well the classifier
performed.
• The lower a classifier’s error rate, the more accurate it is, and therefore, the
higher its weight for voting should be.
• The weight of classifier Mi ’s vote is
Log (
• For each class, c, we sum the weights of each classifier that assigned class c
to X.
• The class with the highest sum is the “winner” and is returned as the class
prediction for tuple X.
Random Forests
• Imagine that each of the classifiers in the ensemble is a decision tree
classifier so that the collection of classifiers.
• The individual decision trees are generated using a random selection
of attributes at each node to determine the split.
• More formally, each tree depends on the values of a random vector
sampled independently and with the same distribution for all trees in
the forest.
• During classification, each tree votes and the most popular class is
returned.
Forest-RC
• Another form of random forest, called Forest-RC, uses random linear combinations of
the input attributes.
• Instead of randomly selecting a subset of the attributes, it creates new attributes (or
features) that are a linear combination of the existing attributes.
• That is, an attribute is generated by specifying L, the number of original attributes to
be combined.
• At a given node, L attributes are randomly selected and added together with
coefficients that are uniform random numbers on [-1, 1].
• F linear combinations are generated, and a search is made over these for the best
split.
• This form of random forest is useful when there are only a few attributes available, so
as to reduce the correlation between individual classifiers.
Random Forests Vs AdaBoost
• Random forests are comparable in accuracy to AdaBoost, yet are
more robust to errors and outliers.
• The generalization error for a forest converges as long as the number
of trees in the forest is large.
• Thus, overfitting is not a problem. The accuracy of a random forest
depends on the strength of the individual classifiers and a measure of
the dependence between them.
• The ideal is to maintain the strength of individual classifiers without
increasing their correlation. Random forests are insensitive to the
number of attributes selected for consideration at each split.
Improving Classification Accuracy of
Class-Imbalanced Data
• Given two-class data, the data are class-imbalanced if the main class of interest
(the positive class) is represented by only a few tuples, while the majority of
tuples represent the negative class.
• The class imbalance problem is closely related to cost-sensitive learning, wherein
the costs of errors, per class, are not equal.
• In medical diagnosis, for example, it is much more costly to falsely diagnose a
cancerous patient as healthy (a false negative) than to misdiagnose a healthy
patient as having cancer (a false positive).
• A false negative error could lead to the loss of life and therefore is much more
expensive than a false positive error.
• Other applications involving class-imbalanced data include fraud detection, the
detection of oil spills from satellite radar images, and fault monitoring.
How to predict the class label of imbalanced
data?
• Sensitivity or recall (the true positive rate) and
• specificity (the true negative rate),
• F1 and
• ROC curves plot sensitivity versus (1-specificity) (i.e., the false positive
rate).
How to improve accuracy of class imbalance
data?
General approaches for improving the classification accuracy of class-imbalanced data
include
(1) Oversampling
―Oversampling works by resampling the positive tuples so that the resulting training set
contains an equal number of positive and negative tuples.
―Example SMOTE algorithm
(2) Undersampling
―Undersampling works by decreasing the number of negative tuples. It randomly eliminates
tuples from the majority (negative) class until there are an equal number of positive and
negative tuples.
(3) threshold moving, and
(4) ensemble techniques
―As discussed previously
Threshold-moving
• It applies to classifiers that, given an input tuple, return a continuous
output value.
• That is, for an input tuple, X, such a classifier returns as output a
mapping, f (X) [0, 1].
• Rather than manipulating the training tuples, this method returns a
classification decision based on the output values. In the simplest
approach, tuples for which f(x)≥t, for some threshold, t , are
considered positive, while all other tuples are considered negative.
• Other approaches may involve manipulating the outputs by weighting

You might also like