Model Evaluation and Selection
Model Evaluation and Selection
Selection
Model Evaluation and Selection
• One would like an estimate of how accurately the classifier is able to
classify data on which the classifier has not been trained.
• What if we have more than one classifier and want to choose the best
one?
• The answer to these questions is obtained through model evaluation
and selection.
• What is accuracy?
• How can we estimate it?
• Are some measures of a classifier’s accuracy more appropriate than others?
Terminology-Introduction
• Positive tuples: (tuples of the main class of interest)
• Negative tuples: (all other tuples).
• Example,
• the positive tuples may be buys_computer=yes while the negative tuples are
buys_computer=no.
• Lets consider P is the number of positive tuples and N is negative
tuples
Terminology-introduction
• True positives (TP): These refer to the positive tuples that were correctly
labeled by the classifier. Let TP be the number of true positives.
• True negatives (TN): These are the negative tuples that were correctly labeled
by the classifier. Let TN be the number of true negatives.
• False positives (FP): These are the negative tuples that were incorrectly labeled
as positive (e.g., tuples of class buys_computer =no for which the classifier
predicted buys_computer=yes). Let FP be the number of false positives.
• False negatives (FN): These are the positive tuples that were incorrectly labeled
as negative (e.g., tuples of class buys_computer=yes for which the classifier
predicted buys_computer=no). Let FN be the number of false negatives.
Model Evaluation Metrics
Confusion Matrix
• The confusion matrix is a useful tool for analyzing how well your
classifier can recognize tuples of different classes.
• TP and TN tell us when the classifier is getting things right, while FP
and FN tell us when the classifier is getting things wrong.
Confusion Matrix where m≥2
• Given m classes (where m ≥ 2), a confusion matrix is a table of at
least size m by m.
• An entry, CMi,j in the first m rows and m columns indicates the
number of tuples of class i that were labeled by the classifier as
class j.
• For a classifier to have good accuracy, ideally most of the tuples
would be represented along the diagonal of the confusion
matrix, from entry CM1,1 to entry CMm,m, with the rest of the
entries being zero or close to zero. That is, ideally, FP and FN are
around zero.
Accuracy
• The accuracy of a classifier on a given test set is the percentage of
test set tuples that are correctly classified by the classifier. That is,
=sensitivity
• A perfect precision score of 1.0 for a class C means that every tuple
that the classifier labeled as belonging to class C does indeed belong to
class C.
• However, it does not tell us anything about the number of class C
tuples that the classifier mislabeled.
• A perfect recall score of 1.0 for C means that every item from class C
was labeled as such, but it does not tell us how many other tuples
were incorrectly labeled as belonging to class C.
• There tends to be an inverse relationship between precision and recall,
where it is possible to increase one at the cost of reducing the other.
Problem
• Find the precision and recall for the following confusion matrix
where err(Xj) is the misclassification error of tuple Xj: If the tuple was
misclassified, then err(Xj) is 1; otherwise, it is 0.
If the performance of classifier Mi is so poor that its error exceeds 0.5,
then we abandon it.
Instead, we try again by generating a new Di training set, from which
we derive a new Mi .
AdaBoost
• The error rate of Mi affects how the weights of the training tuples are
updated.
• If a tuple in round i was correctly classified, its weight is multiplied by
• Once the weights of all the correctly classified tuples are updated, the
weights for all tuples (including the misclassified ones) are normalized
so that their sum remains the same as it was before.
• To normalize a weight, we multiply it by the sum of the old weights,
divided by the sum of the new weights.
• As a result, the weights of misclassified tuples are increased and the
weights of correctly classified tuples are decreased, as described before.
AdaBoost
• Unlike bagging, where each classifier was assigned an equal vote, boosting
assigns a weight to each classifier’s vote, based on how well the classifier
performed.
• The lower a classifier’s error rate, the more accurate it is, and therefore, the
higher its weight for voting should be.
• The weight of classifier Mi ’s vote is
Log (
• For each class, c, we sum the weights of each classifier that assigned class c
to X.
• The class with the highest sum is the “winner” and is returned as the class
prediction for tuple X.
Random Forests
• Imagine that each of the classifiers in the ensemble is a decision tree
classifier so that the collection of classifiers.
• The individual decision trees are generated using a random selection
of attributes at each node to determine the split.
• More formally, each tree depends on the values of a random vector
sampled independently and with the same distribution for all trees in
the forest.
• During classification, each tree votes and the most popular class is
returned.
Forest-RC
• Another form of random forest, called Forest-RC, uses random linear combinations of
the input attributes.
• Instead of randomly selecting a subset of the attributes, it creates new attributes (or
features) that are a linear combination of the existing attributes.
• That is, an attribute is generated by specifying L, the number of original attributes to
be combined.
• At a given node, L attributes are randomly selected and added together with
coefficients that are uniform random numbers on [-1, 1].
• F linear combinations are generated, and a search is made over these for the best
split.
• This form of random forest is useful when there are only a few attributes available, so
as to reduce the correlation between individual classifiers.
Random Forests Vs AdaBoost
• Random forests are comparable in accuracy to AdaBoost, yet are
more robust to errors and outliers.
• The generalization error for a forest converges as long as the number
of trees in the forest is large.
• Thus, overfitting is not a problem. The accuracy of a random forest
depends on the strength of the individual classifiers and a measure of
the dependence between them.
• The ideal is to maintain the strength of individual classifiers without
increasing their correlation. Random forests are insensitive to the
number of attributes selected for consideration at each split.
Improving Classification Accuracy of
Class-Imbalanced Data
• Given two-class data, the data are class-imbalanced if the main class of interest
(the positive class) is represented by only a few tuples, while the majority of
tuples represent the negative class.
• The class imbalance problem is closely related to cost-sensitive learning, wherein
the costs of errors, per class, are not equal.
• In medical diagnosis, for example, it is much more costly to falsely diagnose a
cancerous patient as healthy (a false negative) than to misdiagnose a healthy
patient as having cancer (a false positive).
• A false negative error could lead to the loss of life and therefore is much more
expensive than a false positive error.
• Other applications involving class-imbalanced data include fraud detection, the
detection of oil spills from satellite radar images, and fault monitoring.
How to predict the class label of imbalanced
data?
• Sensitivity or recall (the true positive rate) and
• specificity (the true negative rate),
• F1 and
• ROC curves plot sensitivity versus (1-specificity) (i.e., the false positive
rate).
How to improve accuracy of class imbalance
data?
General approaches for improving the classification accuracy of class-imbalanced data
include
(1) Oversampling
―Oversampling works by resampling the positive tuples so that the resulting training set
contains an equal number of positive and negative tuples.
―Example SMOTE algorithm
(2) Undersampling
―Undersampling works by decreasing the number of negative tuples. It randomly eliminates
tuples from the majority (negative) class until there are an equal number of positive and
negative tuples.
(3) threshold moving, and
(4) ensemble techniques
―As discussed previously
Threshold-moving
• It applies to classifiers that, given an input tuple, return a continuous
output value.
• That is, for an input tuple, X, such a classifier returns as output a
mapping, f (X) [0, 1].
• Rather than manipulating the training tuples, this method returns a
classification decision based on the output values. In the simplest
approach, tuples for which f(x)≥t, for some threshold, t , are
considered positive, while all other tuples are considered negative.
• Other approaches may involve manipulating the outputs by weighting