Classification
Classification
create any model immediately from the training data, and this is where
•They just memorize the training data, and each time there is a need to
make a prediction, they search for the nearest neighbor from the whole
• KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
• Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a
similarity measure. Our KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?
• Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of
these categories.
• With the help of K-NN, we can easily identify the category or class of
a particular dataset.
How does K-NN work?
• The K-NN working can be explained on the basis of the below algorithm:
• Step-4: Among these k neighbors, count the number of the data points in each
category.
• Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
• The value of K that delivers the best accuracy for both training
and testing data is selected.
• It is recommended to always select an odd value of K
• When the value of K is set to even, a situation may arise in which the
elements from both groups are equal. In the diagram below, elements
from both groups are equal in the internal “Red” circle (k == 4).
• Due to the fact that one of the two groups would still
be in the majority, the value of K is selected as odd.
• The impact of selecting a smaller or larger K value on the model
• Larger K value: The case of underfitting occurs when the value of
k is increased. In this case, the model would be unable to
correctly learn on the training data.
• Smaller k value: The condition of overfitting occurs when the
value of k is smaller. The model will capture all of the training
data, including noise. The model will perform poorly for the test
data in this scenario.
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may
be complex some time.
• The computation cost is high because of calculating the
distance between the data points for all the training
samples.
Case-based reasoning
• Case-based reasoning is any kind of problem-solving
approach that uses past solutions to solve similar problems.
• It assumes that knowledge can be acquired through past
experiences, and can help warn you of avenues that will lead
to failure or to help you think of successful past solutions that
could be adapted to the problem at hand.
• For example, Google Maps uses case-based reasoning to tell
you how long your journey will take by examining the
patterns of past users to see how long it took them to get from
point A to point B. Even if your path is from two slightly
different points, it makes inferences on how long your
journey will take.
Model Evaluation and Selection
Evaluation metrics: How can we measure accuracy? Other metrics to consider?
Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
Methods for estimating a classifier’s accuracy:
Holdout method, random subsampling
Cross-validation
Bootstrap
Comparing classifiers:
Confidence intervals
Cost-benefit analysis and ROC Curves
37
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
39
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
40
Classifier Evaluation Metrics: Example
41
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
Given data is randomly partitioned into two independent sets
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
data
*Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
42
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
43
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
Suppose we have 2 classifiers, M1 and M2, which one is better?
Use 10-fold cross-validation to obtain and
These mean error rates are just estimates of error on the true population of future
data cases
What if the difference between the 2 error rates is just attributed to chance?
Use a test of statistical significance
Obtain confidence limits for our error estimates
44
Estimating Confidence Intervals:
Null Hypothesis
Perform 10-fold cross-validation
Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10)
Use t-test (or Student’s t-test)
Null Hypothesis: M1 & M2 are the same
If we can reject null hypothesis, then
we conclude that the difference between M1 & M2 is statistically significant
Chose model with lower error rate
45
Estimating Confidence Intervals: t-test
where
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
46
Estimating Confidence Intervals:
Table for t-distribution
Symmetric
Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
Confidence limit, z
= sig/2
47
Estimating Confidence Intervals:
Statistical Significance
Are M1 & M2 significantly different?
Compute t. Select significance level (e.g. sig = 5%)
Consult table for t-distribution: Find t value corresponding to k-1 degrees of
freedom (here, 9)
t-distribution is symmetric: typically upper % points of distribution shown → look
up value for confidence limit z=sig/2 (here, 0.025)
If t > z or t < -z, then t value lies in rejection region:
Reject null hypothesis that mean error rates of M & M are same
1 2
48
Model Selection: ROC Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection theory
Shows the trade-off between the true
positive rate and the false positive rate
The area under the ROC curve is a Vertical axis
measure of the accuracy of the model represents the true
positive rate
Rank the test tuples in decreasing Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at The plot also shows a
the top of the list diagonal line
The closer to the diagonal line (i.e., the A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
49
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
50
Chapter 8. Classification: Basic Concepts
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Boosting: weighted vote with a collection of classifiers
52
Bagging: Boostrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set D of d tuples is
i
sampled with replacement from D (i.e., bootstrap)
A classifier model M is learned for each training set D
i i
Classification: classify an unknown sample X
Each classifier M returns its class prediction
i
The bagged classifier M* counts the votes and assigns the class with the
most votes to X
Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
Accuracy
Often significantly better than a single classifier derived from D
53
Boosting
Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
How boosting works?
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
Boosting algorithm can be extended for numeric prediction
Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 54
Adaboost (Freund and Schapire, 1997)
Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
Initially, all the weights of tuples are set the same (1/d)
Generate k classifiers in k rounds. At round i,
Tuples from D are sampled (with replacement) to form a training set Di
of the same size
Each tuple’s chance of being selected is based on its weight
A classification model Mi is derived from Di
Its error rate is calculated using Di as a test set
If a tuple is misclassified, its weight is increased, o.w. it is decreased
Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
55
Random Forest (Breiman 2001)
Random Forest:
Each classifier in the ensemble is a decision tree classifier and is
returned
Two Methods to construct Random Forest:
Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
Forest-RC (random linear combinations): Creates new attributes (or
class
Threshold-moving: moves the decision threshold, t, so that
the rare class tuples are easier to classify, and hence, less
chance of costly false negative errors
Ensemble techniques: Ensemble multiple classifiers
introduced above
Still difficult for class imbalance problem on multiclass tasks
57
Chapter 8. Classification: Basic Concepts