This document discusses machine learning classification tasks and performance evaluation metrics. It covers classifying images from the MNIST dataset using algorithms like logistic regression and decision trees. Various performance metrics are examined, including accuracy, precision, recall, F1 score, and confusion matrices. Tradeoffs between precision and recall are also addressed.
This document discusses machine learning classification tasks and performance evaluation metrics. It covers classifying images from the MNIST dataset using algorithms like logistic regression and decision trees. Various performance metrics are examined, including accuracy, precision, recall, F1 score, and confusion matrices. Tradeoffs between precision and recall are also addressed.
▪ Classification Tasks ▪ Different Performance Measures for performance evaluation Classification ▪ The most common supervised learning tasks are regression (predicting values) and classification (predicting classes). ▪ We then explored a regression task, predicting housing values, using various algorithms such as Linear Regression, Decision Trees, and Random Forests. ▪ Now we will turn our attention to classification Classification ▪ MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. ▪ Each image is labeled with the digit it represents. ▪ This set is often called the “Hello World” of Machine Learning: whenever people come up with a new classification algorithm, they are curious to see how it will perform on MNIST. MNIST dataset The following code fetches the MNIST dataset MNIST dataset
▪ There are 70,000 images, and each image has 784
features. ▪ This is because each image is 28×28 pixels, and each feature simply represents one pixel’s intensity, from 0 (black) to 255 (white). ▪ The MNIST dataset is actually already split into a training set (the first 60,000 images) and a test set (the last 10,000 images) Training a Binary Classifier ▪ Let’s simplify the problem for now and only try to identify one digit—for example, the number 5. ▪ This “5-detector” will be an example of a binary classifier, capable of distinguishing between just two classes, 5 and not-5. ▪ A good place to start is with a Stochastic Gradient Descent (SGD) classifier, using SGDClassifier class. ▪ This classifier is capable of handling very large datasets efficiently. ▪ This is in part because SGD deals with training instances independently, one at a time (which also makes SGD well suited for online learning). Performance Measures ▪ Evaluating a classifier is often significantly trickier than evaluating a regressor, so we will spend a large part of this chapter on this topic. ▪ There are many performance measures available: ▪ Accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (i.e., when some classes are much more frequent than others). ▪ A much better way to evaluate the performance of a classifier is to look at the confusion matrix. Performance Measures Confusion matrix. ▪ The general idea is to count the number of times instances of class A are classified as class B. ▪ For example, to know the number of times the classifier confused images of 5s with 3s, you would look in the 5th row and 3rd column of the confusion matrix. ▪ To compute the confusion matrix, you first need to have a set of predictions, so they can be compared to the actual targets. ▪ You could make predictions on the test set Performance Measures Confusion matrix. Performance Measures Confusion matrix. Performance Measures
▪ The confusion matrix gives you a lot of
information, but sometimes you may prefer a more concise metric. ▪ An interesting one to look at is the accuracy of the positive predictions; this is called the precision of the classifier. Performance Measures ▪ A trivial way to have perfect precision is to make one single positive prediction and ensure it is correct (precision = 1/1 = 100%). ▪ This would not be very useful since the classifier would ignore all but one positive instance. ▪ Therefore, precision is typically used along with another metric named recall, also called sensitivity or true positive rate (TPR): The ratio of positive instances that are correctly detected by the classifier. Performance Measures ▪ It is often convenient to combine precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers. ▪ The F1 score is the harmonic mean of precision and recall. Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. ▪ As a result, the classifier will only get a high F1 score if both recall and precision are high. Performance Measures ▪ F1 score favors classifiers that have similar precision and recall. ▪ This is not always what you want: in some contexts you mostly care about precision, and in other contexts you really care about recall. ▪ For example, if you trained a classifier to detect videos that are safe for kids, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision), rather than a classifier that has a much higher recall but lets a few really bad videos show up in your product. Performance Measures
▪ On the other hand, suppose you train a classifier
to detect shoplifters on surveillance images: it is probably fine if your classifier has only 30% precision as long as it has 99% recall (sure, the security guards will get a few false alerts, but almost all shoplifters will get caught). ▪ Unfortunately, you can’t have it both ways: increasing precision reduces recall, and vice versa. This is called the precision/recall tradeoff. Summary
▪ Classification tasks ▪ Different performance evaluation metrics to evaluate performance of classification algorithms.