ML Notes UT-2
ML Notes UT-2
CLASSIFICATION
Supervised Machine Learning algorithm can be broadly classified into Regression
and Classification Algorithms. In Regression algorithms, we have predicted the output for
continuous values, but to predict the categorical values, we need Classification
algorithms.
2.1 Classification Algorithm:
The Classification algorithm is a Supervised Learning technique that is used to
identify the category of new observations based on training data. In Classification, a
program learns from the given dataset or observations and then classifies new
observation into several classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam,
cat, or dog, etc. Classes can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labelled input data, which means it contains input with
the corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
like each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier.
If the input feature vector to the classifier is a real vector to the classifier is a real vector ,
then the output score is
2.2 Performance Evaluation:
• Confusion Matrix: As the target variable is not continuous, binary classification
model predicts the probability of a target variable to be Yes/No. To evaluate such
a model, a metric called the confusion matrix is used, also called the classification
or co-incidence matrix. With the help of a confusion matrix, we can calculate
important performance measures:
• Accuracy: Accuracy is the simple ratio between the number of correctly classified
points to the total number of points.
Accuracy = (TP + TN) / (TP + FP +TN + FN)
This term tells us how many right classifications were made out of all the classifications.
In other words, how many TPs and TNs were done out of TP + TN + FP + FNs. It tells the
ratio of “True”s to the sum of “True”s and “False”s.
Use case: Out of all the patients who visited the doctor, how many were correctly
diagnosed as Covid positive and Covid negative.
• Precision: Precision is the fraction of the correctly classified instances from the
total classified instances. Precision is the fraction of true positive examples among
the examples that the model classified as positive. In other words, the number of
true positives divided by the number of false positives plus true positives.
Low precision: the more False positives the model predicts, the lower the precision.
Use case: Let’s take another example of a classification algorithm that marks emails as
spam or not. Here, if emails that are of importance get marked as positive, then useful
emails will end up going to the “Spam” folder, which is dangerous. Hence, the
classification model which has the least FP value needs to be selected. In other words, a
model that has the highest precision needs to be selected among all the models.
• Recall or Sensitivity: Recall is the fraction of the correctly classified instances from
the total classified instances. The number of true positives divided by the number
of true positives plus false negatives.
Low recall: the more False Negatives the model predicts, the lower the recall.
Use case: Out of all the actual Covid patients who visited the doctor, how many were
actually diagnosed as Covid positive. Hence, the classification model which has the least
FN value needs to be selected. In other words, a model that has the highest recall value
needs to be selected among all the models.
Precision helps us understand how useful the results are. Recall helps us understand how
complete the results are.
• ROC Curves: A Receiver Operating Characteristic curve or ROC curve is created by
plotting the True Positive (TP) against the False Positive (FP) at various threshold
settings. The ROC curve is generated by plotting the cumulative distribution
function of the True Positive in the y-axis versus the cumulative distribution
function of the False Positive on the x-axis.
F-Measure: Once precision and recall have been calculated for a binary classification
problem, the two scores can be combined into the calculation of the F-Measure.
The traditional F measure is calculated as follows:
F-Measure = (2 * Precision * Recall) / (Precision + Recall)
This is the harmonic mean of the two fractions. This is sometimes called the F-Score or
the F1-Score and might be the most common metric used on imbalanced classification
problems.
2.3 Multi-class Classification: Multiclass classification is a classification task with more
than two classes. Each sample can only be labelled as one class. Each training point
belongs to one of N different classes. The goal is to construct a function which, given a
new data point, will correctly predict the class to which the new point belongs. There are
many scenarios in which there are multiple categories to which points belong, but a given
point can belong to multiple categories. In its most basic form, this problem decomposes
trivially into a set of unlinked binary problems, which can be solved naturally using our
techniques for binary classification.
For example, classification using features extracted from a set of images of fruit, where
each image may either be of an orange, an apple, or a pear. Each image is one sample and
is labelled as one of the 3 possible classes. Multiclass classification assumes that each
sample is assigned to one and only one label - one sample cannot, for example, be both a
pear and an apple
2.3.1 Binary vs Multiclass Classification:
2. One vs Rest: One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a
heuristic method for using binary classification algorithms for multi-class
classification. It involves splitting the multi-class dataset into multiple binary
classification problems. A binary classifier is then trained on each binary
classification problem and predictions are made using the model that is the most
confident. The obvious approach is to use a one-versus-the-rest approach (also
called one-vs-all), in which we train C binary classifiers, fc(x), where the data from
class c is treated as positive, and the data from all the other classes is treated as
negative.
2.4 Linear Models:
Linear modelling in a classification context consists of regression followed by a
transformation to return a categorical output and thereby producing a decision
boundary. The two most used linear classification algorithms are logistic regression and
linear support vector machines. In the field of machine learning, the goal of statistical
classification is to use an object's characteristics to identify which class (or group) it
belongs to. A linear classifier achieves this by making a classification decision based on
the value of a linear combination of the characteristics.
The mathematical formula for binary classification to make prediction is given below —
ŷ = x[0] * z[0] + x[1] * z[1] + … + x[p] * z[p] + b > 0
The formula is quite like the one used in linear regression, but here the weighted sum of
the features is just returned. The threshold of the predicted value is zero in binary
classification. If the function is less than zero, the class is predicted as -1 and if it is greater
than zero, the class is predicted as +1. This common rule is used in case of all linear
models for classification.
2.4.1 Linear Support Vector Machines (SVM):
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning. The goal of the SVM
algorithm is to create the best line or decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new data point in the correct category in
the future. This best decision boundary is called a hyperplane. SVM chooses the extreme
points/vectors that help in creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support Vector Machine. Consider the
below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. Based on the support vectors, it will classify it as a
cat. Consider the below diagram:
Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the closest
point of the lines from both the classes. These points are called support vectors.
The distance between the vectors and the hyperplane is called as margin. And the
goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Suppose we are given 2 Hyperplane one with 100% accuracy (HP1) on the left side
and another with >90% accuracy (HP2) on the right side. Which one would you think is
the correct classifier? Most of us would pick the HP2 thinking that it because of the
maximum margin. But it is the wrong answer. But Support Vector Machine would choose
the HP1 though it has a narrow margin. Because though HP2 has maximum margin but it
is going against the constrain that: each data point must lie on the correct side of the
margin and there should be no misclassification. This constrain is the hard constrain that
Support Vector Machine follows throughout.
HP1 is a Hard SVM (left side) while HP2 is a Soft SVM (right side). By default, Support
Vector Machine implements Hard margin SVM. It works well only if our data is linearly
separable. Hard margin SVM does not allow any misclassification to happen. In case our
data is non-separable/ nonlinear then the Hard margin SVM will not return any
hyperplane as it will not be able to separate the data. Hence this is where Soft Margin SVM
comes to the rescue. Soft margin SVM allows some misclassification to happen by relaxing
the hard constraints of Support Vector Machine. Soft margin SVM is implemented with
the help of the Regularization parameter (C): It tells us how much misclassification we
want to avoid.
– Hard margin SVM generally has large values of C.
– Soft margin SVM generally has small values of C.
2.6 SVM Kernel to handle non-linear data:
SVM can be extended to solve nonlinear classification tasks when the set of samples
cannot be separated linearly. By applying kernel functions, the samples are mapped onto
a high-dimensional feature space, in which the linear classification is possible.
• Gaussian Radial Basis Function (RBF): It is one of the most preferred and used
kernel functions in SVM. It is usually chosen for non-linear data. It helps to make
proper separation when there is no prior knowledge of data.
The value of (gamma) varies from 0 to 1. We must manually provide the value
• Gaussian Kernel: The Gaussian kernel is a very popular kernel function used in
many machine learning algorithms, especially in support vector machines (SVMs).
It is more often used than polynomial kernels when learning from nonlinear
datasets and is usually employed in formulating the classical SVM for nonlinear
problems. The Gaussian kernel function allows the separation of nonlinearly
separable data by mapping the input vector to Hilbert space. The Gaussian kernel
is an exponential function including norm and real constant.
• Polynomial: In general, the polynomial kernel is defined as
2.7 Logistic Regression: Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1. Logistic Regression is much like the Linear Regression except
that how they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems. In Logistic
regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1). The curve from the logistic function indicates the
likelihood of something such as whether the cells are cancerous or not, a mouse is obese
or not based on its weight, etc. Logistic Regression is a significant machine learning
algorithm because it can provide probabilities and classify new data using continuous
and discrete datasets. Logistic Regression can be used to classify the observations using
different types of data and can easily determine the most effective variables used for the
classification. This type of statistical model (also known as logit model) is often used for
classification and predictive analytics. Logistic regression estimates the probability of an
event occurring, such as voted or didn’t vote, based on a given dataset of independent
variables. Since the outcome is a probability, the dependent variable is bounded between
0 and 1. In logistic regression, a logit transformation is applied on the odds—that is, the
probability of success divided by the probability of failure.
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
2.7.2 Steps in Logistic Regression: To implement the Logistic Regression using Python,
we will use the same steps as we have done in previous topics of Regression. Below are
the steps:
• Data Pre-processing step
• Fitting Logistic Regression to the Training set
• Predicting the test result
• Test accuracy of the result (Creation of Confusion matrix)
• Visualizing the test set result.