Unit 5 - DA - Classification & Clustering
Unit 5 - DA - Classification & Clustering
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Classification is a task that involves separating data points into different classes i.e.,
assigning a class label to each data point instance in a dataset based on its features.
The goal of classification is to build a model that accurately predicts the class labels
of new instances based on their features.
There are two main types of classification: binary classification and multi-class
classification. Binary classification involves classifying instances into two classes,
such as “spam” or “not spam”, while multi-class classification involves classifying
instances into more than two classes.
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.
In clustering (also called cluster analysis), a group of different data objects is
classified as similar objects. One group means a cluster of data. Data sets are
divided into different groups in the cluster analysis, which is based on the
similarity of the data.
After the classification of data into various groups, a label is assigned to the group.
It helps in adapting to the changes by doing the classification.
Both Classification and Clustering is used for the categorization of objects into
one or more classes based on the features. They appear to be a similar process as
the basic difference is minute.
In the case of Classification, there are predefined labels assigned to each input
instance according to their features whereas in clustering those labels are missing.
Parameter Classification Clustering
Basic Process of classifying the input Grouping the instances based on
instances based on their their similarity without the help
corresponding class labels of class labels
Need It has labels so there is need of There is no need of training and
training and testing dataset for testing dataset
verifying the model created
Complexity More complex as compared to Less complex as compared to
clustering classification
Prior knowledge Yes (The number of classes are No (The number of classes is
of classes known) unknown)
School of Computer Engineering
Classification and Clustering Example
7
The distance-based algorithms are used to measure the distance between data
points.
Distance measures play an important role in classification and clustering i.e., it
provide the foundations for many popular and effective algorithms like KNN (K-
Nearest Neighbors) for classifications and K-Means clustering for clustering.
Different distance measures can be chosen and used depending on the types of
data. A distance measure is an objective score that summarizes the relative
difference between two objects in a problem domain. Most commonly, the two
objects are rows of data that describes a subject (such as a person, car, or house),
or an event (such as purchases, a claim, or a diagnosis)
If the distance among two data points is small then there is a high degree of
similarity among the objects and vice versa.
The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric and lazy
learning algorithm, which uses proximity to make classifications about the grouping of an
individual data point.
Lazy learning algorithm − It does not have a specialized training phase and uses all the
data for training while classification.
Non-parametric learning algorithm − It doesn’t assume anything about the underlying
data.
The k value in the k-NN algorithm defines how many neighbors will be checked to determine
the classification of a specific query point.
If k is set to 1, the instance will be assigned to the same class as its single nearest neighbor. If k
is set to 5, the classes of 5 closest points are checked.
Defining k can be a balancing act as different values can lead to over fitting or under fitting.
Lower values of k can have high variance, but low bias, and larger values of k may lead to high
bias and lower variance.
The choice of k will largely depend on the input data as data with more outliers or noise will
likely perform better with higher values of k.
Overall, it is recommended to have an odd number for k to avoid ties in classification, and
cross-validation tactics can help to choose the optimal k for the dataset.
School of Computer Engineering
Working of KNN
13
Let given is the data from the questionnaires survey with two attributes (X1: Acid
durability and X2: Strength) to classify whether a special paper tissue is good or not.
Classify the new paper tissue with features: X1=3 and X2=7.
Step 3: Determine K nearest neighbor those are at minimum distance and rank them.
7 4 Bad 5 4
3 4 Good 3 1
1 4 Good 3.6 2
Step 4: Apply Simple Majority
There are 2 “Good” and 1 “Bad”. Thus the new paper tissue will be classified as Good
due to simple majority.
Given is the customer detail as customer’s age, income, credit card no. and their
corresponding class. Find the class label for the new data item (reflecting in red font):
Customer: RRR, Age: 37, Income: 50K, Cr Card: 2, Class- ?
Dataset
ID3 stands for Iterative Dichotomiser 3 and is named such because the
algorithm iteratively (i.e., repeatedly) dichotomizes(i.e., divides) features
into two or more groups at each step.
It is a classification algorithm that follows a top-down greedy approach by
selecting a best features that yields maximum Information Gain(IG) or
minimum Entropy(H).
The top-down approach means start building the tree from the top and the
greedy approach means that at each iteration, select the best feature at the
present moment to create a node.
Entropy is a measure of the amount of uncertainty in the dataset S i.e.,
H(S) = ∑c ∈ C − p (c) log2 p(c), where:
S - The current dataset for which entropy is being calculated.
C - Set of classes in S {example of C ={yes, no}}
p(c) - The proportion of the number of elements in class c to the number
of elements in set S.
School of Computer Engineering
ID3 Algorithm cont…
25
C ={Yes, No}
Proportion of the number of elements in class Yes i.e., p(Yes) = 9/(9+5) =9/14
Proportion of the number of elements in class No i.e., p(No) = 5/(9+5) =5/14
H(S) = -(9/14 * log₂(9/14) + (5/14) * log₂(5/14)) = 0.94
Information Gain IG(A, S) tells how much uncertainty in the dataset S was
reduced after splitting S on feature A i.e.,
IG(A , S) = H(S) − I(A) where I(A) = ∑ t∈T p(t) * H(t) and
H(S) - Entropy of dataset S
T - The subsets created from splitting dataset S by feature A
p(t) - The proportion of the number of elements in t to the number of
elements in dataset S.
H(t) - Entropy of subset t.
I(A) - Average Entropy Information for feature A
Complete entropy of dataset is (4 out of 12 are Yes and 8 out of 12 are No) -
H(S) = - p(YES) * log2(p(YES)) - p(no) * log2(p(no))
= - (4/12) * log2(4/12) - (8/12) * log2(8/12)
= - (-0.527) - (-0.395)
= 0.922
Information Gain for Regular:
Sr No Regular Need Tutor Sr No Regular Need Tutor
5 No Yes
1 Yes Yes
6 No No
2 Yes No
7 No Yes
3 Yes Yes 8 No No
4 Yes No 10 No No
11 No No
9 Yes No
12 No No
5 rows with Yes value out of 12 for which
there are 2 Yes in the target column and 3 No 7 rows with No value out of 12 for which there are 2 Yes in the target
in the target column. column and 5 No in the target column.
H(regular=YES) = -(2/5)*log2(2/5)-(3/5)*log2(3/5)
= - (- 0.5288) - (- 0.4422) = 0.971
H(regular=NO) = -(2/7)*log2(2/7)-(5/7)*log2(5/7)
= -(-0.516) - (- 0.347) = 0.863
Step 2: Calculate entropy for all its categorical values and information gain for the
features.
Second feature – Temperature
Categorical values - hot, mild, cool
H(Temperature=hot) = -(2/4)*log(2/4)-(2/4)*log(2/4) = 1
H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811
H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179
Step 2: Calculate entropy for all its categorical values and information gain for the
features.
Third feature - Humidity
Categorical values - high, normal
H(Humidity=high) = -(3/7)*log(3/7)-(4/7)*log(4/7) = 0.983
H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591
Step 2: Calculate entropy for all its categorical values and information gain for the
features.
Fourth feature – Wind
Categorical values - weak, strong
H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811
H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1
Information Gain (Sunny, Wind) = H(Sunny) - I(Sunny, Wind) = 0.971 - 0.9508 = 0.0202
What is hyperplane?
A hyperplane is a generalization of a plane.
In one dimension, it is called a point.
In two dimensions, it is a line.
In three dimensions, it is a plane.
In more dimensions one can call it an hyperplane.
The following figure represents datapoint in one dimension and the point L is a
separating hyperplane.
Here, maximizing the distances between nearest data point (either class) and hyperplane
will help to decide the right hyperplane. This distance is called as margin.
Some of you may have selected the hyperplane B as it has higher margin compared to A.
But, here is the catch, SVM selects the hyperplane which classifies the classes accurately
prior to maximizing margin. Here, hyperplane B has a classification error and A has
classified all correctly. Therefore, the right hyperplane is A.
School of Computer Engineering
SVM cont…
53
(Scenario-4) Below, we are unable to segregate the two classes using a straight line, as
one of the stars lies in the territory of other(circle) class as an outlier.
One star at other end is like an outlier for star class. The SVM algorithm has a feature to
ignore outliers and find the hyperplane that has the maximum margin. Hence, we can say,
SVM classification is robust to outliers.
Linear SVM: Linear SVM is used for data that are linearly separable i.e. for a
dataset that can be categorized into two categories by utilizing a single
straight line. Such data points are termed as linearly separable data, and the
classifier is used described as a linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for data that are non-linearly
separable data i.e. a straight line cannot be used to classify the dataset. For
this, we use something known as a kernel trick that sets data points in a
higher dimension where they can be separated using planes or other
mathematical functions. Such data points are termed as non-linear data, and
the classifier used is termed as a non-linear SVM classifier.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Now, the values for each can be obtained by looking at the dataset and substitute them into the
equation. For all entries in the dataset, the denominator does not change, it remain static.
Note: Bayes Theorem describes the probability of an event, based on prior knowledge of conditions
that might be related to the event.
Using the above function, we can obtain the class, given the predictors.
P(Y) = 9/ 14 and P(N) = 5/14 where Y stands for Yes and N stands for No.
The outlook probability is: P(sunny | Y) = 2/9, P(overcast | Y) = 4/9, P(rain | Y) = 3/9,
P(sunny | N) = 3/5, P(overcast | N) = 0, P(rain | N) = 2/5
The temperature probability is: P(hot | Y) = 2/9, P(mild | Y) = 4/9, P(cool | Y) = 3/9,
P(hot | N) = 2/5, P(mild | N) = 2/5, P(cool | N) = 1/5
The humidity probability is: P(high | Y) = 3/9, P(normal | Y) = 6/9, P(high | N) = 4/5,
P(normal | N) = 2/5.
The windy probability is: P(true | Y) = 3/9, P(false | Y) = 6/9, P(true | N) = 3/5, P(false
| N) = 2/5
Now it is to predict “play golf” today with the conditions: <outlook = sunny; temperature
= cool; humidity = high; windy = true>, so today = (sunny, cool, high, true)
P(Y | today) = P(sunny | Y) * P(cool | Y) * P(high | Y) * P(true | Y) * P(Yes) = 2/9 * 3/9 *
3/9 * 3/9 * 9/ 14 = 0.005
P(N | today) = P(sunny | N) * P(cool | N) * P(high | N) * P(true | N) * P(N) = 3/5 * 1/5 *
4/5 * 3/5 * 5/14 = 0.020
Since, the probability of No is the larger, we can predict today to play golf be No.
Pros
It is easy and fast to predict class of test data set. It also perform well in multi class
prediction.
When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which
is a strong assumption).
Cons
The assumption of independent predictors. In real life, it is almost impossible to get a
set of predictors which are completely independent.
If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as “Zero Frequency”. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
Clustering is the task of dividing the data points into a number of groups such
that data points in the same groups are more similar to other data points in
the same group and dissimilar to the data points in other groups.
It is basically a collection of objects on the basis of similarity and
dissimilarity between them.
Following is an example of finding clusters of data points based on their
income and debt.
The data points clustered together can be classified into one single group. The
clusters can be distinguished, and can identify that there are 3 clusters.
Now, based on the similarity of these clusters, the most similar clusters combined
together and this process is repeated until only a single cluster is left.
Proximity Matrix
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
The diagonal elements of this matrix is always 0 as the distance of a point with itself is
always 0. The Euclidean distance formula is used to calculate the rest of the distances. So,
to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.
Step 1: First, all the points to an individual cluster is assigned. Different colors here
represent different clusters. Hence, 5 different clusters for the 5 points in the data.
Step 2: Next, look at the smallest distance in the proximity matrix and merge the points
with the smallest distance. Then the proximity matrix is updated.
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Here, the smallest distance is 3 and hence point 1 and 2 is merged.
Let’s look at the updated clusters and accordingly update the proximity matrix. Here, we
have taken the maximum of the two marks (7, 10) to replace the marks for this cluster.
Instead of the maximum, the minimum value or the average values can also be
considered.
Roll No Mark
(1, 2) 10
3 28
4 20
5 35
Step 3: Step 2 is repeated until only a single cluster is left. So, look at the minimum
distance in the proximity matrix and then merge the closest pair of clusters. We will get
the merged clusters after repeating these steps:
To get the number of clusters for hierarchical clustering, we make use of the concept
called a Dendrogram.
A dendrogram is a tree-like diagram that records the sequences of merges or splits.
Let’s get back to faculty-student example. Whenever we merge two clusters, a
dendrogram record the distance between these clusters and represent it in graph
form.
Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples.
School of Computer Engineering
Dendrogram cont…
83
Similarly, we plot all the steps where we merged the clusters and finally, we get a
dendrogram like this:
We can clearly visualize the steps of hierarchical clustering. More the distance of the
vertical lines in the dendrogram, more the distance between those clusters.
Now, we can set a threshold distance and draw a horizontal line (Generally, the threshold
is set in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and
draw a horizontal line:
The number of clusters will be the number of vertical lines which are being intersected
by the line drawn using the threshold. In the above example, since the red line intersects
2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4) and the
other will have a sample (3,5).
School of Computer Engineering
Hierarchical Clustering closeness of two clusters
85
The decision of merging two clusters is taken on the basis of closeness of these
clusters. There are multiple metrics for deciding the closeness of two clusters
and primarily are: Manhattan distance
Euclidean distance Maximum distance
Squared Euclidean distance Mahalanobis distance
The below diagram explains the working of the K-means Clustering Algorithm:
1. Begin
2. Step-1: Select the number K to decide the number of clusters.
3. Step-2: Select random K points or centroids. (It can be other from the input
dataset).
4. Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
5. Step-4: Calculate the variance and place a new centroid of each cluster.
6. Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
7. Step-6: If any reassignment occurs, then go to step-4 else go to step-7.
8. Step-7: The model is ready.
9. End
Suppose we have two variables x and y. The x-y axis scatter plot of these two variables is
given below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
We need to choose some random K points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as K points, which are not the part of dataset.
Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by calculating the distance between two points. So,
we will draw a median between both the centroids. Consider the below image:
From the image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid
i.e. K2. Let's color them as blue and yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the process by choosing
a new centroid. To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like below
image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
School of Computer Engineering
Working of K-Means Algorithm cont…
99
We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:
We can see in the previous image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image: