0% found this document useful (0 votes)
61 views

Unit 5 - DA - Classification & Clustering

Uploaded by

881Aritra Pal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Unit 5 - DA - Classification & Clustering

Uploaded by

881Aritra Pal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

Data Analytics (IT-3006)

Kalinga Institute of Industrial Technology


Deemed to be University
Bhubaneswar-751024

School of Computer Engineering

Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission

3 Credit Lecture Note – Unit 5


Course Contents
2

Sr # Major and Detailed Coverage Area Hrs


5 Classification & Clustering 7
Introduction to classification and clustering,
Distance-Based Algorithms: K Nearest Neighbor
(KNN), Decision Tree-Based Algorithms: Decision
Tree (ID3 Algorithm), Support Vector Machines
(Linear), Naves Bayes.
Overview of Clustering Techniques, Hierarchical
Clustering, Partitioning Methods, K-Means
Algorithm.

School of Computer Engineering


3

School of Computer Engineering


Classification
4

 Classification is a task that involves separating data points into different classes i.e.,
assigning a class label to each data point instance in a dataset based on its features.
The goal of classification is to build a model that accurately predicts the class labels
of new instances based on their features.
 There are two main types of classification: binary classification and multi-class
classification. Binary classification involves classifying instances into two classes,
such as “spam” or “not spam”, while multi-class classification involves classifying
instances into more than two classes.

School of Computer Engineering


Clustering
5

 Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.
 In clustering (also called cluster analysis), a group of different data objects is
classified as similar objects. One group means a cluster of data. Data sets are
divided into different groups in the cluster analysis, which is based on the
similarity of the data.
 After the classification of data into various groups, a label is assigned to the group.
It helps in adapting to the changes by doing the classification.

School of Computer Engineering


Classification and Clustering
6

 Both Classification and Clustering is used for the categorization of objects into
one or more classes based on the features. They appear to be a similar process as
the basic difference is minute.
 In the case of Classification, there are predefined labels assigned to each input
instance according to their features whereas in clustering those labels are missing.
Parameter Classification Clustering
Basic Process of classifying the input Grouping the instances based on
instances based on their their similarity without the help
corresponding class labels of class labels
Need It has labels so there is need of There is no need of training and
training and testing dataset for testing dataset
verifying the model created
Complexity More complex as compared to Less complex as compared to
clustering classification
Prior knowledge Yes (The number of classes are No (The number of classes is
of classes known) unknown)
School of Computer Engineering
Classification and Clustering Example
7

Predefined classes Identifies similarities among objects

School of Computer Engineering


8

School of Computer Engineering


The distance-based algorithms
9

 The distance-based algorithms are used to measure the distance between data
points.
 Distance measures play an important role in classification and clustering i.e., it
provide the foundations for many popular and effective algorithms like KNN (K-
Nearest Neighbors) for classifications and K-Means clustering for clustering.
 Different distance measures can be chosen and used depending on the types of
data. A distance measure is an objective score that summarizes the relative
difference between two objects in a problem domain. Most commonly, the two
objects are rows of data that describes a subject (such as a person, car, or house),
or an event (such as purchases, a claim, or a diagnosis)
 If the distance among two data points is small then there is a high degree of
similarity among the objects and vice versa.

School of Computer Engineering


Distance Measurement Example
10

School of Computer Engineering


Distance Measure
11

From these Euclidean distance is more preferable as it gives equal important to


each feature.

School of Computer Engineering


KNN
12

 The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric and lazy
learning algorithm, which uses proximity to make classifications about the grouping of an
individual data point.
 Lazy learning algorithm − It does not have a specialized training phase and uses all the
data for training while classification.
 Non-parametric learning algorithm − It doesn’t assume anything about the underlying
data.
 The k value in the k-NN algorithm defines how many neighbors will be checked to determine
the classification of a specific query point.
 If k is set to 1, the instance will be assigned to the same class as its single nearest neighbor. If k
is set to 5, the classes of 5 closest points are checked.
 Defining k can be a balancing act as different values can lead to over fitting or under fitting.
Lower values of k can have high variance, but low bias, and larger values of k may lead to high
bias and lower variance.
 The choice of k will largely depend on the input data as data with more outliers or noise will
likely perform better with higher values of k.
 Overall, it is recommended to have an odd number for k to avoid ties in classification, and
cross-validation tactics can help to choose the optimal k for the dataset.
School of Computer Engineering
Working of KNN
13

School of Computer Engineering


Working of KNN Example
14

Consider a new data point that need to


put it in the required category
1. Firstly, choose the number of
neighbors, so let k=5.
2. Next, calculate the Euclidean
distance between the data points.
The Euclidean distance is the
distance between two points.
3. By calculating the Euclidean
distance, the nearest neighbors
are derived i.e., three nearest
neighbors in category A and two
nearest neighbors in category B.

School of Computer Engineering


Working of KNN Example cont…
15

As 3 nearest neighbors are from


category A, hence this new data
point must belong to category A.

School of Computer Engineering


KNN Example
16

Let given is the data from the questionnaires survey with two attributes (X1: Acid
durability and X2: Strength) to classify whether a special paper tissue is good or not.
Classify the new paper tissue with features: X1=3 and X2=7.

X1: Acid Durability X2: Strength Y: Class


7 7 Bad
7 4 Bad
3 4 Good
1 4 Good

School of Computer Engineering


KNN Solution
17

Step 1: Initialize and Define K


K=1 Over fitting Problem
Number of tuples is even, hence consider K as odd.
Let K = 3 for this problem
Step 2: Compute distance between input sample and training samples using Euclidian
distance measure.

X1: Acid Durability X2: Strength Y: Class Distance from (3, 7)


7 7 Bad sqrt((7-3)2 + (7-7)2)=4

7 4 Bad sqrt((7-3)2 + (4-7)2)=5


3 4 Good sqrt((3-3)2 + (4-7)2)=3
1 4 Good sqrt((1-3)2 + (4-7)2)=3.6

School of Computer Engineering


KNN Solution cont…
18

Step 3: Determine K nearest neighbor those are at minimum distance and rank them.

X1: X2: Strength Y: Class Distance from Rank


Acid Durability (3, 7)
7 7 Bad 4 3

7 4 Bad 5 4
3 4 Good 3 1
1 4 Good 3.6 2
Step 4: Apply Simple Majority
There are 2 “Good” and 1 “Bad”. Thus the new paper tissue will be classified as Good
due to simple majority.

School of Computer Engineering


KNN Exercise
19

Given is the customer detail as customer’s age, income, credit card no. and their
corresponding class. Find the class label for the new data item (reflecting in red font):
Customer: RRR, Age: 37, Income: 50K, Cr Card: 2, Class- ?

Customer Age Income No. of Credit Class


Cards
XXX 35 35K 3 No
YYY 22 50K 2 Yes
ZZZ 63 200K 1 No
PPP 59 170K 1 No
QQQ 25 40K 4 Yes
RRR 37 50K 2 ????

School of Computer Engineering


20

School of Computer Engineering


Decision Tree
21
 A decision tree is a structure that contains nodes (rectangular boxes) and edges(arrows) and is built from a
dataset.
 Each node is either used to make a decision (known as decision node) or represent an outcome (known as
leaf node).
 The picture depicts below represent a decision tree that is used to classify whether a person is Fit or Unfit.
 The decision nodes here are questions like ‘Is the person less than 30 years of age?’, ‘Does the person eat
junk?’, etc. and the leaves are one of the two possible outcomes viz. Fit and Unfit.
 Looking at the Decision Tree, following decisions can be looked at: if a person is less than 30 years of age
and doesn’t eat junk food then he is Fit, if a person is less than 30 years of age and eats junk food then he is
Unfit and so on.
 The initial node is called the root node (colored in blue), the final nodes are called the leaf nodes (colored
in green) and the rest of the nodes are called intermediate or internal nodes. The root and intermediate
nodes represent the decisions while the leaf nodes represent the outcomes.

School of Computer Engineering


Decision Tree-Based Algorithms
22

Dataset

School of Computer Engineering


Decision Tree-Based Algorithms cont…
23

Decision Tree-based algorithms

School of Computer Engineering


ID3 Algorithm
24

 ID3 stands for Iterative Dichotomiser 3 and is named such because the
algorithm iteratively (i.e., repeatedly) dichotomizes(i.e., divides) features
into two or more groups at each step.
 It is a classification algorithm that follows a top-down greedy approach by
selecting a best features that yields maximum Information Gain(IG) or
minimum Entropy(H).
 The top-down approach means start building the tree from the top and the
greedy approach means that at each iteration, select the best feature at the
present moment to create a node.
 Entropy is a measure of the amount of uncertainty in the dataset S i.e.,
H(S) = ∑c ∈ C − p (c) log2 p(c), where:
S - The current dataset for which entropy is being calculated.
C - Set of classes in S {example of C ={yes, no}}
p(c) - The proportion of the number of elements in class c to the number
of elements in set S.
School of Computer Engineering
ID3 Algorithm cont…
25

Entropy calculation for the following sample (S):


Sample class Sample count
Yes 9
No 5

C ={Yes, No}
Proportion of the number of elements in class Yes i.e., p(Yes) = 9/(9+5) =9/14
Proportion of the number of elements in class No i.e., p(No) = 5/(9+5) =5/14
H(S) = -(9/14 * log₂(9/14) + (5/14) * log₂(5/14)) = 0.94

School of Computer Engineering


Information Gain
26

Information Gain IG(A, S) tells how much uncertainty in the dataset S was
reduced after splitting S on feature A i.e.,
IG(A , S) = H(S) − I(A) where I(A) = ∑ t∈T p(t) * H(t) and
H(S) - Entropy of dataset S
T - The subsets created from splitting dataset S by feature A
p(t) - The proportion of the number of elements in t to the number of
elements in dataset S.
H(t) - Entropy of subset t.
I(A) - Average Entropy Information for feature A

School of Computer Engineering


Information Gain Calculation
27

Sr No Regular Need Tutor


1 Yes Yes
2 Yes No
3 Yes Yes
Consider the dataset 4 Yes No
for prediction of
5 No Yes
student needs tutor
or not. 6 No No
7 No Yes
8 No No
9 Yes No
10 No No
11 No No
12 No No
School of Computer Engineering
Information Gain Calculation
28

Complete entropy of dataset is (4 out of 12 are Yes and 8 out of 12 are No) -
H(S) = - p(YES) * log2(p(YES)) - p(no) * log2(p(no))
= - (4/12) * log2(4/12) - (8/12) * log2(8/12)
= - (-0.527) - (-0.395)
= 0.922
Information Gain for Regular:
Sr No Regular Need Tutor Sr No Regular Need Tutor
5 No Yes
1 Yes Yes
6 No No
2 Yes No
7 No Yes
3 Yes Yes 8 No No
4 Yes No 10 No No
11 No No
9 Yes No
12 No No
5 rows with Yes value out of 12 for which
there are 2 Yes in the target column and 3 No 7 rows with No value out of 12 for which there are 2 Yes in the target
in the target column. column and 5 No in the target column.

School of Computer Engineering


Information Gain Calculation
29

H(regular=YES) = -(2/5)*log2(2/5)-(3/5)*log2(3/5)
= - (- 0.5288) - (- 0.4422) = 0.971

H(regular=NO) = -(2/7)*log2(2/7)-(5/7)*log2(5/7)
= -(-0.516) - (- 0.347) = 0.863

I(Regular) = p(YES) * H(Regular=YES) + p(NO) * H(Regular=No)


= (5/12)*0.971 + (7/12)*0.863 = 0.907

Information Gain (S, Regular)= H(S) - I(Regular)


= 0.983 - 0.907 = 0.076

School of Computer Engineering


Steps in ID3 algorithm
30

1. Calculate entropy for dataset.


2. For each feature:
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
3. Find the feature with maximum information gain.
4. Repeat it until desired tree is obtained.

School of Computer Engineering


ID3 Example
31

Day Outlook Temperature Humidity Wind Play Tennis


1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes


Using ID3 algorithm 6 Rain Cool Normal Strong No
to build a Decision 7 Overcast Cool Normal Strong Yes
Tree to predict the
8 Sunny Mild High Weak No
weather
9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

School of Computer Engineering


ID3 Example cont…
32

Step 1: Entropy of dataset is:


H(S) = - p(Yes) * log2(p(Yes)) - p(No) * log2(p(No))
= - (9/14) * log2(9/14) - (5/14) * log2(5/14)
= - (-0.41) - (-0.53) = 0.94
Step 2: Calculate entropy for all its categorical values and information gain for the
features.
First feature – Outlook
Categorical values - sunny, overcast and rain
H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971
H(Outlook=rain) = -(3/5)*log(3/5)-(2/5)*log(2/5) =0.971
H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0

Average Entropy Information for Outlook-


I(Outlook) = p(sunny) * H(Outlook=sunny) + p(rain) * H(Outlook=rain) + p(overcast) *
H(Outlook=overcast) = (5/14)*0.971 + (5/14)*0.971 + (4/14)*0 = 0.693

Information Gain(S, Outlook) = H(S) - I(Outlook) = 0.94 - 0.693 = 0.247

School of Computer Engineering


ID3 Example cont…
33

Step 2: Calculate entropy for all its categorical values and information gain for the
features.
Second feature – Temperature
Categorical values - hot, mild, cool
H(Temperature=hot) = -(2/4)*log(2/4)-(2/4)*log(2/4) = 1
H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811
H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179

Average Entropy Information for Temperature -


I(Temperature) = p(hot)*H(Temperature=hot) + p(mild)*H(Temperature=mild) +
p(cool)*H(Temperature=cool) = (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811 = 0.9108

Information Gain(S, Temperature) = H(S) - I(Temperature) = 0.94 - 0.9108 = 0.0292

School of Computer Engineering


ID3 Example cont…
34

Step 2: Calculate entropy for all its categorical values and information gain for the
features.
Third feature - Humidity
Categorical values - high, normal
H(Humidity=high) = -(3/7)*log(3/7)-(4/7)*log(4/7) = 0.983
H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591

Average Entropy Information for Humidity -


I(Humidity) = p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal)
= (7/14)*0.983 + (7/14)*0.591 = 0.787

Information Gain (S, Humidity) = H(S) - I(Humidity) = 0.94 - 0.787 = 0.153

School of Computer Engineering


ID3 Example cont…
35

Step 2: Calculate entropy for all its categorical values and information gain for the
features.
Fourth feature – Wind
Categorical values - weak, strong
H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811
H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1

Average Entropy Information for Wind -


I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong)
= (8/14)*0.811 + (6/14)*1 = 0.892

Information Gain (S, Wind) = H(S) - I(Wind) = 0.94 - 0.892 = 0.048

School of Computer Engineering


ID3 Example cont…
36

Step 3: Find the feature with maximum information gain.


Here, the feature with maximum information gain is Outlook. So, the decision tree built
so far is as follows.
Here, when Outlook = overcast, it is of pure class(Yes).
Now, the same procedure to be repeated for the data with rows consist of Outlook value
as Sunny and then for Outlook value as Rain.

School of Computer Engineering


ID3 Example cont…
37

Step 4: The same process to be repeated until the tree is obtained


Now, finding the best feature for splitting the data with Outlook=Sunny values
Entropy of Sunny is -
H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
= - (2/5) * log2(2/5) - (3/5) * log2(3/5)
= 0.971
First feature – Temperature with categorical values - hot, mild, cool
H(Sunny, Temperature=hot) = -0-(2/2)*log(2/2) = 0
H(Sunny, Temperature=cool) = -(1)*log(1)- 0 = 0
H(Sunny, Temperature=mild) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1

Average Entropy Information for Temperature -


I(Sunny, Temperature) = p(Sunny, hot)*H(Sunny, Temperature=hot) + p(Sunny,
mild)*H(Sunny, Temperature=mild) + p(Sunny, cool)*H(Sunny, Temperature=cool) =
(2/5)*0 + (1/5)*0 + (2/5)*1 = 0.4

Information Gain (Sunny, Temperature) = H(Sunny) - I(Sunny, Temperature) = 0.971 - 0.4


= 0.571
School of Computer Engineering
ID3 Example cont…
38

Step 4: The same process to be repeated until the tree is obtained

Second feature - Humidity with categorical values - high, normal

H(Sunny, Humidity=high) = - 0 - (3/3)*log(3/3) = 0


H(Sunny, Humidity=normal) = -(2/2)*log(2/2)-0 = 0

Average Entropy Information for Humidity -


I(Sunny, Humidity) = p(Sunny, high)*H(Sunny, Humidity=high) + p(Sunny,
normal)*H(Sunny, Humidity=normal) = (3/5)*0 + (2/5)*0 = 0

Information Gain (Sunny, Humidity)= H(Sunny) - I(Sunny, Humidity) = 0.971 - 0 = 0.971

School of Computer Engineering


ID3 Example cont…
39

Step 4: The same process to be repeated until the tree is obtained

Third feature - Wind with categorical values - weak, strong

H(Sunny, Wind=weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918


H(Sunny, Wind=strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1

Average Entropy Information for Wind -

I(Sunny, Wind) = p(Sunny, weak)*H(Sunny, Wind=weak) + p(Sunny, strong)*H(Sunny,


Wind=strong) = (3/5)*0.918 + (2/5)*1 = 0.9508

Information Gain (Sunny, Wind) = H(Sunny) - I(Sunny, Wind) = 0.971 - 0.9508 = 0.0202

School of Computer Engineering


ID3 Example cont…
40

Step 4: The same process to be repeated until the tree is obtained.


The feature with maximum information gain is Humidity. So, the decision tree built so far
is as follows. Here, when Outlook = Sunny and Humidity = High, it is a pure class of
category "no". And When Outlook = Sunny and Humidity = Normal, it is again a pure class
of category "yes". Therefore, we don't need to do further calculations.

School of Computer Engineering


ID3 Example cont…
41

Step 4: The same process to be repeated until the tree is obtained.


Now, finding the best feature for splitting the data with Outlook=Rain values
Entropy of Rain is -
H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
= - (3/5) * log(3/5) - (2/5) * log(2/5)
= 0.971

First feature – Temperature with categorical values - mild, cool


H(Rain, Temperature=cool) = -(1/2)*log(1/2)- (1/2)*log(1/2) = 1
H(Rain, Temperature=mild) = -(2/3)*log(2/3)-(1/3)*log(1/3) = 0.918

Average Entropy Information for Temperature -


I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild) + p(Rain,
cool)*H(Rain, Temperature=cool) = (2/5)*1 + (3/5)*0.918 = 0.9508

Information Gain (Rain, Temperature) = H(Rain) - I(Rain, Temperature) = 0.971 - 0.9508


= 0.0202

School of Computer Engineering


ID3 Example cont…
42

Step 4: The same process to be repeated until the tree is obtained.

Second feature - Wind with categorical values - weak, strong


H(Wind=weak) = -(3/3)*log(3/3)-0 = 0
H(Wind=strong) = 0-(2/2)*log(2/2) = 0

Average Entropy Information for Wind -


I(Wind) = p(Rain, weak)*H(Rain, Wind=weak) + p(Rain, strong)*H(Rain, Wind=strong)
= (3/5)*0 + (2/5)*0 = 0

Information Gain (Rain, Wind)= H(Rain) - I(Rain, Wind) = 0.971 – 0 = 0.971

School of Computer Engineering


ID3 Example cont…
43

Step 4: The same process to be repeated until the tree is obtained.


Here, the feature with maximum information gain is Wind. So, the decision tree built so
far is as follows. Here, when Outlook = Rain and Wind = Strong, it is a pure class of
category "no". When Outlook = Rain and Wind = Weak, it is again a pure class of category
"yes". This is final desired tree for the given dataset.

School of Computer Engineering


ID3 Class Exercise
44

Age Income Student Credit Rating Buy Computer


<=30 High No Fair No

<=30 High No Excellent No

31 ... 40 High No Fair Yes

>40 Medium No Fair Yes

>40 Low Yes Fair No


Use ID3 algorithm >40 Low Yes Excellent Yes
to build a Decision 31 ... 40 Low Yes Excellent Yes
Tree to predict buy
<=30 Medium No Fair No
computer
<=30 Low Yes Fair Yes

>40 Medium Yes Fair Yes

<=30 Medium Yes Excellent Yes

31 ... 40 Medium No Excellent Yes

31 ... 40 High Yes Fair Yes

>40 Medium No Excellent No

School of Computer Engineering


45

School of Computer Engineering


Support Vector Machines
46
Support Vector Machine (SVM) is an algorithm which can be used for both classification
or regression challenges. In the SVM algorithm, each data item is plotted as a point in n-
dimensional space (where n is number of features) with the value of each feature being
the value of a particular coordinate. Then, classification is performed by finding the
hyperplane that differentiates the two classes.

What is hyperplane?
A hyperplane is a generalization of a plane.
 In one dimension, it is called a point.
 In two dimensions, it is a line.
 In three dimensions, it is a plane.
 In more dimensions one can call it an hyperplane.
The following figure represents datapoint in one dimension and the point L is a
separating hyperplane.

School of Computer Engineering


SVM cont…
47

How does SVM works?


 Let’s imagine we have two tags: red and blue, and our
data has two features: x and y. We want a classifier that,
given a pair of (x, y) coordinates, outputs if it’s either red
or blue. We plot our already labeled training data on a
plane.

 A support vector machine takes these data points and


outputs the hyperplane (which in two dimensions it’s
simply a line) that best separates the tags. This line is the
decision boundary: anything that falls to one side of it
we will classify as blue, and anything that falls to the
other as red.
 In 2D, the best hyperplane is simply a line.

School of Computer Engineering


SVM cont…
48
But, what exactly is the best hyperplane? For SVM, it’s the one that maximizes the
margins from both tags. In other words: the hyperplane (in 2D, it's a line) whose
distance to the nearest element of each tag is the largest.

School of Computer Engineering


SVM cont…
49
(Scenario-1) Identification of the right hyperplane: Here, we have three hyperplanes (A,
B and C). Now, the job is to identify the right hyperplane to classify star and circle.
Remember a thumb rule - identify the right hyperplane: “Select the hyperplane which
segregates the two classes better”. In this scenario, hyperplane “B” has excellently
performed this job.

School of Computer Engineering


SVM cont…
50
(Scenario-2) Identify the right hyperplane: Here, we have three hyperplanes (A, B and
C) and all are segregating the classes well. Now, How can we identify the right
hyperplane?

Here, maximizing the distances between nearest data point (either class) and hyperplane
will help to decide the right hyperplane. This distance is called as margin.

School of Computer Engineering


SVM cont…
51
Below, you can see that the margin for hyperplane C is high as compared to both A and B.
Hence, we name the right hyperplane as C. Another lightning reason for selecting the
hyperplane with higher margin is robustness. If we select a hyperplane having low
margin then there is high chance of misclassification.

School of Computer Engineering


SVM cont…
52
(Scenario-3) Identify the right hyperplane: Use the rules as discussed in previous
section to identify the right hyperplane.

Some of you may have selected the hyperplane B as it has higher margin compared to A.
But, here is the catch, SVM selects the hyperplane which classifies the classes accurately
prior to maximizing margin. Here, hyperplane B has a classification error and A has
classified all correctly. Therefore, the right hyperplane is A.
School of Computer Engineering
SVM cont…
53
(Scenario-4) Below, we are unable to segregate the two classes using a straight line, as
one of the stars lies in the territory of other(circle) class as an outlier.

One star at other end is like an outlier for star class. The SVM algorithm has a feature to
ignore outliers and find the hyperplane that has the maximum margin. Hence, we can say,
SVM classification is robust to outliers.

School of Computer Engineering


SVM cont…
54

(Scenario-5) Find the hyperplane to segregate to


classes: In the scenario, we can’t have linear
hyperplane between the two classes, so how does
SVM classify these two classes? Till now, we have
only looked at the linear hyperplane.

SVM can solve this problem. Easily! It solves this


problem by introducing additional feature. Here,
we can add a new feature z=x2+y2. Now, plotting
the data points on axis x and z:
In above plot, points to consider are:
 All values for z would be positive as z is the squared sum of both x and y.
 In the original plot, red circles appear close to the origin of x and y axes,
leading to lower value of z and star relatively away from the origin result to
higher value of z.
School of Computer Engineering
SVM cont…
55

 In the SVM classifier, it is easy to have a linear


hyperplane between these two classes. But, another
burning question which arises is, should we need to
add this feature manually to have a hyperplane. No,
the SVM algorithm has a technique called the
kernel trick. The SVM kernel takes low dimensional
input space and transforms it to a higher
dimensional space i.e., it converts non-separable
problem to separable problem. It is mostly useful in
non-linear separation problem. Simply put, it does
some extremely complex data transformations, then
finds out the process to separate the data based on
the labels or outputs you’ve defined.
 When we look at the hyperplane in original input
space it looks like a circle.

School of Computer Engineering


Types of SVM
56

 Linear SVM: Linear SVM is used for data that are linearly separable i.e. for a
dataset that can be categorized into two categories by utilizing a single
straight line. Such data points are termed as linearly separable data, and the
classifier is used described as a linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for data that are non-linearly
separable data i.e. a straight line cannot be used to classify the dataset. For
this, we use something known as a kernel trick that sets data points in a
higher dimension where they can be separated using planes or other
mathematical functions. Such data points are termed as non-linear data, and
the classifier used is termed as a non-linear SVM classifier.

School of Computer Engineering


57

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
 Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
 Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

School of Computer Engineering


Naïve Bayes Classifier
58
 Naive Bayes classifier (NBC) DAY
is a classification algorithm. It
share a principle, i.e. every
pair of features being classified
is independent of each other.
 Consider the problem of
playing golf, and the dataset is
shown on right.
 The need is to classify
whether the day is suitable for
playing golf, given the features
of the day. The columns
represent these features and
the rows represent individual
observations.

School of Computer Engineering


Naïve Bayes Classifier cont…
59
 If we take the first observation of the dataset, we can observe that is not suitable for playing golf
if the outlook is rainy, temperature is hot, humidity is high and it is not windy.
 We make two assumptions. The first, these predictors are independent i.e. if the temperature is
hot, it does not necessarily mean that the humidity is high. Another assumption is that all the
predictors have an equal effect on the outcome i.e. the day being windy does not have more
importance in deciding to play golf or not.
 The Bayes theorem can be written as P(y|X) = (P(X|y) * P(y)) / P(X) where the variable y is the
dependent variable(play golf), which represents if it is suitable to play golf or not given the
conditions. X represent the independent variables (outlook, temperature, humidity and windy).
 X is given as X = (x1, x2, …., xn) where represent the feature and mapped to outlook, temperature,
humidity and windy.
 By substituting for X and expanding using the chain rule we get:

 Now, the values for each can be obtained by looking at the dataset and substitute them into the
equation. For all entries in the dataset, the denominator does not change, it remain static.
Note: Bayes Theorem describes the probability of an event, based on prior knowledge of conditions
that might be related to the event.

School of Computer Engineering


Naïve Bayes Classifier cont…
60

Using the above function, we can obtain the class, given the predictors.
 P(Y) = 9/ 14 and P(N) = 5/14 where Y stands for Yes and N stands for No.
 The outlook probability is: P(sunny | Y) = 2/9, P(overcast | Y) = 4/9, P(rain | Y) = 3/9,
P(sunny | N) = 3/5, P(overcast | N) = 0, P(rain | N) = 2/5
 The temperature probability is: P(hot | Y) = 2/9, P(mild | Y) = 4/9, P(cool | Y) = 3/9,
P(hot | N) = 2/5, P(mild | N) = 2/5, P(cool | N) = 1/5
 The humidity probability is: P(high | Y) = 3/9, P(normal | Y) = 6/9, P(high | N) = 4/5,
P(normal | N) = 2/5.
 The windy probability is: P(true | Y) = 3/9, P(false | Y) = 6/9, P(true | N) = 3/5, P(false
| N) = 2/5
Now it is to predict “play golf” today with the conditions: <outlook = sunny; temperature
= cool; humidity = high; windy = true>, so today = (sunny, cool, high, true)
P(Y | today) = P(sunny | Y) * P(cool | Y) * P(high | Y) * P(true | Y) * P(Yes) = 2/9 * 3/9 *
3/9 * 3/9 * 9/ 14 = 0.005
P(N | today) = P(sunny | N) * P(cool | N) * P(high | N) * P(true | N) * P(N) = 3/5 * 1/5 *
4/5 * 3/5 * 5/14 = 0.020
Since, the probability of No is the larger, we can predict today to play golf be No.

School of Computer Engineering


Types of Naïve Bayes Classifier
61

 Multinomial Naive Bayes: This is mostly used for document classification


problem, i.e. whether a document belongs to the category of sports, politics,
technology etc. The features/predictors used by the classifier are the
frequency of the words present in the document.
 Bernoulli Naive Bayes: This is similar to the multinomial naive bayes but
the predictors are boolean variables. The parameters that we use to predict
the class variable take up only values yes or no, for example if a word occurs
in the text or not.
 Gaussian Naive Bayes: The predictors take up a continuous value and are
not discrete.

School of Computer Engineering


Naïve Bayes Classifier cont…
62

Pros
 It is easy and fast to predict class of test data set. It also perform well in multi class
prediction.
 When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
 It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which
is a strong assumption).
Cons
 The assumption of independent predictors. In real life, it is almost impossible to get a
set of predictors which are completely independent.
 If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as “Zero Frequency”. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.

School of Computer Engineering


Class Exercise
63
DAY
Use the Naïve Bayes Classifier
to determine whether a day
(Sunny, Hot, Normal, False)
should be classified as a
defaulted to play golf or not.

School of Computer Engineering


Portion
64

School of Computer Engineering


Clustering
65

 Clustering is the task of dividing the data points into a number of groups such
that data points in the same groups are more similar to other data points in
the same group and dissimilar to the data points in other groups.
 It is basically a collection of objects on the basis of similarity and
dissimilarity between them.
 Following is an example of finding clusters of data points based on their
income and debt.

School of Computer Engineering


Clustering Example
66

Imagine of a number of objects in a basket. Each item has a distinct set of


features (size, shape, color, etc.). Now the task is to group each of the objects in
the basket. A natural first question to ask is, ‘on what basis these objects should
be grouped?’ Perhaps size, shape, or color.

School of Computer Engineering


Clustering Example cont…
67

The data points clustered together can be classified into one single group. The
clusters can be distinguished, and can identify that there are 3 clusters.

School of Computer Engineering


Clustering Types
68

Clustering can be divided into two subgroups:


 Hard Clustering: each data point either belongs to a cluster completely or
not.
 Soft Clustering: Instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned.

School of Computer Engineering


Clustering Application
69
 Marketing: It can be used to characterize & discover customer segments for
marketing purposes.
 Biology: It can be used for classification among different species of plants and
animals.
 Libraries: It is used in clustering different books on the basis of topics and
information.
 Insurance: It is used to acknowledge the customers, their policies and identifying the
frauds.
 City Planning: It is used to make groups of houses and to study their values based on
their geographical locations and other factors present.
 Earthquake studies: By learning the earthquake affected areas, the dangerous zones
can be determined.
 Healthcare: It can be used in identifying and classifying the cancerous gene.
 Search Engine: It is the backbone behind the search engines. Search engines try to
group similar objects in one cluster and the dissimilar objects far from each other.
 Education: It can be used to monitor the students' academic performance. Based on
the students' score they are grouped into different-different clusters, where each
clusters denoting the different level of performance.
School of Computer Engineering
Types of clustering algorithms
70
Since the task of clustering is subjective, the means that it can be used for achieving this
goal are plenty. Every methodology follows a different set of rules for defining the
‘similarity’ among data points. The popular types are:
 Connectivity models  Distribution models
 Centroid models  Density models
Connectivity models: These models are based on the notion that the data points closer
in data space exhibit more similarity to each other than the data points lying farther away.
These models can follow two approaches. In the first approach, they start with classifying
all data points into separate clusters & then aggregating them as the distance decreases.
In the second approach, all data points are classified as a single cluster and then
partitioned as the distance increases. Also, the choice of distance function is subjective.
These models are very easy to interpret but lacks scalability for handling big datasets.
Examples of these models are hierarchical clustering algorithm and its variants.
Centroid models: These are iterative clustering algorithms in which the notion of
similarity is derived by the closeness of a data point to the centroid of the clusters. K-
Means clustering algorithm is a popular algorithm that falls into this category. In these
models, the no. of clusters required at the end have to be mentioned beforehand, which
makes it important to have prior knowledge of the dataset.
School of Computer Engineering
Types of clustering algorithms cont…
71
Distribution models: These clustering models are based on the notion of how probable
is it that all data points in the cluster belong to the same distribution. These models often
suffer from overfitting. A popular example of these models is Expectation-maximization
algorithm which uses multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and assign the data
points within these regions in the same cluster. Popular examples of density models are
DBSCAN and OPTICS.
Summary
 Connectivity based: Create a hierarchical decomposition of the set of data using
some criterion.
 Centroid based: Construct various partitions and then evaluate them by some
criterion.
 Distribution based: A model is hypothesized for each of the clusters and the idea is
to find the best fit of that model to each other.
 Density based: Based on connectivity and density functions.

School of Computer Engineering


Hierarchical Clustering
72
Let’s say we have the below points and we want to cluster them into groups.

We can assign each of these points to a separate cluster.

Now, based on the similarity of these clusters, the most similar clusters combined
together and this process is repeated until only a single cluster is left.

We are essentially building a hierarchy of clusters. That’s why this algorithm is


called hierarchical clustering.

School of Computer Engineering


Hierarchical Clustering Example
73
Suppose a faculty wants to divide the students into different groups. The faculty has the
marks scored by each student in an assignment and based on these marks, he/she wants
to segment them into groups. There’s no fixed target as to how many groups to have.
Since the faculty does not know what type of students should be assigned to which
group, it cannot be solved as a supervised learning problem. So, hierarchical clustering
is applied to segment the students into different groups. Let’s take a sample of 5
students.
Roll No Mark Creating a Proximity Matrix
1 10 First, a proximity matrix to be created which tell the
distance between each of these points (marks). Since the
2 7
distance is calculated of each point from each of the other
3 28 points, a square matrix of shape n X n (where n is the
4 20 number of observations) is obtained. Let’s make the 5 x 5
proximity matrix for the example.
5 35

School of Computer Engineering


Hierarchical Clustering Example cont…
74

Proximity Matrix
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
The diagonal elements of this matrix is always 0 as the distance of a point with itself is
always 0. The Euclidean distance formula is used to calculate the rest of the distances. So,
to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.

School of Computer Engineering


Steps to Perform Hierarchical Clustering
75

Step 1: First, all the points to an individual cluster is assigned. Different colors here
represent different clusters. Hence, 5 different clusters for the 5 points in the data.

Step 2: Next, look at the smallest distance in the proximity matrix and merge the points
with the smallest distance. Then the proximity matrix is updated.

Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Here, the smallest distance is 3 and hence point 1 and 2 is merged.

School of Computer Engineering


Steps to Perform Hierarchical Clustering cont…
76

Let’s look at the updated clusters and accordingly update the proximity matrix. Here, we
have taken the maximum of the two marks (7, 10) to replace the marks for this cluster.
Instead of the maximum, the minimum value or the average values can also be
considered.
Roll No Mark
(1, 2) 10
3 28
4 20
5 35

Now, the proximity matrix for these clusters is calculated again.

School of Computer Engineering


Steps to Perform Hierarchical Clustering cont…
77

Revised Proximity Matrix


Roll No 1, 2 3 4 5
1, 2 0 18 10 25
3 18 0 8 7
4 10 8 0 15
5 25 7 15 0

Step 3: Step 2 is repeated until only a single cluster is left. So, look at the minimum
distance in the proximity matrix and then merge the closest pair of clusters. We will get
the merged clusters after repeating these steps:

School of Computer Engineering


Steps to Perform Hierarchical Clustering cont…
78

We started with 5 clusters and finally have a single cluster.

School of Computer Engineering


Types of Hierarchical Clustering
79

There are mainly two types of hierarchical clustering:


 Agglomerative hierarchical clustering
 Divisive Hierarchical clustering
Agglomerative Hierarchical Clustering
 Each point is assigned to an individual cluster in this technique. Suppose there are 4
data points, so each of these points would be assigned to a cluster and hence there
would be 4 clusters in the beginning.
 Then, at each iteration, closest pair of clusters are merged and this step is repeated
until only a single cluster is left.

School of Computer Engineering


Types of Hierarchical Clustering cont…
80

Divisive Hierarchical clustering


 Divisive hierarchical clustering works in the opposite way. Instead of starting with n
clusters (in case of n observations), we start with a single cluster and assign all the
points to that cluster. So, it doesn’t matter if we have 10 or 1000 data points. All
these points will belong to the same cluster at the beginning.
 Now, at each iteration, farthest point in the cluster is split and this process is
repeated until each cluster only contains a single point.

School of Computer Engineering


Dendrogram
81

 To get the number of clusters for hierarchical clustering, we make use of the concept
called a Dendrogram.
 A dendrogram is a tree-like diagram that records the sequences of merges or splits.
 Let’s get back to faculty-student example. Whenever we merge two clusters, a
dendrogram record the distance between these clusters and represent it in graph
form.

We have the samples of the dataset


on the x-axis and the distance on the
y-axis. Whenever two clusters are
merged, we will join them in this
dendrogram and the height of the
join will be the distance between
these points.

School of Computer Engineering


Dendrogram cont…
82

We started by merging sample 1 and 2 and the


distance between these two samples was 3.
Let’s plot this in the dendrogram.

Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples.
School of Computer Engineering
Dendrogram cont…
83

Similarly, we plot all the steps where we merged the clusters and finally, we get a
dendrogram like this:

We can clearly visualize the steps of hierarchical clustering. More the distance of the
vertical lines in the dendrogram, more the distance between those clusters.

School of Computer Engineering


Dendrogram cont…
84

Now, we can set a threshold distance and draw a horizontal line (Generally, the threshold
is set in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and
draw a horizontal line:

The number of clusters will be the number of vertical lines which are being intersected
by the line drawn using the threshold. In the above example, since the red line intersects
2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4) and the
other will have a sample (3,5).
School of Computer Engineering
Hierarchical Clustering closeness of two clusters
85

The decision of merging two clusters is taken on the basis of closeness of these
clusters. There are multiple metrics for deciding the closeness of two clusters
and primarily are:  Manhattan distance
 Euclidean distance  Maximum distance
 Squared Euclidean distance  Mahalanobis distance

School of Computer Engineering


Portion
86

School of Computer Engineering


Partitioning Methods
87

 Partitioning is used to make solving mathematics problems involving large


numbers easier by separating them into smaller units. For example, 782 can be
partitioned into: 700 + 80 + 2. It helps kids see the true value of each digit.
Rather than seeing 782 as an intimidating number, they'll see it as, 700, 80 and
2. Therefore, given n objects or data items, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k <= n.
 That is, it classifies the data into k groups, which together satisfy the following
requirements:
 Each group must contain at least one object
 Each object must belong to exactly one group
 Given k, the number of partitions to construct, a partitioning method creates an
initial partition. It uses iterative relocation technique that attempts to improve
the partitioning by moving objects from one group to another.
 The general criterion of a good partitioning is that objects in the same cluster
are “close” or related to each other, whereas objects of different clusters are “far
apart” or very different.

School of Computer Engineering


Partitioning Methods cont…
88

K-Medoids and K-Means are two types of clustering mechanisms in Partition


Clustering.
 The k-means algorithm, where each cluster is represented by the mean value of
the data points in the cluster. The center of a cluster is not necessarily one of the
input data points.
 The k-medoids algorithm, where each cluster is represented by one of the data
points (medoid) located near the center of the cluster. A medoid can be defined
as data points of a cluster, whose average dissimilarity to all the objects in the
cluster is minimal i.e. it is a most centrally located point in the given dataset.

School of Computer Engineering


Portion
89

School of Computer Engineering


K- Means Clustering Algorithm
90

 It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs only one group that has similar
properties.
 It is a centroid-based algorithm, where each cluster is associated with a
centroid. The main aim of this algorithm is to minimize the sum of distances
between the data point and their corresponding clusters. A centroid is a data
point (imaginary or real) at the center of a cluster.
 The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
 The k-means clustering algorithm mainly performs two tasks:
 Determines the best value for k center points or centroids by an iterative
process.
 Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
 Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.
School of Computer Engineering
K- Means Algorithm cont…
91

The below diagram explains the working of the K-means Clustering Algorithm:

School of Computer Engineering


How does the K-Means Algorithm Work?
92

1. Begin
2. Step-1: Select the number K to decide the number of clusters.
3. Step-2: Select random K points or centroids. (It can be other from the input
dataset).
4. Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
5. Step-4: Calculate the variance and place a new centroid of each cluster.
6. Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
7. Step-6: If any reassignment occurs, then go to step-4 else go to step-7.
8. Step-7: The model is ready.
9. End

School of Computer Engineering


Working of K-Means Algorithm
93

Suppose we have two variables x and y. The x-y axis scatter plot of these two variables is
given below:

School of Computer Engineering


Working of K-Means Algorithm cont…
94

 Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
 We need to choose some random K points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as K points, which are not the part of dataset.
Consider the below image:

School of Computer Engineering


Working of K-Means Algorithm cont…
95

 Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by calculating the distance between two points. So,
we will draw a median between both the centroids. Consider the below image:

School of Computer Engineering


Working of K-Means Algorithm cont…
96

 From the image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid
i.e. K2. Let's color them as blue and yellow for clear visualization.

School of Computer Engineering


Working of K-Means Algorithm cont…
97

 As we need to find the closest cluster, so we will repeat the process by choosing
a new centroid. To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as below:

School of Computer Engineering


Working of K-Means Algorithm cont…
98

 Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like below
image:

From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
School of Computer Engineering
Working of K-Means Algorithm cont…
99

 As reassignment has taken place, so we will again go to the step-4, which is


finding new centroids or K-points.

School of Computer Engineering


Working of K-Means Algorithm cont…
100

 We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:

School of Computer Engineering


Working of K-Means Algorithm cont…
101

 As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:

School of Computer Engineering


Working of K-Means Algorithm cont…
102

 We can see in the previous image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:

School of Computer Engineering


Working of K-Means Algorithm cont…
103

 As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:

School of Computer Engineering


How to Determine the optimal K for K-Means?
104

How to determine the optimal value of K in K-Means clustering? There are 2


methods:
 The Elbow Method
 The Silhouette Method

School of Computer Engineering


105

School of Computer Engineering

You might also like