0% found this document useful (0 votes)
3 views

Week 02 Classification & Clustering

This document provides an overview of classification and clustering techniques in data science, focusing on supervised and unsupervised learning methods. It covers key algorithms such as Bayesian classification, Support Vector Machines, and various clustering techniques, along with their implementation in R programming. The learning outcomes aim to equip learners with the ability to explore and apply these methods to solve real-world machine learning problems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Week 02 Classification & Clustering

This document provides an overview of classification and clustering techniques in data science, focusing on supervised and unsupervised learning methods. It covers key algorithms such as Bayesian classification, Support Vector Machines, and various clustering techniques, along with their implementation in R programming. The learning outcomes aim to equip learners with the ability to explore and apply these methods to solve real-world machine learning problems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Week 02: Classification and Clustering Techniques

Classification and Clustering


Techniques

Contributor

Prof. Chandan Chakraborty


Professor
Department of Computer Science & Engineering

1
Week 02: Classification and Clustering Techniques

LEARNING OUTCOMES

On successful competition of this unit, the learners will be able to

 Explore an overview of data classification techniques.


 Demonstrate few significant supervised data classification methods.
 Develop an exposure on various clustering techniques those of which are most
frequently used in solving various machine learning problems with unlabeled
data.
 Demonstrate the implementation of classification and clustering algorithms for
solving problems using R programming as case studies.

Contents:

1. Overview
2. Supervised Classification
2.1 Bayesian Classification
2.3 Support Vector Machine
2.3 Decision Trees Algorithm
2.4 k-Nearest Neighbor (kNN)
3. Unsupervised Classification
3.1 Hierarchical clustering
3.2 k-means clustering
3.3 Fuzzy C-Means (FCM)
4. Performance evaluation
5. Case studies – Implementation classification and clustering
algorithms for solving problems using R programming.

2
Week 02: Classification and Clustering Techniques

1. An overview

Data Science refers to an interdisciplinary scientific process consisting of computer science,


statistics and machine learning. It is, in fact, has emerged as an emerging field that adopts
scientific methods, processes, algorithms and systems to extract meaningful information from
structured and unstructured data for knowledge discovery. In the current scenario, data science
has been increasingly evolved as one of the most attractive and prospective research areas that
provides data-driven solutions to the difficult problems through the mining of big datasets and
discovering new insights, trends, methods and processes. Various components like collection,
preparation, analysis, visualization, management and preservation of large collections of
information play the key roles. The data science approach closely reflects the way humans
solve problems by means of analyzing the data intelligently. On the other hand, the
advancement of information technology has increased the dimension of the digital data that are
being generated and collected from different sources in different forms. Data helps to generate
information that is analyzed for extracting knowledge on the basis of which decision is made. It
is true fact that lots of data are available everywhere. Digital revolution has even explosively
increased the volume, velocity and variety of data and made the problems much more complex
in nature. In the real words scenario, most of the decision making problems eventually lead to
the classification ones where the humans have limitations to solve the same. Few examples like
fraud detection, pattern recognition, disease detection and treatment planning, weather
forecasting, prediction etc. can be named here. In addition, some of the illustrative examples
with the following questions may be addressed as follows:

Educational Setting:
 Out of 100 students in the class, who are at risk of dropping out?
 Can an effective curriculum be developed based on students’ feedback and
market-need?
 etc…
Credit ratings/targeted marketing:
 Given a database of 100,000 names, which persons are the least likely to default
on their credit cards?
 Identify likely responders to sales promotions.
 etc..

3
Week 02: Classification and Clustering Techniques

Fraud detection:
 Which types of transactions are likely to be fraudulent, given the demographics
and transactional history of a particular customer?
 How to recognize the spam mails, terrorists in a crowd, hackers etc.?

Being the most important part of data science, machine intelligence approach is essential in
solving such pattern recognition / identification problems that led to the framework of
classification problems. Classification techniques are broadly classified into supervised and
unsupervised (clustering) categories. Supervised classification refers to the data-driven models
that are trained with labelled data set whereas unsupervised models are trained with unlabeled
data.

In this chapter, few significant and widely-used supervised and unsupervised machine learning
algorithms have been discussed. In addition, these algorithms have been implemented using R
programming for solving few problems as case studies.

2. Supervised Classification Algorithms

Supervised learning technique uses labelled data to predict outcomes of unseen data or assign
to the unknown data in a known class. It basically analyzes the training data and produces an
inferred function that is then used for mapping new examples. Here, few probabilistic and
machine learning algorithms are described.

2.1 Bayesian classification

Bayesian classification is based on probability theory and more specifically, based on Bayes'
decision theory (Duda, Hart, & Stork, 2007). The principle of the decision is to choose the most
probable or the lowest risk (expected cost) option. Assume that there is a classification task to
classify feature vectors (samples) to K different classes. A feature vector is denoted as

x  [ x1 , x2 ,...xn ]T where n is the dimension of a vector. The probability that a feature vector x
belongs to class wk is P ( wk | X ) and it is often referred to as a posterior probability. The

classification of the vector is done according to posterior probabilities or decision risks


calculated from the probabilities. By Bayes' rule, the posterior probability can be written as

4
Week 02: Classification and Clustering Techniques

p( X | wk ).P(wk )
P( wk | X ) 
P( X )
where p ( X | wk ) is the class conditional probability density function of class wk in the feature

space for feature X given that class is wk . The function tells the distribution of feature vectors in

the feature space inside a particular class, i.e., it describes the class model. p ( w k ) indicates the

a priori probability that refers to the probability of the class before measuring any features. If
prior probabilities are not actually known, they are estimated by the relative occurrences. The
divisor
K
P( X )   p( X | wi )P( wi )
i 1

is a scaling factor to assure that posterior probabilities are really probabilities, i.e., their sum is 1.
It can be shown that choosing the class of the highest posterior probability produces the
minimum error probability. The prior probability can be estimated by considering their
proportional in the database using the following formula
P( wi )
p( wi ) 
P( w1 )  P(w2 )
The key issue in the Bayesian classifier is the class-conditional probability density function
p ( X | wk ) . In practice it is always unknown, except in some artificial classification tasks. The
distribution can be estimated from the training set with a range of methods. If the patterns X
from miscellaneous classes can be approximated by normal distribution, the class conditional
distribution p ( X | wk ) has the form

1 1
p ( X | wk )  M
exp(  ( X   )T W 1 ( X   ))
W 2
2 2

Where W determinant of covariance matrix, M is is number of patterns in the class and  is

the mean vector. The covariance matrix W is


M
1 i
W (X   )T ( X i   )
M i 1

where the mean vector  can be calculated as


M
1 i

M
X
i 1

5
Week 02: Classification and Clustering Techniques

For two class classification problems, the Bayes’ decision can also be made based on the
following comparison:
p( w1 | X )
 1  X  w1
p( w2 | X )
else
p( w1 | X )
 1  X  w2
p( w2 | X )

Input: The input feature vectors  xn  , 1  n  N

Output: cluster label yn

Training:

Compute the mean vector k and covariance matrices k for every class as,
1
k 
wk  x ,1  k  c
xwk

1
k 
wk 
i , xi wk
( xi  k )( xi   k )T

Testing:

 Compute class conditional probability density p( xn | k ) ,


1  1 T 
p( xn | wk )  1/ 2
exp   xn  k   k1  xn   k  
(2 ) d /2
k  2 

where k and k are the mean vector and covariance matrix of the kth class

respectively.

 Compute the posterior density from the prior probability P( wk ) and class conditional

probability density, p ( xn | k ) as:


P ( wk ) p ( xn | wk )
p( wk | xn )  c

 p ( x | w ) P(w )
k 1
k k

where the denominator in the right hand side is the total probability.
 Compute the cluster label as
yn  arg k max p( wk | xn ), 1  k  c

6
Week 02: Classification and Clustering Techniques

2.2 Support Vector Machine (SVM)

(a) Linear SVM


SVM is a single layer highly nonlinear network, which minimizes structural risk and has higher
generalization ability in the sense that it can classify data correctly. It optimizes the class
separation boundary such that the distance from a feature to the class separating hyperplane is
maximum simultaneously. Suppose if (xi, yi) , i=1:N are the N observation(or patterns), xi is the
ith input and yi is the corresponding pattern label, for the two class pattern classification problem,
c+ and c- are the centroids of the two classes, the classifier response is given by,

yi  sgn  xi  c  .w  sgn  xi .c  xi .c  b

1
where b 
2
c 2
 c
2
.

The hyperplane which is optimal in separating the data points into two classes and satisfying
condition will be
minimize 1 2
w
w,b 2

such that yi (w.xi  b)  1, i  1,....., N


The above one is a minimization problem, which is also a quadratic optimization problem. In
order to solve this problem, one must find the saddle point of the Lagrange function
n
1
L p  w, b,   
2
w    y  w
i 1
i i
T
 
xi  b  1

where  i are called Lagrange multipliers under the constraint  i  0 . Lp is minimized in order to

find the optimal saddle point with respect to primal variables w and b . This problem is
transformed into the dual form by differentiating Lp with respect to w and b and introducing

7
Week 02: Classification and Clustering Techniques

Karush Kuhn Tucker conditions. The transformed dual problem is the minimization problem of
the following objective function.
n n n
1
LD ( )  
i 1
i 
2  
i 1 j 1
i
T
j yi y j xi x j

n
subject to i  0, i  1, 2...n and  y
i 1
i i 0

For linearly separable data, the above eq. (10) is useful. But when the data is linearly non-
separable, kernel trick is considered to transform the feature space into a higher dimensional
space to make the data linearly separable. In practice we need not have to map the input
variables into the high dimensional space directly. Instead the inner product between the
features in the kernel space could be used in the optimization problem. The dual problem of
optimization with kernel transformation for SVM is
n n n
1
Maximize LD ( )    2   i i j yi y j k  xi , x j 
i 1 i 1 j 1

n
subject to i  0, i  1, 2...n and  y
i 1
i i 0

where k ( xi , x j )   T ( xi )   ( x j ) which is the dot product between the transformed features. In fact

any symmetric function that satisfy the Mercer conditions can be used as a kernel function.
Commonly used kernel functions are polynomial, quadratic and radial basis function (RBF)
kernels. The polynomial kernel function of degree d is
d
  
k xi , x j  xiT x j  1 
We have used second and third degree polynomial kernels in our analysis. In linear kernel
transformation k  xi , x j   xiT x j which is the inner product between the features as can be seen

from Eq.(11). The RBF kernel transformation is


 xi  x j 
 

k xi , x j  exp   2 
 2 

where  is the width parameter of RBF kernel.

(b) Non-linear SVM

8
Week 02: Classification and Clustering Techniques

Higher dimension non –linear mapping of the input vector x generates the non-linearly separate
input into linearly separable vector (Gunn, 1998) .By choosing a non-linear mapping, the SVM
constructs an optimal separating hyperplane in this higher dimensional space.
Suppose the data is mapped to some other (possibly infinite dimensional) Euclidean space H,
using a mapping which is defined by  :

 (.) : R n  R nh
In this case, optimal function for dual Langrage ( ) with the same constraints becomes
n
1 n n
Max LD ( )    i     i j yi y j K ( xiT x j )
 i 1 2 i 1 j 1

where


K ( xiT x j )   ( xiT ). ( x j )

is the kernel function performing the non-linear mapping into feature space. The kernel function
may be any of the symmetric functions that satisfy the Mercel conditions (Courant, 1953). The
most commonly used are the Gaussian Radial Basis Function (RBF) and the polynomial
function. Their formulas are shown below respectively.
 || xT  x || 
T i j
K ( xi , x j )  exp  
 2 
 2 

K ( xiT , x j )  ( xiT , x j  1) d

where the parameters variance  and degree d must be preset.

2.3 Decision Tree Algorithm


Decision tree is one of the easiest and popular supervised classification algorithms, which is
used to predict the class or value of the target variable by learning simple decision rules inferred
from prior data (training data). It can be applied to both classification and regression. For
knowledge-based systems, decision trees have the advantage of being comprehensible by
human experts and of being directly convertible into production rules. Moreover, when used to
handle a given case, a decision tree not only provides the solution for that case, but also states
the reasons behind its choice. These features are very important in typical application domains
in which human experts seek tools to aid in conducting their job while remaining “in the driver’s
seat.” Another advantage of using decision trees is the ease and efficiency of their construction
compared to that of other classifiers such as neural networks.

9
Week 02: Classification and Clustering Techniques

In decision tree, the task of predicting a class label of an object is started from the root attribute.
Thereafter, the values of the root attribute are compared with the object’s values. On the basis
of comparison, we follow the branch corresponding to that value and jump to the next node.

Now, let us learn about the important terminologies related to decision trees
Key Terms Definitions
Root Node It represents the entire population or sample and this further gets
divided into two or more homogeneous sets.
Splitting It is a process of dividing a node into two or more sub-nodes.
Decision Node When a sub-node splits into further sub-nodes, then it is called
the decision node.
Leaf / Terminal Node Nodes do not split is called Leaf or Terminal node.
Pruning When we remove sub-nodes of a decision node, this process is
called pruning.
Branch / Sub-Tree A subsection of the entire tree is called branch or sub-tree.
Parent and Child A node, which is divided into sub-nodes is called a parent node
Node of sub-nodes whereas sub-nodes are the child of a parent node.

Root Node
Branch / Sub-Tree
Splitting

Decision Node Decision Node

Terminal Node Decision Node Terminal Node Terminal Node

Terminal Node Terminal Node

Fig. Schematic diagram of a decision tree

10
Week 02: Classification and Clustering Techniques

Decision trees classify the examples by sorting them down the tree from the root to some
leaf/terminal node, with the leaf/terminal node providing the classification of the example. Each
node in the tree acts as a test case for some attribute, and each edge descending from the
node corresponds to the possible answers to the test case. This process is recursive in nature
and is repeated for every subtree rooted at the new node.
CASE-STUDY: CONSTRCUTING DECISION TREE FROM EXAMPLE

A decision tree is used as a classifier for determining an appropriate action (among a


predetermined set of actions) for a given case. An example here is considered where the task of
targeting good candidates to be sent an invitation to apply for a credit card: given certain
information about an individual, we need to determine whether or not he or she can be a
candidate. In this example, information about an individual is given as a vector of attributes that
may include sex (male or female), age, status (student, employee, or unemployed), college
grade point average (GPA), annual income, social security number, etc. The allowed actions are
viewed as classes, which are in this case to offer or not to offer an invitation. A decision tree that
performs this task is sketched in the following figure:

Status
Unemployed

Age >= 21 Income >=30,000

Don’t

Don’t GPA >= 3.0 Don’t Invite

Don’t Invite

As the figure shows, each internal node in the tree is labeled with a “test” defined in terms of the
attributes and has a branch for each possible outcome for that test, and each leaf in the tree is
labeled with a class.

11
Week 02: Classification and Clustering Techniques

Attributes used for describing cases can be nominal (taking one of a pre-specified set of values)
or continuous. In the above example, Sex and Status are nominal attributes, whereas Age and
GPA are continuous ones. Typically, a test defined on a nominal attribute has one outcome for
each value of the attribute, whereas a test defined on a continuous attribute is based on a fixed
threshold and has two outcomes, one for each interval as imposed by this threshold. The
decision tree in above figure illustrates these tests.

To find the appropriate class for a given case (individual), we start with the test at the root of the
tree and keep following the branches as determined by the values of the attributes of the case
at hand, until a leaf is reached. For example, suppose the attribute values for a given case are
as follows:
Name = Andrew; Social Security No. = 199199; Age = 22; Sex = Male;
Status = Student; Annual Income = 2,000; College GPA = 3.39.

To classify this case, we start at the root of the tree of the above-mentioned figure, which is
labeled Status, and follow the branch labeled Student from there. Then at the test node Age ≥
21, we follow the true branch, and at the test node GPA ≥ 3.0, we again follow the “true” branch.
This leads finally to a leaf labeled “invite”, indicating that this person is to be invited according to
this decision tree.

Decision tree learning is the task of constructing a decision tree classifier, such as the one in the
above figure, from a collection of historical cases. These are individuals who are already
marked by experts as being good candidates or not. Each historical case is called a training
example, or simply an example, and the collection of such examples from which a decision tree
is to be constructed is called a training sample. A training example is assumed to be
represented as a pair <X, c>, where X is a vector of attribute values describing some case, and
c is the appropriate class for that case. A collection of examples for the credit card task is shown
in the below figure. The following subsections describe how a decision tree can be constructed
from such a collection of training examples.

A General Algorithmic Framework for Constructing Decision Tree


Let = {( , ), ( , ), … … , ( , ) } be a training sample. Constructing a decision tree
form S can be done in a divide-and-conquer fashion as follows:

12
Week 02: Classification and Clustering Techniques

Step 1: If all the examples in S are labeled with the same class, return a leaf labeled
with that class.
Step 2: Choose some test t (according to some criterion) that has two or more mutually
exclusive outcomes { , ,….. }.
Step 3: Partition S into disjoint subsets S1, S2,….,Sr, such that Si consists of those
examples having outcome Oi for the test t, for i = 1, 2, .. .., r.
Step 4: Call this tree-construction procedure recursively on each of the subsets S1,
S2,….,Sr, and let the decision trees returned by these recursive calls be T1, T2 ,Tr.
Step 5: Return a decision tree T with a node labeled t as the root and the trees T1, T2 ,Tr
as subtrees below that node.
Example: For illustration, let us apply the above procedure on the set of data represented in the
following table. We will use the Case IDs 1–15 (listed in the first column) to refer to each of
these examples.

13
Week 02: Classification and Clustering Techniques

 S = {1, 2, 3,……,15} has a mixture of classes, so we proceed to Step 2.


 Suppose we use the attribute Status for our test. This test has three outcomes,
“Student”, “Unemployed”, and “Employee”. It partitions S into the subsets S1 = {1, 4, 6, 7,
9, 10, 11}, S2 = {5, 8, 12, 13}, and S3 = {2, 3, 14, 15}, respectively, for these outcomes.
 Note that S1 has a mixture of classes. Suppose we choose the test Age ≥ 21? This test
partitions S1 into S11 = {6, 10} for the false outcomes and S12 = {1, 4, 7, 9, 11} for the true
outcome.
 S11 = {6, 10} has just one class “don’t”, so a leaf labeled with this class is returned for the
call on S11.
 For the set S12, which has a mixture of classes, if we choose GPA ≥ 3.0?, then the set
will be partitioned into S121 = {7, 9} and S122 = {1, 4, 11}.
 The calls on the sets S121 and S122 will return leaves labeled “don’t” and “invite”,
respectively, and thus, the call on the set S12 will return the subtree of Fig. 3a.
 Now that we are done with the recursive calls on S11 and S12, the call on the set S1 will
return the subtree of Fig. 3b.
 The call on the set S2 will return a leaf labeled “don’t”.
 For S3, which contains a mixture of classes, suppose we choose the test Income ≥
30,000? This will partition S3 into S31 = {14, 15} for the false outcome and S32 = {2, 3} for
the true outcome.
 The recursive calls on S31 and S32 will return leaves labeled “don’t” and “invite”,
respectively, and thus, the call on S3 will return the subtree of Fig. 3c.
 Finally, the call on the entire training sample S will return the tree of Fig. 1.

Obviously, the quality of the tree produced by the above top-down construction procedure of
decision trees depends mainly on how tests are chosen in Step 2.

14
Week 02: Classification and Clustering Techniques

GPA >= 3.0 Age >= 21 Income >= 30,000

Don’t Invite Don’t GPA >= 3.0 Don’t Invite

Don’t Invite
(a) (b) (c)

Fig. Subtrees returned by recursive calls on training examples for credit card
invitations

2.4 k-Nearest Neighbor (k-NN)


In classification, the data consist of a training set and a test set. The training set is a set of N
feature vectors and their class labels; and a learning algorithm is used to train a classifier using
the training set. The test set is a set of feature vectors to which the classifier must assign labels.

To start with an example for addressing nearest neighbor classifier, it can be observed about a
situation as shown in the below diagram such that:
{ IF it walks like a duck & quacks like a duck, THEN it is probably a duck! }

15
Week 02: Classification and Clustering Techniques

An intuitive way to decide how to classify an unlabeled test item is to look at the training data
points nearby, and make the classification according to the classes of those nearby labelled
data points. This intuition is formalized in a classification approach called k-nearest neighbour
(k-NN) classification which is a supervised classification technqiue. The k-NN approach looks at
the k points in the training set that are closest to the test point; the test point is then classified
according to the class to which the majority of the k-nearest
neighbours belong. Hence, it can be summarized that three things
are required for k-NN as follows:

 The set of stored records.


 Distance Metric to compute distance between records.
 The value of k, the number of nearest neighbors to retrieve.

There few situations where K=1, 2, 3 nearest neighbours of a record X are data points that have
k smallest distance to x.

The training of k-NN classifier is simple where we need to just store all the training set.
However, testing is much slower, since it involves measuring the distance between each test
point and every training point. The basic principle of kNN is to classify new cases based on a
similarity measure (e.g., distance functions) in its proximity where all the cases are available. In
fact, a case is classified by a majority vote of its neighbors, with the case being assigned to the
class most common amongst its k-nearest neighbors measured by a distance function. If k = 1,
then the case is simply assigned to the class of its nearest neighbor. Most widely used distance
metrics are given as:

16
Week 02: Classification and Clustering Techniques

Euclidean distance: ∑ ( − )

Manhattan distance: ∑ | − |

Minkowski distance: ∑ (| − |)

The above distance functions are applicable to only continuous variables. In case of categorical
variable, Hamming distance is used. It also brings up the issue of standardization of the
numerical variables between 0 and 1 when there is a mixture of numerical and categorical
variables in the dataset.
Hamming Distance: =∑ | − |
x = y => D = 0
x ≠ y => D = 1

Working principle of k-NN algorithm


We can write the k-NN algorithm precisely as follows, where X is the training data set with class
labels so that X = {(x, c)}, Z is the test set, there are C possible classes, and r is the distance
metric (typically the Euclidean distance):
• For each test example z ∈ Z:
– Compute the distance r(z, x) between z and each training example (x, c) ∈ X
– Select Uk(z) ⊆ X, the set of the k nearest training examples to z.
– Assign test point z to class c, where the majority of the k-nearest neighbours
belongs (by taking majority vote).

Few Case Studies as reported in the following Literature:


• Handwriten character classification using nearest neighbor in large databases. Smith, S.J et. al.;
IEEE PAMI, 2004. Classify handwritten characters into numbers.

• Fast content-based image retrieval based on equal-average K-nearest-neighbor search


schemes z. Lu, H. Burkhardt, S. Boehmer; LNCS, 2006. CBIR (Content based image retrieval),
return the closest neighbors as the relevant items to a query.

• Use of K-Nearest Neighbor classifier for intrusion detecton Yihua Liao, V.Rao Vemuri;
Computers and Security Journal, 2002 Classify program behavior as normal or intrusive.

• Fault Detection Using the k-Nearest Neighbor Rule for Semiconductor Manufacturing Processes
He, Q.P., Jin Wang; IEEE Transactions in Semiconductor Manufacturing, 2007 Early fault
detection in industrial systems.

17
Week 02: Classification and Clustering Techniques

3. Unsupervised Classification / Clustering Techniques

In solving many real life problems, difficulty arises in data classification while the collected data
are unlabeled. Basically, the outcomes are unknown as no supervisor is available to label and
hence the context refers to the unsupervised classification. Clustering is an unsupervised
technique that deals with partitioning the unlabeled data into a set of (homogeneous) clusters on
the basis of similarity amongst the patterns. Few of such clustering techniques are discussed
here.

3.1 Hierarchical clustering

Hierarchical clustering techniques are the most commonly used method of summarizing
data or pattern where the clustering of the data are executed following a hierarchical
structure. Here, clusters are obtained based on the similarity or dissimilarity amongst
the patterns. This helps in discovering the hidden patterns in the data and identifying the
outlier(s) as preprocessing step for other algorithms.

Let us consider a dataset of ‘n’ no. of patterns characterized by ‘p’ no. of


attributes/features and hence define the ‘Pattern Matrix’ of order (n×p) as follows:

Now, compute the proximity matrix (n×n) where each element d(I,j) describes the
distance or difference between ith and jth patterns as follows:

The proximity value is usually computed by using the Euclidian distance measure on the
basis of which the clusters in the data are formed. There are two types of such
hierarchical clustering: -

(a) Agglomerative clustering; and

18
Week 02: Classification and Clustering Techniques

(b) Divisible clustering.

Agglomerative clustering algorithm – It is called also as a bottom-up approach where every


data point is considered in a separate cluster.

Step 1: n data points are considered as n clusters.

Step 2: Compute the proximity values and generate the proximity matrix.

Step 3: Find the smallest proximity value and merge the corresponding clusters to make a
cluster (minimum two data points).

Step 4: Compute the proximity matrix for all the initial and newly formed clusters in Step 3.

Step 5: Repeat Steps 3 and 4 iteratively.

Step 6: Stop when all converge to a single cluster.

Divisible clustering algorithm – It is called also as a top-down approach where every data
point is considered in a separate cluster.

Step 1: n data points are considered as a single cluster.

Step 2: Compute the proximity values and generate the proximity matrix.

Step 3: Find the largest proximity value and split the corresponding clusters to generate
sub-clusters.

Step 4: Compute the proximity matrix for all the initial and newly formed clusters in Step 3.

Step 5: Repeat Steps 3 and 4 iteratively.

Step 6: Stop when n data points are separated as n clusters.

Both the hierarchical algorithms execute iteratively as per the computed proximity values and
produce a binary tree, called Dendogram. The final cluster is the root and each data item is a
leaf.

19
Week 02: Classification and Clustering Techniques

Fig. A schematic dendogram depicting the clustering as per proximity

Diagrammatically, the flow of agglomerative clustering is shown below:

Calculation of the proximity / distance between two clusters is based on the pairwise distances
between members of the clusters following various notions:

(a) Complete linkage – largest distance between points

20
Week 02: Classification and Clustering Techniques

(b) Average linkage – average distance between points


(c) Single linkage – smallest distance between points
(d) Centroid – distance between centroids.

Compete linkage provides preference to compact / spherical clusters. Single linkage can
produce long stretched clusters. An example is given below:

3.2 K-means clustering

The k-means clustering algorithm was first proposed by MacQueen (1967). The algorithm is an
unsupervised classification method, which assumes fixed number of clusters. It belongs to the
central clustering category which uses Euclidean distance as a distance metric. The algorithm
minimizes the total mean squared error between the cluster centroids and the data points. The
algorithm implements the minimization of the following objective function.
k
2
J 
i 1 x j Si
x j  i

21
Week 02: Classification and Clustering Techniques

where x j and i represent the j th pattern and i th cluster center respectively.

The k-means algorithm is given step by step in the following.


Step 1: initialization
Step 2 : Data assignment
2
For a data vector xn , set yn  arg min k xn  k

Step 3: Centroid calculation


For each cluster k, let X k   xn | yn  k , the centroid is estimated as

1
k 
Xk x
xX k

Step 4: Stop the algorithm if yn | n  1,2,...N , does not change, otherwise go back to Step
2.

The k-means algorithm can be initialized by choosing a set of k seed points. Seed points can be
the first k patterns chosen randomly from the pattern matrix. The first seed point can also be
chosen as the centroid of all the patterns, and successive seed points are chosen such that they
are at a certain distance away from the previously chosen seed points. Each pattern is assigned
to a class based on minimum Euclidean distance criterion. Different initial partitions can lead to
different final clustering because the k-means clustering based on the square error criterion can
converge to local minima, rather than the global minima. Therefore sometimes it is required to
run the k-means algorithm many times with different initializations, such that if most of the runs
lead same results then we will have some confidence that global minimum is achieved. In the
data assignment step, based on the existing centroids the data are partitioned into any one
class based on minimum distance between each pattern and the respective class centroid. In
the centroid computation step, the average pattern of all the patterns assigned to a given class
is computed and is replaced with the previous centroid. The k means algorithm terminates when
the criterion function cannot be improved. The algorithm terminates when the cluster labels for
all the patterns do not change between two successive iterations. A maximum number of
iterations can be specified to prevent endless oscillations. The computational complexity of k
means algorithm is of the order O(NdkT ) where N the total number of patterns, d is is the
number of features, k is the number of clusters, and T is the number of iterations.

22
Week 02: Classification and Clustering Techniques

3.3 Fuzzy C-means (FCM)

The k-means clustering algorithm assigns one cluster, and the patterns are partitioned into
disjoint sets. Patterns in one cluster are supposed to be more similar to each other than to
patterns in the different clusters. If the clusters are well separated, then there is no ambiguity or
uncertainty associated with assigning each pattern to one cluster. When the clusters touch or
overlap, the cluster boundaries are not sharp and assignment of patterns to clusters is difficult.
For this kind of problem, fuzzy clustering is useful. Here, a pattern belongs to a class with a
grade of membership. The degree of membership takes a value in the interval [0, 1]. For
ordinary clusters called the crisp clusters, the membership grade for a particular cluster is 1 if
the pattern belongs to the cluster and 0 if it does not. With fuzzy clusters, the pattern xi has a
K
grade of membership uji >=0 or degree of belonging to the jth cluster, where  ji  1 and K is
j 1

the number of clusters. The larger uji, the more confidence exists that xi belongs to cluster j. If uji
is 1, pattern xi belongs to cluster j with absolute certainty. The interpretation of values like 0.25
is less clear. Membership grades are subjective and are based on definitions rather than
measurements. The grade of membership is not the same as probability that the pattern
belongs to the cluster even though the grades of membership and probability both take values
in the range [0, 1]. Under a probabilistic framework a pattern xi can belong to one and only one
cluster depending on the outcome of a random experiment. In fuzzy set theory, a pattern xi can
belong to two clusters simultaneously. The membership grades determine the degree to which
the two cluster labels are applicable. Most of the algorithms based on fuzzy set theory are
partitional.
The fuzzy c means algorithm aims at minimizing the following optimization function
N c
2
EFCM    ji m pi  V j
i 1 j 1
,
where uji is the fuzzy membership, having m as the weighting exponent and with pattern pi such
that it can associate with the cluster j having centroid Vj. The algorithm is as follows.

Step 1: Choose number of clusters c, 2  c  N , weighting exponent, m>1. Initialize


c N
randomly a cluster membership matrix U   R   0,1
(0)
. Each step of this

algorithm is labeled r, where r=0,1,2,….

Step 2: Calculate the c fuzzy centers {vi(r)} as

23
Week 02: Classification and Clustering Techniques

N m

j 1
ji
( r 1)
 pi
(r )
Vj  N m
 j 1
ji
( r 1)

,
where pi is the feature vector.

Step 3: Update U(r) as


2
( m 1)

 1 
 
 pi  V j ( r ) 
 ji ( r )   
2
( m 1)

c  1 
 
 (r )


v 1
 pi  V j 

Step 4: U (r )  U ( r 1) F
  , stop. Otherwise, set r : r  1 and go to Step 2.

Step 5: The clustering decisions for ith data point are made by maximization
ˆ ( pi )  arg j  max  ji
,
where ˆ ( pi )    1 , 2 ,.....c  are a set of clusters.

4. k-fold Cross Validation & Performance Evaluation

Cross-validation is a process of splitting the total dataset into k number of folds such that (k-1)
number of folds altogether constitute a training set and one fold refers to the testing set. Though
there is no such rule to define the value of ‘k’; but, in practice, the value is chosen as either 4 or
5. Therefore, the ratio of training and testing data is either 3:1 or 4:1 where all the folds are
randomized in generating training and testing samples in order to provide the most unbiased
results. In each of the combinations, the performance of the classifier is evaluated using various
measures, as described below.

Confusion matrix: A confusion matrix shows the number of correct and incorrect predictions
made by the classification model compared to the actual outcomes (target value) in the data.
The matrix is NxN, where N is the number of target values (classes). Performance of such
models is commonly evaluated using the data in the matrix. The following table displays a 2x2
confusion matrix for two classes (Positive and Negative).

24
Week 02: Classification and Clustering Techniques

Target
Confusion Matrix
Positive Negative
Model Positive a b Positive Predictive Value a/(a+b)
Negative c d Negative Predictive Value d/(c+d)
Sensitivity Specificity Accuracy = (a+d)/(a+b+c+d)
a/(a+c) d/(b+d)

where
True Positive (TP) - The actual positive is predicted by the model as positive.
True Negative (TN) - The actual negative is predicted by the model as negative.
False Positive (FP) - The actual negative is predicted by the model as positive.
False Negative (FN) – The actual positive is predicted by the model as negative.

Sensitivity of a classifier refers to the true positive rate which is computed as


TP
Sensitivity (%) = × 100
TP  FN
Specificity of a classifier indicates the true negative rate which is mathematically defined as
TN
Specificity (%) = × 100
FP  TN

Ideally, both the sensitivity and specificity are expected to be higher for a classifier to be the
most efficient and accurate. But, in practice, there may be the situation where sensitivity is
higher and specificity tends to be lower. In such situation, it becomes difficulty to make decision
about the superiority of the classifier. In order to address such conflicting scenario, a trade-off
between sensitivity and specificity is modeled and reflected as a receiver operating curve
(ROC). The area under this ROC, known as AUC (area under the curve) is considered as the
singleton accuracy indicator such that higher the AUC is better the classifier.

5. CASE STUDIES – Implementation classification and clustering algorithms for


solving problems using R programming

(a) CASE STUDY-1: Application of Bayesian Classification for predicting student’s


admission
Admission Data Set- A dataset is considered where a student’s admission depends upon
three features viz., GRE score, GPA score and Rank.

25
Week 02: Classification and Clustering Techniques

R-CODE:
# R Libraries
library(naivebayes)
library(dplyr)
library(ggplot2)
library(psych)
# Data
data <- read.csv('binary.csv', header = T)
str(data)
xtabs(~admit+rank, data = data)
# Naive Bayes Model
model <- naive_bayes(admit ~ ., data = train, usekernel = T)
model
train %>%
filter(admit == "1") %>%
summarise(mean(gre), sd(gre))
plot(model)
# Prediction for classification
p <- predict(model, test, type = 'prob')
Results:

26
Week 02: Classification and Clustering Techniques

Prediction: The following table provides the likelihood of admission based on X=(gre, gpa and
rank) of students (testing data set).

SL P(admit=0|X) P(admit=1|X) Predicted Class gre gpa rank


5 0.8986069 0.1013931 0 520 2.93 4
14 0.6992535 0.3007465 0 700 3.08 2
16 0.8099691 0.1900309 0 480 3.44 3
26 0.2164892 0.7835108 1 800 3.66 1
28 0.7768089 0.2231911 0 520 3.74 4
29 0.3652448 0.6347552 1 780 3.22 2

(b)CASE STUDY -2: Application of SVM for IRIS data classification

IRIS data – IRIS data set is a well-known and benchmark dataset that contains 3 classes of
total 150 instances each, where each class refers to a type of iris plant (Setosa, Versicolor,
Virginica). There are five features viz., length and width of sepal and petal. [Downloadable from
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris ].

R-CODE:
#SVM in R
install.packages("e1071")
library(e1071)
data("iris")
mymodel <- svm(Species~., data=iris,
kernel="radial")

#Confusion Matrix and Misclassification Error


pred <-predict(mymodel, iris)
tab <-table(Predicted= pred, Actual=iris$Species)
tab

27
Week 02: Classification and Clustering Techniques

Results:
Parameters:
SVM-Kernel: radial

Number of Support Vectors: 51


(8 22 21)

Number of Classes: 3 (setosa versicolor virginica)

Prediction:
Actual
Predicted setosa versicolor virginica
setosa 50 0 0
versicolor 0 48 2
virginica 0 2 48

Sensitivity (setosa) = 100%


Sensitivity (versicolor) = 96%
Sensitivity (virginica) = 96%

GUI of SVM using R showing the Visualization of classes (Setosa, Versicolor, Virginica)

28
Week 02: Classification and Clustering Techniques

(c)CASE STUDY-3: Application of Decision Tree for Reading Skill Classification


Data Set – A dataset in-built in R named as readingSkills is considered here to create a
decision tree. It consists of three features viz., age, shoesize, score on the basis of which
someone’s reading skill is classified using Decision Tree.

R- CODE:
install.packages("party")
library(party)
print(head(readingSkills))
# Create the tree
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)

Results:

Decision Tree for Classification of Native Speaker

From the decision tree shown above we can conclude that anyone whose ‘reading skills’ score
is less than 38.3 and age is more than 6 is not a native Speaker.

29
Week 02: Classification and Clustering Techniques

(d)CASE STUDY-4: Application of Hierarchical Clustering for US Arrests Data Set

Description – US Arrests data set provides the information about the attributes viz., murder,
assault, urbanpop, rape for 50 states during the crime investigation. Using these data, can we
find the states those of which are having similar crime profile. Here, a hierarchical clustering is
used based on complete linkage method in R programming environment.

R-CODE:
install.packages("cluster")
library(cluster)
df<-USArrests
d<-dist(df, method="euclidean")
hc1<-hclust(d,method="complete")
plot(hc1, cex=0.6, hang=-1)
rect.hclust(hc1, k=5, border=2:6)

Results: Dendogram plot

Dendrogram shows the visualization of four clusters

The algorithm provides the following dendogram where four clusters marked with different
colored boxes show the states within each cluster have similar crime records in respect of the
proximity value ‘3’.

30
Week 02: Classification and Clustering Techniques

References:
 H Almuallim, S Kaneda, Y Akiba, 3 - Development and Applications of Decision Trees,
Editor(s): Cornelius T. Leondes, Expert Systems, Academic Press, 2002, Pages 53-77.
 Duda, Hert and Storck, ………..
 A Webb, Statistical Pattern Recognition, ……………
 Introduction to Statistics and Data Analysis by C Heumann and MS Shalabh, Springer
Pub., 2016.

31

You might also like