0% found this document useful (0 votes)
4 views

AI-unit-5

Uploaded by

shivamkkushwaha0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

AI-unit-5

Uploaded by

shivamkkushwaha0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

K-means Clustering

Ke Chen

COMP24111 Machine Learning


Outline
• Introduction

• K-means Algorithm

• Example

• K-means Demo

• Relevant Issues

• Conclusion

2
COMP24111 Machine Learning
Introduction
• Partitioning Clustering Approach
– a typical clustering analysis approach via partitioning data set iteratively
– construct a partition of a data set to produce several non-empty clusters
(usually, the number of clusters given in advance)
– in principle, partitions achieved via minimising the sum of squared distance in
each cluster
E   i 1 xCi || x  mi ||
K 2

• Given a K, find a partition of K clusters to optimise the chosen


partitioning criterion
– global optimal: exhaustively enumerate all partitions
– heuristic method: K-means algorithm
K-means algorithm (MacQueen’67): each cluster is represented by the centre
of the cluster and the algorithm converges to stable centres of clusters.

3
COMP24111 Machine Learning
K-mean Algorithm
• Given the cluster number K, the K-means algorithm is
carried out in three steps:
Initialisation: set seed points
• Assign each object to the
cluster with the nearest seed
point
• Compute seed points as the
centroids of the clusters of the
current partition (the centroid
is the centre, i.e., mean point,
of the cluster)
• Go back to Step 1), stop when
no more new assignment

4
COMP24111 Machine Learning
Example
• Problem
Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of medicine.

D
Medicine Weight pH-
Index C
A 1 1

B 2 1
A B
C 4 3

D 5 4

5
COMP24111 Machine Learning
Example
• Step 1: Use initial seed points for partitioning
c1  A, c2  B

Euclidean distance

d( D, c1 )  ( 5  1)2  ( 4  1)2  5
d( D, c2 )  ( 5  2)2  ( 4  1)2  4.24

Assign each object to the cluster


with the nearest seed point

6
COMP24111 Machine Learning
Example
• Step 2: Compute new centroids of the current partition

Knowing the members of each


cluster, now we compute the new
centroid of each group based on
these new memberships.
c1  (1, 1)
 2  4  5 1 3  4 
c2   , 
 3 3 
 (11 / 3 , 8 / 3)
 ( 3.67 , 2.67 )
7
COMP24111 Machine Learning
Example
• Step 2: Renew membership based on new centroids

Compute the distance of all


objects to the new centroids

Assign the membership to objects

8
COMP24111 Machine Learning
Example
• Step 3: Repeat the first two steps until its convergence

Knowing the members of each


cluster, now we compute the new
centroid of each group based on
these new memberships.

 1 2 11 1
c1   ,   (1 , 1)
 2 2  2
45 34 1 1
c2   ,   (4 , 3 )
 2 2  2 2

9
COMP24111 Machine Learning
Example
• Step 3: Repeat the first two steps until its convergence

Compute the distance of all objects


to the new centroids

Stop due to no new assignment

10
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. k=5)

11
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
Center locations

12
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
Center locations
3. Each data point finds out
which Center it’s closest to.
(Thus each Center “owns” a
set of data points)

13
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each Center “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns

14
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
5. …and jumps there

15
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
5. …and jumps there
6. …Repeat until terminated!

16
COMP24111 Machine Learning
K-means Demo

K-means Demo

17
COMP24111 Machine Learning
Relevant Issues
• Efficient in computation
– O(tKn), where n is number of objects, K is number of clusters,
and t is number of iterations. Normally, K, t << n.
• Local optimum
– sensitive to initial seed points
– converge to a local optimum that may be unwanted solution
• Other problems
– Need to specify K, the number of clusters, in advance
– Unable to handle noisy data and outliers (K-Medoids algorithm)
– Not suitable for discovering clusters with non-convex shapes
– Applicable only when mean is defined, then what about
categorical data? (K-mode algorithm)

18
COMP24111 Machine Learning
Relevant Issues
• Cluster Validity
– With different initial conditions, the K-means algorithm may result
in different partitions for a given data set.
– Which partition is the “best” one for the given data set?
– In theory, no answer to this question as there is no ground-truth
available in unsupervised learning
– Nevertheless, there are several cluster validity criteria to assess the
quality of clustering analysis from different perspectives
– A common cluster validity criterion is the ratio of the total
between-cluster to the total within-cluster distances
• Between-cluster distance (BCD): the distance between means of two clusters
• Within-cluster distance (WCD): sum of all distance between data points and
the mean in a specific cluster
• A large ratio of BCD:WCD suggests good compactness inside clusters and
good separability among different clusters!
19
COMP24111 Machine Learning
Conclusion
• K-means algorithm is a simple yet popular method for
clustering analysis
• Its performance is determined by initialisation and
appropriate distance measure
• There are several variants of K-means to overcome its
weaknesses
– K-Medoids: resistance to noise and/or outliers
– K-Modes: extension to categorical data clustering analysis
– CLARA: dealing with large data sets
– Mixture models (EM algorithm): handling uncertainty of clusters

20
COMP24111 Machine Learning
Introduction
to
Pattern Recognition
2

Machine Perception
• Build a machine that can recognize
patterns:
– Speech recognition

– Fingerprint identification

– OCR (Optical Character Recognition)

– DNA sequence identification


3

An Example
• “Sorting incoming Fish on a conveyor
according to species using optical
sensing”
Sea bass
Species
Salmon
4

• Problem Analysis

– Set up a camera and take some sample images to


extract features

• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…

• This is the set of all suggested features to explore for


use in our classifier!
5

• Preprocessing

– Use a segmentation operation to isolate fishes


from one another and from the background

• Information from a single fish is sent to a


feature extractor whose purpose is to reduce
the data by measuring certain features

• The features are passed to a classifier


6
7

• Classification

– Select the length of the fish as a possible


feature for discrimination
8
9

The length is a poor feature alone!

Select the lightness as a possible


feature.
10
11

• Threshold decision boundary and cost


relationship
• Move our decision boundary toward smaller
values of lightness in order to minimize the
cost (reduce the number of sea bass that are
classified salmon!)

Task of decision theory


12

• Adopt the lightness and add the width of


the fish
Fish xT = [x1, x2]

Lightness Width
13
14

• We might add other features that are not


correlated with the ones we already have. A
precaution should be taken not to reduce the
performance by adding such “noisy features”

• Ideally, the best decision boundary should be


the one which provides an optimal performance
such as in the following figure:
15
16

• However, our satisfaction is premature


because the central aim of designing a
classifier is to correctly classify novel input

Issue of generalization!
17
18

Pattern Recognition Systems


• Sensing

– Use of a transducer (camera or microphone)


– PR system depends on the bandwidth, the resolution
sensitivity distortion of the transducer

• Segmentation and grouping

– Patterns should be well separated and should not


overlap
19
20

• Feature extraction
– Discriminative features
– Invariant features with respect to translation, rotation and
scale.

• Classification
– Use a feature vector provided by a feature extractor to
assign the object to a category

• Post Processing
– Exploit context input dependent information other than from
the target pattern itself to improve performance (T/-\E and
C/-\T)
21

The Design Cycle


• Data collection
• Feature Choice
• Model Choice
• Training
• Evaluation
• Computational Complexity
22
23

• Data Collection

– How do we know when we have collected an


adequately large and representative set of
examples for training and testing the system?
24

• Feature Choice

– Depends on the characteristics of the


problem domain.
– Simple to extract, invariant to irrelevant
transformation, insensitive to noise.
25

• Model Choice

– Unsatisfied with the performance of our fish


classifier and want to jump to another class of
model
26

• Training

– Use data to determine the classifier.


– Many different procedures for training
classifiers and choosing models exists.
27

• Evaluation

– Measure the error rate (or performance) and


switch from one set of features to another one
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
1
Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
September 2, 2023 2
Bayesian Theorem: Basics

 Let X be a data sample (“evidence”): class label is unknown


 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H), the probability of observing the sample X, given
that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
September 2, 2023 3
Bayesian Theorem

 Given training data X, posteriori probability of a


hypothesis H, P(H|X), follows the Bayes theorem

P(H | X)  P(X | H )P(H )


P(X)
 Informally, this can be written as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
September 2, 2023 4
Towards Naïve Bayesian Classifier
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized

September 2, 2023 5
Derivation of Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):

n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k 1
 This greatly reduces the computation cost: Only counts
the class distribution

September 2, 2023 6
7
Naive Bayes: Example

 Consider PlayTennis, and new instance


<Outlk = sun, Temp = cool, Humid = high, Wind = strong>
 Want to compute:

P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005


P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021
 vNB = n

8
Nearest Neighbor

 Among the various methods of supervised statistical


pattern recognition, the Nearest Neighbor rule
achieves consistently high performance, without a
priori assumptions about the distributions from which
the training examples are drawn.
 Training set involves both positive and negative cases.
 A new sample is classified by calculating the distance
to the nearest training case; the sign of that point
then determines the classification of the sample.

September 2, 2023 9
Contd.

 The k-NN classifier extends this idea by taking


the k nearest points and assigning the sign of the
majority.
 It is common to select k small and odd to break
ties (typically 1, 3 or 5).
 Larger k values help reduce the effects of noisy
points within the training data set, and the choice
of k is often performed through cross-validation.

September 2, 2023 10
The k-Nearest Neighbor Algorithm

 All instances correspond to points in the n-D space


 The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
 Target function could be discrete- or real- valued
 For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq

_
_ _
+ _
+
_ . +
xq
_
+
September 2, 2023 11
Discussion on the k-NN Algorithm

 k-NN for real-valued prediction for a given unknown tuple


 Returns the mean values of the k nearest neighbors
 Distance-weighted nearest neighbor algorithm
 Weight the contribution of each of the k neighbors
according to their distance to the query xq 1 w
 Give greater weight to closer neighbors d ( xq , x )2
i
 Robust to noisy data by averaging k-nearest neighbors

September 2, 2023 12
Example

September 2, 2023 13
SVM—Support Vector Machines
 A new classification method for both linear and nonlinear
data
 It uses a nonlinear mapping to transform the original
training data into a higher dimension
 With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
 With an appropriate nonlinear mapping to a sufficiently
high dimension, data from two classes can always be
separated by a hyperplane
 SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
September 2, 2023 14
SVM—History and Applications
 Vapnik and colleagues (1995)—groundwork from Vapnik
& Chervonenkis’ statistical learning theory in 1960s
 Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
 Used both for classification and prediction
 Applications:
 handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
September 2, 2023 15
SVM—General Philosophy

Small Margin Large Margin


Support Vectors

September 2, 2023 16
SVM—Margins and Support Vectors

September 2, 2023 Data Mining: Concepts 17


SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)

September 2, 2023 18
Why Is SVM Effective on High Dimensional Data?

 The complexity of trained classifier is characterized by the # of


support vectors rather than the dimensionality of the data
 The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
 If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
 The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high

September 2, 2023 19
Dimensionality Reduction Using
PCA/LDA
Dimensionality Reduction
• One approach to deal with high dimensional data is by reducing
their dimensionality.
• Project high dimensional data onto a lower dimensional sub-space
using linear or non-linear transformations.

2
Dimensionality Reduction
• Linear transformations are simple to compute and tractable.

Y U X (bi  u a )t
i i

kx1 kxd dx1 (k<<d)

• Classical –linear- approaches:


– Principal Component Analysis (PCA)
– Fisher Discriminant Analysis (FDA)

3
Principal Component Analysis (PCA)

• Each dimensionality reduction technique finds an


appropriate transformation by satisfying certain criteria
(e.g., information loss, data discrimination, etc.)

• The goal of PCA is to reduce the dimensionality of the


data while retaining as much as possible of the
variation present in the dataset.

4
Principal Component Analysis (PCA)
• Find a basis in a low dimensional sub-space:
− Approximate vectors by projecting them in a low dimensional
sub-space:
(1) Original space representation:

x  a1v1  a2v2  ...  aN vN


where v1 , v2 ,..., vn is a basein the original N-dimensionalspace

(2) Lower-dimensional sub-space representation:

xˆ  b1u1  b2u2  ...  bK uK


where u1 , u2 ,..., uK is a basein the K -dimensionalsub-space (K<N)

• Note: if K=N, then x̂  x


5
Principal Component Analysis (PCA)
• Example (K=N):

6
Principal Component Analysis (PCA)
• Information loss
− Dimensionality reduction implies information loss !!
− PCA preserves as much information as possible:

min || x  xˆ || (reconstruction error)


• What is the “best” lower dimensional sub-space?
The “best” low-dimensional space is centered at the sample mean
and has directions determined by the “best” eigenvectors of the
covariance matrix of the data x.

− By “best” eigenvectors we mean those corresponding to the largest


eigenvalues ( i.e., “principal components”).
− Since the covariance matrix is real and symmetric, these
eigenvectors are orthogonal and form a set of basis vectors.
7
Principal Component Analysis (PCA)
• Methodology
− Suppose x1, x2, ..., xM are N x 1 vectors

8
Principal Component Analysis (PCA)
• Methodology – cont.

bi  uiT ( x  x )

9
Principal Component Analysis (PCA)
• Linear transformation implied by PCA
− The linear transformation RN  RK that performs the dimensionality
reduction is:

10
Principal Component Analysis (PCA)
• Geometric interpretation
− PCA projects the data along the directions where the data varies the
most.
− These directions are determined by the eigenvectors of the
covariance matrix corresponding to the largest eigenvalues.
− The magnitude of the eigenvalues corresponds to the variance of
the data along the eigenvector directions.

11
Principal Component Analysis (PCA)
• PCA and classification
− PCA is not always an optimal dimensionality-reduction procedure
for classification purposes.
• Multiple classes and PCA
− Suppose there are C classes in the training data.
− PCA is based on the sample covariance which characterizes the
scatter of the entire data set, irrespective of class-membership.
− The projection axes chosen by PCA might not provide good
discrimination power.

12
Linear Discriminant Analysis (LDA)

• What is the goal of LDA?


− Perform dimensionality reduction “while preserving as much of the
class discriminatory information as possible”.
− Seeks to find directions along which the classes are best separated.
− Takes into consideration the scatter within-classes but also the
scatter between-classes.
− More capable of distinguishing image variation due to identity from
variation due to other sources such as illumination and expression.

13
LDA

14
Linear Discriminant Analysis (LDA)
• Notation

C Mi
S w   ( x j  μi )( x j  μi )T
i 1 j 1

(each sub-matrix has


rank 1 or less, i.e., outer
product of two vectors)
(Sb has at most rank C-1)

15
Linear Discriminant Analysis (LDA)
• Methodology
projection matrix

y UTx
− LDA computes a transformation that maximizes the between-class
scatter while minimizing the within-class scatter:

| U T SbU | |S |
max T  max b products of eigenvalues !
| U S wU | | Sw |

Sb , S w : scatter matrices of the projected data y

16
Linear Discriminant Analysis (LDA)
• Is LDA always better than PCA?

− There has been a tendency in the computer vision community to


prefer LDA over PCA.
− This is mainly because LDA deals directly with discrimination
between classes while PCA does not pay attention to the underlying
class structure.
− Main results of this study:
1. When the training set is small, PCA can outperform LDA.
2. When the number of samples is large and representative for
each class, LDA outperforms PCA.

17
Support Vector Machine
Classification
• Everyday, all the time we classify things.
• Eg crossing the street:
– Is there a car coming?
– At what speed?
– How far is it to the other side?
– Classification: Safe to walk or not!!!
Classification Problem?
• The goal of classification is to organize and
categorize data into distinct classes
– A model is first created based on the previous
data (training samples)
– This model is then used to classify new data
(unseen samples)
• A sample is characterized by a set of features
• Classification is essentially finding the best
boundary between classes
Classification Formulation
• Given
– an input space 
– a set of classes  ={ 1 , 2 ,..., c }
• the Classification Problem is
– to define a mapping f: g  where each x in
 is assigned to one class
• This mapping function is called a Decision Function
SVM
• An SVM model is a representation of the
examples as points in space, mapped so
that the examples of the separate
categories are divided by a clear gap that
is as wide as possible. New examples are
then mapped into that same space and
predicted to belong to a category based on
which side of the gap they fall on.
b
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1

How would you


classify this data?

w x + b<0
b
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
b
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
b
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?
b
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?

Misclassified
to +1 class
b
Classifier Margin
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
b

Maximum Margin
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the maximum
are those
datapoints that margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Support Vector Machine (SVM)
Support vectors
• SVM, Introduced by Vapnik
(1995).
• SVMs maximize the margin
around the separating
hyperplane.
• The decision function is fully
specified by a subset of training
samples, the support vectors. Maximize
margin
• Solving SVMs is a quadratic
programming problem
Types of SVM
• Linear SVM
Used when the datasets are linearly
separable
0 x

• Non-Linear SVM
Used when the datasets are not
linearly separable
0 x
Linear SVM Mathematically
x+ M=Margin Width

X-

What we know:  
(x  x )  w 2
• w . x+ + b = +1 M  
w w
• w . x- + b = -1
• w . (x+-x-) = 2
Non-linear SVMs
• Solution: mapping data to a higher-dimensional
space:

0 x

x2

0 x
Non-linear SVMs: Feature spaces
• General idea: the original input space can always
be mapped to some higher-dimensional feature
space where the training set is separable:

Φ: x → φ(x)
Properties of SVM
• Ability to handle large feature spaces
• Nice math property: a simple convex optimization
problem which is guaranteed to converge to a single
global solution. So, it is Deterministic Algorithm
SVM Applications
• SVM has been used successfully in
many real-world problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition
Weakness of SVM
• It is sensitive to noise
- A relatively small number of mislabeled examples can
dramatically decrease the performance

• It only considers two classes


- how to do multi-class classification with SVM?
- Answer:
1) with output m, learn m SVM’s
– SVM 1 learns “Output==1” vs “Output != 1”
– SVM 2 learns “Output==2” vs “Output != 2”
– :
– SVM m learns “Output==m” vs “Output != m”
2)To predict the output for a new input, just predict with
each SVM.

You might also like