AI-unit-5
AI-unit-5
Ke Chen
• K-means Algorithm
• Example
• K-means Demo
• Relevant Issues
• Conclusion
2
COMP24111 Machine Learning
Introduction
• Partitioning Clustering Approach
– a typical clustering analysis approach via partitioning data set iteratively
– construct a partition of a data set to produce several non-empty clusters
(usually, the number of clusters given in advance)
– in principle, partitions achieved via minimising the sum of squared distance in
each cluster
E i 1 xCi || x mi ||
K 2
3
COMP24111 Machine Learning
K-mean Algorithm
• Given the cluster number K, the K-means algorithm is
carried out in three steps:
Initialisation: set seed points
• Assign each object to the
cluster with the nearest seed
point
• Compute seed points as the
centroids of the clusters of the
current partition (the centroid
is the centre, i.e., mean point,
of the cluster)
• Go back to Step 1), stop when
no more new assignment
4
COMP24111 Machine Learning
Example
• Problem
Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of medicine.
D
Medicine Weight pH-
Index C
A 1 1
B 2 1
A B
C 4 3
D 5 4
5
COMP24111 Machine Learning
Example
• Step 1: Use initial seed points for partitioning
c1 A, c2 B
Euclidean distance
d( D, c1 ) ( 5 1)2 ( 4 1)2 5
d( D, c2 ) ( 5 2)2 ( 4 1)2 4.24
6
COMP24111 Machine Learning
Example
• Step 2: Compute new centroids of the current partition
8
COMP24111 Machine Learning
Example
• Step 3: Repeat the first two steps until its convergence
1 2 11 1
c1 , (1 , 1)
2 2 2
45 34 1 1
c2 , (4 , 3 )
2 2 2 2
9
COMP24111 Machine Learning
Example
• Step 3: Repeat the first two steps until its convergence
10
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. k=5)
11
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
Center locations
12
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
Center locations
3. Each data point finds out
which Center it’s closest to.
(Thus each Center “owns” a
set of data points)
13
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each Center “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
14
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
5. …and jumps there
15
COMP24111 Machine Learning
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
5. …and jumps there
6. …Repeat until terminated!
16
COMP24111 Machine Learning
K-means Demo
K-means Demo
17
COMP24111 Machine Learning
Relevant Issues
• Efficient in computation
– O(tKn), where n is number of objects, K is number of clusters,
and t is number of iterations. Normally, K, t << n.
• Local optimum
– sensitive to initial seed points
– converge to a local optimum that may be unwanted solution
• Other problems
– Need to specify K, the number of clusters, in advance
– Unable to handle noisy data and outliers (K-Medoids algorithm)
– Not suitable for discovering clusters with non-convex shapes
– Applicable only when mean is defined, then what about
categorical data? (K-mode algorithm)
18
COMP24111 Machine Learning
Relevant Issues
• Cluster Validity
– With different initial conditions, the K-means algorithm may result
in different partitions for a given data set.
– Which partition is the “best” one for the given data set?
– In theory, no answer to this question as there is no ground-truth
available in unsupervised learning
– Nevertheless, there are several cluster validity criteria to assess the
quality of clustering analysis from different perspectives
– A common cluster validity criterion is the ratio of the total
between-cluster to the total within-cluster distances
• Between-cluster distance (BCD): the distance between means of two clusters
• Within-cluster distance (WCD): sum of all distance between data points and
the mean in a specific cluster
• A large ratio of BCD:WCD suggests good compactness inside clusters and
good separability among different clusters!
19
COMP24111 Machine Learning
Conclusion
• K-means algorithm is a simple yet popular method for
clustering analysis
• Its performance is determined by initialisation and
appropriate distance measure
• There are several variants of K-means to overcome its
weaknesses
– K-Medoids: resistance to noise and/or outliers
– K-Modes: extension to categorical data clustering analysis
– CLARA: dealing with large data sets
– Mixture models (EM algorithm): handling uncertainty of clusters
20
COMP24111 Machine Learning
Introduction
to
Pattern Recognition
2
Machine Perception
• Build a machine that can recognize
patterns:
– Speech recognition
– Fingerprint identification
An Example
• “Sorting incoming Fish on a conveyor
according to species using optical
sensing”
Sea bass
Species
Salmon
4
• Problem Analysis
• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…
• Preprocessing
• Classification
Lightness Width
13
14
Issue of generalization!
17
18
• Feature extraction
– Discriminative features
– Invariant features with respect to translation, rotation and
scale.
• Classification
– Use a feature vector provided by a feature extractor to
assign the object to a category
• Post Processing
– Exploit context input dependent information other than from
the target pattern itself to improve performance (T/-\E and
C/-\T)
21
• Data Collection
• Feature Choice
• Model Choice
• Training
• Evaluation
September 2, 2023 5
Derivation of Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):
n
P( X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts
the class distribution
September 2, 2023 6
7
Naive Bayes: Example
8
Nearest Neighbor
September 2, 2023 9
Contd.
September 2, 2023 10
The k-Nearest Neighbor Algorithm
_
_ _
+ _
+
_ . +
xq
_
+
September 2, 2023 11
Discussion on the k-NN Algorithm
September 2, 2023 12
Example
September 2, 2023 13
SVM—Support Vector Machines
A new classification method for both linear and nonlinear
data
It uses a nonlinear mapping to transform the original
training data into a higher dimension
With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently
high dimension, data from two classes can always be
separated by a hyperplane
SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
September 2, 2023 14
SVM—History and Applications
Vapnik and colleagues (1995)—groundwork from Vapnik
& Chervonenkis’ statistical learning theory in 1960s
Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
Used both for classification and prediction
Applications:
handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
September 2, 2023 15
SVM—General Philosophy
September 2, 2023 16
SVM—Margins and Support Vectors
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
September 2, 2023 18
Why Is SVM Effective on High Dimensional Data?
September 2, 2023 19
Dimensionality Reduction Using
PCA/LDA
Dimensionality Reduction
• One approach to deal with high dimensional data is by reducing
their dimensionality.
• Project high dimensional data onto a lower dimensional sub-space
using linear or non-linear transformations.
2
Dimensionality Reduction
• Linear transformations are simple to compute and tractable.
Y U X (bi u a )t
i i
3
Principal Component Analysis (PCA)
4
Principal Component Analysis (PCA)
• Find a basis in a low dimensional sub-space:
− Approximate vectors by projecting them in a low dimensional
sub-space:
(1) Original space representation:
6
Principal Component Analysis (PCA)
• Information loss
− Dimensionality reduction implies information loss !!
− PCA preserves as much information as possible:
8
Principal Component Analysis (PCA)
• Methodology – cont.
bi uiT ( x x )
9
Principal Component Analysis (PCA)
• Linear transformation implied by PCA
− The linear transformation RN RK that performs the dimensionality
reduction is:
10
Principal Component Analysis (PCA)
• Geometric interpretation
− PCA projects the data along the directions where the data varies the
most.
− These directions are determined by the eigenvectors of the
covariance matrix corresponding to the largest eigenvalues.
− The magnitude of the eigenvalues corresponds to the variance of
the data along the eigenvector directions.
11
Principal Component Analysis (PCA)
• PCA and classification
− PCA is not always an optimal dimensionality-reduction procedure
for classification purposes.
• Multiple classes and PCA
− Suppose there are C classes in the training data.
− PCA is based on the sample covariance which characterizes the
scatter of the entire data set, irrespective of class-membership.
− The projection axes chosen by PCA might not provide good
discrimination power.
12
Linear Discriminant Analysis (LDA)
13
LDA
14
Linear Discriminant Analysis (LDA)
• Notation
C Mi
S w ( x j μi )( x j μi )T
i 1 j 1
15
Linear Discriminant Analysis (LDA)
• Methodology
projection matrix
y UTx
− LDA computes a transformation that maximizes the between-class
scatter while minimizing the within-class scatter:
| U T SbU | |S |
max T max b products of eigenvalues !
| U S wU | | Sw |
16
Linear Discriminant Analysis (LDA)
• Is LDA always better than PCA?
17
Support Vector Machine
Classification
• Everyday, all the time we classify things.
• Eg crossing the street:
– Is there a car coming?
– At what speed?
– How far is it to the other side?
– Classification: Safe to walk or not!!!
Classification Problem?
• The goal of classification is to organize and
categorize data into distinct classes
– A model is first created based on the previous
data (training samples)
– This model is then used to classify new data
(unseen samples)
• A sample is characterized by a set of features
• Classification is essentially finding the best
boundary between classes
Classification Formulation
• Given
– an input space
– a set of classes ={ 1 , 2 ,..., c }
• the Classification Problem is
– to define a mapping f: g where each x in
is assigned to one class
• This mapping function is called a Decision Function
SVM
• An SVM model is a representation of the
examples as points in space, mapped so
that the examples of the separate
categories are divided by a clear gap that
is as wide as possible. New examples are
then mapped into that same space and
predicted to belong to a category based on
which side of the gap they fall on.
b
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1
w x + b<0
b
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
b
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Misclassified
to +1 class
b
Classifier Margin
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
b
Maximum Margin
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the maximum
are those
datapoints that margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Support Vector Machine (SVM)
Support vectors
• SVM, Introduced by Vapnik
(1995).
• SVMs maximize the margin
around the separating
hyperplane.
• The decision function is fully
specified by a subset of training
samples, the support vectors. Maximize
margin
• Solving SVMs is a quadratic
programming problem
Types of SVM
• Linear SVM
Used when the datasets are linearly
separable
0 x
• Non-Linear SVM
Used when the datasets are not
linearly separable
0 x
Linear SVM Mathematically
x+ M=Margin Width
X-
What we know:
(x x ) w 2
• w . x+ + b = +1 M
w w
• w . x- + b = -1
• w . (x+-x-) = 2
Non-linear SVMs
• Solution: mapping data to a higher-dimensional
space:
0 x
x2
0 x
Non-linear SVMs: Feature spaces
• General idea: the original input space can always
be mapped to some higher-dimensional feature
space where the training set is separable:
Φ: x → φ(x)
Properties of SVM
• Ability to handle large feature spaces
• Nice math property: a simple convex optimization
problem which is guaranteed to converge to a single
global solution. So, it is Deterministic Algorithm
SVM Applications
• SVM has been used successfully in
many real-world problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition
Weakness of SVM
• It is sensitive to noise
- A relatively small number of mislabeled examples can
dramatically decrease the performance