CCST9017 (2023-24lecture11printed Version) MachineLearning
CCST9017 (2023-24lecture11printed Version) MachineLearning
Unsupervised learning
Clustering: K-nearest neighbors,
Probability distribution estimation: Naïve Bayes,
Hidden Markov models (HMM).
Reinforcement learning
Decision making (robot, chess machine)
The data and the goal
Class 2
Many decision
boundaries can
separate these two
classes
Which one should
Class 1
we choose?
Bad Decision Boundaries
Class 2 Class 2
Class 1 Class 1
yi ( w x i b 1, i 1, 2, ..., r summarizes
w xi + b 1 for yi = 1
w xi + b -1 for yi = -1.
Lagrangian of Original Problem
i0
The Dual Optimization Problem
We can transform the problem to its dual Dot product of X
Support vectors
8=0.6 10=0
’s with values
7=0 different from zero
2=0 (they hold up the
5=0
separating plane)!
1=0.8
4=0
6=1.4
9=0
3=0
Class 1
Non-Linear SVM
How could we generalize this procedure to non-linear data?
Why?
Linear operation in H is equivalent to non-linear operation in
input space.
Non-linear SVMs: Feature Space
General idea: the original input space (x) can be mapped to some higher-
dimensional feature space (φ(x) )where the training set is separable:
x=(x1,x2) 2x1x2
Φ: x → φ(x)
φ(x) =(x12,x22,2x1x2)
x22
x12
If data are mapped into higher a space of sufficiently high dimension,
then they will in general be linearly separable;
N data points are in general separable in a space of N-1 dimensions or
more!!!
Choosing the Kernel Function
Probably the most tricky part of using SVM.
The kernel function is important because it creates the kernel
matrix, which summarizes all the data
Many principles have been proposed (diffusion kernel, Fisher
kernel, string kernel, …)
There is even research to estimate the kernel matrix from
available information
In practice, a low degree polynomial kernel or RBF kernel with a
reasonable width is a good initial try
Note that SVM with RBF kernel is closely related to RBF neural
networks, with the centers of the radial basis functions
automatically chosen for SVM
Applications of SVMs
Bioinformatics
Machine Vision
Text Categorization
Handwritten Character Recognition
Time series analysis
Lots of very successful applications!!!
Unsupervised Learning
Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
These patterns are then utilized to predict the values
of the target attribute in future data instances.
Unsupervised learning: The data have no target
attribute.
We want to explore the data to find some intrinsic
structures (hidden knowledge) in them.
Clustering
Clustering is a technique for finding similarity groups in data,
called clusters. I.e.,
it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far
away) from each other into different clusters.
Clustering is often called an unsupervised learning task
as no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised learning.
Clustering is one of the most utilized data mining
techniques. It has a long history, and used in almost every
field, e.g., medicine, psychology, botany, sociology, biology,
archeology, marketing, insurance, libraries, etc.
What is clustering for?
Let us see some real-life examples
Example 1: groups people of similar sizes together to make
“small”, “medium” and “large” T-Shirts.
Example 2: In marketing, segment customers according to
their similarities, to do targeted marketing. Help marketers
discover distinct groups in their customer bases
Example 3: Given a collection of text documents, we want
to organize them according to their content similarities, to
produce a topic hierarchy.
In recent years, due to the rapid increase of online documents, text
clustering becomes important.
What Is a Good Clustering?
A good clustering method will produce clusters
with
High intra-class similarity
Low inter-class similarity
Minimal domain knowledge required to determine input
parameters
Discovery of clusters with arbitrary shape
Ability to deal with noise and outliers
Interpretability and usability
Similarity and Dissimilarity
Between Objects: distance metrics
• Minkowski distance
Xj = (xj1, xj2, …, xjp)
q q q
d (i, j ) q
xi1 x j1 xi 2 x j 2 ... xip x jp dij = ?
• Euclidean distance
q = 2 d (i, j ) xi1 x j1 2 xi 2 x j 2 2 ... xip x jp 2
• Manhattan distance
q=1 d (i, j ) xi1 x j1 xi 2 x j 2 ... xip x jp
When to use what distance
• The choice of distance measure should be based on
the particular application : What sort of similarities
would you like to detect?
• Euclidean distance – takes into account the magnitude
of the differences of the expression levels.
• In many case it is necessary to normalize and/or
standardize genes or arrays in order to compare the
amount of variation of two different genes or arrays
from their respective central locations.
Notion of a Cluster can be Ambiguous
Six Clusters
How many clusters?
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
51
Weaknesses of k-means
The algorithm is only applicable if the mean is
defined.
For categorical data, k-mode - the centroid is