0% found this document useful (0 votes)
32 views

CCST9017 (2023-24lecture11printed Version) MachineLearning

Machine learning and AI concepts were discussed, including: 1) Types of machine learning such as supervised learning, unsupervised learning, and reinforcement learning were introduced. 2) Deep learning using artificial neural networks and stochastic gradient descent was also covered. 3) Applications of machine learning and AI like speech recognition, image recognition, and medical analysis were mentioned.

Uploaded by

meganyaptan
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

CCST9017 (2023-24lecture11printed Version) MachineLearning

Machine learning and AI concepts were discussed, including: 1) Types of machine learning such as supervised learning, unsupervised learning, and reinforcement learning were introduced. 2) Deep learning using artificial neural networks and stochastic gradient descent was also covered. 3) Applications of machine learning and AI like speech recognition, image recognition, and medical analysis were mentioned.

Uploaded by

meganyaptan
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

CCST9017

Hidden Order in Daily Life:


A Mathematical Perspective
Lecture 11
AI and Machine Learning

Dr. Zhiwen Zhang


Department of Mathematics, HKU
Contents
 Types of Machine learning
Supervised learning, Unsupervised learning, Reinforcement
learning
 Deep learning
Artificial neural networks, deep neural networks
Stochastic gradient descent
 Applications and Artificial intelligence (AI)
Automatic speech recognition, Image recognition, Drug discovery,
Medical Image Analysis, Mobile advertising, Financial
transactions, etc.
What is Learning?

 “Learning denotes changes in a system that ... enable a system to do


the same task … more efficiently the next time.” - Herbert Simon
 “Learning is making useful changes in our minds.” - Marvin Minsky
 “Machine learning refers to a system capable of the autonomous
acquisition and integration of knowledge.”
Machine learning

Machine learning is an application of artificial intelligence (AI)
that provides systems the ability to automatically learn and
improve from experience without being explicitly programmed.
 Machine learning algorithms build a mathematical model based
on sample data, known as "training data", to make predictions or
decisions without being explicitly programmed to do so.
 Machine learning is closely related to computational statistics,
which focuses on making predictions using computers.
 The study of mathematical optimization delivers methods,
theory and application domains to the field of machine learning.
Why Machine Learning?
 No human experts
 industrial/manufacturing control
 mass spectrometer analysis, drug design, astronomic discovery
 Black-box human expertise
 face/handwriting/speech recognition
 driving a car, flying a plane
 Rapidly changing phenomena
 credit scoring, financial modeling
 diagnosis, fraud detection
 Need for customization/personalization
 personalized news reader
 movie/book recommendation
Example: Spam Filter
Example: Digit Recognition
Related Fields
decision game
theory control theory
AI
theory
information
biological theory
evolution
Machine
probability
Learning
& philosophy
statistics
optimization
Data Mining statistical psychology
mechanics
computational
complexity
theory neurophysiology

Machine learning is primarily concerned with the accuracy


and effectiveness of the computer system.
Machine learning and our focus
 Like human learning from past experiences. A
computer does not have “experiences”.
 A computer system learns from data, which
represent some “past experiences” of an application
domain.
 Our focus: learn a target function that can be used
to predict the values of a discrete class attribute,
e.g., approve or not-approved, and high-risk or low
risk.
 For example, A credit card company receives thousands of
applications for new cards. Each application contains information
about an applicant, including age, Marital status, annual salary, etc.
Problem: whether an application should be approved?
Machine Learning Problems
 Supervised Learning: Data and corresponding labels
are given

 Unsupervised Learning: Only data is given, no labels


provided

 Semi-supervised Learning: Some (if not all) labels


are present

 Reinforcement Learning: An agent interacting with


the world makes observations, takes actions, and is
rewarded or punished; it should learn to choose actions
in such a way as to obtain a lot of reward.
Supervised vs. unsupervised
Learning
 Supervised learning: classification is seen as
supervised learning from examples.
 Supervision: The data (observations,
measurements, etc.) are labeled with pre-defined
classes. It is like that a “teacher” gives the classes
(supervision).
 Test data are classified into these classes too.
 Unsupervised learning (clustering)
 Class labels of the data are unknown
 Given a set of data, the task is to establish the
existence of classes or clusters in the data
Algorithms
 Supervised learning
 Classification (discrete labels): linear classifier (e.g.
Support vector machine), Decision tree algorithm.
 Regression (real values)

 Unsupervised learning
 Clustering: K-nearest neighbors,
 Probability distribution estimation: Naïve Bayes,
Hidden Markov models (HMM).
 Reinforcement learning
 Decision making (robot, chess machine)
The data and the goal

 Data: A set of data records (also called


examples, instances or cases) described by
 k attributes: A1, A2, … Ak.
 a class: Each example is labelled with a pre-
defined class.
 Goal: To learn a classification model from the
data that can be used to predict the classes
of new (future, or test) cases/instances.
Example: data (loan application)
Approved or not
An example: the learning task

 Learn a classification model from the data


 Use the model to classify future loan applications
into
 Yes (approved) and
 No (not approved)
 What is the class for following case/instance?
Decision tree
 Decision tree learning is one of the most widely
used techniques for classification.
 Its classification accuracy is competitive with other
methods, and it is very efficient.
 The classification model is a tree, called
decision tree.
 C4.5 is an algorithm used to generate
a decision tree developed by Ross Quinlan,
ranking #1 in the Top 10 Algorithms in Data
Mining.
A decision tree from the loan data

Decision nodes and leaf nodes (classes)


Is the decision tree unique?
No. Here is a simpler tree.
We want smaller tree and accurate tree.
 Easy to understand and perform better.

Finding the best tree is


NP-hard.
All current tree building
algorithms are
heuristic algorithms.
Choose an attribute to partition data

 The key to building a decision tree - which attribute


to choose in order to branch.
 The objective is to reduce impurity or uncertainty in
data as much as possible.
 A subset of data is pure if all instances belong to the
same class.
 The heuristic in C4.5 is to choose the attribute with
the maximum Information Gain or Gain Ratio based
on information theory.
Another example for decision
tree
Decide whether to wait for a table at a restaurant, based on
the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute(feature)-based representations
 Examples described by feature(attribute) values
 (Boolean, discrete, continuous)

 E.g., situations where I will/won't wait for a table:

 Classification of examples is positive (T) or negative (F)


Decision trees
 One possible representation for hypotheses
 E.g., here is the “true” tree for deciding whether to wait:
Choosing an attribute
 Idea: a good attribute splits the examples into subsets
that are (ideally) "all positive" or "all negative"

 Patrons? is a better choice



Attribute Selection Measure:
Information Gain (C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs
to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in
D: m
I ( D )    pi log2 ( pi )
i 1

 Information needed (after using A to split D into v partitions)


v |D |
to classify D:
Info A ( D)  
j
 I (D j )
j 1 | D |
 Information gained by branching on attribute A
Gain(A)  Info(D)  Info A(D)
Information Gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Consider the attributes Patrons and Type (and others too):
2 4 6 2 4
IG( Patrons )  1  [ I (0,1)  I (1,0)  I ( , )]  .0541 bits
12 12 12 6 6
2 1 1 2 1 1 4 2 2 4 2 2
IG(Type )  1  [ I ( , )  I ( , )  I ( , )  I ( , )]  0 bits
12 2 2 12 2 2 12 4 4 12 4 4
Patrons has the highest IG of all attributes and so is chosen by the
decision tree algorithm as the root
One typical decision tree
 Decision tree learned from the 12 examples:

 Substantially simpler than “true” tree---a more complex


hypothesis isn’t justified by small amount of data
Decision Tree Based Classification
 Advantages:
 Easy to construct/implement
 Extremely fast at classifying unknown records
 Models are easy to interpret for small-sized trees
 Accuracy is comparable to other classification
techniques for many simple data sets
 Disadvantages
 Computationally expensive to train
 Some decision trees can be overly complex that do not generalise the data well.
 Overfitting: A decision tree may overfit the trarning data and give wrong testing
results.
Support vector machine
 Support vector machine (SVM) was invented by V. Vapnik and his co-
workers in 1970s in Russia. SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as
Regression problems.
 SVMs are linear classifiers that find a hyperplane to separate two class
of data, positive and negative.
 Kernel functions are used for nonlinear separation.
 SVM not only has a rigorous theoretical foundation, but also performs
classification more accurately than most other methods in applications,
especially for high dimensional data.
 It is perhaps the best classifier for text classification. SVM also be
applied in classification of images, satellite data, etc.
Basic concepts
 Let the set of training examples D be
{(x1, y1), (x2, y2), …, (xr, yr)},
where xi = (x1, x2, …, xn) is an input vector in a
real-valued space X  Rn and yi is its class label
(output value), yi  {1, -1}.
1: positive class and -1: negative class.
 SVM finds a linear function of the form (w: weight
vector), f(x) = w  x + b, which is called a
Support vector machine.
 1 if  w  x i   b  0
yi  
 1 if  w  x i   b  0
The hyperplane
 The hyperplane that separates positive and negative
training data is
w  x + b = 0
 It is also called the decision boundary (surface).
 So many possible hyperplanes, which one to choose?
An example: two-class problem

Class 2
 Many decision
boundaries can
separate these two
classes
 Which one should
Class 1
we choose?
Bad Decision Boundaries

Class 2 Class 2

Class 1 Class 1

SVM looks for the separating hyperplane with the largest


margin.
Optimal decision boundary:
margin should be maximized
 The decision boundary should be as far away from the
data of both classes as possible 2
m
 We should maximize the margin, m w.w
Support vectors
datapoints that the
margin pushes up
against
Class 2

The maximum margin linear


classifier is the linear classifier
Class 1
m with the maximum margin.
This is the simplest kind of
SVM (Called an Linear SVM)
The Optimization Problem
 Let {x1, ..., xn} be our data set and let yi  {1,-1} be
the class label of xi
 The decision boundary should classify all points
correctly A constrained optimization problem

yi ( w  x i   b  1, i  1, 2, ..., r summarizes
w  xi + b  1 for yi = 1
w  xi + b  -1 for yi = -1.
Lagrangian of Original Problem

 The Lagrangian is Lagrangian multipliers

 Note that ||w||2 = wTw


 Setting the graient of w.r.t. w and b to zero, we have

i0
The Dual Optimization Problem
 We can transform the problem to its dual Dot product of X

’s  New variables


(Lagrangian multipliers)
 This is a convex quadratic programming (QP) problem
 Global maximum of  can always be found
i

well established tools for solving this optimization problem


(e.g. cplex)
 Note:
A Geometrical Interpretation
Class 2

Support vectors
8=0.6 10=0
’s with values
7=0 different from zero
2=0 (they hold up the
5=0
separating plane)!
1=0.8
4=0
6=1.4
9=0
3=0
Class 1
Non-Linear SVM
 How could we generalize this procedure to non-linear data?

 Vapnik in 1992 showed that transforming input data xi into a higher


dimensional makes the problem easier.

 We know that data appears only as dot products (xi∙xj)

 Suppose we transform the data to some (possibly infinite


dimensional) space H via a mapping function Φ such that the
data appears of the form Φ(xi)Φ(xj)

 Why?
 Linear operation in H is equivalent to non-linear operation in

input space.
Non-linear SVMs: Feature Space
General idea: the original input space (x) can be mapped to some higher-
dimensional feature space (φ(x) )where the training set is separable:

x=(x1,x2) 2x1x2

Φ: x → φ(x)

φ(x) =(x12,x22,2x1x2)
x22
x12
If data are mapped into higher a space of sufficiently high dimension,
then they will in general be linearly separable;
N data points are in general separable in a space of N-1 dimensions or
more!!!
Choosing the Kernel Function
 Probably the most tricky part of using SVM.
 The kernel function is important because it creates the kernel
matrix, which summarizes all the data
 Many principles have been proposed (diffusion kernel, Fisher
kernel, string kernel, …)
 There is even research to estimate the kernel matrix from
available information
 In practice, a low degree polynomial kernel or RBF kernel with a
reasonable width is a good initial try
 Note that SVM with RBF kernel is closely related to RBF neural
networks, with the centers of the radial basis functions
automatically chosen for SVM
Applications of SVMs
 Bioinformatics
 Machine Vision
 Text Categorization
 Handwritten Character Recognition
 Time series analysis
Lots of very successful applications!!!
Unsupervised Learning
 Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
 These patterns are then utilized to predict the values
of the target attribute in future data instances.
 Unsupervised learning: The data have no target
attribute.
 We want to explore the data to find some intrinsic
structures (hidden knowledge) in them.
Clustering
 Clustering is a technique for finding similarity groups in data,
called clusters. I.e.,
 it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far
away) from each other into different clusters.
 Clustering is often called an unsupervised learning task
as no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised learning.
 Clustering is one of the most utilized data mining
techniques. It has a long history, and used in almost every
field, e.g., medicine, psychology, botany, sociology, biology,
archeology, marketing, insurance, libraries, etc.
What is clustering for?
 Let us see some real-life examples
 Example 1: groups people of similar sizes together to make
“small”, “medium” and “large” T-Shirts.
 Example 2: In marketing, segment customers according to
their similarities, to do targeted marketing. Help marketers
discover distinct groups in their customer bases
 Example 3: Given a collection of text documents, we want
to organize them according to their content similarities, to
produce a topic hierarchy.
 In recent years, due to the rapid increase of online documents, text
clustering becomes important.
What Is a Good Clustering?
 A good clustering method will produce clusters
with
 High intra-class similarity
 Low inter-class similarity
 Minimal domain knowledge required to determine input
parameters
 Discovery of clusters with arbitrary shape
 Ability to deal with noise and outliers
 Interpretability and usability
Similarity and Dissimilarity
Between Objects: distance metrics
• Minkowski distance
Xj = (xj1, xj2, …, xjp)
q q q
d (i, j )  q
xi1  x j1  xi 2  x j 2  ...  xip  x jp dij = ?

Xi = (xi1, xi2, …, xip)

• Euclidean distance
q = 2 d (i, j )  xi1  x j1 2  xi 2  x j 2 2  ...  xip  x jp 2

• Manhattan distance
q=1 d (i, j )  xi1  x j1  xi 2  x j 2  ...  xip  x jp
When to use what distance
• The choice of distance measure should be based on
the particular application : What sort of similarities
would you like to detect?
• Euclidean distance – takes into account the magnitude
of the differences of the expression levels.
• In many case it is necessary to normalize and/or
standardize genes or arrays in order to compare the
amount of variation of two different genes or arrays
from their respective central locations.
Notion of a Cluster can be Ambiguous

Six Clusters
How many clusters?

Two Clusters Four Clusters


K-means clustering algorithm
 Partitioning method: Construct a partition of a database
D of n objects into a set of k clusters
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: e.g. k-means algorithms (MacQueen, 1967), where each
cluster is represented by the center of the cluster.

 Given k, the k-means algorithm consists of four steps:


 Select initial centroids at random.
 Assign each object to the cluster with the nearest centroid.
 Compute each centroid as the mean of the objects assigned to it.
 Repeat previous 2 steps until no change.
K-means clustering algorithm
 Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

51
Weaknesses of k-means
 The algorithm is only applicable if the mean is
defined.
 For categorical data, k-mode - the centroid is

represented by most frequent values.


 The user needs to specify k.
 The algorithm is sensitive to outliers
 Outliers are data points that are very far away

from other data points.


 Outliers could be errors in the data recording or

some special data points with very different values.


Weaknesses of k-means: Problems with
outliers
Weaknesses of k-means
 The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
Some Comments
• Despite weaknesses, k-means is still the most popular
algorithm due to its simplicity, efficiency, and other
clustering algorithms also have their own lists of
weaknesses.
• No clear evidence that any other clustering algorithm
performs better in general although they may be more
suitable for some specific types of data or applications.
• Clustering methods are descriptive techniques, not
interpretative let alone predictive
“It is a long way from clustering genes to
finding their functional roles and moreover, to
understanding the underlying biological process”

You might also like