Lecture 01 - Machine Learning Basics Revision
Lecture 01 - Machine Learning Basics Revision
1
Learning: Why?
2
Learning: When?
3
Learning: What is?
4
Learning: What is?
5
What is Machine Learning?
6
What is Machine Learning?
▪ As intelligence requires knowledge, it is necessary for the
computers to acquire knowledge.
▪ A branch of artificial intelligence, concerned with the design and
development of algorithms that allow computers:
– To evolve behaviors based on empirical data.
– Be capable of the autonomous acquisition and integration of
knowledge
– Given some model structure (prior) and a performance criterion
automatically improve their performance within it through example data
or experience.
– Integrate human knowledge and empirical evidence.
7
What is Machine Learning?
▪ Automating automation
▪ Getting computers to program themselves
▪ Writing software is the bottleneck
▪ Let the data do the work instead!
8
What is Machine Learning?
▪ Traditional Programming
Data
Computer Output
Program
▪ Machine Learning
Data
Computer Program Data Computer Program
Output
9
What is Machine Learning?
Machine Learning vs Statistics
▪ It is similar to statistics...
– Both fields try to uncover patterns in data
– Both fields draw heavily on calculus, probability, and
linear algebra, and share many of the same core
algorithms
10
What is Machine Learning?
11
Motivating Example: Learning to Filter Spam
12
The Learning Process
The Analysis
“real world” results
• Number of recipients
• Size of message
• Number of attachments
• Number of “re”s in the subject line
• …
Email Server
13
Defining the Learning Task
Improve on task T, with respect to performance metric P, based on experience E
T : Playing checkers
P : Percentage of games won against an arbitrary opponent
E : Playing practice games against itself
14
Why use Machine Learning?
15
Tasks that are Best Solved by using a Learning Algorithm
A classic example : It is very hard to say what makes a 2
16
Tasks that are Best Solved by using a Learning Algorithm
Some more examples
▪ Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
– Objects in real scenes
▪ Generating patterns:
– Generating images or motion sequences
▪ Recognizing anomalies:
– Unusual credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant
▪ Prediction:
– Future stock prices or currency exchange rates
– Which movies will a person like?
17
Sample Applications/ Tasks That use Machine Learning
18
Training and Testing
19
Loss
ML Model
(Hypothesis)
Input Predicted
(Example) Example
Update
Calculated Evaluate
Loss (Loss Function)
20
Learning
21
Statistical Inference
▪ Inductive learning
– Involves the creation of a generalized rule for all the data.
– It uses a bottom-up approach
– Every time you eat peanuts, you start to cough. You are allergic to peanuts.
– Jennifer always leaves for school at 7:00 a.m. Jennifer is always on time. Jennifer assumes, then, that if
she leaves at 7:00 a.m. for school today, she will be on time.
Specific Pattern General
Observation Recognition Conclusion
▪ Deductive learning
– Uses the already available facts and information in order to give a valid conclusion.
– It uses a top-down approach.
– All numbers ending in 0 or 5 are divisible by 5. The number 35 ends with a 5, so it must be divisible by
5.
– Red meat has iron in it, and beef is red meat. Therefore, beef has iron in it
Do/Don’t
Specific Formulate Analyze
Collect Data reject
Observation Hypothesis Data
hypothesis
22
Statistical Inference
▪ Transductive learning
– Is used in the field of statistical learning theory to refer to
predicting specific examples given specific examples from a
domain.
– Unlike induction, specific examples are used directly for
reasoning, and no generalization is required.
– This can be a more specific problem to solve than induction.
– A classic example of a transductive algorithm is the k-nearest neighbor algorithm, which does
not model the training data, but uses it directly each time of a prediction.
– It is an interesting framing of supervised learning where the classical problem of “approximating a
mapping function from data and using it to make a prediction” is seen as more difficult than is
required. Instead, specific predictions are made directly from the real samples from the domain. No
function approximation is required.
23
Types of Learning
24
Types of Learning: Supervised Learning
Feature
▪ Given examples of a vectors
25
Types of Learning: Supervised Learning: Classification
▪ 𝑥 can be multi-dimensional
▪ Each dimension corresponds to an attribute
▪ Example 1: Breast cancer
– Clump Thickness
– Uniformity of Cell Size
– Uniformity of Cell Shape
26
Types of Learning: Supervised Learning: Classification
Classification: Applications
▪ Aka Pattern recognition Training examples of a person
27
Types of Learning: Supervised Learning: Classification
Other Supervised Learning Settings
▪ Multi-Class Classification
– There are more than two classes to classify instances into.
▪ Multi-Label Classification
– Multi-label learning refers to the classification problem where each
example can be assigned to multiple class labels simultaneously.
▪ Semi-supervised classification
– Make use of labeled and unlabeled data
28
Types of Learning: Supervised Learning: Regression
29
Types of Learning: Supervised Learning: Regression
Regression Applications
▪ Navigating a car: Angle of the steering wheel (CMU NavLab)
▪ Kinematics of a robot arm
α1
30
Types of Learning: Supervised Learning: Uses
31
Types of Learning: Supervised Learning: Multi-instance learning
32
Types of Learning: Unsupervised Learning
Feature
vectors
▪ For about 40 years, unsupervised
learning was largely ignored by the Training
machine learning community Examples
– Some widely used definitions of machine Machine
learning actually excluded it. Learning
– Many researchers thought that clustering Algorithm
was the only form of unsupervised learning.
33
Types of Learning: Unsupervised Learning: Clustering
34
Types of Learning: Unsupervised Learning: Learning Associations
▪ Basket analysis:
– 𝑃(𝑌| 𝑋) probability that somebody who buys 𝑋 also buys 𝑌 where 𝑋 and 𝑌 are products/services.
– Example: 𝑃(𝑐ℎ𝑖𝑝𝑠|𝑏𝑒𝑒𝑟) = 0.7
35
Types of Learning: Semi-Supervised Learning
36
Types of Learning: Reinforcement Learning
▪ Examples:
– Credit assignment problem
– Game playing
– Robot in a maze
– Balance a pole on your hand
– Multiple agents, partial observability, ...
37
Types of Learning: Reinforcement Learning
The Agent-Environment Interface
38
Learning Paradigms
39
Learning Paradigms: Transfer learning
No labelled data in
source domain
Labelled data
Inductive transfer
available in the
learning
target domain
Labelled data in
source domain
Labelled data
Transductive
Transfer learning available only in
transfer learning
the source domain
No labelled data in
Unsupervised
both target and
transfer learning
source domain
40
Learning Paradigms: Active learning
▪ Active Learning
– Learner can query an oracle about class of an
unlabeled example in the environment.
– Learner can construct an arbitrary example and
query an oracle for its label.
41
Learning Paradigms: Active learning
▪ Pool-based sampling
– When there is small subset of labelled data in a large pool of unlabelled data, queries are selectively drawn from that pool
– Based on some “informativeness measure”, all the queries are first ranked and then the best one is taken into account.
– This is the main difference between stream-based and pool-based sampling
– i.e individual querying and group querying.
42
Learning Paradigms: Ensemble Learning
43
Learning Paradigms: Ensemble methods
The Wisdom of Crowds
▪ Under certain controlled conditions, the
aggregation of information in groups,
resulting in decisions that are often
superior to those that can been made by
any single - even experts.
▪ Imitates our second nature to seek
several opinions before making any
crucial decision. We weigh the individual
opinions, and combine them to reach a
final decision
44
Learning Paradigms: Ensemble methods
Committees of Experts
▪ – “ … a medical school that has the objective that all
students, given a problem, come up with an identical
solution”
▪ There is not much point in setting
up a committee of experts from
such a group - such a committee
will not improve on the judgment of
an individual.
▪ Consider: There needs to be
disagreement for the committee to
have the potential to be better than
an individual.
45
Design
46
Designing a Learning System
Environment/
Experience
▪ Choose the training experience
▪ Choose exactly what is to be learned
– i.e. the target function
Training Testing
▪ Choose how to represent the target Data Data
function
▪ Choose a learning algorithm to infer the
target function from the experience
Performance
Learner Knowledge
Element
47
Training vs. Test Distribution
48
ML in a Nutshell
49
ML in a Nutshell: Representations
▪ Model ensembles
50
ML in a Nutshell: Evaluation
▪ Entropy ▪ Recall r is the number of correctly classified positive examples divided by the
total number of actual positive examples in the test set.
▪ K-L divergence
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
▪ R = Recall = =
▪ Etc. 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑃 +𝐹𝑁
51
ML in a Nutshell: Various Search/Optimization Algorithms
52
ML in Practice
4. Interpreting results
53
Linear Classifiers
54
Linear Classifiers
55
Linear Classifiers
56
Linear Classifiers
57
Linear Classifiers
58
Linear Classifiers
59
No Linear Classifier can cover all instances
60
No Linear Classifier can cover all instances
61
Decision tree
62
Top Down Induction of Decision Trees
63
Top Down Induction of Decision Trees
64
Top Down Induction of Decision Trees
65
Top Down Induction of Decision Trees
Which One?
66
Issues and Solutions
67
Overfitting and Underfitting
▪ A: Generalization error
▪ B: Training Error
▪ C: Stopping Point
▪ D: Underfitting: when model is too simple,
both training and test errors are large
▪ E: Overtraining: means that it learns the
training set too well – it overfits to the
training set such that it performs poorly on
the test set.
68
Occam's razor (14th-century)
▪ The Occam Dilemma: Sadly, in ML, accuracy and simplicity (interpretability) are in
conflict.
69
No Free Lunch Theorem in Machine Learning
▪ “For any two learning algorithms, there are just as many
situations (appropriately weighted) in which algorithm
one is superior to algorithm two as vice versa,
according to any of the measures of "superiority“
▪ Hume (1739–1740) pointed out that ‘even after the
observation of the frequent or constant conjunction of
objects, we have no reason to draw any inference
concerning any object beyond those of which we have
had experience’.
▪ More recently, and with increasing rigour, Mitchell
(1980), Schaffer (1994) and Wolpert (1996) showed that
bias-free learning is futile.
▪ Wolpert (1996) shows that in a noise-free scenario
where the loss function is the misclassification rate, if
one is interested in off-training-set error, then there are
no a priori distinctions between learning algorithms.
70
No Free Lunch Theorem in Machine Learning
▪ More formally, where
– d = training set;
– m = number of elements in training set;
– f = ‘target’ input-output relationships;
– h = hypothesis (the algorithm's guess for f made in response to d); and
– C = off-training-set ‘loss’ associated with f and h (‘generalization error’)
▪ All algorithms are equivalent, on average, by any of the following measures of risk:
𝐸(𝐶|𝑑), 𝐸(𝐶|𝑚), 𝐸(𝐶|𝑓, 𝑑), or 𝐸(𝐶|𝑓, 𝑚).
– How well you do is determined by how ‘aligned’ your learning algorithm 𝑃(ℎ|𝑑) is with the actual
posterior, 𝑃(𝑓|𝑑).
▪ Wolpert's result, in essence, formalizes Hume, extends him and calls the whole of
science into question.
71
No Free Lunch Theorem in Machine Learning
So why Develop New Algorithms?
▪ Practitioner are mostly concerned with choosing the most appropriate algorithm for
the problem at hand
▪ This requires some a priori knowledge – data distribution, prior probabilities,
complexity of the problem, the physics of the underlying phenomenon, etc.
▪ The No Free Lunch theorem tells us that – unless we have some a priori
knowledge – simple classifiers (or complex ones for that matter) are not necessarily
better than others. However, given some a priori information, certain classifiers may
better MATCH the characteristics of certain type of problems.
▪ The main challenge of the practitioner is then, to identify the correct match between
the problem and the classifier! …which is yet another reason to arm yourself with a
diverse set of learner arsenal !
72
Less is More The Curse of Dimensionality (Bellman, 1961)
73
The Problem with Dimensionality Reduction
74
The Problem with Dimensionality Reduction
75
The Problem with Dimensionality Reduction
76
The Problem with Dimensionality Reduction
77
Instability and the Rashomon Effect
78
Summery
Task Data Learning Structure Learning Algorithm
From AI perspective Supervision
•Perception
Complexity of Data Functions •Supervised learning
•Induction •Binary, category, •Unsupervised learning
•Generation continuous, scale, vector, •Semi-supervised learning
graph, natural object Networks •Reinforcement learning
79
References
▪ Introduction to Machine Learning by Ethem Alpaydın
▪ Machine Learning by Pedro Domingos
▪ Overview of Machine Learning by Ἀπφία Ελευθερόπουλος
▪ Introduction to Machine Learning by Eric Eaton
▪ Machine Learning by Rajat Sharma
▪ Introduction to Machine Learning by Lior Rokach
▪ Introduction to Reinforcement Learning for Beginners by Prathima Kadari
▪ Statistics in Marketing - Discrete Probability Distributions by Bernard Yeo
▪ Examples of Occam's Razor: Principle Simply Explained by Jennifer Gunner
▪ No Free Lunch Theorems
▪ A Treatise of Human Nature: Being an Attempt to Introduce the Experimental Method of Reasoning into Moral Subjects by David Hume
▪ The Danger of Dimensionality Reduction by Jessica Lin
▪ How was the voting in ancient Greece by greecehighdefinition
▪ No, Machine Learning is not just glorified Statistics by Joe Davison
▪ Chapter 1 - Introduction by National Cheng Kung University – NCKU
▪ 8 Inspirational Applications of Deep Learning by Jason Brownlee
▪ The world of loss function by 홍배 김 (Hongbae Kim)
▪ Introduction to Semi-Supervised Learning by TEKSANDS
▪ Gentle Introduction to Transduction in Machine Learning by Jason Brownlee
▪ Anomaly Detection and Complex Event Processing over IoT Data Streams by and Fatos Xhafa
▪ Let Your Model Select Which Data to Learn From — Basics of Active Learning by Parthvi Shah
80