Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
Reading
Ch 18.6-18.12, 20.1-20.3.2
Outline
Different types of learning problems
Different types of learning algorithms
Supervised learning
Decision trees
Nave Bayes
Perceptrons, Multi-layer Neural Networks
Boosting
Decision trees
K-nearest neighbors
Nave Bayes
Perceptrons, Support vector Machines (SVMs), Neural Networks
Inductive learning
Let x represent the input vector of attributes
xj is the jth component of the vector x
xj is the value of the jth attribute, j = 1,d
8
7
FEATURE 2
6
5
4
3
2
1
0
4
FEATURE 1
Decision Boundaries
8
Decision
Boundary
Decision
Region 1
FEATURE 2
6
5
4
3
2
Decision
Region 2
1
0
4
FEATURE 1
Debt
Income
Debt
Income > t1
??
t1
Income
Debt
Income > t1
t2
Debt > t2
t1
Income
??
Debt
Income > t1
t2
Debt > t2
t3
t1
Income
Income > t3
Debt
Income > t1
t2
Debt > t2
t3
t1
Income
Income > t3
k = 1, m
Prediction
Compute the closest mean to a test vector x (using Euclidean
distance)
Predict the corresponding class
8
7
FEATURE 2
6
5
4
3
2
1
0
4
FEATURE 1
Feature 2
1
2
?
1
Feature 1
Feature 2
1
2
?
1
Feature 1
Feature 2
1
2
?
1
Feature 1
Feature 2
1
2
?
1
Feature 1
Decision Region
for Class 1
Decision Region
for Class 2
1
Feature 2
1
2
?
1
Feature 1
Predicts blue
FEATURE 2
Predicts red
5
4
3
2
1
0
4
FEATURE 1
Linear Classifiers
w j xj
w j xj > 0
= wt x = 0
Learning consists of searching in the d-dimensional weight space for the set of weights (the
linear boundary) that minimizes an error measure
A threshold can be introduced by a dummy feature that is always one; its weight corresponds
to (the negative of) the threshold
8
A Possible Decision Boundary
FEATURE 2
6
5
4
3
2
1
0
4
FEATURE 1
8
7
Another Possible
Decision Boundary
FEATURE 2
6
5
4
3
2
1
0
4
FEATURE 1
8
Minimum Error
Decision Boundary
FEATURE 2
6
5
4
3
2
1
0
4
FEATURE 1
wj xj )
f
f)
f
[ (f[x(i)]) - y(i) ]2
where x(i) is the ith input vector in the training data, i=1,..N
y(i) is the ith target value (-1 or 1)
f[x(i)] =
w j xj
where
Notes:
( E[w] )
wj
wj
wj
+ * error * c * xj(i)
X2
X3
Xn
C
Bayes Rule: P(C | X1,Xn) is proportional to P (C)i P(Xi | C)
[note: denominator P(X1,Xn) is constant for all classes, may be ignored.]
Features Xi are conditionally independent given the class variable C
choose the class value ci with the highest P(ci | x1,, xn)
simple to implement, often works very well
e.g., spam email classification: Xs = counts of words in emails
Conditional probabilities P(Xi | C) can easily be estimated from labeled date
Problem: Need to avoid zeroes, e.g., from limited training data
Solutions: Pseudo-counts, beta[a,b] distribution, etc.
Summary
Learning
Given a training data set, a class of models, and an error function,
this is essentially a search or optimization problem