CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign
CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign
Dan Roth
University of Illinois, Urbana-Champaign
[email protected]
http://L2R.cs.uiuc.edu/~danr
3322 SC
x∈ X System y∈ Y
An item x y = f(x) An item y
drawn from an drawn from an
input space X output space Y
x∈ X System y∈ Y
An item x y = f(x) An item y
drawn from an drawn from an
input space X output space Y
x∈ X Learned Model y∈ Y
y = g(x)
An item x An item y
drawn from an drawn from a label
instance space X space Y
Data
D train Learned
Learning
(x1, y1) model
Algorithm
(x2, y2) g(x)
…
(xN, yN)
Give the learner examples in D train
The learner returns a model g(x)
INTRODUCTION CS446 Spring ’17 16
Supervised learning: Testing
Labeled
Test Data
D test
(x’1, y’1)
(x’2, y’2)
…
(x’M, y’M)
Representation
What functions should we learn (hypothesis spaces) ?
How to map raw input to an instance space?
Any rigorous way to find these? Any general approach?
Algorithms
What are good algorithms?
How do we define success?
Generalization Vs. over fitting
The computational problem
INTRODUCTION CS446 Spring ’17 24
Using supervised learning
What is our instance space?
Gloss: What kind of features are we using?
What is our label space?
Gloss: What kind of learning task are we dealing with?
What is our hypothesis space?
Gloss: What kind of functions (models) are we learning?
What learning algorithm do we use?
Gloss: How do we learn the model from the labeled data?
What is our loss function/evaluation metric?
Gloss: How do we measure success? What drives learning?
x2
x1
INTRODUCTION CS446 Spring ’17 29
Good features are essential
We could be wrong !
Our prior knowledge might be wrong:
y=x4 one-of (x1, x3) is also consistent
Our guess of the hypothesis space could be wrong
What function?
nt
se
re
What’s best?
p
Re
y = sgn {wTx}
A possibility: Define the learning problem to be:
A (linear) function that best separates the data
INTRODUCTION CS446 Spring ’17 48
Expressivity
f(x) = sgn {x ¢ w - } = sgn{i=1n wi xi - }
Many functions are Linear Probabilistic Classifiers as well
n o
Conjunctions:
ati
nt
y = x1 Æ x3 Æ x5
se
re
At least m of n:
y = at least 2 of {x1 ,x3, x5 }
y = sgn{1 ¢ x1 + 1 ¢ x3 + 1 ¢ x5 - 2} }; w = (1, 0, 1, 0, 1) =2
Many functions are not
Xor: y = x1 Æ x2 Ç :x1 Æ :x2
Non trivial DNF: y = x1 Æ x2 Ç x3 Æ x4
But can be made linear
functions
o
ati
nt
se
re
p
Re
x2
nt
xi 2 {0,1}
se
re
p
Re
x1
Whether
Weather
Input Transformation
New Space: Y = {y1,y2,…} = {xi,xi xj, xi xj xj,…}
INTRODUCTION CS446 Spring ’17 53
Third Step: How to Learn?
A possibility: Local search
Start with a linear threshold function.
See how well you are doing.
Correct
s
m
define:
m
ith
r
Misclassification Error:
go
Al
A Learning Algorithm
Where:
r
go
We cannot do it.
m
ith
J(w)
w
w 4 w3 w 2 w1
INTRODUCTION CS446 Spring ’17 64
LMS: An Optimization Algorithm
data.
ith
(j)
o d i w x i w x
i
j
Let td be the target value for this example (real value; represents u ¢ x)
The error the current hypothesis makes on the data set is:
(j) 1
J(w) Err( w ) (t d - o d ) 2
2 dD
INTRODUCTION CS446 Spring ’17 66
Gradient Descent
To find the best direction in the weight spacew we
compute the gradient of E with respect to each of the
components of
E E E
E(w ) [ , ,..., ]
s
m
w 1 w 2 w n
ith
r
go
Al
Therefore: E 1
(t o ) 2
s
m
d d
w i w i 2 dD
ith
r
go
Al
1
(t d o d ) 2
2 dD w i
1
2(t d o d ) (t d w d x d )
2 dD w i
(t d o d )(-x id )
dD
INTRODUCTION CS446 Spring ’17 68
Gradient Descent: LMS
Weight update rule:
w i R (t d o d )x id
dD
s
m
Winnow/Perceptron
A multiplicative/additive update algorithm with some
sparsity properties in the function space (a large number of
irrelevant attributes) or features space (sparse examples)
Logistic Regression, SVM…many other algorithms