0% found this document useful (0 votes)

76 views

CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign

This document provides an overview of the CS446: Machine Learning course, including the instructor details, policies, topics to be covered such as supervised learning, and information about assignments, exams, and resources available including the class website and discussion sections. The course will introduce students to machine learning theory and algorithms through homework assignments, projects, a midterm and final exam. Students are encouraged to ask questions of the instructor and teaching assistants.

Uploaded by

mysyuj

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views

CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign

Uploaded by

mysyuj

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 71

CS 446: Machine Learning

Dan Roth
University of Illinois, Urbana-Champaign

[email protected]
http://L2R.cs.uiuc.edu/~danr
3322 SC

INTRODUCTION CS446 Spring ’17

CS446: Machine Learning
Participate, Ask Questions
What do you need to know:
Theory of Computation
Probability Theory
Linear Algebra
Programming (Java; your favorite language; some Matlab)

Homework 0 – on the web

 No need to submit

Who is the class for?

Future Machine Learning researchers/Advanced users

INTRODUCTION CS446 Spring ’17 2

CS446: Policies
Cheating Info page
No. Note also the Schedule
We take it very seriously. Page and our Notes
Homework:
 Collaboration is encouraged
 But, you have to write your own solution/program.
 (Please don’t use old solutions)
Late Policy:
You have a credit of 4 days (4*24hours); That’s it.
Grading:
 Possibly separate for grads/undergrads.
 5% Quizzes; 25% - homework; 30%-midterm; 40%-final;
 Projects: 25% (4 hours)
Questions?
INTRODUCTION CS446 Spring ’17 3
CS446 Team
Dan Roth (3323 Siebel)
 Monday, 1-2 PM (or: appointment)
TAs
 Chase Duncan Tues 12-1 (3333 SC)
 Subhro Roy Wed 4-5 (3333 SC)
 Qiang Ning Thur 3-4 (3333 SC)
 Hao Wu Fri 1-2 (3333 SC)
Discussion Sections: (starting 3rd week) [times/locations not final]
 Tuesday: 11 -12 [3405 SC] Subhro Roy [A-I]
 Wednesdays: 5 -6 [3405 SC] Hao Wu [J-L]
 Thursdays: 2 - 3 [3405 SC] Chase Duncan [M-S]
 Fridays: 4 -5 [3405 SC] Qiang Ning [T-Z]

INTRODUCTION CS446 Spring ’17 4

CS446 on the web

Check our class website:

 Schedule, slides, videos, policies
 http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-17/index.html
 Sign up, participate in our Piazza forum:
 Announcements and discussions
 https://piazza.com/class#fall2017/cs446
 Log on to Compass:
 Submit assignments, get your grades
 https://compass2g.illinois.edu
Scribing the Class [Good writers; Latex; Paid Hourly]

INTRODUCTION CS446 Spring ’17 5

What is Learning

The Badges Game……

 This is an example of the key learning protocol: supervised
learning
First question: Are you sure you got it?
 Why?
Issues:
 Prediction or Modeling?
 Representation
 Problem setting
 Background Knowledge
 When did learning take place?
 Algorithm

INTRODUCTION CS446 Spring ’17 6

Training data
+ Naoki Abe + Peter Bartlett + Carla E. Brodley
- Myriam Abramson - Eric Baum + Nader Bshouty
+ David W. Aha + Welton Becket - Wray Buntine
+ Kamal M. Ali - Shai Ben-David - Andrey Burago
- Eric Allender + George Berg + Tom Bylander
+ Dana Angluin + Neil Berkman + Bill Byrne
- Chidanand Apte + Malini Bhandaru - Claire Cardie
+ Minoru Asada + Bir Bhanu + John Case
+ Lars Asker + Reinhard Blasig + Jason Catlett
+ Javed Aslam - Avrim Blum - Philip Chan
+ Jose L. Balcazar - Anselm Blumer - Zhixiang Chen
- Cristina Baroglio + Justin Boyan - Chris Darken

INTRODUCTION CS446 Spring ’17 7

The Badges game

+ Naoki Abe - Eric Baum

Conference attendees to the 1994 Machine Learning
conference were given name badges labeled with +
or −.

What function was used to assign these labels?

INTRODUCTION CS446 Spring ’17 8

Raw test data
Gerald F. DeJong Priscilla Rasmussen
Chris Drummond Dan Roth
Yolanda Gil Yoram Singer
Attilio Giordana Lyle H. Ungar
Jiarong Hong
J. R. Quinlan

INTRODUCTION CS446 Spring ’17 9

Labeled test data
+ Gerald F. DeJong - Priscilla Rasmussen
- Chris Drummond + Dan Roth
+ Yolanda Gil + Yoram Singer
- Attilio Giordana - Lyle H. Ungar
+ Jiarong Hong
- J. R. Quinlan

INTRODUCTION CS446 Spring ’17 10

What is Learning

The Badges Game……

 This is an example of the key learning protocol: supervised
learning
First question: Are you sure you got it right?
 Why?
Issues:
 Prediction or Modeling?
 Representation
 Problem setting
 Background Knowledge
 When did learning take place?
 Algorithm

INTRODUCTION CS446 Spring ’17 11

Supervised Learning
Input Output

x∈ X System y∈ Y
An item x y = f(x) An item y
drawn from an drawn from an
input space X output space Y

We consider systems that apply a function f()

to input items x and return an output y = f(x).

INTRODUCTION CS446 Spring ’17 12

Supervised Learning
Input Output

x∈ X System y∈ Y
An item x y = f(x) An item y
drawn from an drawn from an
input space X output space Y

In (supervised) machine learning, we deal with

systems whose f(x) is learned from examples.

INTRODUCTION CS446 Spring ’17 13

Why use learning?

We typically use machine learning when

the function f(x) we want the system to apply is
unknown to us, and we cannot “think” about it. The
function could actually be simple.

INTRODUCTION CS446 Spring ’17 14

Supervised learning
Input Output
Target function
y = f(x)

x∈ X Learned Model y∈ Y
y = g(x)
An item x An item y
drawn from an drawn from a label
instance space X space Y

INTRODUCTION CS446 Spring ’17 15

Supervised learning: Training
Can you suggest other
Labeled Training learning protocols?

Data
D train Learned
Learning
(x1, y1) model
Algorithm
(x2, y2) g(x)
…
(xN, yN)
Give the learner examples in D train
The learner returns a model g(x)
INTRODUCTION CS446 Spring ’17 16
Supervised learning: Testing
Labeled
Test Data
D test
(x’1, y’1)
(x’2, y’2)
…
(x’M, y’M)

Reserve some labeled data for testing

INTRODUCTION CS446 Spring ’17 17

Supervised learning: Testing
Labeled
Test Data
Raw Test D test Test
Data (x’1, y’1) Labels
X test (x’2, y’2) Y test
x’1 … y’1
x’2 (x’M, y’M) y’2
….
...
x’M y’M
INTRODUCTION CS446 Spring ’17
Supervised learning: Testing
Can you use the test
Apply the model to the raw test data data otherwise?
Evaluate by comparing predicted labels against the test labels

Raw Test Predicted Test

Data Labels Labels
X test g(X test) Y test
Learned
x’1 g(x’1) y’1
model
x’2 g(x) g(x’2) y’2
…. ….
...
x’M g(x’M)
y’M
INTRODUCTION CS446 Spring ’17 19
What is Learning

The Badges Game……

INTRODUCTION CS446 Spring ’17 20

Course Overview
Introduction: Basic problems and questions
A detailed example: Linear threshold units
 Online Learning
Two Basic Paradigms:
 PAC (Risk Minimization)
 Bayesian theory
Learning Protocols:
 Supervised; Unsupervised; Semi-supervised
Algorithms
 Decision Trees (C4.5)
 [Rules and ILP (Ripper, Foil)]
 Linear Threshold Units (Winnow; Perceptron; Boosting; SVMs; Kernels)
 [Neural Networks (Backpropagation)]
 Probabilistic Representations (naïve Bayes; Bayesian trees; Densities)
 Unsupervised /Semi supervised: EM
Clustering; Dimensionality Reduction

INTRODUCTION CS446 Spring ’17 21

Supervised Learning
Given: Examples (x,f(x)) of some unknown function f

Find: A good approximation of f

x provides some representation of the input

 The process of mapping a domain element into a
representation is called Feature Extraction. (Hard; ill-
understood; important)
 x 2 {0,1}n or x 2 <n
The target function (label)
 f(x) 2 {-1,+1} Binary Classification
 f(x) 2 {1,2,3,.,k-1} Multi-class classification
 f(x) 2 < Regression
INTRODUCTION CS446 Spring ’17 22
Supervised Learning : Examples
Disease diagnosis
 x: Properties of patient (symptoms, lab tests)
Many problems
 f : Disease (or maybe: recommended therapy) that do not seem
Part-of-Speech tagging like classification
 x: An English sentence (e.g., The can will rust)
problems can be
decomposed to
 f : The part of speech of a word in the sentence classification

Face recognition problems. E.g,

Semantic Role La
 x: Bitmap picture of person’s face beling
 f : Name the person (or maybe: a property of)
Automatic Steering
 x: Bitmap picture of road surface in front of car
 f : Degrees to turn the steering wheel

INTRODUCTION CS446 Spring ’17 23

Key Issues in Machine Learning
Modeling
 How to formulate application problems as machine
learning problems ? How to represent the data?
 Learning Protocols (where is the data & labels coming
from?)

Representation
 What functions should we learn (hypothesis spaces) ?
 How to map raw input to an instance space?
 Any rigorous way to find these? Any general approach?

Algorithms
 What are good algorithms?
 How do we define success?
 Generalization Vs. over fitting
 The computational problem
INTRODUCTION CS446 Spring ’17 24
Using supervised learning
What is our instance space?
 Gloss: What kind of features are we using?
What is our label space?
 Gloss: What kind of learning task are we dealing with?
What is our hypothesis space?
 Gloss: What kind of functions (models) are we learning?
What learning algorithm do we use?
 Gloss: How do we learn the model from the labeled data?
What is our loss function/evaluation metric?
 Gloss: How do we measure success? What drives learning?

INTRODUCTION CS446 Spring ’17 25

1. The instance space X
Input Output
Learned
x∈X Model y∈Y
An item x y = g(x) An item y
drawn from an drawn from a label
instance space X space Y

Designing an appropriate instance space X

is crucial for how well we can predict y.

INTRODUCTION CS446 Spring ’17 26

1. The instance space X
When we apply machine learning to a task, we first
need to define the instance space X.
Instances x ∈X are defined by features:
 Boolean features:
 Does this email contain the word ‘money’?
 Numerical features:
 How often does ‘money’ occur in this email?
 What is the width/height of this bounding box?
 What is the length of the first name?

INTRODUCTION CS446 Spring ’17 27

What’s X for the Badges game?
Possible features:
• Gender/age/country of the person?
• Length of their first or last name?
• Does the name contain letter ‘x’?
• How many vowels does their name contain?
• Is the n-th letter a vowel?

INTRODUCTION CS446 Spring ’17 28

X as a vector space
X is an N-dimensional vector space (e.g. <N)
 Each dimension = one feature.
Each x is a feature vector (hence the boldface x).
Think of x = [x1 … xN] as a point in X :

x1
INTRODUCTION CS446 Spring ’17 29
Good features are essential

The choice of features is crucial for how well a task

can be learned.
 In many application areas (language, vision, etc.), a lot of
work goes into designing suitable features.
 This requires domain expertise.

CS446 can’t teach you what specific features

to use for your task.
 But we will touch on some general principles

INTRODUCTION CS446 Spring ’17 31

2. The label space Y
Input Output
Learned
x∈X Model y∈Y
An item x y = g(x) An item y
drawn from an drawn from a label
instance space X space Y

The label space Y determines what kind of

supervised learning task we are dealing with

INTRODUCTION CS446 Spring ’17 32

Supervised learning tasks I
The focus of CS446.
But…
Output labels y∈Y are categorical:
 Binary classification: Two possible labels
 Multiclass classification: k possible labels

 Output labels y∈Y are structured objects (sequences of

labels, parse trees, etc.)
 Structure learning

INTRODUCTION CS446 Spring ’17 33

Supervised learning tasks II

Output labels y∈Y are numerical:

 Regression (linear/polynomial):
 Labels are continuous-valued
 Learn a linear/polynomial function f(x)
 Ranking:
 Labels are ordinal
 Learn an ordering f(x1) > f(x2) over input

INTRODUCTION CS446 Spring ’17 34

3. The model g(x)
Input Output
Learned
x∈X Model y∈Y
An item x y = g(x) An item y
drawn from an drawn from a label
instance space X space Y

We need to choose what kind of model

we want to learn

INTRODUCTION CS446 Spring ’17 35

A Learning Problem
x1
x2 Unknown
x3 function y = f (x1, x2, x3, x4)
x4
Example x1 x2 x3 x4 y
1 0 0 1 0 0
Can you learn this function?
2 0 1 0 0 0
What is it?
3 0 0 1 1 1
4 1 0 0 1 1
5 0 1 1 0 0
6 1 1 0 0 0
7 0 1 0 1 0

INTRODUCTION CS446 Spring ’17 36

Hypothesis Space
Complete Ignorance: Example x 1 x 2 x 3 x4 y
There are 216 = 65536 possible functions 0 0 0 0 ?
0 0 0 1 ?
over four input features. 0 0 1 0 0
0 0 1 1 1
0 1 0 0 0
We can’t figure out which one is 0 1 0 1 0
correct until we’ve seen every 0 1 1 0 0
0 |X| 1 1 1 ?
possible input-output pair.  There are |Y| 1 0possible 0 0 ?
functions f(x)1 from0 the 0 instance
1 1
space X to the1 label
0 1space 0 Y.?
After observing seven examples we still 1 0 1 1 ?
1 1 consider
 Learners typically 0 0 only
0
have 29 possibilities for f 1 1 0 1 ?
a subset of the
1 functions
1 1 0 from ? X
to Y, called 1the 1hypothesis
1 1 ?
Is Learning Possible? space H . H ⊆|Y| |X|

INTRODUCTION CS446 Spring ’17 37

Hypothesis Space (2)
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 1 1 1
Simple Rules: There are only 16 simple conjunctive rules 4 1 0 0 1 1
5 0 1 1 0 0
of the form y=xi Æ xj Æ xk 6 1 1 0 0 0
7 0 1 0 1 0
Rule Counterexample Rule Counterexample
y=c x2  x3 0011 1
x1 1100 0 x2  x4 0011 1
x2 0100 0 x3  x4 1001 1
x3 0110 0 x1  x2  x3 0011 1
x4 0101 1 x1  x2  x4 0011 1
x1  x2 1100 0 x1  x3  x4 0011 1
x1  x3 0011 1 x2  x3  x4 0011 1
x1  x4 0011 1 x1  x2  x3  x4 0011 1
No simple rule explains the data. The same is true for simple clauses.
INTRODUCTION CS446 Spring ’17 38
Don’t worry, this function is
actually a neural network…
Hypothesis Space (3)
Notation: 2 variables from the set on the 1 0 0 1 0 0
left. Value: Index of the counterexample. 2 0 1 0 0 0
m-of-n rules: There are 32 possible rules 3 0 0 1 1 1
4 1 0 0 1 1
of the form ”y = 1 if and only if at least m 5 0 1 1 0 0
of the following n variables are 1” 6 1 1 0 0 0
7 0 1 0 1 0
variables 1-of 2-of 3-of 4-of variables 1-of 2-of 3-of 4-of
x1 3 - - - x2, x4 2 3 - -
x2 2 - - - x3, x4 4 4 - -
x3 1 - - - x1,x2, x3 1 3 3 -
x4 7 - - - x1,x2, x4 2 3 3 -
x1,x2 2 3 - - x1,x3,x4 1  3 -
x1, x3 1 3 - - x2, x3,x4 1 5 3 -
x1, x4 6 3 - - x1, x2, x3,x4 1 5 3 3
x2,x3 2 3 - -
Found a consistent hypothesis.
INTRODUCTION CS446 Spring ’17 39
Views of Learning
Learning is the removal of our remaining uncertainty:
 Suppose we knew that the unknown function was an m-of-n
Boolean function, then we could use the training data to
infer which function it is.
Learning requires guessing a good, small hypothesis
class:
 We can start with a very small class and enlarge it until it
contains an hypothesis that fits the data.

We could be wrong !
 Our prior knowledge might be wrong:
 y=x4  one-of (x1, x3) is also consistent
 Our guess of the hypothesis space could be wrong

If this is the unknown function, then we will make errors when

we are given new examples, and are asked to predict the value
of the function
INTRODUCTION CS446 Spring ’17 40
General strategies for Machine
Learning
Develop flexible hypothesis spaces:
 Decision trees, neural networks, nested collections.
Develop representation languages for restricted
classes of functions:
 Serve to limit the expressivity of the target models
 E.g., Functional representation (n-of-m); Grammars; linear
functions; stochastic models;
 Get flexibility by augmenting the feature space
In either case:
Develop algorithms for finding a hypothesis in our
hypothesis space, that fits the data
And hope that they will generalize well

INTRODUCTION CS446 Spring ’17 41

CS446 Team
[updated 1/24/17]
Dan Roth (3323 Siebel)
 Monday 1-2 (or: appointment)
TAs
 Chase Duncan Tues 12-1 (3333 SC)
 Subhro Roy Wed 4-5 (3333 SC)
 Qiang Ning Thur 3-4 (3333 SC)
 Hao Wu Fri 1-2 (3333 SC)
Discussion Sections: (starting next week)
 Monday: 4 -5 [3405 SC] Chase Duncan[A-I]
 Wednesdays: 5 -6 [3405 SC] Hao Wu [J-L]
 Thursdays: 4 - 5 [3405 SC] Subhro Roy[M-S]
 Fridays: 4 -5 [3405 SC] Qiang Ning [T-Z]

INTRODUCTION CS446 Spring ’17 44

CS446 on the web

Check our class website:

INTRODUCTION CS446 Spring ’17 45

Administration
Registration [Ask NOW] Questions
Hw0: Solution will be released today
Hw1 will be out tomorrow
 Please start working on it as soon as possible;
 Discussion sessions will start next week; come ready with questions
Projects
 Small (2-3) groups; your choice of a topic.
 25% of the grade  needs to be a substantial project
 Extra credit for undergrads
Quiz 1: Avg. score: 4.51/5
 Only 165 of you attempted it (???)

INTRODUCTION CS446 Spring ’17 46

An Example
I don’t know {whether, weather} to laugh or cry
This is the Modeling Step
How can we make this a learning problem?
ng
eli
od

What is the hypothesis space?

We will look for a function

F: Sentences {whether, weather}
We need to define the domain of this function better.

An option: For each word w in English define a Boolean feature xw :

[xw =1] iff w is in the sentence

This maps a sentence to a point in {0,1}50,000
In this space: some points are whether points
Learning Protocol?
some are weather points
Supervised? Unsupervised?
INTRODUCTION CS446 Spring ’17 47
Representation Step: What’s
Good?
Learning problem:
Find a function that
n

best separates the data

o
ati

What function?
nt
se
re

What’s best?
p
Re

Memorizing vs. Learning

(How to find it?) Accuracy vs. Simplicity
How well will you do?
Linear = linear in the feature space On what?
x= data representation; w = the classifier Impact on Generalization

y = sgn {wTx}
A possibility: Define the learning problem to be:
A (linear) function that best separates the data
INTRODUCTION CS446 Spring ’17 48
Expressivity
f(x) = sgn {x ¢ w - } = sgn{i=1n wi xi -  }
Many functions are Linear Probabilistic Classifiers as well
n o

 Conjunctions:
ati
nt

 y = x1 Æ x3 Æ x5
se
re

 y = sgn{1 ¢ x1 + 1 ¢ x3 + 1 ¢ x5 - 3}; w = (1, 0, 1, 0, 1) =3

p
Re

 At least m of n:
 y = at least 2 of {x1 ,x3, x5 }
 y = sgn{1 ¢ x1 + 1 ¢ x3 + 1 ¢ x5 - 2} }; w = (1, 0, 1, 0, 1) =2
Many functions are not
 Xor: y = x1 Æ x2 Ç :x1 Æ :x2
 Non trivial DNF: y = x1 Æ x2 Ç x3 Æ x4
But can be made linear

INTRODUCTION CS446 Spring ’17 49

Functions Can be Made Linear
Data are not linearly separable in one dimension
Not separable if you insist on using a specific class of
n

functions
o
ati
nt
se
re
p
Re

INTRODUCTION CS446 Spring ’17 50

Blown Up Feature Space
Data are separable in <x, x2> space
Key issue: Representation:
what features to use. x2
Computationally, can be
done implicitly (kernels)
Not always ideal.

INTRODUCTION CS446 Spring ’17 51

Exclusive-OR (XOR)
(x1 Æ x2) Ç (:{x1} Æ :{x2})
In general: a parity function.
n o
ati

x2
nt

xi 2 {0,1}
se
re
p
Re

f(x1, x2,…, xn) = 1

iff  xi is even

This function is not

linearly separable.

INTRODUCTION CS446 Spring ’17 52

Functions Can be Made Linear
Discrete Case
A real Weather/Whether
example y 3 Ç y4 Ç y7
x1 x2 x4 Ç x2 x4 x5 Ç x1 x3 x7 New discriminator is
functionally simpler

Whether

Weather

Space: X= x1, x2,…, xn

Input Transformation
New Space: Y = {y1,y2,…} = {xi,xi xj, xi xj xj,…}
INTRODUCTION CS446 Spring ’17 53
Third Step: How to Learn?
A possibility: Local search
 Start with a linear threshold function.
 See how well you are doing.
 Correct
s
m

 Repeat until you converge.

ith
r
go
Al

There are other ways that

do not search directly in
the hypotheses space
 Directly compute the
hypothesis

INTRODUCTION CS446 Spring ’17 54

A General Framework for
Learning
Goal: predict an unobserved output value y 2 Y
based on an observed input vector x 2 X

Estimate a functional relationship y~f(x)

s
m
ith

from a set {(x,y)i}i=1,n

r
go
Al

Most relevant - Classification: y  {0,1} (or y  {1,2,…k} )

 (But, within the same framework can also talk about Regression, y 2 < )

Simple loss function: # of mistakes

[…] is a indicator function
What do we want f(x) to satisfy?
 We want to minimize the Risk: L(f()) = E X,Y( [f(x)y] )
 Where: E X,Y denotes the expectation with respect to the true
distribution.

INTRODUCTION CS446 Spring ’17 55

A General Framework for
Learning (II)
We want to minimize the Loss: L(f()) = E X,Y( [f(X)Y] )
Where: E X,Y denotes the expectation with respect to the true
distribution. Side note: If the distribution over X£Y is known,
predict: y = argmaxy P(y|x)
s
m

This is the best possible (the optimal Bayes' error).

ith

We cannot minimize this loss

r
go

Instead, we try to minimize the empirical classification error.

For a set of training examples {(xi,yi)}i=1,n

Try to minimize: L’(f()) = 1/n i [f(xi)yi]
 (Issue I: why/when is this good enough? Not now)
This minimization problem is typically NP hard.
To alleviate this computational problem, minimize a new function – a
convex upper bound of the classification error function
I(f(x),y) =[f(x) y] = {1 when f(x)y; 0 otherwise}

INTRODUCTION CS446 Spring ’17 56

Algorithmic View of Learning: an
Optimization Problem
A Loss Function L(f(x),y) measures the penalty
incurred by a classifier f on example (x,y).
There are many different loss functions one could
s

define:
m
ith
r

 Misclassification Error:
go
Al

L(f(x),y) = 0 if f(x) = y; 1 otherwise

 Squared Loss:
A continuous convex loss
L(f(x),y) = (f(x) – y) 2

 Input dependent loss:

L
function allows a simpler
optimization algorithm.
L(f(x),y) = 0 if f(x)= y; c(x)otherwise. f(x) –y

INTRODUCTION CS446 Spring ’17 57

Loss

Here f(x) is the prediction 2 <

y 2 {-1,1} is the correct value
0-1 Loss L(y,f(x))= ½ (1-sgn(yf(x)))
Log Loss 1/ln2 log (1+exp{-yf(x)})
Hinge Loss L(y, f(x)) = max(0, 1 - y f(x))
Square Loss L(y, f(x)) = (y - f(x))2

0-1 Loss x axis = yf(x)

Log Loss = x axis = yf(x)
Hinge Loss: x axis = yf(x)
Square Loss: x axis = (y - f(x)+1)

INTRODUCTION CS446 Spring ’17 58

Example

Putting it all together:

A Learning Algorithm

INTRODUCTION CS446 Spring ’17

Third Step: How to Learn?
A possibility: Local search
 Start with a linear threshold function.
 See how well you are doing.
 Correct
s
m

 Repeat until you converge.

ith
r
go
Al

There are other ways that

do not search directly in
the hypotheses space
 Directly compute the
hypothesis

INTRODUCTION CS446 Spring ’17 60

Learning Linear Separators (LTU)
f(x) = sgn {xT ¢ w - } = sgn{i=1n wi xi -  }
xT= (x1 ,x2,… ,xn) 2 {0,1}n
is the feature based

s
m

encoding of the data point

ith
r
go

wT= (w1 ,w2,… ,wn) 2 <n

is the target function.

 determines the shift
with respect to the origin
w

INTRODUCTION CS446 Spring ’17 61

Canonical Representation
f(x) = sgn {wT ¢ x - } = sgn{i=1n wi xi -  }

sgn {wT ¢ x - } ´ sgn {(w’)T ¢ x’}

s
m
ith

Where:
r
go

 x’ = (x, -1) and w’ = (w, )

Moved from an n dimensional representation to an

(n+1) dimensional representation, but now can look
for hyperplanes that go through the origin.

INTRODUCTION CS446 Spring ’17 62

General Learning Principle
The Risk: a
function of w Our goal is to find a w that The loss: a function of xT, w and y
minimizes the expected risk
J(w) = E X,Y Q(x, y, w)

s

We cannot do it.
m
ith

Instead, we approximate J(w)

r
go
Al

using a finite training set of

independent samples (xi, yi)
J(w) ~=~ 1/m 1,m Q(xi ,yi, w)
To find the minimum, we use a
batch gradient descent algorithm w
That is, we successively compute
estimates wt of the optimal parameter vector w:
wt+1 = wt - r J(w) = wt - 1/m 1,m r Q(xi ,yi, w)

INTRODUCTION CS446 Spring ’17 63

Gradient Descent
We use gradient descent to determine the weight vector that
minimizes J(w) = Err (w) ;
Fixing the set D of examples, J=Err is a function of wj
At each step, the weight vector is modified in the direction that
s
m
ith

produces the steepest descent along the error surface.

r
go
Al

J(w)

w
w 4 w3 w 2 w1
INTRODUCTION CS446 Spring ’17 64
LMS: An Optimization Algorithm

Our Hypothesis Space is the collection of Linear Threshold Units

Loss function:
 Squared loss: LMS (Least Mean Square, L2)
 Q(x, y, w) = ½ (wT x – y)2

s
m
ith
r
go
Al

INTRODUCTION CS446 Spring ’17 65

LMS: An Optimization Algorithm
(i (subscript) – vector component; j (superscript) - time; d – example #)

Assumption: x 2 Rn; u 2 Rn is the target weight vector;

the target (label) is td = u ¢ x Noise has been added;
so, possibly, no weight vector is consistent with the
s
m

data.
ith

Let w(j) be the current weight vector we have

r
go

Our prediction on the d-th example x is:

 (j) 
o d  i w  x i  w  x
i
j

Let td be the target value for this example (real value; represents u ¢ x)
The error the current hypothesis makes on the data set is:

 (j) 1
J(w)  Err( w )   (t d - o d ) 2
2 dD
INTRODUCTION CS446 Spring ’17 66
Gradient Descent

To find the best direction in the weight spacew we
compute the gradient of E with respect to each of the
components of
 E E E
E(w )  [ , ,..., ]
s
m

w 1 w 2 w n
ith
r
go
Al

This vector specifies the direction that produces the

steepest increase in E;
 
We want to modify w in the direction of  E(w )
  
w  w  Δw
Where (with a fixed step size R):
 
Δw  - R E(w )

INTRODUCTION CS446 Spring ’17 67

Gradient Descent: LMS
 1

We have: Err( w (j) ) 
2 dD
(t d - o d ) 2

Therefore: E  1
  (t  o ) 2

s
m

d d
w i w i 2 dD
ith
r
go
Al

1 
  (t d  o d ) 2 
2 dD w i
1   
  2(t d  o d ) (t d  w d  x d ) 
2 dD w i
  (t d  o d )(-x id )
dD
INTRODUCTION CS446 Spring ’17 68
Gradient Descent: LMS
Weight update rule:
w i  R  (t d  o d )x id
dD
s
m

Gradient descent algorithm for training linear units:

ith
r
go

 Start with an initial random weight vector

 For every example d with target value td do:

 
 Evaluate the linear unit o d 
 i
wi   x id  w  xd
 Update w by addingw i to each component
 Continue until E below some threshold
This algorithm always converges to a local minimum of J(w), for small enough steps.
Here (LMS for linear regression), the surface contains only a single global minimum,
so the algorithm converges to a weight vector with minimum error, regardless of
whether the examples are linearly separable.
The surface may have local minimum if the loss function is different.

INTRODUCTION CS446 Spring ’17 69

Incremental (Stochastic)
Gradient Descent: LMS
Dropped the averaging operation.
Weight update rule: Instead of averaging the gradient of
the loss over the complete training
w i  R(t d  o d )x id set, choose at random a sample
(x,y) (or a subset of examples) and
update wt
s
m

Gradient descent algorithm for training linear units:

ith
r
go

 Start with an initial random weight vector

 For every example d with target value td do:

 
o d  iw i  x id  w  x d
 Evaluate the linear unit

 update w by incrementally by adding w i to each
component (update without summing over all data)
 Continue until E below some threshold

INTRODUCTION CS446 Spring ’17 70

Incremental (Stochastic)
Gradient Descent: LMS
Dropped the averaging operation.
Weight update rule: Sometimes called “on-line” since
we don’t need a reference to a
w i  R(t d  o d )x id training set:
observe – predict – get feedback.
s
m

Gradient descent algorithm for training linear units:

ith
r
go

 Start with an initial random weight vector

 For every example d with target value: t d

 
 Evaluate the linear unit o 
 d  w  x id  w  x d
i i
 update w by incrementally adding w to each component
i
(update without summing over all data)
 Continue until E below some threshold
In general - does not converge to global minimum
Decreasing R with time guarantees convergence
But, on-line algorithms are sometimes advantageous…

INTRODUCTION CS446 Spring ’17 71

Learning Rates and Convergence
In the general (non-separable) case the learning rate
R must decrease to zero to guarantee convergence.
The learning rate is called the step size. There are
s

more sophisticated algorithms that choose the step

m h
rit

size automatically and converge faster.

go
Al

Choosing a better starting point also has impact.

The gradient descent and its stochastic version are

very simple algorithms, but almost all the algorithms
we will learn in the class can be traced back to
gradient decent algorithms for different loss
functions and different hypotheses spaces.

INTRODUCTION CS446 Spring ’17 72

Computational Issues
Assume the data is linearly separable.
Sample complexity:
 Suppose we want to ensure that our LTU has an error rate (on new
examples) of less than  with high probability (at least (1-))
s
m

 How large does m (the number of examples) must be in order to achieve

ith

this ? It can be shown that for n dimensional problems

r
go

m = O(1/  [ln(1/ ) + (n+1) ln(1/ ) ].

Computational complexity: What can be said?

 It can be shown that there exists a polynomial time algorithm for finding
consistent LTU (by reduction from linear programming).
 [Contrast with the NP hardness for 0-1 loss optimization]
 (On-line algorithms have inverse quadratic dependence on the margin)

INTRODUCTION CS446 Spring ’17 73

Other Methods for LTUs
Fisher Linear Discriminant:
 A direct computation method
Probabilistic methods (naïve Bayes):
s
m

 Produces a stochastic classifier that can be viewed as a

ith

linear threshold unit.

r
go
Al

Winnow/Perceptron
 A multiplicative/additive update algorithm with some
sparsity properties in the function space (a large number of
irrelevant attributes) or features space (sparse examples)
Logistic Regression, SVM…many other algorithms

INTRODUCTION CS446 Spring ’17 74

IBC02 Group C 2.S1 Back Savers Production Problem
No ratings yet
IBC02 Group C 2.S1 Back Savers Production Problem
11 pages
HE Ichest Ngineer: A Story That Will Unravel The Secrets of The Rich
No ratings yet
HE Ichest Ngineer: A Story That Will Unravel The Secrets of The Rich
1 page
CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign
No ratings yet
CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign
75 pages
Machine Learning Theory CSE 250C: Introductory Lecture
No ratings yet
Machine Learning Theory CSE 250C: Introductory Lecture
29 pages
CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign
No ratings yet
CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign
26 pages
Artificial Neural Networks: CS464 Introduction To Machine Learning 1
No ratings yet
Artificial Neural Networks: CS464 Introduction To Machine Learning 1
39 pages
01 02 Introduction Regression Analysis and GR
No ratings yet
01 02 Introduction Regression Analysis and GR
11 pages
2-Inductive Learning
No ratings yet
2-Inductive Learning
37 pages
CS464 Ch1 Intro Fall2020
No ratings yet
CS464 Ch1 Intro Fall2020
83 pages
2024-SCU-ML-1-3-PLA
No ratings yet
2024-SCU-ML-1-3-PLA
50 pages
Machine Learning 10-401, Spring 2018: Introduction, Admin, Course Overview
No ratings yet
Machine Learning 10-401, Spring 2018: Introduction, Admin, Course Overview
35 pages
01_ml_basics
No ratings yet
01_ml_basics
61 pages
Course Logistics and Introduction To Machine Learning
No ratings yet
Course Logistics and Introduction To Machine Learning
34 pages
MACHINE LEARNING ALGORITHM - Unit-1-1
100% (1)
MACHINE LEARNING ALGORITHM - Unit-1-1
78 pages
ML Overview
No ratings yet
ML Overview
26 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Andrew NG Complete Machine Learning
No ratings yet
Andrew NG Complete Machine Learning
170 pages
Unit 1&2
No ratings yet
Unit 1&2
270 pages
Machine Learning: Introduction: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Machine Learning: Introduction: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
30 pages
Unit 4
No ratings yet
Unit 4
26 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
Machine Learning Crashcourse
No ratings yet
Machine Learning Crashcourse
233 pages
1c Machinelearning
No ratings yet
1c Machinelearning
50 pages
CS-871-Lecture 1
No ratings yet
CS-871-Lecture 1
41 pages
Fundamental_Deep learning
No ratings yet
Fundamental_Deep learning
69 pages
CONCEPTS_OF_MACHINE_LEARNING [MINOR]
No ratings yet
CONCEPTS_OF_MACHINE_LEARNING [MINOR]
14 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
Introduction To Machine Learning: David Kauchak CS 451 - Fall 2013
No ratings yet
Introduction To Machine Learning: David Kauchak CS 451 - Fall 2013
34 pages
Introduction To Machine Learning: WWW - Seas.upenn - Edu/ Cis519
100% (1)
Introduction To Machine Learning: WWW - Seas.upenn - Edu/ Cis519
51 pages
Lecture W1c UG
No ratings yet
Lecture W1c UG
33 pages
01 Introduction
No ratings yet
01 Introduction
43 pages
1 - Concept Learning
No ratings yet
1 - Concept Learning
18 pages
Machine Learning: Dan Goldwasser
No ratings yet
Machine Learning: Dan Goldwasser
33 pages
CSE445 1 Intro to ML
No ratings yet
CSE445 1 Intro to ML
36 pages
Machine Learning-Lecture 01
No ratings yet
Machine Learning-Lecture 01
28 pages
Lecture1 PDF
No ratings yet
Lecture1 PDF
37 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
CONCEPTS IN MACHINE LEARNING-Ktunotes.in
No ratings yet
CONCEPTS IN MACHINE LEARNING-Ktunotes.in
14 pages
ML 01
No ratings yet
ML 01
15 pages
CE880_lecture5_slides
No ratings yet
CE880_lecture5_slides
32 pages
MLLecture 1
No ratings yet
MLLecture 1
10 pages
An Introduction To Machine Learning
No ratings yet
An Introduction To Machine Learning
136 pages
Introduction To Machine Learning: Pekka Parviainen
No ratings yet
Introduction To Machine Learning: Pekka Parviainen
39 pages
ML - Week 1
No ratings yet
ML - Week 1
37 pages
Lecture_1
No ratings yet
Lecture_1
10 pages
Unit-1 MLT
No ratings yet
Unit-1 MLT
51 pages
Lecture 1
No ratings yet
Lecture 1
30 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
ECS171: Machine Learning: Lecture 1: Overview of Class, LFD 1.1, 1.2
No ratings yet
ECS171: Machine Learning: Lecture 1: Overview of Class, LFD 1.1, 1.2
29 pages
ML Short U1-4
No ratings yet
ML Short U1-4
60 pages
Lecture 1
100% (1)
Lecture 1
81 pages
ML Lecture#1
No ratings yet
ML Lecture#1
52 pages
Introduction
No ratings yet
Introduction
41 pages
mlintro-2
No ratings yet
mlintro-2
28 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
18 pages
ML Lab Manual-17csl76
No ratings yet
ML Lab Manual-17csl76
43 pages
1 - Introduction AML
No ratings yet
1 - Introduction AML
41 pages
ML 23 First Lectures 2 3 v0.1
No ratings yet
ML 23 First Lectures 2 3 v0.1
66 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
Algorithmic Thinking, 2nd Edition: Learn Algorithms to Level Up Your Coding Skills
From Everand
Algorithmic Thinking, 2nd Edition: Learn Algorithms to Level Up Your Coding Skills
Daniel Zingaro
No ratings yet
BIG SKETCHES BOOK by KAFASYAAR
No ratings yet
BIG SKETCHES BOOK by KAFASYAAR
59 pages
Talks On Gita
No ratings yet
Talks On Gita
3 pages
The University of Chicago Press
No ratings yet
The University of Chicago Press
4 pages
ELearning Options and Derivatives Resource Guide
No ratings yet
ELearning Options and Derivatives Resource Guide
23 pages
Soundings: Charterparty Checklist: Are You Contractually Ready For 2020?
No ratings yet
Soundings: Charterparty Checklist: Are You Contractually Ready For 2020?
4 pages
Yogi Sunira PDF
No ratings yet
Yogi Sunira PDF
1 page
Future Gain: Bajaj Allianz
No ratings yet
Future Gain: Bajaj Allianz
6 pages
7th Issue Bihar Branding Magazine - English - 2019
100% (1)
7th Issue Bihar Branding Magazine - English - 2019
36 pages
Model S04: Instruction Book Mode D'emploi
No ratings yet
Model S04: Instruction Book Mode D'emploi
44 pages
Fast Memory Efficient Local Outlier Detection
No ratings yet
Fast Memory Efficient Local Outlier Detection
2 pages
MBA 19 PAT 302 DS Unit 1.2.3
No ratings yet
MBA 19 PAT 302 DS Unit 1.2.3
23 pages
L10 Huffman Encoding Greedy
No ratings yet
L10 Huffman Encoding Greedy
52 pages
Emulation of FMA and Correctly-Rounded Sums: Proved Algorithms Using Rounding To Odd
No ratings yet
Emulation of FMA and Correctly-Rounded Sums: Proved Algorithms Using Rounding To Odd
9 pages
Python Class 5
No ratings yet
Python Class 5
22 pages
DSA Minor
No ratings yet
DSA Minor
28 pages
EE351Chap3-2.0 - Transfer FTN & SFGs
No ratings yet
EE351Chap3-2.0 - Transfer FTN & SFGs
19 pages
CHAPTER 01 - Basics of Coding Theory
No ratings yet
CHAPTER 01 - Basics of Coding Theory
36 pages
My Project PDF
No ratings yet
My Project PDF
15 pages
Unit 4
No ratings yet
Unit 4
55 pages
Bubble Sort
No ratings yet
Bubble Sort
6 pages
DSA s23
No ratings yet
DSA s23
2 pages
DSA Question Bank
No ratings yet
DSA Question Bank
7 pages
Course: KS091201 Matematika Diskrit (Discrete Mathematics)
No ratings yet
Course: KS091201 Matematika Diskrit (Discrete Mathematics)
17 pages
HW 02 Sol
No ratings yet
HW 02 Sol
5 pages
Random Number Generators: Professor Karl Sigman Columbia University Department of IEOR New York City USA
No ratings yet
Random Number Generators: Professor Karl Sigman Columbia University Department of IEOR New York City USA
17 pages
Module-3 Syntax Analyzer
No ratings yet
Module-3 Syntax Analyzer
80 pages
3 - Lecture 07
No ratings yet
3 - Lecture 07
70 pages
DR Nazir A. Zafar Advanced Algorithms Analysis and Design
No ratings yet
DR Nazir A. Zafar Advanced Algorithms Analysis and Design
40 pages
Remove Ghosting Effect
No ratings yet
Remove Ghosting Effect
8 pages
TCS Most Frequently Asked Technical Interview Questions
No ratings yet
TCS Most Frequently Asked Technical Interview Questions
8 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Example Bilet Maths
No ratings yet
Example Bilet Maths
9 pages
DAA Unit 3
No ratings yet
DAA Unit 3
46 pages
Digital Filter Structures and Quantization Effects: 6.1: Direct-Form Network Structures
No ratings yet
Digital Filter Structures and Quantization Effects: 6.1: Direct-Form Network Structures
35 pages
Quantum Algorithm - Deutsch's Algorithm
No ratings yet
Quantum Algorithm - Deutsch's Algorithm
20 pages
Module 4
No ratings yet
Module 4
53 pages
Lesson 3 - Java Names and Labels
No ratings yet
Lesson 3 - Java Names and Labels
69 pages
PPHXL 4r01j
No ratings yet
PPHXL 4r01j
5 pages