100% found this document useful (1 vote)

67 views

CS229 Lecture 3 PDF

This document summarizes a lecture on supervised learning and classification. It begins with an overview of linear regression and logistic regression. It then discusses how linear regression can be justified using a probabilistic interpretation involving least squares optimization. This allows generalizing the approach to more complex models using maximum likelihood. The document explains how logistic regression uses a sigmoid link function to model binary classification problems, interpreting the model probabilities as P(y=1|x) and P(y=0|x). Overall, it introduces key concepts in supervised learning including regression, classification, maximum likelihood, and link functions.

Uploaded by

Amr Abbas

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

67 views

CS229 Lecture 3 PDF

Uploaded by

Amr Abbas

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

CS 229 Lecture Three

Supervised Learning: Classification

Chris Ré

April 2, 2023
Supervised Learning and Classification

I Linear Regression via a Probabilistic Interpretation

I Logistic Regression
I Optimization Method: Newton’s Method

We’ll learn the maximum likelihood method (a probabilistic

interpretation) to generalize from linear regression to more
sophisticated models.
A Justification for Least Squares?

I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which

x (i) ∈ Rd+1 and y (i) ∈ R.
I Do find θ ∈ Rd+1 s.t. θ = argminθ ni=1 (hθ (x (i) ) − y (i) )2 in
P
which hθ (x) = θT x.
A Justification for Least Squares?

I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which

x (i) ∈ Rd+1 and y (i) ∈ R.
I Do find θ ∈ Rd+1 s.t. θ = argminθ ni=1 (hθ (x (i) ) − y (i) )2 in
P
which hθ (x) = θT x.

Where did this model come from?

One way to view is via a probabilistic interpretation (helpful

throughout the course).
A Justification for Least Squares?

We make an assumption (common in statistics) that the data are

generated according to some model (that may contain random
choices). That is,
y (i) = θT x (i) + ε(i) .
Here, ε(i) is a random variable that captures “noise” that is,
unmodeled effects, measurement errors, etc.
A Justification for Least Squares?

We make an assumption (common in statistics) that the data are

Please keep in mind: this is just a model! As they say, all

models are wrong but some models are useful. This model
has been shockingly useful.
What do we expect of the noise?
What properties should we expect from ε(i)

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:

I E[ε(i) ] = 0 – the noise is unbiased.
What do we expect of the noise?
What properties should we expect from ε(i)

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:

I E[ε(i) ] = 0 – the noise is unbiased.
I The errors for different points are independent and identically
distributed (called, iid)

E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.

and 2
(i)
E ε = σ2

Here σ 2 is some measure of how noisy the data are.

What do we expect of the noise?
What properties should we expect from ε(i)

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:

I E[ε(i) ] = 0 – the noise is unbiased.
I The errors for different points are independent and identically
distributed (called, iid)

E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.

and 2
(i)
E ε = σ2

Here σ 2 is some measure of how noisy the data are. Turns out,
this effectively defines the Gaussian or Normal distribution.
Notation for the Gaussian
We write z ∼ N (µ, σ 2 ) and read these symbols as
z is distributed as a normal with mean µ and standard
deviation σ 2 .
or equivalently
(z − µ)2

1
P(z) = √ exp − .
σ 2π 2σ 2
Notation for Guassians in our Problem
Recall in our model,

y (i) = θT x (i) + ε(i) in which ε(i) ∼ N (0, σ 2 ).

or more compactly notation:

y (i) | x (i) ; θ ∼ N (θT x, σ 2 ).

equivalently,
( )
1 (y (i) − x T θ)2
P y (i) | x (i) ; θ = √ exp −
σ 2π 2σ 2

I We condition on x (i) .
I In contrast, θ parameterizes or “picks” a distribution.
We use bar (|) versus semicolon (;) notation above.
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).

n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).

n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).

n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1

For convenience, we use the Log Likelihood `(θ) = log L(θ).

n
X 1 (x (i) θ − y (i) )2
`(θ) = log √ −
σ 2π 2σ 2
i=1
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).

n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1

For convenience, we use the Log Likelihood `(θ) = log L(θ).

n
X 1 (x (i) θ − y (i) )2
`(θ) = log √ −
σ 2π 2σ 2
i=1
n
1 1 X (i) 1
=n log √ − 2 (x θ − y (i) )2 = C (σ, n) − 2 J(θ)
σ 2π 2σ σ
i=1

where C (σ, n) = n log σ√12π .

(Log) Likelihoods!

So we’ve shown that finding a θ to maximize L(θ) is the same as

maximizing
1
`(θ) = C (σ, n) − 2 J(θ)
σ
Or minimizing, J(θ) directly (why?)

Takeaway: “Under the hood,” solving least squares is

solving a maximum likelihood problem for a particular
probabilistic model.

This view shows a path to generalize to new situations!

Summary of Least Squares

I We introduced the Maximum Likelihood framework–super

powerful (next lectures)
I We showed that least squares was actually a version of
maximum likelihoods.
I We learned some notation that will help us later in the
course. . .
Classification
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Why not use regression, say least squares? A picture . . .
Logistic Regression: Link Functions
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Want hθ (x) ∈ [0, 1]. Let’s pick a smooth function:

hθ (x) = g (θT x)

Here, g is a link function. There are many. . .

Logistic Regression: Link Functions
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Want hθ (x) ∈ [0, 1]. Let’s pick a smooth function:

hθ (x) = g (θT x)

Here, g is a link function. There are many. . . but we’ll pick one!
1
g (z) = .
1 + e −z
Logistic Regression: Link Functions
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Want hθ (x) ∈ [0, 1]. Let’s pick a smooth function:

hθ (x) = g (θT x)

Here, g is a link function. There are many. . . but we’ll pick one!
1
g (z) = .
1 + e −z

How do we interpret hθ (x)?

P(y = 1 | x; θ) = hθ (x)
P(y = 0 | x; θ) = 1 − hθ (x)
Logistic Regression: Link Functions
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)

Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
Logistic Regression: Link Functions
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)

Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
n
(i) (i)
Y
= hθ (x (i) )y (1 − hθ (x (i) ))1−y exponents encode “if-then”
i=1
Logistic Regression: Link Functions
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)

Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
n
(i) (i)
Y
= hθ (x (i) )y (1 − hθ (x (i) ))1−y exponents encode “if-then”
i=1

Taking logs to compute the log likelihood `(θ) we have:

n
X
`(θ) = log L(θ) = y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
i=1
Now to solve it. . .

n
X
`(θ) = log L(θ) = y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
i=1

We maximize for θ but we already saw how to do this! Just

compute derivative, run (S)GD and you’re done with it!

Takeaway: This is another example of the max likelihood

method: we setup the likelihood, take logs, and compute
derivatives.
Time Permitting: There is magic in the derivative. . .
Even more, the batch update can be written in a remarkably
familiar form:
X
θ(t+1) = θ(t) + (y (j) − hθ (x (j) ))x (j) .
j∈B

We sketch why (you can check!) We drop superscripts to simplify

notation and examine a single data point:
y log hθ (x) + (1 − y ) log(1 − hθ (x))
Tx Tx
= − y log(1 + e −θ ) + (1 − y )(−θT x) − (1 − y ) log(1 + e −θ )
−θT x
= − log(1 + e ) − (1 − y )(θT x)
−θ T x
We used 1 − hθ (x) = e −θT x . We now compute the derivative of
1−e
this expression wrt θ and get:
T
e −θ x
x − (1 − y )x = (y − hθ (x))x
1 + e −θT x
Summary of Introduction to Classification

I We used the principle of maximum likelihood (and a

probabilistic model) to extend to classification.
Summary of Introduction to Classification

I We used the principle of maximum likelihood (and a

probabilistic model) to extend to classification.
I We developed logistic regression from this principle.
I Logistic regression is widely used today.
Summary of Introduction to Classification

I We used the principle of maximum likelihood (and a

probabilistic model) to extend to classification.
I We developed logistic regression from this principle.
I Logistic regression is widely used today.
I We noticed a familiar pattern: take derivatives of the
likelihood, and the derivatives had this (hopefully) intuitive
“misprediction form”
Newton’s Method
Given f : Rd → R find x s.t. f (x) = 0.
Newton’s Method
Given f : Rd → R find x s.t. f (x) = 0.

We apply this with f (θ) = ∇θ `(θ), the likelihood function

Newton’s Method (Drawn in Class)
Given f : Rd → R find x s.t. f (x) = 0.
Newton’s Method Summary
Given f : Rd → R find x s.t. f (x) = 0.
I This is the update rule in 1d

f (x (t) )
x (t+1) = x (t) −
f 0 (x (t) )
I It may converge very fast (quadratic local convergence!)
I For the likelihood, i.e., f (θ) = ∇θ `(θ) we need to generalize
to a vector-valued function which has:
−1
θ(t+1) = θ(t) − H(θ(t) ) ∇θ `(θ(t) ).

∂
in which Hi,j (θ) = ∂θi ∂θj `(θ).
Optimization Method Summary

Method Compute per Step Number of Steps

SGD
Minibatch SGD
GD
Newton
I In classical stats, d is small (< 100), n is often small, and
exact parameters matter
I In modern ML, d is huge (billions, trillions), n is huge
(trillions), and parameters used only for prediction
I As a result, (minibatch) SGD is the workhorse of ML.
Classification Lecture Summary

I We saw the differences between classification and regression.

I We learned about a principle for probabilistic interpretation for
linear regression and classification: Maximum Likelihood.
I We used this to derive logistic regression.
I The Maximum Likelihood principle will be used again next
lecture (and in the future)
I We saw Newton’s method, which is classically used models
(more statistics than ML–it’s not used in most modern ML)

QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Logistic Regression Example
100% (1)
Logistic Regression Example
22 pages
Stat1012 Cheatsheet Double-Sided
100% (1)
Stat1012 Cheatsheet Double-Sided
2 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Poly
100% (1)
Poly
108 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Logistic Regression Model Study Assignment
100% (1)
Logistic Regression Model Study Assignment
5 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
Homework 2
100% (1)
Homework 2
12 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
Community Medicine Trans - Epidemic Investigation 2
100% (1)
Community Medicine Trans - Epidemic Investigation 2
10 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
Data Analytics Week 3
100% (1)
Data Analytics Week 3
42 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
Linear Regression (Check List)
100% (1)
Linear Regression (Check List)
2 pages
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
100% (1)
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
151 pages
KPMG Data
50% (2)
KPMG Data
3,723 pages
KPMG - Data Set
100% (1)
KPMG - Data Set
1,685 pages
Human Life Span Prediction Using Machine Learning
100% (1)
Human Life Span Prediction Using Machine Learning
9 pages
Import As
100% (1)
Import As
27 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
Logistic Regression
100% (1)
Logistic Regression
17 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Risk Return Summery
100% (1)
Risk Return Summery
85 pages
8multiple Linear Regression
100% (1)
8multiple Linear Regression
21 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Forecasting of Stock Prices Using Multi Layer Perceptron
100% (1)
Forecasting of Stock Prices Using Multi Layer Perceptron
6 pages
Logistic Regression
100% (1)
Logistic Regression
14 pages
LPTHW
100% (1)
LPTHW
220 pages
Tutor
100% (1)
Tutor
309 pages
Python Vs R in Data and Machine Learning PDF
100% (1)
Python Vs R in Data and Machine Learning PDF
6 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Life Expectancy Using Data Analytics
100% (1)
Life Expectancy Using Data Analytics
9 pages
Photon Prog Guide
100% (1)
Photon Prog Guide
919 pages
In All The Regression Models That We Have Considered So
100% (1)
In All The Regression Models That We Have Considered So
52 pages
Preparation and Evaluation of Polyherbal Hair Oil
100% (1)
Preparation and Evaluation of Polyherbal Hair Oil
13 pages
Homework 2
100% (1)
Homework 2
14 pages
January 1, 1983 1990 5 July 1994 1930 1960
100% (1)
January 1, 1983 1990 5 July 1994 1930 1960
13 pages
7. Heteroscedasticity: y = β + β x + · · · + β x + u
100% (1)
7. Heteroscedasticity: y = β + β x + · · · + β x + u
21 pages
Airbnbs in Seattle, Wa: Questions
100% (1)
Airbnbs in Seattle, Wa: Questions
5 pages
EFFIE 2002 Case Studies
100% (1)
EFFIE 2002 Case Studies
16 pages
1
100% (1)
1
385 pages
KPMG
100% (1)
KPMG
2 pages
Logistic Regression
100% (1)
Logistic Regression
56 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
1745129316852
No ratings yet
1745129316852
149 pages
1745249054671
No ratings yet
1745249054671
7 pages
1745225694772
No ratings yet
1745225694772
24 pages
Support Tool To Strengthen Health Information Systems
No ratings yet
Support Tool To Strengthen Health Information Systems
240 pages
Connected Care For Canadians Act
No ratings yet
Connected Care For Canadians Act
16 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages

CS229 Lecture 3 PDF

Uploaded by

CS229 Lecture 3 PDF

Uploaded by

CS 229 Lecture Three

Supervised Learning: Classification

I Linear Regression via a Probabilistic Interpretation

We’ll learn the maximum likelihood method (a probabilistic

I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which

I Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} in which

Where did this model come from?

One way to view is via a probabilistic interpretation (helpful

We make an assumption (common in statistics) that the data are

We make an assumption (common in statistics) that the data are

Please keep in mind: this is just a model! As they say, all

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:

E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.

Here σ 2 is some measure of how noisy the data are.

y (i) = θT x (i) + ε(i) .

Again, it’s a model and ε(i) is a random variable:

E[ε(i) ε(j) ] = E[ε(i) ]E[ε(j) ] for i 6= j.

y (i) = θT x (i) + ε(i) in which ε(i) ∼ N (0, σ 2 ).

or more compactly notation:

y (i) | x (i) ; θ ∼ N (θT x, σ 2 ).

For convenience, we use the Log Likelihood `(θ) = log L(θ).

For convenience, we use the Log Likelihood `(θ) = log L(θ).

where C (σ, n) = n log σ√12π .

So we’ve shown that finding a θ to maximize L(θ) is the same as

Takeaway: “Under the hood,” solving least squares is

This view shows a path to generalize to new situations!

I We introduced the Maximum Likelihood framework–super

Here, g is a link function. There are many. . .

How do we interpret hθ (x)?

Taking logs to compute the log likelihood `(θ) we have:

We maximize for θ but we already saw how to do this! Just

Takeaway: This is another example of the max likelihood

We sketch why (you can check!) We drop superscripts to simplify

I We used the principle of maximum likelihood (and a

I We used the principle of maximum likelihood (and a

I We used the principle of maximum likelihood (and a

We apply this with f (θ) = ∇θ `(θ), the likelihood function

Method Compute per Step Number of Steps

I We saw the differences between classification and regression.

You might also like