CS229 Lecture 3 PDF
CS229 Lecture 3 PDF
Chris Ré
April 2, 2023
Supervised Learning and Classification
and 2
(i)
E ε = σ2
and 2
(i)
E ε = σ2
Here σ 2 is some measure of how noisy the data are. Turns out,
this effectively defines the Gaussian or Normal distribution.
Notation for the Gaussian
We write z ∼ N (µ, σ 2 ) and read these symbols as
z is distributed as a normal with mean µ and standard
deviation σ 2 .
or equivalently
(z − µ)2
1
P(z) = √ exp − .
σ 2π 2σ 2
Notation for Guassians in our Problem
Recall in our model,
equivalently,
( )
1 (y (i) − x T θ)2
P y (i) | x (i) ; θ = √ exp −
σ 2π 2σ 2
I We condition on x (i) .
I In contrast, θ parameterizes or “picks” a distribution.
We use bar (|) versus semicolon (;) notation above.
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).
n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).
n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1
(Log) Likelihoods!
Intuition: among many distributions, pick the one that agrees with
the data the most (is most “likely”).
n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1
n
Y
L(θ) =p(y |X ; θ) = p(y (i) | x (i) ; θ) iid assumption
i=1
n
( )
Y 1 (x (i) θ − y (i) )2
= √ exp −
σ 2π 2σ 2
i=1
hθ (x) = g (θT x)
hθ (x) = g (θT x)
Here, g is a link function. There are many. . . but we’ll pick one!
1
g (z) = .
1 + e −z
Logistic Regression: Link Functions
Given a training set {(x (i) , y (i) ) for i = 1, . . . , n} let y (i) ∈ {0, 1}.
Want hθ (x) ∈ [0, 1]. Let’s pick a smooth function:
hθ (x) = g (θT x)
Here, g is a link function. There are many. . . but we’ll pick one!
1
g (z) = .
1 + e −z
P(y = 1 | x; θ) = hθ (x)
P(y = 0 | x; θ) = 1 − hθ (x)
Logistic Regression: Link Functions
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)
Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
Logistic Regression: Link Functions
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)
Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
n
(i) (i)
Y
= hθ (x (i) )y (1 − hθ (x (i) ))1−y exponents encode “if-then”
i=1
Logistic Regression: Link Functions
Let’s write the Likelihood function. Recall:
P(y = 1 | x; θ) =hθ (x)
P(y = 0 | x; θ) =1 − hθ (x)
Then,
n
Y
L(θ) =P(y | X ; θ) = p(y (i) | x (i) ; θ)
i=1
n
(i) (i)
Y
= hθ (x (i) )y (1 − hθ (x (i) ))1−y exponents encode “if-then”
i=1
n
X
`(θ) = log L(θ) = y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
i=1
f (x (t) )
x (t+1) = x (t) −
f 0 (x (t) )
I It may converge very fast (quadratic local convergence!)
I For the likelihood, i.e., f (θ) = ∇θ `(θ) we need to generalize
to a vector-valued function which has:
−1
θ(t+1) = θ(t) − H(θ(t) ) ∇θ `(θ(t) ).
∂
in which Hi,j (θ) = ∂θi ∂θj `(θ).
Optimization Method Summary