lec22-ML III
lec22-ML III
[These slides were created by Dan Klein, Pieter Abbeel, Anca Dragan, Sergey Levine. All CS188 materials are at http://ai.berkeley.edu.]
Last Time: Perceptron
w1
§ If the activation is: f1
S
w2
§ Positive, output +1 f2 >0?
w3
§ Negative, output -1 f3
Last Time: Perceptron
w1
§ If the activation is: f1
S
w2
§ Positive, output +1 f2 >0?
w3
§ Negative, output -1 f3
Binary Decision Rule
§ In the space of feature vectors
§ Examples are points
§ Any weight vector is a hyperplane
§ One side corresponds to Y=+1
§ Other corresponds to Y=-1
money
2
+1 = SPAM
1
BIAS : -3
free : 4
money : 2
... 0
-1 = HAM 0 1 free
Learning: Binary Perceptron
§ Start with weights w = 0
§ For each training instance f(x), y*:
§ Classify with current weights
§ If correct: no change!
§ If wrong: lower score of wrong answer, raise
score of right answer
Score of wrong class: Score of wrong class:
𝑤" ⋅ 𝑓 𝑤" − 𝑓 ⋅ 𝑓
= 𝑤" ⋅ 𝑓 − 𝑓 ⋅ 𝑓
# of features
# of mistakes during training < !
width of margin
Problems with the Perceptron
𝑤 𝑧<0
How to get probabilistic decisions?
§ Perceptron scoring: z = w · f (x)
§ If z = w · f (x) very positive à want probability of + going to 1
§ If z = w · f (x) very negative à want probability of + going to 0
§ Sigmoid function
1
(z) = z
1+e
𝑒4
= 4
𝑒 +1
How to get probabilistic decisions?
§ Perceptron scoring: z = w · f (x)
§ If z = w · f (x) very positive à want probability of + going to 1
§ If z = w · f (x) very negative à want probability of + going to 0
§ Sigmoid function
!
𝑃 𝑦 = +1 𝑥 ; 𝑤) =
1 !"# !"⋅$(&)
(z) = z
1+e !
𝑃 𝑦 = −1 𝑥 ; 𝑤) = 1 − !"# !"⋅$(&)
= Logistic Regression
A 1D Example
𝑃 𝑟𝑒𝑑 𝑥
𝑓(𝑥)
1
𝑃 𝑟𝑒𝑑 𝑥 ; 𝑤 = 𝜙 𝑤 ⋅ 𝑓(𝑥) =
1 + 𝑒 56⋅8(:)
A 1D Example: varying w
𝑃 𝑟𝑒𝑑 𝑥
𝑤=1
𝑤 = 10
𝑤=∞
𝑓(𝑥)
1
𝑃 𝑟𝑒𝑑 𝑥 ; 𝑤 = 𝜙 𝑤 ⋅ 𝑓(𝑥) =
1 + 𝑒 $%⋅'())
A 1D Example: varying w
𝑃 𝑟𝑒𝑑 𝑥
𝑓(𝑥)
A 1D Example: varying w
𝑃 𝑟𝑒𝑑 𝑥
𝑓(𝑥)
Best w?
§ Recall maximum likelihood estimation: Choose the w value that
maximizes the probability of the observed (training) data
Best w?
§ Recall maximum likelihood estimation: Choose the w value that
maximizes the probability of the observed (training) data
Separable Case: Deterministic Decision – Many Options
Separable Case: Probabilistic Decision – Clear Preference
0.7 | 0.3
0.5 | 0.5
0.7 | 0.3
0.3 | 0.7 0.5 | 0.5
0.3 | 0.7
Multiclass Logistic Regression
Multiclass Logistic Regression
§ Recall Perceptron:
§ A weight vector for each class:
!" ⋅$(&)
!
𝑃 𝑦 𝑥 ; 𝑤) = !"( ⋅$(&)
∑"( !
wy(i) ·f (x(i) )
(i) (i) e
with: P (y |x ; w) = P (i) )
e w y ·f (x
y
𝑓#
𝑧# s
𝑓$ o
f
𝑧$ t
𝑓% m
a
x
… 𝑧%
𝑓&
𝑧> = 𝑤> ⋅ 𝑓
= 2 𝑤?> ⋅ 𝑓?
?
Logistic Regression for 3-way classification
𝑓# 𝑤""
𝑤!" 𝑧# s
𝑓$ o
f
𝑤#" 𝑧$ t
𝑓% m
a
𝑤$" x
… 𝑧%
𝑓&
𝑧> = 𝑤> ⋅ 𝑓
= 2 𝑤?> ⋅ 𝑓?
?
Logistic Regression for 3-way classification
x1 𝑓#
𝑧# s
x2 𝑓$ o
f
Feature Extraction 𝑧$ t
x3 Code
𝑓% m
a
… x
… 𝑧%
Xd 𝑓&
Deep Neural Network for 3-way classification
Layer 1 Layer 2 Layer L
x1
𝑧# s
x2 o
f
… t
𝑧$
x3 m
a
… … … x
… 𝑧%
Xd
Deep Neural Network for 3-way classification
Hidden unit 1 in layer 1
(#)
x1 ℎ#
𝑧# s
x2 o
f
… t
𝑧$
x3 m
a
… … … x
… 𝑧%
Xd
Deep Neural Network for 3-way classification
Hidden unit 1 in layer 1
(")
𝑤"" (#)
x1 ℎ#
(")
𝑤!" 𝑧# s
x2 o
f
… t
𝑧$
x3 m
a
… … … x
(") … 𝑧%
𝑤'"
Xd
(!) (!) !
ℎ! =𝜙 𝑤! ⋅𝑥 = 𝜙() 𝑤$% ⋅ 𝑥$ )
$
𝜙 = activation function
Deep Neural Network for 3-way classification
(#)
x1 ℎ#
𝑧# s
(#) o
x2 ℎ$
f
… t
(#) 𝑧$
x3 ℎ% m
a
… … … x
… 𝑧%
(#)
Xd ℎ&
Deep Neural Network for 3-way classification
Hidden unit 1 in layer 2
(#) ($)
x1 ℎ# ℎ#
𝑧# s
(#) o
x2 ℎ$
f
… t
(#) 𝑧$
x3 ℎ% m
a
… … … x
… 𝑧%
(#)
Xd ℎ&
Deep Neural Network for 3-way classification
Hidden unit 1 in layer 2
(!)
(#)
𝑤"" ($)
x1 ℎ# ℎ#
(!)
𝑤!" s
𝑧#
(#) o
x2 ℎ$
f
… t
(#) 𝑧$
x3 ℎ% m
a
… … (!) … x
𝑤$" … 𝑧%
(#)
Xd ℎ&
(#) ($)
x1 ℎ# ℎ#
𝑧# s
(#) ($) o
x2 ℎ$ ℎ$
f
… t
(#) ($) 𝑧$
x3 ℎ% ℎ% m
a
… … … x
… 𝑧%
(#) ($)
Xd ℎ& ℎ&
Deep Neural Network for 3-way classification
X
(i) (i)
max ll(w) = max log P (y |x ; w)
w w
i