0% found this document useful (0 votes)
2 views

lec22-ML III

The document includes announcements for a project due date and a guest lecture on large model development. It covers topics related to logistic regression and neural networks, detailing the perceptron model, learning rules for binary and multiclass perceptrons, and introduces logistic regression as a probabilistic approach. Additionally, it discusses the properties, challenges, and improvements of perceptrons, as well as the application of deep neural networks for classification tasks.

Uploaded by

23020011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lec22-ML III

The document includes announcements for a project due date and a guest lecture on large model development. It covers topics related to logistic regression and neural networks, detailing the perceptron model, learning rules for binary and multiclass perceptrons, and introduces logistic regression as a probabilistic approach. Additionally, it discusses the properties, challenges, and improvements of perceptrons, as well as the application of deep neural networks for classification tasks.

Uploaded by

23020011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Announcements

§ Project 4 due today (Thursday, Nov 14) at 11:59pm PT

§ Catherine Olsson (Anthropic) giving guest lecture next Tuesday


(Nov 19) on large model development and interpretability
§ Come in person and ask questions!
CS 188: Artificial Intelligence
Logistic Regression and Neural Networks

[These slides were created by Dan Klein, Pieter Abbeel, Anca Dragan, Sergey Levine. All CS188 materials are at http://ai.berkeley.edu.]
Last Time: Perceptron

§ Inputs are feature values


§ Each feature has a weight
§ Sum is the activation

w1
§ If the activation is: f1

S
w2
§ Positive, output +1 f2 >0?
w3
§ Negative, output -1 f3
Last Time: Perceptron

§ Inputs are feature values


§ Each feature has a weight
§ Sum is the activation

Originated from computationally modeling neurons:

w1
§ If the activation is: f1

S
w2
§ Positive, output +1 f2 >0?
w3
§ Negative, output -1 f3
Binary Decision Rule
§ In the space of feature vectors
§ Examples are points
§ Any weight vector is a hyperplane
§ One side corresponds to Y=+1
§ Other corresponds to Y=-1

money
2

+1 = SPAM

1
BIAS : -3
free : 4
money : 2
... 0
-1 = HAM 0 1 free
Learning: Binary Perceptron
§ Start with weights w = 0
§ For each training instance f(x), y*:
§ Classify with current weights

§ If correct (i.e., y=y*), no change!


§ If wrong: adjust the weight vector by
adding or subtracting the feature
vector. Subtract if y* is -1.
Learning: Binary Perceptron
Before update: After update:
§ Start with weights w = 0
§ For each training instance f(x), y*:
§ Classify with current weights

§ If correct: (i.e., y=y*), no change!


§ If wrong: adjust the weight vector by
adding or subtracting the feature
vector. Subtract if y* is -1. 𝑤⋅𝑓 𝑤 + 𝑦∗ ⋅ 𝑓 ⋅ 𝑓
= 𝑤 ⋅ 𝑓 + 𝑦∗ ⋅ 𝑓 ⋅ 𝑓
Multiclass Decision Rule

§ If we have multiple classes:


§ A weight vector for each class:

§ Score (activation) of a class y:

§ Prediction highest score wins

Binary = multiclass where the negative class has weight zero


Learning: Multiclass Perceptron

§ Start with all weights = 0


§ Pick up training examples f(x), y* one by one Predicted Class

§ Predict with current weights

§ If correct: no change! True Class


§ If wrong: lower score of wrong answer, raise
score of right answer
Learning: Multiclass Perceptron
Before update: After update:
§ Start with all weights = 0
§ Pick up training examples f(x), y* one by one
§ Predict with current weights

§ If correct: no change!
§ If wrong: lower score of wrong answer, raise
score of right answer
Score of wrong class: Score of wrong class:
𝑤" ⋅ 𝑓 𝑤" − 𝑓 ⋅ 𝑓
= 𝑤" ⋅ 𝑓 − 𝑓 ⋅ 𝑓

Score of right class: Score of right class:


𝑤" ∗ ⋅ 𝑓 𝑤" ∗ ⋅ 𝑓 + 𝑓 ⋅ 𝑓
Example: Multiclass Perceptron
Iteration 0: x: “win the vote” f(x): [1 1 0 1 1] y*: politics
Iteration 1: x: “win the election” f(x): [1 1 0 0 1] y*: politics
Iteration 2: x: “win the game” f(x): [1 1 1 0 1] y*: sports

BIAS 1 0 0 1 BIAS 0 1 1 0 BIAS 0 0 0 0


win 0 -1 -1 0 win 0 1 1 0 win 0 0 0 0
game 0 0 0 1 game 0 0 0 -1 game 0 0 0 0
vote 0 -1 -1 -1 vote 0 1 1 1 vote 0 0 0 0
the 0 -1 -1 0 the 0 1 1 0 the 0 0 0 0
𝑤⋅𝑓 𝑥 : 1 -2 -2 𝑤⋅𝑓 𝑥 : 0 3 3 𝑤⋅𝑓 𝑥 : 0 0 0
Properties of Perceptrons
Separable
§ Separability: true if some parameters get the training set
perfectly correct

§ Convergence: if the training is separable, perceptron will


eventually converge (binary case)

§ Mistake Bound: the maximum number of mistakes (binary


Non-Separable
case) related to the margin or degree of separability

# of features
# of mistakes during training < !
width of margin
Problems with the Perceptron

§ Noise: if the data isn’t separable,


weights might thrash
§ Averaging weight vectors over time
can help (averaged perceptron)

§ Mediocre generalization: finds a


“barely” separating solution

§ Overtraining: test / held-out


accuracy usually rises, then falls
§ Overtraining is a kind of overfitting
Improving the Perceptron
Non-Separable Case: Deterministic Decision
Even the best linear boundary makes at least one mistake
Non-Separable Case: Probabilistic Decision
0.9 | 0.1
0.7 | 0.3
0.5 | 0.5
0.3 | 0.7
0.1 | 0.9
How to get probabilistic decisions?
§ Perceptron scoring: z = w · f (x)
§ If z = w · f (x) very positive à want probability of + going to 1
§ If z = w · f (x) very negative à want probability of + going to 0
𝑧>0
𝑧=0

𝑤 𝑧<0
How to get probabilistic decisions?
§ Perceptron scoring: z = w · f (x)
§ If z = w · f (x) very positive à want probability of + going to 1
§ If z = w · f (x) very negative à want probability of + going to 0

§ Sigmoid function

1
(z) = z
1+e
𝑒4
= 4
𝑒 +1
How to get probabilistic decisions?
§ Perceptron scoring: z = w · f (x)
§ If z = w · f (x) very positive à want probability of + going to 1
§ If z = w · f (x) very negative à want probability of + going to 0

§ Sigmoid function
!
𝑃 𝑦 = +1 𝑥 ; 𝑤) =
1 !"# !"⋅$(&)
(z) = z
1+e !
𝑃 𝑦 = −1 𝑥 ; 𝑤) = 1 − !"# !"⋅$(&)
= Logistic Regression
A 1D Example
𝑃 𝑟𝑒𝑑 𝑥

𝑓(𝑥)

definitely blue not sure definitely red

1
𝑃 𝑟𝑒𝑑 𝑥 ; 𝑤 = 𝜙 𝑤 ⋅ 𝑓(𝑥) =
1 + 𝑒 56⋅8(:)
A 1D Example: varying w
𝑃 𝑟𝑒𝑑 𝑥

𝑤=1

𝑤 = 10
𝑤=∞
𝑓(𝑥)

1
𝑃 𝑟𝑒𝑑 𝑥 ; 𝑤 = 𝜙 𝑤 ⋅ 𝑓(𝑥) =
1 + 𝑒 $%⋅'())
A 1D Example: varying w
𝑃 𝑟𝑒𝑑 𝑥

𝑓(𝑥)
A 1D Example: varying w
𝑃 𝑟𝑒𝑑 𝑥

𝑓(𝑥)
Best w?
§ Recall maximum likelihood estimation: Choose the w value that
maximizes the probability of the observed (training) data
Best w?
§ Recall maximum likelihood estimation: Choose the w value that
maximizes the probability of the observed (training) data
Separable Case: Deterministic Decision – Many Options
Separable Case: Probabilistic Decision – Clear Preference

0.7 | 0.3
0.5 | 0.5
0.7 | 0.3
0.3 | 0.7 0.5 | 0.5
0.3 | 0.7
Multiclass Logistic Regression
Multiclass Logistic Regression
§ Recall Perceptron:
§ A weight vector for each class:

§ Score (activation) of a class y: z=

§ Prediction highest score wins

§ How to make the scores into probabilities?


z1
e ez2 ez3
z1 , z2 , z3 ! , ,
ez1 + ez2 + ez3 ez1 + ez2 + ez3 ez1 + ez2 + ez3
original activations softmax activations
-!" -!$
§ In general: softmax 𝑧+, . . . , 𝑧, = [∑ !# , … , ∑ !# ]
/ /
Multiclass Logistic Regression
§ Recall Perceptron:
§ A weight vector for each class:

§ Score (activation) of a class y: z=

§ Prediction highest score wins

§ How to make the scores into probabilities?

!" ⋅$(&)
!
𝑃 𝑦 𝑥 ; 𝑤) = !"( ⋅$(&)
∑"( !

= Multi-Class Logistic Regression


Best w?
§ Maximum likelihood estimation:
X
(i) (i)
max ll(w) = max log P (y |x ; w)
w w
i

wy(i) ·f (x(i) )
(i) (i) e
with: P (y |x ; w) = P (i) )
e w y ·f (x
y

= Multi-Class Logistic Regression


Logistic Regression for 3-way classification

𝑓#
𝑧# s
𝑓$ o
f
𝑧$ t
𝑓% m
a
x
… 𝑧%

𝑓&

𝑧> = 𝑤> ⋅ 𝑓
= 2 𝑤?> ⋅ 𝑓?
?
Logistic Regression for 3-way classification

𝑓# 𝑤""
𝑤!" 𝑧# s
𝑓$ o
f
𝑤#" 𝑧$ t
𝑓% m
a
𝑤$" x
… 𝑧%

𝑓&

𝑧> = 𝑤> ⋅ 𝑓
= 2 𝑤?> ⋅ 𝑓?
?
Logistic Regression for 3-way classification

x1 𝑓#
𝑧# s
x2 𝑓$ o
f
Feature Extraction 𝑧$ t
x3 Code
𝑓% m
a
… x
… 𝑧%

Xd 𝑓&
Deep Neural Network for 3-way classification
Layer 1 Layer 2 Layer L

x1
𝑧# s
x2 o
f
… t
𝑧$
x3 m
a
… … … x
… 𝑧%

Xd
Deep Neural Network for 3-way classification
Hidden unit 1 in layer 1

(#)
x1 ℎ#
𝑧# s
x2 o
f
… t
𝑧$
x3 m
a
… … … x
… 𝑧%

Xd
Deep Neural Network for 3-way classification
Hidden unit 1 in layer 1
(")
𝑤"" (#)
x1 ℎ#
(")
𝑤!" 𝑧# s
x2 o
f
… t
𝑧$
x3 m
a
… … … x
(") … 𝑧%
𝑤'"

Xd

(!) (!) !
ℎ! =𝜙 𝑤! ⋅𝑥 = 𝜙() 𝑤$% ⋅ 𝑥$ )
$
𝜙 = activation function
Deep Neural Network for 3-way classification

(#)
x1 ℎ#
𝑧# s
(#) o
x2 ℎ$
f
… t
(#) 𝑧$
x3 ℎ% m
a
… … … x
… 𝑧%

(#)
Xd ℎ&
Deep Neural Network for 3-way classification
Hidden unit 1 in layer 2

(#) ($)
x1 ℎ# ℎ#
𝑧# s
(#) o
x2 ℎ$
f
… t
(#) 𝑧$
x3 ℎ% m
a
… … … x
… 𝑧%

(#)
Xd ℎ&
Deep Neural Network for 3-way classification
Hidden unit 1 in layer 2
(!)
(#)
𝑤"" ($)
x1 ℎ# ℎ#
(!)
𝑤!" s
𝑧#
(#) o
x2 ℎ$
f
… t
(#) 𝑧$
x3 ℎ% m
a
… … (!) … x
𝑤$" … 𝑧%

(#)
Xd ℎ&

(&) (&) (!) & (!)


ℎ! =𝜙 𝑤! ⋅ℎ = 𝜙() 𝑤$% ⋅ ℎ$ )
$
𝜙 = activation function
Deep Neural Network for 3-way classification

(#) ($)
x1 ℎ# ℎ#
𝑧# s
(#) ($) o
x2 ℎ$ ℎ$
f
… t
(#) ($) 𝑧$
x3 ℎ% ℎ% m
a
… … … x
… 𝑧%

(#) ($)
Xd ℎ& ℎ&
Deep Neural Network for 3-way classification

(#) ($) (*)


x1 ℎ# ℎ# ℎ#
𝑧# s
(#) ($) (*) o
x2 ℎ$ ℎ$ ℎ$
f
… t
(#) ($) (*) 𝑧$
x3 ℎ% ℎ% ℎ% m
a
… … … x
… 𝑧%

(#) ($) (*)


Xd ℎ& ℎ& ℎ&

(') (') ('(!) • Neural network with L layers


ℎ% = 𝜙() 𝑤$% ⋅ ℎ$ ) • ℎ()) : activations at layer l
• 𝑤 ()) : weights taking activations
$ from layer l-1 to layer l
𝜙 = activation function
Deep Neural Network for 3-way classification
𝑊 (#) 𝑊 ($)
(#) ($) (*)
x1 ℎ# ℎ# ℎ#
𝑧# s
(#) ($) (*) o
x2 ℎ$ ℎ$ ℎ$
f
… t
(#) ($) (*) 𝑧$
x3 ℎ% ℎ% ℎ% m
a
… … … x
… 𝑧%

(#) ($) (*)


Xd ℎ& ℎ& ℎ&

ℎ(') = 𝜙(ℎ '(!


×𝑊 ' )
𝜙 = activation function
Deep Neural Network for 3-way classification
𝑊 (#) 𝑊 ($)
(#) ($) (*)
x1 ℎ# ℎ# ℎ#
𝑧# s
(#) ($) (*) o
x2 ℎ$ ℎ$ ℎ$
f
… t
(#) ($) (*) 𝑧$
x3 ℎ% ℎ% ℎ% m
a
… … … x
… 𝑧%

(#) ($) (*)


Xd ℎ& ℎ& ℎ&

• Sometimes also called Multi-Layer Perceptron (MLP) or Feed-Forward Network (FFN)


It is a component of larger Transformer Models*

Attention is all you need,


Vaswani et al, 2017
Common Activation Functions 𝜙

[source: MIT 6.S191 introtodeeplearning.com]


Deep Neural Network Training
Training the deep neural network is just like logistic regression:

just w tends to be a much, much larger vector


How do we maximize functions?

X
(i) (i)
max ll(w) = max log P (y |x ; w)
w w
i

In general, cannot always take derivative and set to 0

Use numerical optimization!


Hill Climbing
Recall from CSPs lecture: simple, general idea
Start wherever
Repeat: move to the best neighboring state
If no neighbors better than current, quit

What’s particularly tricky when hill-climbing for multiclass


logistic regression?
• Optimization over a continuous space
• Infinitely many neighbors!
• How to do this efficiently?
Next Time: Optimization and more Neural Networks!
Naïve Bayes vs Logistic Regression
Naïve Bayes Logistic Regression

Joint over all features and label: Conditional:


Model
𝑃(𝑌, 𝐹" , 𝐹! , … ) 𝑃 𝑦 𝑓" , 𝑓! , … ; 𝑤)

Inference in a Bayes Net: Directly output label:


Predicted class probabilities
𝑃 𝑌 𝑓 ∝ 𝑃 𝑌 𝑃(𝑓" |𝑌) … 𝑃 𝑦 = +1 𝑓; 𝑤) = 1/(1 + 𝑒 ()⋅+ )

Features Discrete Discrete or Continuous

Entries of probability tables 𝑃(𝑌)


Parameters Weight vector 𝑤
and 𝑃(𝐹, |𝑌)

Learning Counting occurrences of events Iterative numerical optimization

You might also like