01 - Neural Network Basics
01 - Neural Network Basics
“Best” Function f*
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
8 What is the Model?
什麼是模型?
9 Training Procedure Outline
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
10 Classification Task
◉ Sentiment Analysis
“這規格有誠意!” +
Binary Classification
“太爛了吧~” -
Class A (yes)
input
◉ Speech Phoneme Recognition object
Class B (no)
/h/
◉ Handwritten Recognition Multi-class Classification
Class A
2 input
object Class B
Class C
f (x ) = y f :R →RN M
f (x ) = y f :R →RN M
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
16 A Single Neuron
x1 w1 Activation
x2 w2
function
(z )
z (z )
+ y
wN
…
(z ) =
xN 1
z
b 1 + e−z
bias Sigmoid function
x1 w1 Activation
x2 w2
function
(z )
z (z )
+ y
wN
…
(z ) =
xN 1
z
b 1 + e−z
1 bias
Sigmoid function
The bias term is an “always on” feature
18 Why Bias?
(z )
b
bias
z
The bias term gives a class prior
19 Model Parameters of A Single Neuron
x1 w1
x2 w2 z (z )
+ y
wN
…
xN
(z ) =
1
b 1 + e−z
1 bias
f : RN → RM
x1 w1
x2 w2 z
+ y
… wN
xN is "2" y 0.5
b
1 bias
not "2" y 0.5
x1 + y1
“1” or not
x2 + y2
“2” or not Which one is max?
…
xN + y3
“3” or not
…
…
1
10 neurons/10 classes
A layer of neurons can handle multiple possible output,
and the result depends on the max one
22 Training Procedure Outline
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
23 A Layer of Neurons – Perceptron
◉ Output units all operate separately – no shared weights
x1 + y1
x2 + y2
…
xN + y3
1
…
…
A perceptron can represent AND, OR, NOT, etc., but not XOR → linear separator
http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
25 How to Implement XOR?
A
Input A’
Output A + B’
A B
AB’ + A’B
0 0 0
0 1 1
A’ + B
1 0 1 B
1 1 0 B’
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
27 Neural Networks – Multi-Layer Perceptron
z1
x1 + a1
x2 z2 z
+ a2 + y
1 Hidden Units 1
28 Expression of Multi-Layer Perceptron
◉ Continuous function w/ 2 layers ◉ Continuous function w/ 3 layers
x1 … y1
vector … vector
x2 … y2
x y
…
……
……
……
……
xN … yM
Output of a neuron:
1 l −1 1
a a1l layer l
1
l
2
a l −1
2
a l
a i neuron i
2 2
…..
…..
j i
a lj−1 ….. ail
…..
Layer l − 1 Layer l
N l −1 nodes N l nodes output of one layer → a vector
31 Notation Definition
layer l − 1
to layer l
1 l −1 1
a1 a1l from neuron j (layer l-1)
2 2 to neuron i (layer l)
l −1
a 2 a2l
…..
…..
j i
a lj−1 ….. ail
…..
2 l −1
2
a 2 a2l
…..
…..
b2l
j i
a lj−1 l ail
b
…..
…..
1
Layer l − 1 Layer l bias of all neurons at each layer
N l −1 nodes N l nodes → a vector
33 Notation Definition
j i
…..
…..
l
a i
: output of a neuron : a weight
1 l −1 1
a1 zl
1
a1l
2 l −1
2
a 2 z l
2
a2l
…
…
…
…
j i
a lj−1 z l
i ail
…
…
…
…
a l −1 zl al
Layer l − 1 Layer l
N l −1 nodes N l nodes
36 Layer Output Relation – from a to z
1 l −1 1
a zl
a1l
… …
1 1
2 l −1
2
a 2 z l
2
a2l
…
…
…
…
j i
a lj−1 z l
i ail
…
…
…
…
a l −1 zl al
Layer l − 1 Layer l
N l −1 nodes N l nodes
37 Layer Output Relation – from z to a
( )
ail = zil
1 1
( )
l −1
a1 zl
1
a1l a1l z1l
l l
2
a l −1
2 z l
2
2
a2l ( )
a2 z 2
=
…
…
…
…
j
a lj−1
i ( )
ai zi
l l
z l
i a l
i
…
…
…
…
a l −1 zl a l
Layer l − 1
( )
Layer l
N l −1 nodes N l nodes al = z l
38 Layer Output Relation
1 l −1 1
a a1l
z l = W l a l −1 + b l
l
1 z1
2 2
( )
l −1
a l
a2l
al = z l
2 z 2
…
…
…
…
j i
a lj−1 z l
i ail
…
…
…
…
a l −1 zl al
Layer l − 1 Layer l
N l −1 nodes N l nodes
39 Neural Network Formulation
◉ Fully connected feedforward network f : RN → RM
x1 … y1
vector … vector
x2
… y2 y
x
…
…..
…..
…..
…..
xN … yM
=
=
40 Neural Network Formulation
◉ Fully connected feedforward network f : RN → RM
x1 … y1
vector … vector
x2
… y2 y
x
…
…..
…..
…..
…..
xN … yM
41 Activation Function
bounded
function
42 Activation Function
boolean
linear
non-linear
43 Non-Linear Activation Function
◉ Sigmoid
◉ Tanh
http://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf
What does the “Good”
45 Function mean?
什麼叫做“好”的Function呢?
46 Training Procedure Outline
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
47 Training Procedure Outline
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
48 Function = Model Parameters
◉ Formal definition
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
50 Model Parameter Measurement
◉ Define a function to measure the quality of a parameter set θ
○ Evaluating by a loss/cost/error function C(θ) → how bad θ is
○ Best model parameter set
A “Good” function:
◉ Hinge loss
◉ Logistic loss
https://en.wikipedia.org/wiki/Loss_functions_for_classification
How can we Pick the
53 “Best” Function?
我們如何找出“最好”的Function呢?
54 Training Procedure Outline
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
55 Problem Statement
◉ Given a loss function and several model parameter sets
○ Loss function:
○ Model parameter sets:
◉ Find a model parameter set that minimizes C(θ)
How to solve this optimization problem?
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
57 Gradient Descent for Optimization
◉ Assume that θ has only one variable
0
1
2
3
Idea: drop a ball and find the position where the ball stops rolling (local minima)
58 Gradient Descent for Optimization
◉ Assume that θ has only one variable
Randomly start at 𝜃 0
Compute 𝑑𝐶 𝜃 0 Τ𝑑𝜃:
C( )
Compute 𝑑𝐶 𝜃 1 Τ𝑑𝜃:
…
η is “learning rate”
0
1
59 Gradient Descent for Optimization
◉ Assume that θ has two variables {θ1, θ2}
60 Gradient Descent for Optimization
◉ Assume that θ has two variables {θ1, θ2}
• Randomly start at 𝜃 0 :
𝜃2
𝛻𝐶 𝜃 0
Algorithm
1
𝜃0 𝛻𝐶 𝜃 Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
𝜃1 𝛻𝐶 𝜃 2 {
𝜃2 compute gradient at 𝜃 𝑖
Gradient update parameters
Movement 𝛻𝐶 𝜃 3
𝜃3 }
𝜃1
62 Revisit Neural Network Formulation
◉ Fully connected feedforward network f : RN → RM
x1 … y1
vector … vector
x2
… y2 y
x
…
…..
…..
…..
…..
xN … yM
63 Gradient Descent for Neural Network
Algorithm
Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
{
compute gradient at 𝜃 𝑖
update parameters
}
64 Gradient Descent for Optimization
Simple Case
Algorithm
Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
{
x1 w1 compute gradient at 𝜃 𝑖
update parameters
x2 w2 + z (z ) y
b }
1
65 Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
◉ Update three parameters for t-th iteration
sigmoid func
chain rule
68 Gradient Descent for Optimization
Simple Case – Square Error Loss
◉ Square Error Loss
69 Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
◉ Update three parameters for t-th iteration x1 w1
x2 w2 z (z )
+ y
b
1
70 Optimization Algorithm
Algorithm
Initialization: set the parameters 𝜃, 𝑏 at random
while(stopping criteria not met)
{
for training sample {𝑥, 𝑦},
ො compute gradient and update parameters
}
71 Gradient Descent for Neural Network
Algorithm
Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
{
compute gradient at 𝜃 𝑖
update parameters
Training Data
(x1, yˆ1 ), (x2 , yˆ 2 ),
After seeing all training samples, the model can be updated → slow
73 Training Procedure Outline
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
74 Stochastic Gradient Descent (SGD)
◉ Gradient Descent
The model can be updated after seeing one training sample → faster
75 Epoch Definition
◉ When running SGD, the model starts θ0
Training Data
(x1, yˆ1 ), (x2 , yˆ 2 ),
pick x1
pick x2 see all training
…
…
samples once
pick
xk
…
…
pick x1
76 Gradient Descent v.s. SGD
◉ Gradient Descent ◉ Stochastic Gradient Descent
✓ Update after seeing all examples ✓ If there are 20 examples, update 20 times
in one epoch.
1 epoch
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
78 Mini-Batch SGD
◉ Batch Gradient Descent
Why is mini-batch
faster than SGD?
Batch Size
1 10 100 1000 10000 full
SGD Gradient Descent
Mini-Batch
82 SGD v.s. Mini-Batch
◉ Stochastic Gradient Descent (SGD)
𝑧1 = 𝑊1 𝑥 𝑧1 = 𝑊1 𝑥 ……
◉ Mini-Batch SGD
matrix
𝑧1 𝑧1 = 𝑊1 𝑥 𝑥
Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
85 Initialization
◉ Different initialization parameters may result in different trained models
cost
Very Large
small
Large
Just make
Error
# parameters updates Surface
http://stackoverflow.com/questions/13693966/neural-net-selecting-data-for-each-mini-
88 Learning Recipe
Testing Data
x ŷ x y x y
*
“Best” Function f
89 Learning Recipe
Testing Data
x ŷ x y x y
immediately Do not know the
know the performance until
performance submission
90 Learning Recipe
modify training
process
◉ Possible reasons
○ no good function exists: bad hypothesis function set
→ reconstruct the model architecture
○ cannot find a good function: local optima
→ change the training strategy
91 Learning Recipe
yes yes
get good results get good results on done
no no dev/validation set
on training set
◉ Possible solutions
○ more training samples
○ some tips: dropout, etc.
93 Concluding Remarks
◉ Q1. What is the model? Model Architecture
“Best” Function f*