Deep Learning Basics Lecture 1 Feedforward
Deep Learning Basics Lecture 1 Feedforward
Lecture 1: Feedforward
Princeton University COS 495
Instructor: Yingyu Liang
Motivation I: representation learning
Machine learning 1-2-3
𝑥
Color Histogram
Extract build
features hypothesis 𝑦 = 𝑤𝑇𝜙 𝑥
build
hypothesis 𝑦 = 𝑤𝑇𝜙 𝑥
Linear model
Example: Polynomial kernel SVM
𝑥1
𝑦 = sign(𝑤 𝑇 𝜙(𝑥) + 𝑏)
𝑥2
Fixed 𝜙 𝑥
Motivation: representation learning
• Why don’t we also learn 𝜙 𝑥 ?
Learn 𝜙 𝑥 𝜙 𝑥 Learn 𝑤
𝑦 = 𝑤𝑇𝜙 𝑥
𝑥
Feedforward networks
• View each dimension of 𝜙 𝑥 as something to be learned
…
𝑦 = 𝑤𝑇𝜙 𝑥
…
𝑥 𝜙 𝑥
Feedforward networks
• Linear functions 𝜙𝑖 𝑥 = 𝜃𝑖𝑇 𝑥 don’t work: need some nonlinearity
…
𝑦 = 𝑤𝑇𝜙 𝑥
…
𝑥 𝜙 𝑥
Feedforward networks
• Typically, set 𝜙𝑖 𝑥 = 𝑟(𝜃𝑖𝑇 𝑥) where 𝑟(⋅) is some nonlinear function
…
𝑦 = 𝑤𝑇𝜙 𝑥
…
𝑥 𝜙 𝑥
Feedforward deep networks
• What if we go deeper?
…
…
……
𝑦
…
ℎ1 ℎ2 ℎ𝐿
𝑥
Figure from
Deep learning, by
Goodfellow, Bengio, Courville.
Dark boxes are things to be learned.
Motivation II: neurons
Motivation: neurons
Figure from
Wikipedia
Motivation: abstract neuron model
• Neuron activated when the correlation
between the input and a pattern 𝜃
exceeds some threshold 𝑏 𝑥1
• 𝑦 = threshold(𝜃 𝑇 𝑥 − 𝑏) 𝑥2
or 𝑦 = 𝑟(𝜃 𝑇 𝑥 − 𝑏)
• 𝑟(⋅) called activation function 𝑦
𝑥𝑑
Motivation: artificial neural networks
Motivation: artificial neural networks
• Put into layers: feedforward deep networks
…
…
……
𝑦
…
ℎ1 ℎ2 ℎ𝐿
𝑥
Components in Feedforward
networks
Components
• Representations:
• Input
• Hidden variables
• Layers/weights:
• Hidden layers
• Output layer
Components
First layer Output layer
…
…
……
𝑦
…
ℎ
Output layers
Output layer
• Multi-dimensional regression: 𝑦 = 𝑊𝑇ℎ +𝑏
• Linear units: no nonlinearity
ℎ
Output layers
Output layer
𝜎(𝑤 𝑇 ℎ
• Binary classification: 𝑦 = + 𝑏)
• Corresponds to using logistic regression on ℎ
ℎ
Output layers
Output layer
• Multi-class classification:
• 𝑦 = softmax 𝑧 where 𝑧 = 𝑊 𝑇 ℎ + 𝑏
• Corresponds to using multi-class
logistic regression on ℎ
𝑧 𝑦
ℎ
Hidden layers
• Neuron take weighted linear
combination of the previous
…
layer
• So can think of outputting one
value for the next layer
…
ℎ𝑖 ℎ𝑖+1
Hidden layers
• 𝑦 = 𝑟(𝑤 𝑇 𝑥 + 𝑏)
𝑟(⋅)
𝑥 𝑦
Too small gradient
Gradient 0
Hidden layers
• Generalizations of ReLU gReLU 𝑧 = max 𝑧, 0 + 𝛼 min{𝑧, 0}
• Leaky-ReLU 𝑧 = max{𝑧, 0} + 0.01 min{𝑧, 0}
• Parametric-ReLU 𝑧 : 𝛼 learnable
gReLU 𝑧