Lecture 9. Neural Networks
Lecture 9. Neural Networks
Andrew Ng
One hidden layer
Neural Network
Neural Networks
Overview
What is a Neural Network?
𝑥1
𝑥2 ^
𝑦
𝑥3 x
𝑧 =𝑤 𝑥 +𝑏 𝑎=𝜎 ( 𝑧 ) ℒ (𝑎, 𝑦 )
𝑇
w
b
𝑥1
𝑥2 ^
𝑦
𝑥3 x
[ 1] [2 ] [2 ]
[ 1] [ 1] [1 ]
𝑎[1] =𝜎 (𝑧 [1 ]) 𝑧 =𝑊 𝑎 +𝑏 𝑎 =𝜎 ( 𝑧 ) ℒ(𝑎[ 2] , 𝑦 )
[ 2] [2] [ 1] [ 2]
𝑊 𝑧 =𝑊 𝑥 +𝑏
[ 1] [ 2]
𝑏 𝑊
[ 2] Andrew Ng
𝑏
One hidden layer
Neural Network
Neural Network
Representation
Neural Network Representation
𝑥1
𝑥2 ^
𝑦
𝑥3
Andrew Ng
One hidden layer
Neural Network
Computing a
Neural Network’s
Output
Neural Network Representation
𝑥1 𝑥1
𝑥2 𝑤
𝑇 𝜎 (𝑧)
𝑥 +𝑏 𝑎= ^
𝑦 𝑥2 ^
𝑦
𝑧 𝑎 𝑥3
𝑥3
𝑇
𝑧 =𝑤 𝑥 +𝑏
𝑎=𝜎 (𝑧 )
Andrew Ng
Neural Network Representation
𝑥1 𝑥1
𝑥2 𝑤
𝑇 𝜎 (𝑧)
𝑥 +𝑏 𝑎= ^
𝑦 𝑥2 ^
𝑦
𝑧 𝑎 𝑥3
𝑥3
𝑇
𝑧 =𝑤 𝑥 +𝑏 𝑥1
𝑎=𝜎 (𝑧 ) 𝑥2 ^
𝑦
𝑥3
Andrew Ng
Neural Network Representation
𝑎 [11 ] )
𝑥1 𝑎 [21 ] )
𝑥2 ^
𝑦
)
𝑎 [31 ]
𝑥3
𝑎 [41 ] )
Andrew Ng
Neural Network Representation learning
𝑎 [11 ]
Given input x:
𝑥1 𝑎 [21 ] 𝑧 [ 1] = 𝑊 [ 1 ] 𝑥+ 𝑏[ 1 ]
𝑥2 ^
𝑦
𝑎 [31 ]
𝑥3 𝑎 [ 1 ] = 𝜎 ( 𝑧 [ 1] )
𝑎 [41 ]
𝑧 [ 2 ] =𝑊 [ 2] 𝑎 [ 1] +𝑏 [ 2 ]
𝑎[ 2 ] = 𝜎 ( 𝑧 [ 2 ] )
Andrew Ng
One hidden layer
Neural Network
Vectorizing across
multiple examples
Vectorizing across multiple
examples
𝑧 [ 1] = 𝑊 [ 1 ] 𝑥+ 𝑏[ 1 ]
𝑥1
[1 ] [ 1]
𝑎 =𝜎 ( 𝑧 )
𝑥2 ^
𝑦
𝑧 [ 2 ] =𝑊 [ 2] 𝑎 [ 1] +𝑏 [ 2 ]
𝑥3
𝑎 [ 2 ] =𝜎 ( 𝑧 [ 2 ] )
Andrew Ng
Vectorizing across multiple
examples
for i = 1 to m:
[ 1] (𝑖 ) [ 1 ] (𝑖 ) [1 ]
𝑧 =𝑊 𝑥 +𝑏
[ 1 ] ( 𝑖) [ 1] ( 𝑖 )
𝑎 =𝜎 ( 𝑧 )
[ 2 ] (𝑖 ) [ 2] [ 1] ( 𝑖 ) [ 2]
𝑧 =𝑊 𝑎 +𝑏
[2 ](𝑖 ) [2 ]( 𝑖 )
𝑎 =𝜎 (𝑧 )
Andrew Ng
One hidden layer
Neural Network
Explanation
for vectorized
implementation
Justification for vectorized
implementation
Andrew Ng
Recap of vectorizing across multiple
examples
𝑥1 for i = 1 to m
[ 1] (𝑖 ) [1 ] (𝑖 ) [1 ]
𝑥2 ^
𝑦 𝑧 =𝑊 𝑥 +𝑏
[ 1 ] ( 𝑖) [ 1] ( 𝑖 )
𝑥3 𝑎 =𝜎 ( 𝑧 )
[ 2 ] (𝑖 ) [ 2] [ 1] ( 𝑖 ) [ 2]
𝑧 =𝑊 𝑎 +𝑏
[2 ](𝑖 ) [2 ]( 𝑖 )
𝑎 =𝜎 (𝑧 )
𝑋=¿ 𝑥( 1)𝑥( 2)… 𝑥( 𝑚)
[ 1] [1 ] [ 1]
𝑍 =𝑊 𝑋 +𝑏
𝐴 [ 1 ] =𝜎 ( 𝑍 [ 1] )
[2 ] [ 2] [ 1] [ 2]
[1] (1 ) [1] (2 …) [1] (𝑚 ) 𝑍 =𝑊 𝐴 +𝑏
𝑎 𝑎 𝑎 𝐴 [ 2 ] =𝜎 ( 𝑍 [ 2 ] )
Andrew Ng
One hidden layer
Neural Network
Activation functions
Activation functions
𝑥1
𝑥2 ^
𝑦
𝑥3
Given x:
[ 1] [1 ] [1 ]
𝑧 = 𝑊 𝑥+ 𝑏
𝑎 [ 1 ] =𝜎 ( 𝑧 [ 1] )
𝑧 [ 2 ] =𝑊 [ 2] 𝑎 [ 1] +𝑏 [ 2 ]
𝑎 [ 2 ] =𝜎 ( 𝑧 [ 2 ] )
Pros and cons of activation functions
a a
x
z
1
sigmoid: 𝑎= 1+𝑒 − 𝑧
a a
z z
Andrew Ng
One hidden layer
Neural Network
Why do you
need non-linear
activation functions?
Activation function
𝑥1
𝑥2 ^
𝑦
𝑥3
Given x:
𝑧 [ 1] = 𝑊 [ 1 ] 𝑥+ 𝑏[ 1 ]
[1 ] [1] [1 ]
𝑎 =𝑔 (𝑧 )
𝑧 [ 2 ] =𝑊 [ 2] 𝑎 [ 1] +𝑏 [ 2 ]
𝑎 [ 2 ] =𝑔 [2 ]( 𝑧 [ 2 ] )
Andrew Ng
One hidden layer
Neural Network
Derivatives of
activation functions
Sigmoid activation function
a
1
𝑔 ( 𝑧 )= −𝑧
1+𝑒
z
Andrew Ng
Tanh activation function
a
𝑔 ( 𝑧 )=tanh ( 𝑧 )
Andrew Ng
ReLU and Leaky ReLU
a a
z z
ReLU Leaky ReLU
Andrew Ng
One hidden layer
Neural Network
Andrew Ng
Formulas for computing derivatives
Andrew Ng
One hidden layer
Neural Network
Backpropagation
intuition (Optional)
Computing gradients
Logistic regression
𝑥
𝑤
𝑏
Andrew Ng
Neural network gradients
[ 2]
𝑊
𝑥 𝑏
[ 2]
[ 1]
𝑊
[ 1]
𝑏
Andrew Ng
Summary of gradient descent
𝑑 𝑧 [ 2]=𝑎 [2 ] − 𝑦
𝑇
[ 2] [ 2] [ 1]
𝑑 𝑊 =𝑑 𝑧 𝑎
[ 2] [ 2]
𝑑 𝑏 =𝑑 𝑧
[1 ] [ 2] 𝑇 [ 2] [ 1] [ 1]
𝑑 𝑧 =𝑊 𝑑 𝑧 ∗ 𝑔 ′ (z )
[1 ] [1 ] 𝑇
𝑑 𝑊 =𝑑 𝑧 𝑥
[1 ] [1 ]
𝑑 𝑏 =𝑑 𝑧
Andrew Ng
Summary of gradient descent
𝑑 𝑧 [ 2]=𝑎 [2 ] − 𝑦 [ 2]
𝑑 𝑍 = 𝐴 −𝑌
[2 ]
[ 2] [ 2] [ 1]
𝑇 1
[ 2][ 2] [1 ] 𝑇
𝑑 𝑊 =𝑑 𝑧 𝑎 𝑑𝑊 = 𝑑𝑍 𝐴
𝑚
[ 2] [ 2] 1
[ 2] [ 2]
𝑑 𝑏 =𝑑 𝑧 𝑑 𝑏 = 𝑛𝑝 . 𝑠𝑢𝑚(𝑑 𝑍 , 𝑎𝑥𝑖𝑠=1 , 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠=𝑇𝑟𝑢𝑒
𝑚
[1 ] [ 2] 𝑇 [ 2] [ 1] [ 1]
𝑑 𝑧 =𝑊 𝑑 𝑧 ∗ 𝑔 ′ ( z ) 𝑑 𝑍 [1 ]=𝑊 [ 2 ] 𝑇 𝑑 𝑍 [2] ∗𝑔 [1] ′ ( Z [ 1 ] )
[1 ] [1 ] 𝑇 1
[1 ][1 ] 𝑇
𝑑 𝑊 =𝑑 𝑧 𝑥 𝑑𝑊 = 𝑑 𝑍 𝑋
𝑚
[1 ] [1 ]
𝑑 𝑏 =𝑑 𝑧
Andrew Ng
One hidden layer
Neural Network
Random Initialization
What happens if you initialize weights to
zero?
𝑥1 𝑎
[1]
1
[2 ]
𝑎 1
^
𝑦
𝑥2 𝑎
[1]
2
Andrew Ng
Random initialization
𝑥1 𝑎
[1]
1
[2 ]
𝑎 1
^
𝑦
𝑥2 𝑎
[1]
2
Andrew Ng
Multi-class
classification
Softmax regression
Recognizing cats, dogs, and baby chicks
3 1 2 0 3 2 0 1
X ^
𝑦
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Deep Neural
Networks
Deep L-layer
neural network
Andrew Ng
Andrew Ng
Deep Neural
Networks
Getting your matrix
dimensions right
Parameters and
𝑥1
^
𝑦
𝑥2
Andrew Ng
Vectorized implementation
𝑥1
^
𝑦
𝑥2
Andrew Ng
Deep Neural
Networks
Why deep
representations?
Intuition about deep representation
^
𝑦
Andrew Ng
Circuit theory and deep learning
Informally: There are functions you can compute with a
“small” L-layer deep neural network that shallower
networks require exponentially more hidden units to
compute.
Andrew Ng
Deep Neural
Networks
Building blocks of
deep neural networks
Forward and backward functions
𝑥1
𝑥2
^
𝑦
𝑥3
𝑥4
Andrew Ng
Forward and backward functions
Andrew Ng
Deep Neural
Networks
Forward and backward
propagation
Forward propagation for layer l
Input
Ocache ()
Andrew Ng
Backward propagation for layer l
Input
O
Andrew Ng
Summary
Andrew Ng
Deep Neural
Networks
Parameters vs
Hyperparameters
What are hyperparameters?
Parameters:
Andrew Ng
Applied deep learning is a very
empirical process
Idea
cost
Andrew Ng
Deep Neural
Networks
What does this
have to do with
the brain?
Forward and backward propagation
[ 1] [ 1] [ 1] [ 𝐿] [ 𝐿]
𝑍 =𝑊 𝑋 +𝑏 𝑑𝑍 = 𝐴1 −𝑌
[1] [ 1] [ 1] [𝐿 ] [ 𝐿]𝑇
𝐴 =𝑔 ( 𝑍 ) [ 𝐿]
𝑑𝑊 = 𝑑 𝑍 𝐴
[ 2]
𝑍 =𝑊 𝐴 +𝑏
[2 ] [ 1] [2 ] 𝑚
1
𝑑 𝑏 = 𝑛𝑝 . ∑ (d 𝑍 , 𝑎𝑥𝑖𝑠=1 , 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠=𝑇𝑟𝑢𝑒)
[ 𝐿] [ 𝐿]
[2 ] [ 2] [2 ]
𝐴 =𝑔 ( 𝑍 ) 𝑚 𝑇
𝑑𝑍 [ 𝐿− 1]
=𝑑𝑊 [ 𝐿 ] 𝑑 𝑍 [ 𝐿 ] 𝑔 [ 𝐿 ] (𝑍 [ 𝐿− 1] )
′
…
^
𝐴 =𝑔 [ 𝐿 ] ( 𝑍 [ 𝐿 ] )= 𝑌
[ 𝐿]
…
𝑇
[ 𝐿]
[1 ]
𝑑𝑍 =𝑑 𝑊 𝑑 𝑍 𝑔 (𝑍 [ 1] )
[2 ] ′[ 1]
1
[1 ] [ 1] [ 1] 𝑇
𝑑𝑊 = 𝑑 𝑍 𝐴
𝑚
1
𝑑 𝑏 = 𝑛𝑝 . ∑ (d 𝑍 , 𝑎𝑥𝑖𝑠=1 ,𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠=𝑇𝑟𝑢𝑒)
[1 ] [ 1]
𝑚
Andrew Ng
Setting up your
ML application
Train/dev/test
sets
Applied ML is a highly iterative
process
# Idea
layers
# hidden
units
learning rates
activation functions
… Experiment Code
Andrew Ng
Train/dev/test sets
Andrew Ng
Mismatched train/test distribution
Bias/Variance
Bias and Variance
Andrew Ng
Bias and Variance
Cat classification
Andrew Ng
High bias and high variance
𝑥2
𝑥1
Andrew Ng
Setting up
your
ML
application
Basic “recipe”
for machine learning
Basic “recipe” for machine learning
Andrew Ng
Basic recipe for machine learning
Andrew Ng
Regularizing
your neural
network
Regularization
Logistic regression
min 𝐽 (𝑤 , 𝑏)
𝑤, 𝑏
Andrew Ng
Neural network
Andrew Ng
How does regularization prevent
overfitting?
𝑥1
𝑥2 ^
𝑦
𝑥3
Andrew Ng
How does regularization prevent
overfitting?
Andrew Ng
Regularizing
your neural
network
Why regularization
reduces overfitting
How does regularization prevent
overfitting?
𝑥1
𝑥2 ^
𝑦
𝑥3
Andrew Ng
Regularizing
your neural
network
Dropout
regularization
Dropout regularization
𝑥1 𝑥1
𝑥2 ^
𝑦
𝑥2 ^
𝑦
𝑥3 𝑥3
𝑥4 𝑥4
Andrew Ng
Implementing dropout (“Inverted
dropout”)
Andrew Ng
Making predictions at test time
Andrew Ng
Regularizing
your neural
network
Understanding
dropout
Why does drop-out work?
Intuition: Can’t rely on any one feature, so have to
spread out weights.
𝑥1
𝑥2 ^
𝑦
𝑥3
Andrew Ng
Regularizing
your neural
network
Other regularization
methods
Data augmentation
4
Andrew Ng
Early stopping
#
iterations
Andrew Ng
Setting up your
optimization
problem
Normalizing inputs
Normalizing training sets
Andrew Ng
𝑚
1
Why normalize inputs? 𝐽 ( 𝑤 , 𝑏 ) = ∑ℒ ( ^
𝑚𝑖 =1
( 𝑖)
𝑦 ,𝑦 )
(𝑖 )
Unnormalized: Normalized:
𝐽 𝐽
𝑤 𝑤
𝑏 𝑏
𝑏 𝑏
𝑤 𝑤Andrew Ng
Setting up your
optimization
problem
Vanishing/exploding
gradients
Vanishing/exploding gradients
^
𝑦
=
Andrew Ng
Single neuron example
^
𝑦
𝑎=𝑔( 𝑧 )
Andrew Ng
Setting up your
optimization
problem
Numerical approximation
of gradients
Checking your derivative computation
Andrew Ng
Checking your derivative
computation
𝑓
Andrew Ng
Setting up your
optimization
problem
Gradient Checking
Gradient check for a neural network
Andrew Ng
Gradient checking (Grad check)
Andrew Ng
Setting up your
optimization
problem
Gradient Checking
implementation notes
Gradient checking implementation
notes
- Don’t use in training – only to debug
- Remember regularization.
Andrew Ng