0% found this document useful (0 votes)
5 views

Lecture 9. Neural Networks

The document provides an overview of neural networks, detailing their structure, including one hidden layer, and the computation of outputs using activation functions. It discusses the importance of non-linear activation functions, gradient descent, and backpropagation for training neural networks. Additionally, it touches on deep neural networks, their advantages, and the distinction between parameters and hyperparameters in the context of deep learning.

Uploaded by

Văn Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 9. Neural Networks

The document provides an overview of neural networks, detailing their structure, including one hidden layer, and the computation of outputs using activation functions. It discusses the importance of non-linear activation functions, gradient descent, and backpropagation for training neural networks. Additionally, it touches on deep neural networks, their advantages, and the distinction between parameters and hyperparameters in the context of deep learning.

Uploaded by

Văn Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 106

Neural Networks

PGS.TS. Nguyễn Phương Thái


NLP Laboratory, Institute of Artificial Intelligence, VNU-UET

Adapted from slides of Andrew Ng

Andrew Ng
One hidden layer
Neural Network

Neural Networks
Overview
What is a Neural Network?
𝑥1
𝑥2 ^
𝑦
𝑥3 x

𝑧 =𝑤 𝑥 +𝑏 𝑎=𝜎 ( 𝑧 ) ℒ (𝑎, 𝑦 )
𝑇
w

b
𝑥1
𝑥2 ^
𝑦

𝑥3 x

[ 1] [2 ] [2 ]
[ 1] [ 1] [1 ]
𝑎[1] =𝜎 (𝑧 [1 ]) 𝑧 =𝑊 𝑎 +𝑏 𝑎 =𝜎 ( 𝑧 ) ℒ(𝑎[ 2] , 𝑦 )
[ 2] [2] [ 1] [ 2]
𝑊 𝑧 =𝑊 𝑥 +𝑏
[ 1] [ 2]
𝑏 𝑊
[ 2] Andrew Ng
𝑏
One hidden layer
Neural Network

Neural Network
Representation
Neural Network Representation

𝑥1
𝑥2 ^
𝑦

𝑥3

Andrew Ng
One hidden layer
Neural Network

Computing a
Neural Network’s
Output
Neural Network Representation

𝑥1 𝑥1
𝑥2 𝑤
𝑇 𝜎 (𝑧)
𝑥 +𝑏 𝑎= ^
𝑦 𝑥2 ^
𝑦
𝑧 𝑎 𝑥3
𝑥3

𝑇
𝑧 =𝑤 𝑥 +𝑏
𝑎=𝜎 (𝑧 )

Andrew Ng
Neural Network Representation

𝑥1 𝑥1
𝑥2 𝑤
𝑇 𝜎 (𝑧)
𝑥 +𝑏 𝑎= ^
𝑦 𝑥2 ^
𝑦
𝑧 𝑎 𝑥3
𝑥3

𝑇
𝑧 =𝑤 𝑥 +𝑏 𝑥1
𝑎=𝜎 (𝑧 ) 𝑥2 ^
𝑦
𝑥3
Andrew Ng
Neural Network Representation
𝑎 [11 ] )
𝑥1 𝑎 [21 ] )
𝑥2 ^
𝑦
)
𝑎 [31 ]
𝑥3
𝑎 [41 ] )

Andrew Ng
Neural Network Representation learning
𝑎 [11 ]
Given input x:
𝑥1 𝑎 [21 ] 𝑧 [ 1] = 𝑊 [ 1 ] 𝑥+ 𝑏[ 1 ]
𝑥2 ^
𝑦
𝑎 [31 ]
𝑥3 𝑎 [ 1 ] = 𝜎 ( 𝑧 [ 1] )
𝑎 [41 ]
𝑧 [ 2 ] =𝑊 [ 2] 𝑎 [ 1] +𝑏 [ 2 ]

𝑎[ 2 ] = 𝜎 ( 𝑧 [ 2 ] )

Andrew Ng
One hidden layer
Neural Network

Vectorizing across
multiple examples
Vectorizing across multiple
examples
𝑧 [ 1] = 𝑊 [ 1 ] 𝑥+ 𝑏[ 1 ]
𝑥1
[1 ] [ 1]
𝑎 =𝜎 ( 𝑧 )
𝑥2 ^
𝑦
𝑧 [ 2 ] =𝑊 [ 2] 𝑎 [ 1] +𝑏 [ 2 ]
𝑥3
𝑎 [ 2 ] =𝜎 ( 𝑧 [ 2 ] )

Andrew Ng
Vectorizing across multiple
examples
for i = 1 to m:
[ 1] (𝑖 ) [ 1 ] (𝑖 ) [1 ]
𝑧 =𝑊 𝑥 +𝑏
[ 1 ] ( 𝑖) [ 1] ( 𝑖 )
𝑎 =𝜎 ( 𝑧 )
[ 2 ] (𝑖 ) [ 2] [ 1] ( 𝑖 ) [ 2]
𝑧 =𝑊 𝑎 +𝑏
[2 ](𝑖 ) [2 ]( 𝑖 )
𝑎 =𝜎 (𝑧 )

Andrew Ng
One hidden layer
Neural Network

Explanation
for vectorized
implementation
Justification for vectorized
implementation

Andrew Ng
Recap of vectorizing across multiple
examples
𝑥1 for i = 1 to m
[ 1] (𝑖 ) [1 ] (𝑖 ) [1 ]
𝑥2 ^
𝑦 𝑧 =𝑊 𝑥 +𝑏
[ 1 ] ( 𝑖) [ 1] ( 𝑖 )
𝑥3 𝑎 =𝜎 ( 𝑧 )
[ 2 ] (𝑖 ) [ 2] [ 1] ( 𝑖 ) [ 2]
𝑧 =𝑊 𝑎 +𝑏
[2 ](𝑖 ) [2 ]( 𝑖 )
𝑎 =𝜎 (𝑧 )
𝑋=¿ 𝑥( 1)𝑥( 2)… 𝑥( 𝑚)
[ 1] [1 ] [ 1]
𝑍 =𝑊 𝑋 +𝑏
𝐴 [ 1 ] =𝜎 ( 𝑍 [ 1] )
[2 ] [ 2] [ 1] [ 2]
[1] (1 ) [1] (2 …) [1] (𝑚 ) 𝑍 =𝑊 𝐴 +𝑏
𝑎 𝑎 𝑎 𝐴 [ 2 ] =𝜎 ( 𝑍 [ 2 ] )
Andrew Ng
One hidden layer
Neural Network

Activation functions
Activation functions

𝑥1
𝑥2 ^
𝑦
𝑥3

Given x:
[ 1] [1 ] [1 ]
𝑧 = 𝑊 𝑥+ 𝑏
𝑎 [ 1 ] =𝜎 ( 𝑧 [ 1] )
𝑧 [ 2 ] =𝑊 [ 2] 𝑎 [ 1] +𝑏 [ 2 ]
𝑎 [ 2 ] =𝜎 ( 𝑧 [ 2 ] )
Pros and cons of activation functions
a a

x
z
1
sigmoid: 𝑎= 1+𝑒 − 𝑧

a a

z z
Andrew Ng
One hidden layer
Neural Network

Why do you
need non-linear
activation functions?
Activation function
𝑥1
𝑥2 ^
𝑦
𝑥3
Given x:
𝑧 [ 1] = 𝑊 [ 1 ] 𝑥+ 𝑏[ 1 ]
[1 ] [1] [1 ]
𝑎 =𝑔 (𝑧 )
𝑧 [ 2 ] =𝑊 [ 2] 𝑎 [ 1] +𝑏 [ 2 ]
𝑎 [ 2 ] =𝑔 [2 ]( 𝑧 [ 2 ] )
Andrew Ng
One hidden layer
Neural Network

Derivatives of
activation functions
Sigmoid activation function

a
1
𝑔 ( 𝑧 )= −𝑧
1+𝑒
z

Andrew Ng
Tanh activation function
a
𝑔 ( 𝑧 )=tanh ( 𝑧 )

Andrew Ng
ReLU and Leaky ReLU
a a

z z
ReLU Leaky ReLU

Andrew Ng
One hidden layer
Neural Network

Gradient descent for


neural networks
Gradient descent for neural
networks

Andrew Ng
Formulas for computing derivatives

Andrew Ng
One hidden layer
Neural Network

Backpropagation
intuition (Optional)
Computing gradients
Logistic regression
𝑥
𝑤
𝑏

Andrew Ng
Neural network gradients
[ 2]
𝑊
𝑥 𝑏
[ 2]

[ 1]
𝑊
[ 1]
𝑏

Andrew Ng
Summary of gradient descent
𝑑 𝑧 [ 2]=𝑎 [2 ] − 𝑦
𝑇
[ 2] [ 2] [ 1]
𝑑 𝑊 =𝑑 𝑧 𝑎
[ 2] [ 2]
𝑑 𝑏 =𝑑 𝑧
[1 ] [ 2] 𝑇 [ 2] [ 1] [ 1]
𝑑 𝑧 =𝑊 𝑑 𝑧 ∗ 𝑔 ′ (z )

[1 ] [1 ] 𝑇
𝑑 𝑊 =𝑑 𝑧 𝑥
[1 ] [1 ]
𝑑 𝑏 =𝑑 𝑧
Andrew Ng
Summary of gradient descent
𝑑 𝑧 [ 2]=𝑎 [2 ] − 𝑦 [ 2]
𝑑 𝑍 = 𝐴 −𝑌
[2 ]

[ 2] [ 2] [ 1]
𝑇 1
[ 2][ 2] [1 ] 𝑇

𝑑 𝑊 =𝑑 𝑧 𝑎 𝑑𝑊 = 𝑑𝑍 𝐴
𝑚
[ 2] [ 2] 1
[ 2] [ 2]
𝑑 𝑏 =𝑑 𝑧 𝑑 𝑏 = 𝑛𝑝 . 𝑠𝑢𝑚(𝑑 𝑍 , 𝑎𝑥𝑖𝑠=1 , 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠=𝑇𝑟𝑢𝑒
𝑚
[1 ] [ 2] 𝑇 [ 2] [ 1] [ 1]
𝑑 𝑧 =𝑊 𝑑 𝑧 ∗ 𝑔 ′ ( z ) 𝑑 𝑍 [1 ]=𝑊 [ 2 ] 𝑇 𝑑 𝑍 [2] ∗𝑔 [1] ′ ( Z [ 1 ] )

[1 ] [1 ] 𝑇 1
[1 ][1 ] 𝑇
𝑑 𝑊 =𝑑 𝑧 𝑥 𝑑𝑊 = 𝑑 𝑍 𝑋
𝑚
[1 ] [1 ]
𝑑 𝑏 =𝑑 𝑧
Andrew Ng
One hidden layer
Neural Network

Random Initialization
What happens if you initialize weights to
zero?
𝑥1 𝑎
[1]
1
[2 ]
𝑎 1
^
𝑦
𝑥2 𝑎
[1]
2

Andrew Ng
Random initialization
𝑥1 𝑎
[1]
1
[2 ]
𝑎 1
^
𝑦
𝑥2 𝑎
[1]
2

Andrew Ng
Multi-class
classification

Softmax regression
Recognizing cats, dogs, and baby chicks

3 1 2 0 3 2 0 1

X ^
𝑦

Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Deep Neural
Networks
Deep L-layer
neural network
Andrew Ng
Andrew Ng
Deep Neural
Networks
Getting your matrix
dimensions right
Parameters and

𝑥1
^
𝑦
𝑥2

Andrew Ng
Vectorized implementation

𝑥1
^
𝑦
𝑥2

Andrew Ng
Deep Neural
Networks
Why deep
representations?
Intuition about deep representation

^
𝑦

Andrew Ng
Circuit theory and deep learning
Informally: There are functions you can compute with a
“small” L-layer deep neural network that shallower
networks require exponentially more hidden units to
compute.

Andrew Ng
Deep Neural
Networks
Building blocks of
deep neural networks
Forward and backward functions
𝑥1
𝑥2
^
𝑦
𝑥3
𝑥4

Andrew Ng
Forward and backward functions

Andrew Ng
Deep Neural
Networks
Forward and backward
propagation
Forward propagation for layer l
Input
Ocache ()

Andrew Ng
Backward propagation for layer l
Input
O

Andrew Ng
Summary

Andrew Ng
Deep Neural
Networks
Parameters vs
Hyperparameters
What are hyperparameters?
Parameters:

Andrew Ng
Applied deep learning is a very
empirical process
Idea

cost

Experiment Code # of iterations

Andrew Ng
Deep Neural
Networks
What does this
have to do with
the brain?
Forward and backward propagation
[ 1] [ 1] [ 1] [ 𝐿] [ 𝐿]
𝑍 =𝑊 𝑋 +𝑏 𝑑𝑍 = 𝐴1 −𝑌
[1] [ 1] [ 1] [𝐿 ] [ 𝐿]𝑇
𝐴 =𝑔 ( 𝑍 ) [ 𝐿]
𝑑𝑊 = 𝑑 𝑍 𝐴
[ 2]
𝑍 =𝑊 𝐴 +𝑏
[2 ] [ 1] [2 ] 𝑚
1
𝑑 𝑏 = 𝑛𝑝 . ∑ (d 𝑍 , 𝑎𝑥𝑖𝑠=1 , 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠=𝑇𝑟𝑢𝑒)
[ 𝐿] [ 𝐿]
[2 ] [ 2] [2 ]
𝐴 =𝑔 ( 𝑍 ) 𝑚 𝑇

𝑑𝑍 [ 𝐿− 1]
=𝑑𝑊 [ 𝐿 ] 𝑑 𝑍 [ 𝐿 ] 𝑔 [ 𝐿 ] (𝑍 [ 𝐿− 1] )

^
𝐴 =𝑔 [ 𝐿 ] ( 𝑍 [ 𝐿 ] )= 𝑌
[ 𝐿]


𝑇
[ 𝐿]
[1 ]
𝑑𝑍 =𝑑 𝑊 𝑑 𝑍 𝑔 (𝑍 [ 1] )
[2 ] ′[ 1]
1
[1 ] [ 1] [ 1] 𝑇

𝑑𝑊 = 𝑑 𝑍 𝐴
𝑚
1
𝑑 𝑏 = 𝑛𝑝 . ∑ (d 𝑍 , 𝑎𝑥𝑖𝑠=1 ,𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠=𝑇𝑟𝑢𝑒)
[1 ] [ 1]
𝑚

Andrew Ng
Setting up your

ML application

Train/dev/test
sets
Applied ML is a highly iterative
process
# Idea
layers
# hidden
units
learning rates

activation functions

… Experiment Code

Andrew Ng
Train/dev/test sets

Andrew Ng
Mismatched train/test distribution

Training set: Dev/test sets:


Cat pictures from Cat pictures from
webpages users using your
app

Not having a test set might be okay. (Only dev


set.)
Andrew Ng
Setting up
your
ML
application

Bias/Variance
Bias and Variance

high bias “just right” high variance

Andrew Ng
Bias and Variance
Cat classification

Train set error:


Dev set error:

Andrew Ng
High bias and high variance
𝑥2

𝑥1

Andrew Ng
Setting up
your
ML
application
Basic “recipe”
for machine learning
Basic “recipe” for machine learning

Andrew Ng
Basic recipe for machine learning

Andrew Ng
Regularizing
your neural
network

Regularization
Logistic regression
min 𝐽 (𝑤 , 𝑏)
𝑤, 𝑏

Andrew Ng
Neural network

Andrew Ng
How does regularization prevent
overfitting?
𝑥1
𝑥2 ^
𝑦
𝑥3

Andrew Ng
How does regularization prevent
overfitting?

Andrew Ng
Regularizing
your neural
network

Why regularization
reduces overfitting
How does regularization prevent
overfitting?
𝑥1
𝑥2 ^
𝑦
𝑥3

high bias “just right” high variance


Andrew Ng
How does regularization prevent
overfitting?

Andrew Ng
Regularizing
your neural
network

Dropout
regularization
Dropout regularization

𝑥1 𝑥1
𝑥2 ^
𝑦
𝑥2 ^
𝑦
𝑥3 𝑥3
𝑥4 𝑥4

Andrew Ng
Implementing dropout (“Inverted
dropout”)

Andrew Ng
Making predictions at test time

Andrew Ng
Regularizing
your neural
network

Understanding
dropout
Why does drop-out work?
Intuition: Can’t rely on any one feature, so have to
spread out weights.

𝑥1
𝑥2 ^
𝑦

𝑥3

Andrew Ng
Regularizing
your neural
network

Other regularization
methods
Data augmentation

4
Andrew Ng
Early stopping

#
iterations
Andrew Ng
Setting up your
optimization
problem

Normalizing inputs
Normalizing training sets

Andrew Ng
𝑚
1
Why normalize inputs? 𝐽 ( 𝑤 , 𝑏 ) = ∑ℒ ( ^
𝑚𝑖 =1
( 𝑖)
𝑦 ,𝑦 )
(𝑖 )

Unnormalized: Normalized:
𝐽 𝐽

𝑤 𝑤
𝑏 𝑏
𝑏 𝑏

𝑤 𝑤Andrew Ng
Setting up your
optimization
problem

Vanishing/exploding
gradients
Vanishing/exploding gradients
^
𝑦
=

Andrew Ng
Single neuron example

^
𝑦
𝑎=𝑔( 𝑧 )

Andrew Ng
Setting up your
optimization
problem
Numerical approximation
of gradients
Checking your derivative computation

Andrew Ng
Checking your derivative
computation
𝑓

Andrew Ng
Setting up your
optimization
problem

Gradient Checking
Gradient check for a neural network

Take and reshape into a big vector

Take and reshape into a big vector

Andrew Ng
Gradient checking (Grad check)

Andrew Ng
Setting up your
optimization
problem
Gradient Checking
implementation notes
Gradient checking implementation
notes
- Don’t use in training – only to debug

- If algorithm fails grad check, look at components to try to identify b

- Remember regularization.

- Doesn’t work with dropout.

- Run at random initialization; perhaps again after some training.

Andrew Ng

You might also like