0% found this document useful (0 votes)
403 views

Deep Learning Book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

This document summarizes key aspects of deep feedforward neural networks discussed in Chapter 6 of the Deep Learning book. It first compares linear classifiers, shallow neural networks with one hidden layer, and deep neural networks, noting that deep networks can represent functions using relatively small hidden layers compared to shallow networks. It then discusses hyperparameters like depth, hidden layer sizes, and activation functions that define a network architecture. Finally, it discusses training parameters like weights and biases that must be optimized to learn functions from data.

Uploaded by

HarshitShukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
403 views

Deep Learning Book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

This document summarizes key aspects of deep feedforward neural networks discussed in Chapter 6 of the Deep Learning book. It first compares linear classifiers, shallow neural networks with one hidden layer, and deep neural networks, noting that deep networks can represent functions using relatively small hidden layers compared to shallow networks. It then discusses hyperparameters like depth, hidden layer sizes, and activation functions that define a network architecture. Finally, it discusses training parameters like weights and biases that must be optimized to learn functions from data.

Uploaded by

HarshitShukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Deep Learning book, by Ian Goodfellow,

Yoshua Bengio and Aaron Courville


Chapter 6 :Deep Feedforward Networks

Benoit Massé Dionyssos Kounades-Bastian

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 1 / 25


Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25


Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

W b
x u y

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25


Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

x1 W11 . . . W23 b1
u1 y1
x2 b2
u2 y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b

Prediction : y = W> x + b

x1 W, b
y1
x2
y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)

Advantages
Easy to use
Easy to train, low risk of overtting

Drawbacks
Some problems are inherently non-linear

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 3 / 25


Solving XOR

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25


Solving XOR

Linear regressor

x1 W, b
y
x2
There is no value for W and b
such that ∀(x1 , x2 ) ∈ {0, 1}2
 
x1
W >
+ b = xor (x1 , x2 )
x2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25


Solving XOR

What about... ?
W, b
x1 u1 V, c
y
x2 u2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25


Solving XOR

What about... ?
W, b
x1 u1 V, c
y
x2 u2
Strictly equivalent :
The composition of two linear operation is still a linear
operation

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25


Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25


Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }

It is possible !
1 1 0 1
     
With W = ,b= ,V= and c = 0,
1 1 −1 −2

Vφ(Wx + b) = xor (x1 , x2 )


Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25
Neural network with one hidden layer

Compact representation
W , b, φ V, c
x h y

Neural network
Hidden layer with non-linearity
→ can represent broader class of function

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 7 / 25


Universal approximation theorem

Theorem
A neural network with one hidden layer can approximate any
continuous function

More formally, given a continuous function f : Cn 7→ Rm where


n
Cn is a compact subset of R ,

K
:x→ vi φ(wi > x + bi ) + c
X
ε
∀ε, ∃fNN
i =1
such that
ε
∀x ∈ Cn , ||f (x ) − fNN (x )|| < ε

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 8 / 25


Problems

Obtaining the network


The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25


Problems

Obtaining the network


The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b
Using the network
Even if we nd a way to obtain the network, the size of the
hidden layer may be prohibitively large.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25


Deep neural network
Why Deep ?
Let's stack l hidden layers one after the other; l is called the
length of the network.

W 1 , b1 , φ W 2 , b2 , φ W l , bl , φ V, c
x h1 ... hl y

Properties of DNN
The universal approximation theorem also apply
Some functions can be approximated by a DNN with N
hidden unit, and would require O(e N ) hidden units to be
represented by a shallow network.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 10 / 25


Summary
Comparison
Linear classier
− Limited representational power

+ Simple

Shallow Neural network (Exactly one hidden layer)


+ Unlimited representational power

− Sometimes prohibitively wide

Deep Neural network


+ Unlimited representational power

+ Relatively small number of hidden units needed

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25


Summary
Comparison
Linear classier
− Limited representational power

+ Simple

Shallow Neural network (Exactly one hidden layer)


+ Unlimited representational power

− Sometimes prohibitively wide

Deep Neural network


+ Unlimited representational power

+ Relatively small number of hidden units needed

Remaining problem
How to get this DNN ?

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25


On the path of getting my own DNN

Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25


On the path of getting my own DNN

Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit

Parameters
When the architecture is dened, we need to train the DNN
W1 , b1 , . . . , Wl , bl

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25


Hyperparameters

The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25


Hyperparameters

The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve

The activation function φ


g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0

Sigmoid 1
− −

Many others : tanh, RBF, softplus...

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25


Hyperparameters

The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve

The activation function φ


g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0

Sigmoid 1
− −

Many others : tanh, RBF, softplus...

The output unit


Linear outputE[y] = V> hl + c
For regression with Gaussian distribution y ∼ N (E[y], I)

Sigmoid output = σ(w> hl + ) b
For classication with Bernouilli distribution P (y = 1|x) = ŷ

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25


Parameters Training

Objective
Let's dene θ = (W1 , b1 , . . . , Wl , bl ).
We suppose we have a set of inputs X = (x1 , . . . , xN ) and a
set of expected outputs Y = (y1 , . . . , yN ). The goal is to nd
a neural network fNN such that
∀i , fNN (x i , θ) ' yi .

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 14 / 25


Parameters Training

Cost function
To evaluate the error that our current network makes, let's
dene a cost function L(X, Y, θ). The goal becomes to nd
argmin L(X, Y, θ)
θ

Loss function
Should represent a combination of the distances between every
yi and the corresponding fNN (xi , θ)
Mean square error (rare)
Cross-entropy

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 15 / 25


Parameters Training

Find the minimum


The basic idea consists in computing θ̂ such that
∇θ L(X, Y, θ̂) = 0.

This is dicult to solve analytically e.g. when θ have millions


of degrees of freedom.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 16 / 25


Parameters Training

Gradient descent
Let's use a numerical way to optimize θ, called the gradient
descent (section 4.3). The idea is that
f (θ − εu) ' f (θ) − εu> ∇f (θ)

So if we take u = ∇f (θ), we have u> u > 0 and then


f (θ − εu) ' f (θ) − εu> u < f (θ).

If f is a function to minimize, we have an update rule that


improves our estimate.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 17 / 25


Parameters Training

Gradient descent algorithm


1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25


Parameters Training

Gradient descent algorithm


1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold

Problem
How to estimate eciently ∇θ L(X, Y, θ̂) ?
Back-propagation algorithm

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25


Back-propagation for Parameter Learning

Consider the architecture:


w 2 w 1

y φ φ
x
with function:

y =φ w2 φ(w1 x ) ,
N
some training pairs T = x̂n , ŷn n=1 , and


an activation-function φ().
Learn w1 , w2 so that: Feeding x̂n results ŷn .

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 19 / 25


Prerequisite: dierentiable activation function

For learning to be possible φ() has to be dierentiable.


Let φ0 (x ) = ∂φ(x ) denote the derivative of φ(x ).
∂x
For example when φ(x ) = Relu(x ) we have:

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 20 / 25


Gradient-based Learning

Minimize the loss function L(w1 , w2 , T ).


We will learn the weights by iterating:
 
updated ∂L
 ∂ w1 
  
w1 w1
= −γ , (1)
w2 w2 ∂L
∂ w2

L is the loss function (must be dierentiable): In detail is


L(w1 , w2 , T ) and we want to compute the gradient(s) at
w1 , w2 .

γ is the learning rate (a scalar typically known).

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 21 / 25


Back-propagation

Calculate intermediate The partial derivatives are:


values on all units:
∂L(d )
∂d = L0 ( ).d
a = w1x̂n
6

1 .
∂d
= φ0 ( w2φ(w1x̂n ))
b = φ(w1x̂n ) ∂c
7 .
.
∂c
w2
2

c = w2φ(w1x̂n )
8
∂b = .

w1x̂n )
3 .
∂b
d = φ w2φ(w1x̂n ) ∂a = φ0 ( .
9

4 .

L(d ) = L φ w2 φ(w1 x̂n )



5 .

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 22 / 25


Calculating the Gradients I

Apply chain rule:


∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1

∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25


Calculating the Gradients I

Apply chain rule:


∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1

∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2

Start the calculation from left-to-right.


We propage the gradients (partial products) from the last
layer towards the input.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25


Calculating the Gradients

And because we have N training pairs:


N
∂L X ∂L(dn ) ∂ dn ∂ cn ∂ bn ∂ an
= ,
∂ w1 n=1 ∂ dn ∂ cn ∂ bn ∂ an w1

N
∂L X ∂L(dn ) ∂ dn ∂ cn
= .
∂ w2 n=1 ∂ dn ∂ cn w2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 24 / 25


Thank you !

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 25 / 25

You might also like