Lecture 9 H
Lecture 9 H
Mitesh M. Khapra
1/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.1 : A quick recap of training deep neural
networks
2/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
We already saw how to train this network
σ w = w − η∇w where,
w ∂L (w)
∇w =
x ∂w
= (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x
y
What about a wider network with more inputs:
σ w1 = w1 − η∇w1
w1 w2 w3 w2 = w2 − η∇w2
w3 = w3 − η∇w3
x2 x1 x3
where, ∇wi = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ xi
3/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
σ ∂L (w) ∂L (w) ∂y ∂a3 ∂h2 ∂a2 ∂h1 ∂a1
= . . . . . .
a3 ∂w1 ∂y ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂w1
w3 h2 ∂L (w)
= ∗ ............... ∗ h0
σ ∂y
a2
w2 In general,
h1
σ ∂L (w)
a1 ∇wi = ∗ ............... ∗ hi−1
∂y
w1
Notice that ∇wi is proportional to the correspond-
x = h0
ing input hi−1 (we will use this fact later)
ai = wi hi−1 ; hi = σ(ai )
a1 = w1 ∗ x = w1 ∗ h0
4/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
What happens if we have a network which is deep
and wide?
σ
How do you calculate ∇w2 =?
σ It will be given by chain rule applied across mul-
tiple paths (We saw this in detail when we studied
σ σ σ back propagation)
σ σ σ
w1 w2 w3
x1 x2 x3
5/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
The gradient w.r.t. a parameter is proportional to the input to the parameters
(recall the “..... ∗ x” term or the “.... ∗ hi ” term in the formula for ∇wi )
6/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
σ
Backpropagation was made popular
σ σ σ by Rumelhart et.al in 1986
However when used for really deep
σ σ σ networks it was not very successful
w1 w2 w3 In fact, till 2006 it was very hard to
x1 x2 x3 train very deep networks
Typically, even after a large number
of epochs the training did not con-
verge
7/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.2 : Unsupervised pre-training
8/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
Well, until 2006 it wasn’t so popular
The field got revived after the seminal work of Hinton and Salakhutdinov in
2006
1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006. 9/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let’s look at the idea of unsupervised pre-training introduced in this paper ...
(note that in this paper they introduced the idea in the context of RBMs but we
will discuss it in the context of Autoencoders)
10/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Consider the deep neural network
shown in this figure
reconstruct x Let us focus on the first two layers of
the network (x and h1 )
x̂ We will first train the weights
between these two layers using an un-
h1 supervised objective
Note that we are trying to reconstruct
x the input (x) from the hidden repres-
entation (h1 )
m n
1 XX We refer to this as an unsupervised
min (x̂ij − xij )2
m objective because it does not involve
i=1 j=1
the output label (y) and only uses the
input data (x)
11/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
At the end of this step, the weights
in layer 1 are trained such that h1
ĥ1 captures an abstract representation
of the input x
h2 We now fix the weights in layer 1 and
repeat the same process with layer 2
h1 At the end of this step, the weights in
layer 2 are trained such that h2 cap-
x tures an abstract representation of h1
We continue this process till the last
m n
1 XX hidden layer (i.e., the layer before the
min (ĥ1ij − h1ij )2
m output layer) so that each successive
i=1 j=1
layer captures an abstract represent-
ation of the previous layer
12/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
After this layerwise pre-training, we
add the output layer and train the
whole network using the task specific
objective
Note that, in effect we have initial-
ized the weights of the network us-
ing the greedy unsupervised objective
and are now fine tuning these weights
x1 x2 x3 using the supervised objective
m
1 X
min (yi − f (xi ))2
θ m
i=1
13/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Let’s see what these two questions mean and try to answer them based on some
(among many) existing studies1,2
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 14/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
15/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What is the optimization problem that we are trying to solve?
m
1 X
minimize L (θ) = (yi − f (xi ))2
m
i=1
Is it the case that in the absence of unsupervised pre-training we are not able
to drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...
16/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
Indeed Larochelle et.al.1 show that if
the last layer has large capacity then
L (θ) goes to 0 even without pre-
training
However, if the capacity of the net-
work is small, unsupervised pre-
training helps
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 17/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
18/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
L-1 regularization: constrains most
weights to be 0
L-2 regularization: prevents most
weights from taking large values
1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71 19/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Unsupervised objective: Indeed, pre-training constrains the
weights to lie in only certain regions
m n of the parameter space
1 XX
Ω(θ) = (xij − x̂ij )2 Specifically, it constrains the weights
m
i=1 j=1 to lie in regions where the character-
istics of the data are captured well (as
We can think of this unsupervised ob- governed by the unsupervised object-
jective as an additional constraint on ive)
the optimization problem
This unsupervised objective ensures
Supervised objective: that that the learning is not greedy
w.r.t. the supervised objective (and
m also satisfies the unsupervised object-
1 X
L (θ) = (yi − f (xi ))2 ive)
m
i=1
20/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Some other experiments have also
shown that pre-training is more ro-
bust to random initializations
One accepted hypothesis is that pre-
training leads to better weight ini-
tializations (so that the layers cap-
ture the internal characteristics of the
data)
1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009 21/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
So what has happened since 2006-2009?
22/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
23/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.3 : Better activation functions
24/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
25/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Before we look at activation functions, let’s try to answer the following question:
“What makes Deep Neural Networks powerful ?”
26/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y Consider this deep neural network
Imagine if we replace the sigmoid in
σ a each layer by a simple linear trans-
3
w3 formation
σ h2
a2
w2 y = (w4 ∗ (w3 ∗ (w2 ∗ (w1 x))))
σ h1
a1
w1
Then we will just learn y as a linear
h0 = x transformation of x
In other words we will be constrained
to learning linear decision boundaries
We cannot learn arbitrary decision
boundaries
27/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In particular, a deep linear neural
network cannot learn such boundar-
ies
But a deep non linear neural net-
work can indeed learn such bound-
aries (recall Universal Approximation
Theorem)
28/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Now let’s look at some non-linear activation functions that are typically used in
deep neural networks (Much of this material is taken from Andrej Karpathy’s
lecture notes 1 )
1
http://cs231n.github.io 29/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
1
σ(x) = 1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Since we are always interested in
gradients, let us find the gradient of
this function
Sigmoid
∂σ(x)
= σ(x)(1 − σ(x))
∂x
(you can easily derive it)
Let us see what happens if we use sig-
moid in a deep network
30/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
σ h
a4
4 While calculating ∇w2 at some point
in the chain rule we will encounter
σ h
a3
3
∂h3 ∂σ(a3 )
a3 = w2 h2 = = σ(a3 )(1 − σ(a3 ))
∂a3 ∂a3
σ h
a2
2 h3 = σ(a3 )
What is the consequence of this ?
σ h
a1
1 To answer this question let us first
understand the concept of saturated
neuron ?
h0 = x
31/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
1
0.8
A sigmoid neuron is said to have sat-
0.6 urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-
0.4
ation?
0.2 Well it would be 0 (you can see it from
the plot or from the formula that we
x derived)
−2 −1 1 2
0.2
P4
i=1 wi xi
−2 −1 1 2
33/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons cause the gradient Why is this a problem??
to vanish y
Sigmoids are not zero centered
35/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered ∇w2
And lastly, sigmoids are compu- starting from this
tationally expensive (because of initial position
exp (x)) only way to reach it
is by taking a zigzag path
∇w1
36/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
tanh(x) Compresses all its inputs to the range
[-1,1]
1 y
Zero centered
What is the derivative of this func-
0.5 tion?
∂tanh(x)
= (1 − tanh2 (x))
x ∂x
−4 −2 0 2 4
The gradient still vanishes at satura-
−0.5 tion
Also computationally expensive
−1
f(x) = tanh(x)
37/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
ReLU
f (x) = max(0, x)
f (x) = max(0, x + 1) − max(0, x − 1)
38/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
ReLU
Advantages of ReLU
Does not saturate in the positive re-
gion
Computationally efficient
In practice converges much faster
than sigmoid/tanh1
f (x) = max(0, x)
1
ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoffrey E. Hinton, 2012 39/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In practice there is a caveat
y Let’s see what is the derivative of ReLU(x)
a2 ∂ReLU (x)
w3 = 0 if x<0
h1 ∂x
a1 = 1 if x>0
40/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
w1 x1 + w2 x2 + b < 0 [if b << 0]
w3 be zero
h1
a1 The weights w1 , w2 and b will not get updated
[∵ there will be a zero term in the chain rule]
w1 w2 b
41/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
In practice a large fraction of ReLU
units can die if the learning rate is set
a2
w3 too high
h1 It is advised to initialize the bias to a
a1
positive value (0.01)
w1 w2 b Use other variants of ReLU (as we
will soon see)
x1 x2 1
42/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
No saturation
Leaky ReLU Will not die (0.01x ensures that
at least a small gradient will flow
y
through)
Computationally efficient
Close to zero centered ouputs
x
Parametric ReLU
f (x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
f(x) = max(0.01x,x)
43/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Exponential Linear Unit
f (x) = x if x>0
x
= ae − 1 if x≤0 44/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Maxout Neuron
45/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)
46/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.4 : Better initialization strategies
47/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
48/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
What happens if we initialize all
h21 weights to 0?
σ
a21
All neurons in layer 1 will get the
h11 h12 h13
σ σ σ same activation
a11 a12 a13
Now what will happen during back
propagation?
x1 x2
∂L (w) ∂y ∂h11
∇w11 = . . .x1
a11 = w11 x1 + w12 x2 ∂y ∂h11 ∂a11
a12 = w21 x1 + w22 x2 ∂L (w) ∂y ∂h12
∇w21 = . . .x1
∴ a11 = a12 = 0 ∂y ∂h12 ∂a12
∴ h11 = h12 but h11 = h12
and a12 = a12
∴ ∇w11 = ∇w21
49/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
0.4
0.3
0.2
0.1
−3 −2 −1 0 1 2 3
51/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What will happen during back
propagation?
Recall that ∇w1 is proportional to
the activation passing through it
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
They will all be close to 0 (vanishing
gradient problem)
52/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let us try to initialize the weights to
large random numbers
Most activations have saturated
What happens to the gradients at sat-
uration?
They will all be close to 0 (vanishing
gradient problem)
53/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let us try to arrive at a more principled
way of initializing weights
Xn
s11 s1n s11 = w1i xi
i=1
n
X n
X
x1 x2 x3 xn V ar(s11 ) = V ar( w1i xi ) = V ar(w1i xi )
i=1 i=1
n
X
(E[w1i ])2 V ar(xi )
=
i=1
+ (E[xi ])2 V ar(w1i ) + V ar(xi )V ar(w1i )
[Assuming 0 Mean inputs and n
weights] X
= V ar(xi )V ar(w1i )
[Assuming V ar(xi ) = V ar(x)∀i ] i=1
[Assuming = (nV ar(w))(V ar(x))
V ar(w1i ) = V ar(w)∀i]
54/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In general,
s11 s1n
V ar(S1i ) = (nV ar(w))(V ar(x))
55/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let us see what happens if we add one
more layer
s21 Using the same procedure as above
we will arrive at
n
s11 s1n X
V ar(s21 ) = V ar(s1i )V ar(w2i )
i=1
x1 x2 x3 xn = nV ar(s1i )V ar(w2 )
V ar(Si1 ) = nV ar(w1 )V ar(x) V ar(s21 ) ∝ [nV ar(w2 )][nV ar(w1 )]V ar(x)
z
nV ar(w) = n V ar( √ )
n
1
= n ∗ V ar(z) = 1 ← (U nit Gaussian)
n
57/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let’s see what happens if we use this
initialization
sigmoid activations
58/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
However this does not work for ReLU
neurons
Why ?
Intuition: He et.al. argue that a
factor of 2 is needed when dealing
with ReLU Neurons
Intuitively this happens because the
range of ReLU neurons is restricted
only to the positive half of the space
59/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Indeed when we account for this
factor of 2 we see better performance
60/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.5 : Batch Normalization
61/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
We will now see a method called batch normalization which allows us to be less
careful about initialization
62/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
To understand the intuition behind Batch Nor-
malization let us consider a deep network
h4
Let us focus on the learning process for the weights
between these two layers
h3
Typically we use mini-batch algorithms
h2 What would happen if there is a constant change
in the distribution of h3
h1 In other words what would happen if across mini-
batches the distribution of h3 keeps changing
h0 Would the learning process be easy or hard?
x1 x2 x3
63/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
sˆik = s√ik −E[sik ]
var(sik )
But how do we compute E[sik ] and Var[sik ]?
We compute it from a mini-batch
Thus we are explicitly ensuring that the distri-
bution of the inputs at different layers does not
change across batches
64/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
This is what the deep network will look like with
Batch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-
ferentiable, the Batch Normalization layer is also
differentiable
Hence we can backpropagate through this layer
65/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y (k) = γ k sˆik + β (k)
What happens if the network learns:
p
γ k = var(xk )
β k = E[xk ]
We will recover sik
γ k and β k are additional In other words by adjusting these additional para-
parameters of the network. meters the network can learn to recover sik if that
is more favourable
66/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
We will now compare the performance with and without batch normalization on
MNIST data using 2 layers....
67/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
68/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Beyond batch normalization
69/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9