0% found this document useful (0 votes)
5 views

Lecture 9 H

Lecture 9 of CS7015 discusses training deep neural networks, focusing on techniques such as greedy layerwise pre-training, better activation functions, and weight initialization methods. It highlights the importance of gradients in training and the evolution of deep learning since 2006, including the introduction of unsupervised pre-training and its benefits. The lecture also addresses the challenges of training deep networks and the role of activation functions in enabling the learning of complex decision boundaries.

Uploaded by

12212125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 9 H

Lecture 9 of CS7015 discusses training deep neural networks, focusing on techniques such as greedy layerwise pre-training, better activation functions, and weight initialization methods. It highlights the importance of gradients in training and the evolution of deep learning since 2006, including the introduction of unsupervised pre-training and its benefits. The lecture also addresses the challenges of training deep networks and the role of activation functions in enabling the learning of complex decision boundaries.

Uploaded by

12212125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

CS7015 (Deep Learning) : Lecture 9

Greedy Layerwise Pre-training, Better activation functions, Better weight


initialization methods, Batch Normalization

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.1 : A quick recap of training deep neural
networks

2/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
We already saw how to train this network
σ w = w − η∇w where,
w ∂L (w)
∇w =
x ∂w
= (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x

y
What about a wider network with more inputs:
σ w1 = w1 − η∇w1
w1 w2 w3 w2 = w2 − η∇w2
w3 = w3 − η∇w3
x2 x1 x3
where, ∇wi = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ xi

3/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
σ ∂L (w) ∂L (w) ∂y ∂a3 ∂h2 ∂a2 ∂h1 ∂a1
= . . . . . .
a3 ∂w1 ∂y ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂w1
w3 h2 ∂L (w)
= ∗ ............... ∗ h0
σ ∂y
a2
w2 In general,
h1
σ ∂L (w)
a1 ∇wi = ∗ ............... ∗ hi−1
∂y
w1
Notice that ∇wi is proportional to the correspond-
x = h0
ing input hi−1 (we will use this fact later)
ai = wi hi−1 ; hi = σ(ai )
a1 = w1 ∗ x = w1 ∗ h0
4/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
What happens if we have a network which is deep
and wide?
σ
How do you calculate ∇w2 =?
σ It will be given by chain rule applied across mul-
tiple paths (We saw this in detail when we studied
σ σ σ back propagation)

σ σ σ
w1 w2 w3
x1 x2 x3

5/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
The gradient w.r.t. a parameter is proportional to the input to the parameters
(recall the “..... ∗ x” term or the “.... ∗ hi ” term in the formula for ∇wi )

6/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y

σ
Backpropagation was made popular
σ σ σ by Rumelhart et.al in 1986
However when used for really deep
σ σ σ networks it was not very successful
w1 w2 w3 In fact, till 2006 it was very hard to
x1 x2 x3 train very deep networks
Typically, even after a large number
of epochs the training did not con-
verge
7/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.2 : Unsupervised pre-training

8/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
Well, until 2006 it wasn’t so popular
The field got revived after the seminal work of Hinton and Salakhutdinov in
2006

1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006. 9/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let’s look at the idea of unsupervised pre-training introduced in this paper ...
(note that in this paper they introduced the idea in the context of RBMs but we
will discuss it in the context of Autoencoders)

10/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Consider the deep neural network
shown in this figure
reconstruct x Let us focus on the first two layers of
the network (x and h1 )
x̂ We will first train the weights
between these two layers using an un-
h1 supervised objective
Note that we are trying to reconstruct
x the input (x) from the hidden repres-
entation (h1 )
m n
1 XX We refer to this as an unsupervised
min (x̂ij − xij )2
m objective because it does not involve
i=1 j=1
the output label (y) and only uses the
input data (x)

11/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
At the end of this step, the weights
in layer 1 are trained such that h1
ĥ1 captures an abstract representation
of the input x
h2 We now fix the weights in layer 1 and
repeat the same process with layer 2
h1 At the end of this step, the weights in
layer 2 are trained such that h2 cap-
x tures an abstract representation of h1
We continue this process till the last
m n
1 XX hidden layer (i.e., the layer before the
min (ĥ1ij − h1ij )2
m output layer) so that each successive
i=1 j=1
layer captures an abstract represent-
ation of the previous layer

12/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
After this layerwise pre-training, we
add the output layer and train the
whole network using the task specific
objective
Note that, in effect we have initial-
ized the weights of the network us-
ing the greedy unsupervised objective
and are now fine tuning these weights
x1 x2 x3 using the supervised objective
m
1 X
min (yi − f (xi ))2
θ m
i=1
13/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?

Let’s see what these two questions mean and try to answer them based on some
(among many) existing studies1,2

1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 14/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?

15/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What is the optimization problem that we are trying to solve?
m
1 X
minimize L (θ) = (yi − f (xi ))2
m
i=1

Is it the case that in the absence of unsupervised pre-training we are not able
to drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...

16/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
Indeed Larochelle et.al.1 show that if
the last layer has large capacity then
L (θ) goes to 0 even without pre-
training
However, if the capacity of the net-
work is small, unsupervised pre-
training helps
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 17/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?

18/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
L-1 regularization: constrains most
weights to be 0
L-2 regularization: prevents most
weights from taking large values

1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71 19/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Unsupervised objective: Indeed, pre-training constrains the
weights to lie in only certain regions
m n of the parameter space
1 XX
Ω(θ) = (xij − x̂ij )2 Specifically, it constrains the weights
m
i=1 j=1 to lie in regions where the character-
istics of the data are captured well (as
We can think of this unsupervised ob- governed by the unsupervised object-
jective as an additional constraint on ive)
the optimization problem
This unsupervised objective ensures
Supervised objective: that that the learning is not greedy
w.r.t. the supervised objective (and
m also satisfies the unsupervised object-
1 X
L (θ) = (yi − f (xi ))2 ive)
m
i=1

20/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Some other experiments have also
shown that pre-training is more ro-
bust to random initializations
One accepted hypothesis is that pre-
training leads to better weight ini-
tializations (so that the layers cap-
ture the internal characteristics of the
data)

1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009 21/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
So what has happened since 2006-2009?

22/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies

23/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.3 : Better activation functions

24/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies

25/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Before we look at activation functions, let’s try to answer the following question:
“What makes Deep Neural Networks powerful ?”

26/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y Consider this deep neural network
Imagine if we replace the sigmoid in
σ a each layer by a simple linear trans-
3
w3 formation
σ h2
a2
w2 y = (w4 ∗ (w3 ∗ (w2 ∗ (w1 x))))
σ h1
a1
w1
Then we will just learn y as a linear
h0 = x transformation of x
In other words we will be constrained
to learning linear decision boundaries
We cannot learn arbitrary decision
boundaries

27/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In particular, a deep linear neural
network cannot learn such boundar-
ies
But a deep non linear neural net-
work can indeed learn such bound-
aries (recall Universal Approximation
Theorem)

28/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Now let’s look at some non-linear activation functions that are typically used in
deep neural networks (Much of this material is taken from Andrej Karpathy’s
lecture notes 1 )

1
http://cs231n.github.io 29/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
1
σ(x) = 1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Since we are always interested in
gradients, let us find the gradient of
this function
Sigmoid
∂σ(x)
= σ(x)(1 − σ(x))
∂x
(you can easily derive it)
Let us see what happens if we use sig-
moid in a deep network

30/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
σ h
a4
4 While calculating ∇w2 at some point
in the chain rule we will encounter
σ h
a3
3
∂h3 ∂σ(a3 )
a3 = w2 h2 = = σ(a3 )(1 − σ(a3 ))
∂a3 ∂a3
σ h
a2
2 h3 = σ(a3 )
What is the consequence of this ?
σ h
a1
1 To answer this question let us first
understand the concept of saturated
neuron ?
h0 = x

31/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
1

0.8
A sigmoid neuron is said to have sat-
0.6 urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-
0.4
ation?
0.2 Well it would be 0 (you can see it from
the plot or from the formula that we
x derived)
−2 −1 1 2

Saturated neurons thus cause the


gradient to vanish.
32/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons thus cause the
gradient to vanish.

But why would the neurons saturate


w1 w2 w3 w4 ?
Consider what would happen if we
use sigmoid neurons and initialize the
weights to very high values ?
σ( 4i=1 wi xi )
P
The neurons will saturate very
1
y
quickly
0.8 The gradients will vanish and the
0.6 training will stall (more on this later)
0.4

0.2
P4
i=1 wi xi

−2 −1 1 2

33/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons cause the gradient Why is this a problem??
to vanish y
Sigmoids are not zero centered

Consider the gradient w.r.t. w1 and a3 = w1 ∗ h21 + w2 ∗ h22


w2 w1 w2
∂L (w) ∂y ∂h3 ∂a3
∇w1 = h21 h2
∂y h3 ∂a3 ∂w1
∂L (w) ∂y ∂h3 ∂a3
∇w2 = h22 h1
∂y h3 ∂a3 ∂w2
Note that h21 and h22 are between h0 = x
[0, 1] (i.e., they are both positive)
So if the first common term (in red)
Essentially, either all the gradients at
is positive (negative) then both ∇w1
a layer are positive or all the gradients
and ∇w2 are positive (negative)
at a layer are negative
34/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons cause the gradient This restricts the possible update dir-
to vanish ections
Sigmoids are not zero centered ∇w2

(Not possible) Quadrant in which


all gradients are
+ve
(Allowed)
∇w1
Now imagine:
Quadrant in which
all gradients are this is the
-ve optimal w
(Allowed)
(Not possible)

35/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered ∇w2
And lastly, sigmoids are compu- starting from this
tationally expensive (because of initial position
exp (x)) only way to reach it
is by taking a zigzag path

∇w1

36/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
tanh(x) Compresses all its inputs to the range
[-1,1]
1 y
Zero centered
What is the derivative of this func-
0.5 tion?
∂tanh(x)
= (1 − tanh2 (x))
x ∂x
−4 −2 0 2 4
The gradient still vanishes at satura-
−0.5 tion
Also computationally expensive

−1

f(x) = tanh(x)
37/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
ReLU

Is this a non-linear function?


Indeed it is!
In fact we can combine two ReLU
units to recover a piecewise linear ap-
proximation of the sigmoid function

f (x) = max(0, x)
f (x) = max(0, x + 1) − max(0, x − 1)

38/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
ReLU

Advantages of ReLU
Does not saturate in the positive re-
gion
Computationally efficient
In practice converges much faster
than sigmoid/tanh1

f (x) = max(0, x)

1
ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoffrey E. Hinton, 2012 39/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In practice there is a caveat
y Let’s see what is the derivative of ReLU(x)

a2 ∂ReLU (x)
w3 = 0 if x<0
h1 ∂x
a1 = 1 if x>0

w1 w2 b Now consider the given network


What would happen if at some point a large
x1 x2 1 gradient causes the bias b to be updated to a
large negative value?

40/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
w1 x1 + w2 x2 + b < 0 [if b << 0]

y The neuron would output 0 [dead neuron]


Not only would the output be 0 but during
a2 backpropagation even the gradient ∂h
∂a1 would
1

w3 be zero
h1
a1 The weights w1 , w2 and b will not get updated
[∵ there will be a zero term in the chain rule]
w1 w2 b

x1 x2 1 ∂L (θ) ∂y ∂a2 ∂h1 ∂a1


∇w1 = . . . .
∂y ∂a2 ∂h1 ∂a1 ∂w1
The neuron will now stay dead forever!!

41/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
In practice a large fraction of ReLU
units can die if the learning rate is set
a2
w3 too high
h1 It is advised to initialize the bias to a
a1
positive value (0.01)
w1 w2 b Use other variants of ReLU (as we
will soon see)
x1 x2 1

42/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
No saturation
Leaky ReLU Will not die (0.01x ensures that
at least a small gradient will flow
y
through)
Computationally efficient
Close to zero centered ouputs

x
Parametric ReLU
f (x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
f(x) = max(0.01x,x)

43/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Exponential Linear Unit

All benefits of ReLU


x aex − 1 ensures that at least a small
gradient will flow through
Close to zero centered outputs
Expensive (requires computation of
exp(x))

f (x) = x if x>0
x
= ae − 1 if x≤0 44/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Maxout Neuron

Generalizes ReLU and Leaky ReLU


max(w1T x + b1 , w2T x + b2 ) No saturation! No death!
Doubles the number of parameters

45/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)

46/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.4 : Better initialization strategies

47/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies

48/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
What happens if we initialize all
h21 weights to 0?
σ
a21
All neurons in layer 1 will get the
h11 h12 h13
σ σ σ same activation
a11 a12 a13
Now what will happen during back
propagation?
x1 x2
∂L (w) ∂y ∂h11
∇w11 = . . .x1
a11 = w11 x1 + w12 x2 ∂y ∂h11 ∂a11
a12 = w21 x1 + w22 x2 ∂L (w) ∂y ∂h12
∇w21 = . . .x1
∴ a11 = a12 = 0 ∂y ∂h12 ∂a12
∴ h11 = h12 but h11 = h12
and a12 = a12
∴ ∇w11 = ∇w21

49/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
0.4

0.3

0.2

0.1

−3 −2 −1 0 1 2 3

the network has 5 layers


each layer has 500 neurons
we will run forward propagation on
this network with different weight ini-
tializations
50/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let’s try to initialize the weights to
small random numbers
We will see what happens to the ac-
tivation across different layers

tanh activation functions

51/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What will happen during back
propagation?
Recall that ∇w1 is proportional to
the activation passing through it
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
They will all be close to 0 (vanishing
gradient problem)

52/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let us try to initialize the weights to
large random numbers
Most activations have saturated
What happens to the gradients at sat-
uration?
They will all be close to 0 (vanishing
gradient problem)

sigmoid activations with large weights

53/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let us try to arrive at a more principled
way of initializing weights
Xn
s11 s1n s11 = w1i xi
i=1
n
X n
X
x1 x2 x3 xn V ar(s11 ) = V ar( w1i xi ) = V ar(w1i xi )
i=1 i=1
n
X
(E[w1i ])2 V ar(xi )

=
i=1
+ (E[xi ])2 V ar(w1i ) + V ar(xi )V ar(w1i )

[Assuming 0 Mean inputs and n
weights] X
= V ar(xi )V ar(w1i )
[Assuming V ar(xi ) = V ar(x)∀i ] i=1
[Assuming = (nV ar(w))(V ar(x))
V ar(w1i ) = V ar(w)∀i]
54/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In general,

s11 s1n
V ar(S1i ) = (nV ar(w))(V ar(x))

What would happen if nV ar(w)  1


x1 x2 x3 xn ?
The variance of S1i will be large
What would happen if nV ar(w) → 0?
The variance of S1i will be small

55/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let us see what happens if we add one
more layer
s21 Using the same procedure as above
we will arrive at
n
s11 s1n X
V ar(s21 ) = V ar(s1i )V ar(w2i )
i=1

x1 x2 x3 xn = nV ar(s1i )V ar(w2 )

V ar(Si1 ) = nV ar(w1 )V ar(x) V ar(s21 ) ∝ [nV ar(w2 )][nV ar(w1 )]V ar(x)

∝ [nV ar(w)]2 V ar(x)


Assuming weights across all layers
have the same variance
56/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In general,

V ar(ski ) = [nV ar(w)]k V ar(x)

To ensure that variance in the output of any


layer does not blow up or shrink we want:
V ar(az) = a2 (V ar(z))
n V ar(w) = 1

If we draw the weights from a unit Gaussian


and scale them by √1n then, we have :

z
nV ar(w) = n V ar( √ )
n
1
= n ∗ V ar(z) = 1 ← (U nit Gaussian)
n

57/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let’s see what happens if we use this
initialization

sigmoid activations

58/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
However this does not work for ReLU
neurons
Why ?
Intuition: He et.al. argue that a
factor of 2 is needed when dealing
with ReLU Neurons
Intuitively this happens because the
range of ReLU neurons is restricted
only to the positive half of the space

59/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Indeed when we account for this
factor of 2 we see better performance

60/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.5 : Batch Normalization

61/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
We will now see a method called batch normalization which allows us to be less
careful about initialization

62/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
To understand the intuition behind Batch Nor-
malization let us consider a deep network
h4
Let us focus on the learning process for the weights
between these two layers
h3
Typically we use mini-batch algorithms
h2 What would happen if there is a constant change
in the distribution of h3
h1 In other words what would happen if across mini-
batches the distribution of h3 keeps changing
h0 Would the learning process be easy or hard?

x1 x2 x3

63/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
sˆik = s√ik −E[sik ]
var(sik )
But how do we compute E[sik ] and Var[sik ]?
We compute it from a mini-batch
Thus we are explicitly ensuring that the distri-
bution of the inputs at different layers does not
change across batches

64/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
This is what the deep network will look like with
Batch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-
ferentiable, the Batch Normalization layer is also
differentiable
Hence we can backpropagate through this layer

65/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y (k) = γ k sˆik + β (k)
What happens if the network learns:
p
γ k = var(xk )
β k = E[xk ]
We will recover sik
γ k and β k are additional In other words by adjusting these additional para-
parameters of the network. meters the network can learn to recover sik if that
is more favourable

66/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
We will now compare the performance with and without batch normalization on
MNIST data using 2 layers....

67/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
68/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Beyond batch normalization

69/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9

You might also like