0% found this document useful (0 votes)

5 views

Lecture 9 H

Lecture 9 of CS7015 discusses training deep neural networks, focusing on techniques such as greedy layerwise pre-training, better activation functions, and weight initialization methods. It highlights the importance of gradients in training and the evolution of deep learning since 2006, including the introduction of unsupervised pre-training and its benefits. The lecture also addresses the challenges of training deep networks and the role of activation functions in enabling the learning of complex decision boundaries.

Uploaded by

12212125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Lecture 9 H

Uploaded by

12212125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

CS7015 (Deep Learning) : Lecture 9

Greedy Layerwise Pre-training, Better activation functions, Better weight

initialization methods, Batch Normalization

Mitesh M. Khapra

Department of Computer Science and Engineering

Indian Institute of Technology Madras

1/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.1 : A quick recap of training deep neural
networks

2/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
We already saw how to train this network
σ w = w − η∇w where,
w ∂L (w)
∇w =
x ∂w
= (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x

y
What about a wider network with more inputs:
σ w1 = w1 − η∇w1
w1 w2 w3 w2 = w2 − η∇w2
w3 = w3 − η∇w3
x2 x1 x3
where, ∇wi = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ xi

3/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
σ ∂L (w) ∂L (w) ∂y ∂a3 ∂h2 ∂a2 ∂h1 ∂a1
= . . . . . .
a3 ∂w1 ∂y ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂w1
w3 h2 ∂L (w)
= ∗ ............... ∗ h0
σ ∂y
a2
w2 In general,
h1
σ ∂L (w)
a1 ∇wi = ∗ ............... ∗ hi−1
∂y
w1
Notice that ∇wi is proportional to the correspond-
x = h0
ing input hi−1 (we will use this fact later)
ai = wi hi−1 ; hi = σ(ai )
a1 = w1 ∗ x = w1 ∗ h0
4/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
What happens if we have a network which is deep
and wide?
σ
How do you calculate ∇w2 =?
σ It will be given by chain rule applied across mul-
tiple paths (We saw this in detail when we studied
σ σ σ back propagation)

σ σ σ
w1 w2 w3
x1 x2 x3

5/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Things to remember
Training Neural Networks is a Game of Gradients (played using any of the
existing gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
The gradient w.r.t. a parameter is proportional to the input to the parameters
(recall the “..... ∗ x” term or the “.... ∗ hi ” term in the formula for ∇wi )

6/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y

σ
Backpropagation was made popular
σ σ σ by Rumelhart et.al in 1986
However when used for really deep
σ σ σ networks it was not very successful
w1 w2 w3 In fact, till 2006 it was very hard to
x1 x2 x3 train very deep networks
Typically, even after a large number
of epochs the training did not con-
verge
7/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.2 : Unsupervised pre-training

8/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What has changed now? How did Deep Learning become so popular despite
this problem with training large networks?
Well, until 2006 it wasn’t so popular
The field got revived after the seminal work of Hinton and Salakhutdinov in
2006

1
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006. 9/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let’s look at the idea of unsupervised pre-training introduced in this paper ...
(note that in this paper they introduced the idea in the context of RBMs but we
will discuss it in the context of Autoencoders)

10/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Consider the deep neural network
shown in this figure
reconstruct x Let us focus on the first two layers of
the network (x and h1 )
x̂ We will first train the weights
between these two layers using an un-
h1 supervised objective
Note that we are trying to reconstruct
x the input (x) from the hidden repres-
entation (h1 )
m n
1 XX We refer to this as an unsupervised
min (x̂ij − xij )2
m objective because it does not involve
i=1 j=1
the output label (y) and only uses the
input data (x)

11/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
At the end of this step, the weights
in layer 1 are trained such that h1
ĥ1 captures an abstract representation
of the input x
h2 We now fix the weights in layer 1 and
repeat the same process with layer 2
h1 At the end of this step, the weights in
layer 2 are trained such that h2 cap-
x tures an abstract representation of h1
We continue this process till the last
m n
1 XX hidden layer (i.e., the layer before the
min (ĥ1ij − h1ij )2
m output layer) so that each successive
i=1 j=1
layer captures an abstract represent-
ation of the previous layer

12/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
After this layerwise pre-training, we
add the output layer and train the
whole network using the task specific
objective
Note that, in effect we have initial-
ized the weights of the network us-
ing the greedy unsupervised objective
and are now fine tuning these weights
x1 x2 x3 using the supervised objective
m
1 X
min (yi − f (xi ))2
θ m
i=1
13/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?

Let’s see what these two questions mean and try to answer them based on some
(among many) existing studies1,2

1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009
2
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 14/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?

15/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What is the optimization problem that we are trying to solve?
m
1 X
minimize L (θ) = (yi − f (xi ))2
m
i=1

Is it the case that in the absence of unsupervised pre-training we are not able
to drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...

16/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
The error surface of the supervised
objective of a Deep Neural Network
is highly non-convex
With many hills and plateaus and val-
leys
Given that large capacity of DNNs it
is still easy to land in one of these 0
error regions
Indeed Larochelle et.al.1 show that if
the last layer has large capacity then
L (θ) goes to 0 even without pre-
training
However, if the capacity of the net-
work is small, unsupervised pre-
training helps
1
Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009 17/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?

18/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What does regularization do? It con-
strains the weights to certain regions
of the parameter space
L-1 regularization: constrains most
weights to be 0
L-2 regularization: prevents most
weights from taking large values

1
Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,
Pg 71 19/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Unsupervised objective: Indeed, pre-training constrains the
weights to lie in only certain regions
m n of the parameter space
1 XX
Ω(θ) = (xij − x̂ij )2 Specifically, it constrains the weights
m
i=1 j=1 to lie in regions where the character-
istics of the data are captured well (as
We can think of this unsupervised ob- governed by the unsupervised object-
jective as an additional constraint on ive)
the optimization problem
This unsupervised objective ensures
Supervised objective: that that the learning is not greedy
w.r.t. the supervised objective (and
m also satisfies the unsupervised object-
1 X
L (θ) = (yi − f (xi ))2 ive)
m
i=1

20/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Some other experiments have also
shown that pre-training is more ro-
bust to random initializations
One accepted hypothesis is that pre-
training leads to better weight ini-
tializations (so that the layers cap-
ture the internal characteristics of the
data)

1
The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan et
al,2009 21/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
So what has happened since 2006-2009?

22/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies

23/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.3 : Better activation functions

24/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies

25/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Before we look at activation functions, let’s try to answer the following question:
“What makes Deep Neural Networks powerful ?”

26/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y Consider this deep neural network
Imagine if we replace the sigmoid in
σ a each layer by a simple linear trans-
3
w3 formation
σ h2
a2
w2 y = (w4 ∗ (w3 ∗ (w2 ∗ (w1 x))))
σ h1
a1
w1
Then we will just learn y as a linear
h0 = x transformation of x
In other words we will be constrained
to learning linear decision boundaries
We cannot learn arbitrary decision
boundaries

27/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In particular, a deep linear neural
network cannot learn such boundar-
ies
But a deep non linear neural net-
work can indeed learn such bound-
aries (recall Universal Approximation
Theorem)

28/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Now let’s look at some non-linear activation functions that are typically used in
deep neural networks (Much of this material is taken from Andrej Karpathy’s
lecture notes 1 )

1
http://cs231n.github.io 29/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
1
σ(x) = 1+e−x
As is obvious, the sigmoid function
compresses all its inputs to the range
[0,1]
Since we are always interested in
gradients, let us find the gradient of
this function
Sigmoid
∂σ(x)
= σ(x)(1 − σ(x))
∂x
(you can easily derive it)
Let us see what happens if we use sig-
moid in a deep network

30/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
σ h
a4
4 While calculating ∇w2 at some point
in the chain rule we will encounter
σ h
a3
3
∂h3 ∂σ(a3 )
a3 = w2 h2 = = σ(a3 )(1 − σ(a3 ))
∂a3 ∂a3
σ h
a2
2 h3 = σ(a3 )
What is the consequence of this ?
σ h
a1
1 To answer this question let us first
understand the concept of saturated
neuron ?
h0 = x

31/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
1

0.8
A sigmoid neuron is said to have sat-
0.6 urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-
0.4
ation?
0.2 Well it would be 0 (you can see it from
the plot or from the formula that we
x derived)
−2 −1 1 2

Saturated neurons thus cause the

gradient to vanish.
32/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons thus cause the
gradient to vanish.

But why would the neurons saturate

w1 w2 w3 w4 ?
Consider what would happen if we
use sigmoid neurons and initialize the
weights to very high values ?
σ( 4i=1 wi xi )
P
The neurons will saturate very
1
y
quickly
0.8 The gradients will vanish and the
0.6 training will stall (more on this later)
0.4

0.2
P4
i=1 wi xi

−2 −1 1 2

33/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons cause the gradient Why is this a problem??
to vanish y
Sigmoids are not zero centered

Consider the gradient w.r.t. w1 and a3 = w1 ∗ h21 + w2 ∗ h22

w2 w1 w2
∂L (w) ∂y ∂h3 ∂a3
∇w1 = h21 h2
∂y h3 ∂a3 ∂w1
∂L (w) ∂y ∂h3 ∂a3
∇w2 = h22 h1
∂y h3 ∂a3 ∂w2
Note that h21 and h22 are between h0 = x
[0, 1] (i.e., they are both positive)
So if the first common term (in red)
Essentially, either all the gradients at
is positive (negative) then both ∇w1
a layer are positive or all the gradients
and ∇w2 are positive (negative)
at a layer are negative
34/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons cause the gradient This restricts the possible update dir-
to vanish ections
Sigmoids are not zero centered ∇w2

(Not possible) Quadrant in which

all gradients are
+ve
(Allowed)
∇w1
Now imagine:
Quadrant in which
all gradients are this is the
-ve optimal w
(Allowed)
(Not possible)

35/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Saturated neurons cause the gradient
to vanish
Sigmoids are not zero centered ∇w2
And lastly, sigmoids are compu- starting from this
tationally expensive (because of initial position
exp (x)) only way to reach it
is by taking a zigzag path

∇w1

36/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
tanh(x) Compresses all its inputs to the range
[-1,1]
1 y
Zero centered
What is the derivative of this func-
0.5 tion?
∂tanh(x)
= (1 − tanh2 (x))
x ∂x
−4 −2 0 2 4
The gradient still vanishes at satura-
−0.5 tion
Also computationally expensive

−1

f(x) = tanh(x)
37/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
ReLU

Is this a non-linear function?

Indeed it is!
In fact we can combine two ReLU
units to recover a piecewise linear ap-
proximation of the sigmoid function

f (x) = max(0, x)
f (x) = max(0, x + 1) − max(0, x − 1)

38/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
ReLU

Advantages of ReLU
Does not saturate in the positive re-
gion
Computationally efficient
In practice converges much faster
than sigmoid/tanh1

f (x) = max(0, x)

1
ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky Ilya
Sutskever, Geoffrey E. Hinton, 2012 39/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In practice there is a caveat
y Let’s see what is the derivative of ReLU(x)

a2 ∂ReLU (x)
w3 = 0 if x<0
h1 ∂x
a1 = 1 if x>0

w1 w2 b Now consider the given network

What would happen if at some point a large
x1 x2 1 gradient causes the bias b to be updated to a
large negative value?

40/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
w1 x1 + w2 x2 + b < 0 [if b << 0]

y The neuron would output 0 [dead neuron]

Not only would the output be 0 but during
a2 backpropagation even the gradient ∂h
∂a1 would
1

w3 be zero
h1
a1 The weights w1 , w2 and b will not get updated
[∵ there will be a zero term in the chain rule]
w1 w2 b

x1 x2 1 ∂L (θ) ∂y ∂a2 ∂h1 ∂a1

∇w1 = . . . .
∂y ∂a2 ∂h1 ∂a1 ∂w1
The neuron will now stay dead forever!!

41/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
In practice a large fraction of ReLU
units can die if the learning rate is set
a2
w3 too high
h1 It is advised to initialize the bias to a
a1
positive value (0.01)
w1 w2 b Use other variants of ReLU (as we
will soon see)
x1 x2 1

42/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
No saturation
Leaky ReLU Will not die (0.01x ensures that
at least a small gradient will flow
y
through)
Computationally efficient
Close to zero centered ouputs

x
Parametric ReLU
f (x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
f(x) = max(0.01x,x)

43/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Exponential Linear Unit

All benefits of ReLU

x aex − 1 ensures that at least a small
gradient will flow through
Close to zero centered outputs
Expensive (requires computation of
exp(x))

f (x) = x if x>0
x
= ae − 1 if x≤0 44/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Maxout Neuron

Generalizes ReLU and Leaky ReLU

max(w1T x + b1 , w2T x + b2 ) No saturation! No death!
Doubles the number of parameters

45/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)

46/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.4 : Better initialization strategies

47/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies

48/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
y
What happens if we initialize all
h21 weights to 0?
σ
a21
All neurons in layer 1 will get the
h11 h12 h13
σ σ σ same activation
a11 a12 a13
Now what will happen during back
propagation?
x1 x2
∂L (w) ∂y ∂h11
∇w11 = . . .x1
a11 = w11 x1 + w12 x2 ∂y ∂h11 ∂a11
a12 = w21 x1 + w22 x2 ∂L (w) ∂y ∂h12
∇w21 = . . .x1
∴ a11 = a12 = 0 ∂y ∂h12 ∂a12
∴ h11 = h12 but h11 = h12
and a12 = a12
∴ ∇w11 = ∇w21

49/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
We will now consider a feedforward
network with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-
sian
0.4

0.3

0.2

0.1

−3 −2 −1 0 1 2 3

tanh activation functions

51/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
What will happen during back
propagation?
Recall that ∇w1 is proportional to
the activation passing through it
If all the activations in a layer are
very close to 0, what will happen to
the gradient of the weights connect-
ing this layer to the next layer?
They will all be close to 0 (vanishing
gradient problem)

52/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let us try to initialize the weights to
large random numbers
Most activations have saturated
What happens to the gradients at sat-
uration?
They will all be close to 0 (vanishing
gradient problem)

sigmoid activations with large weights

53/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let us try to arrive at a more principled
way of initializing weights
Xn
s11 s1n s11 = w1i xi
i=1
n
X n
X
x1 x2 x3 xn V ar(s11 ) = V ar( w1i xi ) = V ar(w1i xi )
i=1 i=1
n
X
(E[w1i ])2 V ar(xi )

=
i=1
+ (E[xi ])2 V ar(w1i ) + V ar(xi )V ar(w1i )

[Assuming 0 Mean inputs and n
weights] X
= V ar(xi )V ar(w1i )
[Assuming V ar(xi ) = V ar(x)∀i ] i=1
[Assuming = (nV ar(w))(V ar(x))
V ar(w1i ) = V ar(w)∀i]
54/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In general,

s11 s1n
V ar(S1i ) = (nV ar(w))(V ar(x))

What would happen if nV ar(w) 1

x1 x2 x3 xn ?
The variance of S1i will be large
What would happen if nV ar(w) → 0?
The variance of S1i will be small

55/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let us see what happens if we add one
more layer
s21 Using the same procedure as above
we will arrive at
n
s11 s1n X
V ar(s21 ) = V ar(s1i )V ar(w2i )
i=1

x1 x2 x3 xn = nV ar(s1i )V ar(w2 )

V ar(Si1 ) = nV ar(w1 )V ar(x) V ar(s21 ) ∝ [nV ar(w2 )][nV ar(w1 )]V ar(x)

∝ [nV ar(w)]2 V ar(x)

Assuming weights across all layers
have the same variance
56/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
In general,

V ar(ski ) = [nV ar(w)]k V ar(x)

To ensure that variance in the output of any

layer does not blow up or shrink we want:
V ar(az) = a2 (V ar(z))
n V ar(w) = 1

If we draw the weights from a unit Gaussian

and scale them by √1n then, we have :

z
nV ar(w) = n V ar( √ )
n
1
= n ∗ V ar(z) = 1 ← (U nit Gaussian)
n

57/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Let’s see what happens if we use this
initialization

sigmoid activations

58/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
However this does not work for ReLU
neurons
Why ?
Intuition: He et.al. argue that a
factor of 2 is needed when dealing
with ReLU Neurons
Intuitively this happens because the
range of ReLU neurons is restricted
only to the positive half of the space

59/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Indeed when we account for this
factor of 2 we see better performance

60/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Module 9.5 : Batch Normalization

61/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
We will now see a method called batch normalization which allows us to be less
careful about initialization

62/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
To understand the intuition behind Batch Nor-
malization let us consider a deep network
h4
Let us focus on the learning process for the weights
between these two layers
h3
Typically we use mini-batch algorithms
h2 What would happen if there is a constant change
in the distribution of h3
h1 In other words what would happen if across mini-
batches the distribution of h3 keeps changing
h0 Would the learning process be easy or hard?

x1 x2 x3

63/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
It would help if the pre-activations at each layer
were unit gaussians
Why not explicitly ensure this by standardizing
the pre-activation ?
sˆik = s√ik −E[sik ]
var(sik )
But how do we compute E[sik ] and Var[sik ]?
We compute it from a mini-batch
Thus we are explicitly ensuring that the distri-
bution of the inputs at different layers does not
change across batches

64/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
This is what the deep network will look like with
Batch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-
ferentiable, the Batch Normalization layer is also
differentiable
Hence we can backpropagate through this layer

65/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
Catch: Do we necessarily want to force a unit
gaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-
lowing step:
y (k) = γ k sˆik + β (k)
What happens if the network learns:
p
γ k = var(xk )
β k = E[xk ]
We will recover sik
γ k and β k are additional In other words by adjusting these additional para-
parameters of the network. meters the network can learn to recover sik if that
is more favourable

66/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
We will now compare the performance with and without batch normalization on
MNIST data using 2 layers....

67/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
68/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Beyond batch normalization

69/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9

Full Download GRE Math Strategies Effective Strategies Practice from 99th Percentile Instructors Manhattan Prep PDF DOCX
100% (2)
Full Download GRE Math Strategies Effective Strategies Practice from 99th Percentile Instructors Manhattan Prep PDF DOCX
62 pages
Broadband Premises Troubleshooting Par 1 Study Guide PDF
25% (4)
Broadband Premises Troubleshooting Par 1 Study Guide PDF
4 pages
HDSM USB Instruction Manual
No ratings yet
HDSM USB Instruction Manual
20 pages
CS490 Advanced Topics in Computing (Deep Learning)
No ratings yet
CS490 Advanced Topics in Computing (Deep Learning)
37 pages
a imprimer 4
No ratings yet
a imprimer 4
4 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Lecture -14_FFNN
No ratings yet
Lecture -14_FFNN
59 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
6.3 HiddenUnits
No ratings yet
6.3 HiddenUnits
26 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Lecture 5
No ratings yet
Lecture 5
1,073 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Session NN
No ratings yet
Session NN
32 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
ITNN Week3
No ratings yet
ITNN Week3
21 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
DL Mentoring Session - Final
No ratings yet
DL Mentoring Session - Final
17 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Deep Learning Andrew NG
100% (3)
Deep Learning Andrew NG
173 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
Lecture 3 H
No ratings yet
Lecture 3 H
70 pages
Lecture_09_slides_-_after
No ratings yet
Lecture_09_slides_-_after
57 pages
EELU ANN ITF309 Lecture 09 Spring 2024
No ratings yet
EELU ANN ITF309 Lecture 09 Spring 2024
52 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
Slides 11
No ratings yet
Slides 11
48 pages
2. Deep Neural Network
No ratings yet
2. Deep Neural Network
60 pages
A Probabilistic Theory of Deep Learning: Unit 2
No ratings yet
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
02_NEURAL_NETWORKS
No ratings yet
02_NEURAL_NETWORKS
32 pages
06 AIS302 ANN backpropagation
No ratings yet
06 AIS302 ANN backpropagation
83 pages
1725876123-Unit 1 Fundamental of Deep Learning
No ratings yet
1725876123-Unit 1 Fundamental of Deep Learning
51 pages
ANN-CNN-RNN
No ratings yet
ANN-CNN-RNN
26 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Lect 12 -Deep Feed Forward NN- Review
No ratings yet
Lect 12 -Deep Feed Forward NN- Review
93 pages
Deep Learning
No ratings yet
Deep Learning
57 pages
LecML -3 NN
No ratings yet
LecML -3 NN
33 pages
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
No ratings yet
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
69 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
ca3dl
No ratings yet
ca3dl
6 pages
Lecture 12 - Neural Networks (DONE!!) PDF
No ratings yet
Lecture 12 - Neural Networks (DONE!!) PDF
27 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
1
No ratings yet
1
15 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Deep Learning Tutorial: Reference: Hung-Yi Lee
100% (1)
Deep Learning Tutorial: Reference: Hung-Yi Lee
179 pages
Artificial Neural Networks_dl
No ratings yet
Artificial Neural Networks_dl
55 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
DS SMC Sentry IT Controller
No ratings yet
DS SMC Sentry IT Controller
2 pages
Lect. - 11 - Heat Exchanger
No ratings yet
Lect. - 11 - Heat Exchanger
11 pages
Woodward MFR 13
No ratings yet
Woodward MFR 13
91 pages
Data Sheet 6ES7214-1AF40-0XB0: General Information
No ratings yet
Data Sheet 6ES7214-1AF40-0XB0: General Information
9 pages
Mathematics 9 - Examination - ASJ
No ratings yet
Mathematics 9 - Examination - ASJ
3 pages
Assignment 1 Solutions
No ratings yet
Assignment 1 Solutions
3 pages
TGT-P-H01-RP-0002 Rev.0 PDF
100% (1)
TGT-P-H01-RP-0002 Rev.0 PDF
41 pages
Stratigraphy and Morphotectonics of Karoo Deposits of The Northern Selous Basin Tanzania - WopfnerKaaya1991
No ratings yet
Stratigraphy and Morphotectonics of Karoo Deposits of The Northern Selous Basin Tanzania - WopfnerKaaya1991
17 pages
application-rotocure
No ratings yet
application-rotocure
8 pages
Momentum Practice Problems: Physics Chapter 8
No ratings yet
Momentum Practice Problems: Physics Chapter 8
23 pages
Hilirisasi Hasil Riset
No ratings yet
Hilirisasi Hasil Riset
25 pages
AnaChem Reviewer
No ratings yet
AnaChem Reviewer
9 pages
One Dimensional Flow
No ratings yet
One Dimensional Flow
11 pages
Jurnal Dka 6
No ratings yet
Jurnal Dka 6
5 pages
WIN10 Commands PDF
No ratings yet
WIN10 Commands PDF
3 pages
DLL chemNOV14
No ratings yet
DLL chemNOV14
5 pages
Article 93814
No ratings yet
Article 93814
5 pages
Sic 631
No ratings yet
Sic 631
19 pages
BLV Ender 3 Pro - Upgrade: Like Collect Comments Post A Make Watch Remix It Share
No ratings yet
BLV Ender 3 Pro - Upgrade: Like Collect Comments Post A Make Watch Remix It Share
20 pages
Cement Chapter 4
100% (2)
Cement Chapter 4
11 pages
Instant Download OpenSSL Cookbook The Definitive Guide to the Most Useful Command Line Features 3rd edition Ivan Ristić PDF All Chapters
100% (1)
Instant Download OpenSSL Cookbook The Definitive Guide to the Most Useful Command Line Features 3rd edition Ivan Ristić PDF All Chapters
47 pages
Q8 Formula Special D1 FE 0W-20: Description
No ratings yet
Q8 Formula Special D1 FE 0W-20: Description
1 page
ISC Class 12 Physics Paper 2 (Practical) Question Paper - 2019
0% (1)
ISC Class 12 Physics Paper 2 (Practical) Question Paper - 2019
4 pages
Ed566523 PDF
No ratings yet
Ed566523 PDF
19 pages
Chapter 8 Biomechanics
No ratings yet
Chapter 8 Biomechanics
7 pages
Full download (Ebook) Student Solutions Manual for Aufmann/Lockwood's Basic College Math: An Applied Approach, 10th by Richard N. Aufmann, Joanne Lockwood ISBN 9781285420172, 1285420179 pdf docx
100% (2)
Full download (Ebook) Student Solutions Manual for Aufmann/Lockwood's Basic College Math: An Applied Approach, 10th by Richard N. Aufmann, Joanne Lockwood ISBN 9781285420172, 1285420179 pdf docx
81 pages
Ioc Pyq
No ratings yet
Ioc Pyq
15 pages

Lecture 9 H

Uploaded by

Lecture 9 H

Uploaded by

CS7015 (Deep Learning) : Lecture 9

Greedy Layerwise Pre-training, Better activation functions, Better weight

Department of Computer Science and Engineering

Saturated neurons thus cause the

But why would the neurons saturate

Consider the gradient w.r.t. w1 and a3 = w1 ∗ h21 + w2 ∗ h22

(Not possible) Quadrant in which

Is this a non-linear function?

w1 w2 b Now consider the given network

y The neuron would output 0 [dead neuron]

x1 x2 1 ∂L (θ) ∂y ∂a2 ∂h1 ∂a1

All benefits of ReLU

Generalizes ReLU and Leaky ReLU

the network has 5 layers

tanh activation functions

sigmoid activations with large weights

What would happen if nV ar(w) 1

∝ [nV ar(w)]2 V ar(x)

V ar(ski ) = [nV ar(w)]k V ar(x)

To ensure that variance in the output of any

If we draw the weights from a unit Gaussian

You might also like

Lecture 9 H

Uploaded by

Lecture 9 H

Uploaded by

CS7015 (Deep Learning) : Lecture 9

Greedy Layerwise Pre-training, Better activation functions, Better weight

Department of Computer Science and Engineering

Saturated neurons thus cause the

But why would the neurons saturate

Consider the gradient w.r.t. w1 and a3 = w1 ∗ h21 + w2 ∗ h22

(Not possible) Quadrant in which

Is this a non-linear function?

w1 w2 b Now consider the given network

y The neuron would output 0 [dead neuron]

x1 x2 1 ∂L (θ) ∂y ∂a2 ∂h1 ∂a1

All benefits of ReLU

Generalizes ReLU and Leaky ReLU

the network has 5 layers

tanh activation functions

sigmoid activations with large weights

What would happen if nV ar(w)  1

∝ [nV ar(w)]2 V ar(x)

V ar(ski ) = [nV ar(w)]k V ar(x)

To ensure that variance in the output of any

If we draw the weights from a unit Gaussian

You might also like

What would happen if nV ar(w) 1