Vanishing Gradients in Deep Networks
Vanishing Gradients in Deep Networks
Purdue University
Purdue University 1
Preamble
Solving difficult image recognition and detection problems requires deep networks.
But training deep networks can be difficult because of the vanishing gradients
problem. Vanishing gradients means that the gradients of the loss become more
and more muted in the beginning layers of a network as the network become
increasingly deeper.
The main reason for this is the multiplicative effect that goes into the calculation
of the gradients — the gradient calculated for each layer are a product of the
partial derivatives in all the higher indexed layers in the network.
Modern deep networks use multiple strategies to cope with the problem of
vanishing gradients. These are listed on the next slide.
Purdue University 2
Preamble (contd.)
Here are the strategies commonly used used for addressing the problem of
vanishing gradients:
normalized initializations for the learnable parameters
batch normalization
using skip connections
Purdue University 4
A Demonstration of the Power of Skip Connections
Outline
Purdue University 5
A Demonstration of the Power of Skip Connections
Make sure that you have Version 1.0.6 or higher of the module.
The in-class demo is based on the inner class SkipConnections of the
module. This is just a convenience wrapper class for the actual
network classes used in the demo: BMEnet and SkipBlock.
I have used “BME” in the name BMEnet in honor of the School
of Biomedical Engineering, Purdue University. Without their
sponsorship, you would not be taking BME695DL/ECE695DL this
semester.
Purdue University 6
A Demonstration of the Power of Skip Connections
What I mean by that is that we will build a network whose layers are
built from instances of SkipBlock.
2 When the input and the output channels are unequal for a
convolutional layer, the number of output channels is exactly twice the
number of input channels.
SkipBlock’s Definition
Can you see the skipping action in the definition of forward()? We combine the
input saved in line (B) with the output at the end.
class SkipBlock(nn.Module):
def __init__(self, in_ch, out_ch, downsample=False, skip_connections=True):
super(DLStudio.SkipConnections.SkipBlock, self).__init__()
self.downsample = downsample
self.skip_connections = skip_connections
self.in_ch = in_ch
self.out_ch = out_ch
self.convo1 = nn.Conv2d(in_ch, out_ch, 3, stride=1, padding=1)
self.convo2 = nn.Conv2d(in_ch, out_ch, 3, stride=1, padding=1)
norm_layer1 = nn.BatchNorm2d
norm_layer2 = nn.BatchNorm2d
self.bn1 = norm_layer1(out_ch)
self.bn2 = norm_layer2(out_ch)
if downsample:
self.downsampler = nn.Conv2d(in_ch, out_ch, 1, stride=2)
Purdue University 10
A Demonstration of the Power of Skip Connections
class BMEnet(nn.Module):
Purdue University 12
A Demonstration of the Power of Skip Connections
Here are the naming conventions I have used for the different types of
SkipBlock instances used in BMEnet:
The next slide talks about the forward() and its four for loops.
Purdue University 13
A Demonstration of the Power of Skip Connections
As you can tell from the code in the forward() of BMEnet, I have
divided the network definition into four sections, each section created
with a for loop.
Purdue University 14
Comparing the Classification Performance with and without Skip
Connections
Outline
Purdue University 15
Comparing the Classification Performance with and without Skip
Connections
Purdue University
Continued on the next slide .... 17
Comparing the Classification Performance with and without Skip
Connections
Purdue University 20
What Causes Vanishing Gradients?
Outline
Purdue University 21
What Causes Vanishing Gradients?
There are many publications in the literature that talk about the
problem of vanishing gradients in deep networks. In my opinion, the
2010 paper “Understanding the Difficulty of Training Deep
Feedforward Neural Networks” by Glorot and Bengio is the best. You
can access it here:
http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
To convey the key arguments in the paper, I’ll use the notation
described on the next slide.
Purdue University 22
What Causes Vanishing Gradients?
The Notation
x: represents the input to the neural network
L: represents the loss at the output of the final layer
zi : represents the input for the i th layer
zki : represents the k th element of zi
si : represents the preactivation output for the i th layer
sik : represents the k th element of si
Wi : represents the link weights between the input for i th layer and its
preactivation output
i
wl,k : represents the value at the index pair (l, k) of the weight Wi
bi : represents the bias value needed at the output for the i th layer
f (): represents the activation function for the i th layer
f 0 (): represents the partial derivation of the activation function for the
i th layer with respect to its argument
Purdue University 23
What Causes Vanishing Gradients?
The post-activation output for the i th layer is the input for the
(i + 1)th layer, hence the notation zi+1 in the second equation.
Let’s now take the partial derivative of L — which, in principle, could
be any scalar — with respect to the pre-activation values sik and the
i :
learnable weights wl,k
∂L 0 k i+1 ∂L
= f (si ) Wk,• (3)
∂sik ∂si+1
∂L i ∂L
= zl (4)
i
∂wl,k ∂sik
Glorot and Bengio have argued that if the weights are properly
initialized, their variances can be made to remain approximately the
same during forward propagation. [BTW, how to best initialize the weights is an issue unto
itself in DL.] This assumption plays an important role in our examination
The notation Var [x] stands for the variance in the individual elements
of the input x, Var [z i ] for the variance associated with each element
of the i th layer input zi , and Var [w i ] for the variance in each element
of the weight Wi .
Think of the variances as “signal energy” in a network.
Purdue University 26
What Causes Vanishing Gradients?
The rationale that goes into writing the approximate formula shown
on the previous slide for how the variances propagate in the forward
direction also dictates the following relationships for a network with d
layers:
d
∂L ∂L i0
Y
Var = Var · ni 0 +1 Var [w ] (6)
∂s i ∂s d
i 0 =i
i−1 d−1
∂L i0 i0 ∂L
Y Y
Var = ni 0 Var [w ] · ni 0 +1 Var [w ] · Var [x] · Var (7)
∂w i ∂s d
i 0 =0 i 0 =i
Eq. (6) says that, in the layered structure of a neural network, while
the variances travel forward in the manner shown on the previous
slide, variances in the gradient of a scalar that exists at the last node
with respect to the preactivation values in the intermediate layers
travel backwards as shown by that equation.
As to how the variances in the gradient of the output scalar vis-a-vis
the weight elements depend on the variance of the gradient of the
same scalar at the output is shown by Eq. (7).
Purdue University 27
What Causes Vanishing Gradients?
The two equations on the last slide say that whereas the “energy” in
the gradient of the loss with respect to the weight elements is
independent of the layer index, the same in the gradient of the loss
with respect to the preactivation output in each layer will become
more and more muted as d − i becomes larger and larger.
Purdue University 29
A Beautiful Explanation for Why Skip Connections Help
Outline
Purdue University 30
A Beautiful Explanation for Why Skip Connections Help
I’ll now present what has got to be the most beautiful explanation for
why using skip connections helps mitigate the problem of vanishing
gradients. This explanation was first presented in a 2016 paper
“Residual Networks Behave Like Ensembles of Relatively Shallow
Networks” by Veit, Wilber, and Belongie that you can download from:
http://papers.nips.cc/paper/6556- residual- networks- behave- like- ensembles- of- relatively- shallow- networks
The main argument made in this paper is that using skip connections
turns a deep network into an ensemble of relatively shallow networks.
As to what that means is illustrated in the figure in the next slide.
Purdue University 31
A Beautiful Explanation for Why Skip Connections Help
The figure shown below, from the previously mentioned paper by Veit,
Wilber, and Belongie, nicely explains the main point of that paper.
Purdue University 32
A Beautiful Explanation for Why Skip Connections Help
Purdue University 34
Visualizing the Loss Function for a Network with Skip
Connections
Outline
Purdue University 35
Visualizing the Loss Function for a Network with Skip
Connections
https://papers.nips.cc/paper/7875-visualizing-the-loss-landscape-of-neural-nets.pdf
Their visualization tool also shows that when a deep network uses
skip connections, the chaotic loss function becomes significantly
smoother in the vicinity of the global minimum.
The authors claim that, since in the vicinity of any local optimum, the
loss surface is bound to be nearly convex, the paths must lie in an
extremely low-dimensional space.
Purdue University 37
Visualizing the Loss Function for a Network with Skip
Connections
Purdue University 38