0% found this document useful (0 votes)

11 views

Vanishing Gradients in Deep Networks

The document discusses using skip connections to address the problem of vanishing gradients in deep networks. It provides an outline of topics to be covered including demonstrating the performance improvement from skip connections using a CNN module. The module defines a SkipBlock class that serves as the building block for networks with skip connections.

Uploaded by

Juan Carlos Álvarez Salazar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Vanishing Gradients in Deep Networks

Uploaded by

Juan Carlos Álvarez Salazar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Using Skip Connections to Mitigate the Problem

of Vanishing Gradients in Deep Networks

Lecture Notes on Deep Learning

Avi Kak and Charles Bouman

Purdue University

Thursday 9th April, 2020 23:23

Purdue University 1
Preamble

Solving difficult image recognition and detection problems requires deep networks.

But training deep networks can be difficult because of the vanishing gradients
problem. Vanishing gradients means that the gradients of the loss become more
and more muted in the beginning layers of a network as the network become
increasingly deeper.

The main reason for this is the multiplicative effect that goes into the calculation
of the gradients — the gradient calculated for each layer are a product of the
partial derivatives in all the higher indexed layers in the network.

Modern deep networks use multiple strategies to cope with the problem of
vanishing gradients. These are listed on the next slide.

Purdue University 2
Preamble (contd.)
Here are the strategies commonly used used for addressing the problem of
vanishing gradients:
normalized initializations for the learnable parameters
batch normalization
using skip connections

Of the three mitigation strategies listed above, the last — using

skip connections — is the most counter intuitive and, therefore, also
the most fun to talk about. So I’ll take it up first in this lecture.

This lecture will start with a demonstration of the improvement in

classification accuracy when using skip connections. For this demo, I’ll use
the BMEnet inner class of my DLStudio-1.0.6 module.

Subsequently, I’ll explain more precisely what is meant by the vanishing

gradient problem in deep learning and why using skip connections helps.
Purdue University 3
Outline

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

Connections

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

Connections

Purdue University 4
A Demonstration of the Power of Skip Connections

Outline

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

Connections

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

Connections

Purdue University 5
A Demonstration of the Power of Skip Connections

SkipConnection Class in My DLStudio Module

I think the best way to start this lecture is with an in-class
demonstration of the improvements in the performance of a CNN by
using skip connections. I’ll use my DLStudio module for the demo.
You can install this module from:
https://pypi.org/project/DLStudio/1.0.6

Make sure that you have Version 1.0.6 or higher of the module.
The in-class demo is based on the inner class SkipConnections of the
module. This is just a convenience wrapper class for the actual
network classes used in the demo: BMEnet and SkipBlock.
I have used “BME” in the name BMEnet in honor of the School
of Biomedical Engineering, Purdue University. Without their
sponsorship, you would not be taking BME695DL/ECE695DL this
semester.
Purdue University 6
A Demonstration of the Power of Skip Connections

The Roles Played by the BMEnet and SkipBlock

Classes
Just as torch.nn.Conv2d, torch.nn.Linear, etc., are the building
blocks of a CNN in PyTorch, SkipBlock will serve as the primary
building block for creating a deep network with skip connections.

What I mean by that is that we will build a network whose layers are
built from instances of SkipBlock.

As to the reason for why the building block is named SkipBlock, it

has two pathways for the input to get to the output: one goes
through a couple of convolutional layers and other just directly.
Obviously, the input going directly to the output means that it is
skipping the convolutional layers in the other path.

The overall network that is built with SkipBlock will be an instance

of BMEnet.
Purdue University 7
A Demonstration of the Power of Skip Connections

Two Assumptions Implicit in the Definition of

SkipBlock
In the code shown on the next slide, an instance of SkipBlock
consists of two convolutional layers, as you can see in lines (B) and
(E). Each convolutional output is subject to batch normalization and
ReLU activation. Additionally, this class is based on the following two
implicit assumptions:
1 In the overall network, when a convolutional kernel creates a
lower-resolution output, the change will be by a factor of 2 exactly.
[For example, a convolutional kernel may convert a 32 × 32 image into a 16 × 16 output, or a 16 × 16 input
into an 8 × 8 output, and so on.]

2 When the input and the output channels are unequal for a
convolutional layer, the number of output channels is exactly twice the
number of input channels.

These two assumptions make it possible to define a small building

block class that can then be used to create networks of arbitrary
depth
Purdue (without needing any additional glue code).
University 8
A Demonstration of the Power of Skip Connections

SkipBlock’s Definition
Can you see the skipping action in the definition of forward()? We combine the
input saved in line (B) with the output at the end.
class SkipBlock(nn.Module):
def __init__(self, in_ch, out_ch, downsample=False, skip_connections=True):
super(DLStudio.SkipConnections.SkipBlock, self).__init__()
self.downsample = downsample
self.skip_connections = skip_connections
self.in_ch = in_ch
self.out_ch = out_ch
self.convo1 = nn.Conv2d(in_ch, out_ch, 3, stride=1, padding=1)
self.convo2 = nn.Conv2d(in_ch, out_ch, 3, stride=1, padding=1)
norm_layer1 = nn.BatchNorm2d
norm_layer2 = nn.BatchNorm2d
self.bn1 = norm_layer1(out_ch)
self.bn2 = norm_layer2(out_ch)
if downsample:
self.downsampler = nn.Conv2d(in_ch, out_ch, 1, stride=2)

def forward(self, x): ## (A)

identity = x ## (B)
out = self.convo1(x) ## (C)
out = self.bn1(out) ## (D)
out = torch.nn.functional.relu(out) ## (E)
if self.in_ch == self.out_ch: ## (F)
out = self.convo2(out) ## (G)
out = self.bn2(out) ## (H)
out = torch.nn.functional.relu(out) ## (I)

Purdue University Continued on the next slide .... 9

A Demonstration of the Power of Skip Connections

SkipBlock’s Definition (contd.)

(...... continued from the previous slide)
if self.downsample: ## (J)
out = self.downsampler(out) ## (K)
identity = self.downsampler(identity) ## (L)
if self.skip_connections: ## (M)
if self.in_ch == self.out_ch: ## (N)
out += identity ## (O)
else: ## (P)
out[:,:self.in_ch,:,:] += identity ## (Q)
out[:,self.in_ch:,:,:] += identity ## (R)
return out ## (S)

Purdue University 10
A Demonstration of the Power of Skip Connections

The BMEnet Class — Building a Network with

SkipBlock
Now I am ready to define our main neural network class BMEnet:

class BMEnet(nn.Module):

def init(self, skip_connections=True, depth=32):

super(DLStudio.SkipConnections.BMEnet, self).__init__()
if depth not in [8, 16, 32, 64]:
sys.exit("BMEnet has been tested for depth for only 16, 32, and 64")
self.depth = depth // 8
self.conv = nn.Conv2d(3, 64, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.skip64_arr = nn.ModuleList()
for i in range(self.depth):
self.skip64_arr.append(DLStudio.SkipConnections.SkipBlock(64, 64, skip_connections=skip_connections
self.skip64ds = DLStudio.SkipConnections.SkipBlock(64, 64, downsample=True, skip_connections=skip_conne
self.skip64to128 = DLStudio.SkipConnections.SkipBlock(64, 128, skip_connections=skip_connections )
self.skip128_arr = nn.ModuleList()
for i in range(self.depth):
self.skip128_arr.append(DLStudio.SkipConnections.SkipBlock(128, 128, skip_connections=skip_connecti
self.skip128ds = DLStudio.SkipConnections.SkipBlock(128,128, downsample=True, skip_connections=skip_con
self.fc1 = nn.Linear(2048, 1000)
self.fc2 = nn.Linear(1000, 10)

Continued on the next slide ....

Purdue University 11
A Demonstration of the Power of Skip Connections

The BMEnet Class (contd.)

(...... continued from the previous slide)
def forward(self, x): ## (A)
x = self.pool(torch.nn.functional.relu(self.conv(x))) ## (B)
for i,skip64 in enumerate(self.skip64_arr[:self.depth//4]): ## (C)
x = skip64(x) ## (D)
x = self.skip64ds(x) ## (E)
for i,skip64 in enumerate(self.skip64_arr[self.depth//4:]): ## (F)
x = skip64(x) ## (G)
x = self.skip64ds(x) ## (H)
x = self.skip64to128(x) ## (I)
for i,skip128 in enumerate(self.skip128_arr[:self.depth//4]): ## (J)
x = skip128(x) ## (K)
for i,skip128 in enumerate(self.skip128_arr[self.depth//4:]): ## (L)
x = skip128(x) ## (M)
x = x.view(-1, 2048 ) ## (N)
x = torch.nn.functional.relu(self.fc1(x)) ## (O)
x = self.fc2(x) ## (P)
return x ## (Q)

Purdue University 12
A Demonstration of the Power of Skip Connections

Explaining the BMEnet Class

Here are the naming conventions I have used for the different types of
SkipBlock instances used in BMEnet:

In the code in init (), when the name of a SkipBlock instance

ends in “ds”, as in the names self.skip64ds and self.skip128ds,
that means that an important duty of that SkipBlock is to
downsample the image array by a factor of 2.
The name self.skip64to128 is for an instance of SkipBlock that
has 64 channels at its input and 128 channels at its output.
The names self.skip64 and self.skip128 are routine SkipBlock
instances that neither downsample the images nor change the number
of channels from input to output.

The next slide talks about the forward() and its four for loops.
Purdue University 13
A Demonstration of the Power of Skip Connections

Explaining the BMEnet Class (contd.)

As you can tell from the code in the forward() of BMEnet, I have
divided the network definition into four sections, each section created
with a for loop.

The number of SkipBlock layers created in each section depends on

the value given to self.depth.

The four for loops are separated by either a downsmapling

SkipBlock, which would either be self.skip64ds or self.skip128ds,
or a channel changing one like self.skip64to128.

The different SkipBlock versions are created by the two constructor

options for the BMEnet class, skip connections and depth that you
see in the header of the init () for BMEnet.

Purdue University 14
Comparing the Classification Performance with and without Skip
Connections

Outline

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

Connections

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

Connections

Purdue University 15
Comparing the Classification Performance with and without Skip
Connections

The In-Class Demo

The script playing with skip connections.py in the Examples

directory of the DLStudio distribution shows how you can ask the
module constructor to create a BMEnet network with your choice of
values for the options skip connections and depth.
The next slide shows the details of the network constructed when the
constructor option depth is set to 32.
Why do think the number of learnable parameters goes suddenly up
from 1,792 in the layer Conv2d-1 to 36,926 in the layer Conv2d-3?
You will also notice that all the entries are zero for the SkipBlock
layers. How do you explain that?
At the end of the network detail, you will see that the total number of
learnable parameters exceeds 5 million.
Purdue University 16
Comparing the Classification Performance with and without Skip
Connections

The Network Generated for depth=32

a summary of input/output for the model:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 32, 32] 1,792
MaxPool2d-2 [-1, 64, 16, 16] 0
Conv2d-3 [-1, 64, 16, 16] 36,928
BatchNorm2d-4 [-1, 64, 16, 16] 128
Conv2d-5 [-1, 64, 16, 16] 36,928
BatchNorm2d-6 [-1, 64, 16, 16] 128
SkipBlock-7 [-1, 64, 16, 16] 0
Conv2d-8 [-1, 64, 16, 16] 36,928
BatchNorm2d-9 [-1, 64, 16, 16] 128
Conv2d-10 [-1, 64, 16, 16] 36,928
BatchNorm2d-11 [-1, 64, 16, 16] 128
SkipBlock-12 [-1, 64, 16, 16] 0
Conv2d-13 [-1, 64, 16, 16] 36,928
BatchNorm2d-14 [-1, 64, 16, 16] 128
Conv2d-15 [-1, 64, 16, 16] 36,928
BatchNorm2d-16 [-1, 64, 16, 16] 128
SkipBlock-17 [-1, 64, 16, 16] 0
Conv2d-18 [-1, 64, 16, 16] 36,928
BatchNorm2d-19 [-1, 64, 16, 16] 128
Conv2d-20 [-1, 64, 16, 16] 36,928
BatchNorm2d-21 [-1, 64, 16, 16] 128
SkipBlock-22 [-1, 64, 16, 16] 0
Conv2d-23 [-1, 64, 16, 16] 36,928
BatchNorm2d-24 [-1, 64, 16, 16] 128
Conv2d-25 [-1, 64, 16, 16] 36,928
BatchNorm2d-26 [-1, 64, 16, 16] 128
Conv2d-27 [-1, 64, 8, 8] 4,160
Conv2d-28 [-1, 64, 8, 8] 4,160
SkipBlock-29 [-1, 64, 8, 8] 0
Conv2d-30 [-1, 64, 8, 8] 36,928
BatchNorm2d-31 [-1, 64, 8, 8] 128
Conv2d-32 [-1, 64, 8, 8] 36,928
BatchNorm2d-33 [-1, 64, 8, 8] 128
SkipBlock-34 [-1, 64, 8, 8] 0
Conv2d-35 [-1, 64, 8, 8] 36,928

Purdue University
Continued on the next slide .... 17
Comparing the Classification Performance with and without Skip
Connections

The Network Generated for depth=32 (contd.)

BatchNorm2d-36 [-1, 64, 8, 8] 128
Conv2d-37 [-1, 64, 8, 8] 36,928
BatchNorm2d-38 [-1, 64, 8, 8] 128
SkipBlock-39 [-1, 64, 8, 8] 0
Conv2d-40 [-1, 64, 8, 8] 36,928
BatchNorm2d-41 [-1, 64, 8, 8] 128
Conv2d-42 [-1, 64, 8, 8] 36,928
BatchNorm2d-43 [-1, 64, 8, 8] 128
SkipBlock-44 [-1, 64, 8, 8] 0
Conv2d-45 [-1, 64, 8, 8] 36,928
BatchNorm2d-46 [-1, 64, 8, 8] 128
Conv2d-47 [-1, 64, 8, 8] 36,928
BatchNorm2d-48 [-1, 64, 8, 8] 128
SkipBlock-49 [-1, 64, 8, 8] 0
Conv2d-50 [-1, 128, 8, 8] 73,856
BatchNorm2d-51 [-1, 128, 8, 8] 256
Conv2d-52 [-1, 128, 8, 8] 73,856
BatchNorm2d-53 [-1, 128, 8, 8] 256
SkipBlock-54 [-1, 128, 8, 8] 0
Conv2d-55 [-1, 128, 8, 8] 147,584
BatchNorm2d-56 [-1, 128, 8, 8] 256
Conv2d-57 [-1, 128, 8, 8] 147,584
BatchNorm2d-58 [-1, 128, 8, 8] 256
SkipBlock-59 [-1, 128, 8, 8] 0
Conv2d-60 [-1, 128, 8, 8] 147,584
BatchNorm2d-61 [-1, 128, 8, 8] 256
Conv2d-62 [-1, 128, 8, 8] 147,584
BatchNorm2d-63 [-1, 128, 8, 8] 256
SkipBlock-64 [-1, 128, 8, 8] 0
Conv2d-65 [-1, 128, 8, 8] 147,584
BatchNorm2d-66 [-1, 128, 8, 8] 256
Conv2d-67 [-1, 128, 8, 8] 147,584
BatchNorm2d-68 [-1, 128, 8, 8] 256
SkipBlock-69 [-1, 128, 8, 8] 0
Conv2d-70 [-1, 128, 8, 8] 147,584
BatchNorm2d-71 [-1, 128, 8, 8] 256
Conv2d-72 [-1, 128, 8, 8] 147,584
BatchNorm2d-73 [-1, 128, 8, 8] 256
SkipBlock-74 [-1, 128, 8, 8] 0

Purdue University Continued on the next slide .... 18

Comparing the Classification Performance with and without Skip
Connections

Pay attention to the total number of learnable parameters and to how

many of those are contributed to by the convolutional layers.
Purdue University 19
Comparing the Classification Performance with and without Skip
Connections

Comparing the Results

Based on the output of the script playing with skip connections.py in the
Examples directory of the DLStudio-1.0.6 distribution.
The BMEnet constructor was called as follows for the two plots:
for the plot in red: BMEnet(skip_connections=True, depth=32)
for the plot in blue: BMEnet(skip_connections=False, depth=32)

Purdue University 20
What Causes Vanishing Gradients?

Outline

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

Connections

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

Connections

Purdue University 21
What Causes Vanishing Gradients?

The Problem of Vanishing Gradients

There are many publications in the literature that talk about the
problem of vanishing gradients in deep networks. In my opinion, the
2010 paper “Understanding the Difficulty of Training Deep
Feedforward Neural Networks” by Glorot and Bengio is the best. You
can access it here:

http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

To convey the key arguments in the paper, I’ll use the notation
described on the next slide.

Subsequently, I’ll present Glorot and Bengio’s equations for the

calculation of the gradients of loss with respect to the learnable
parameters in the different layers of a network.

Purdue University 22
What Causes Vanishing Gradients?

The Notation
x: represents the input to the neural network
L: represents the loss at the output of the final layer
zi : represents the input for the i th layer
zki : represents the k th element of zi
si : represents the preactivation output for the i th layer
sik : represents the k th element of si
Wi : represents the link weights between the input for i th layer and its
preactivation output
i
wl,k : represents the value at the index pair (l, k) of the weight Wi
bi : represents the bias value needed at the output for the i th layer
f (): represents the activation function for the i th layer
f 0 (): represents the partial derivation of the activation function for the
i th layer with respect to its argument
Purdue University 23
What Causes Vanishing Gradients?

Equations for Backpropagating the Loss

We start with the same relationships between the input and the
output for a layer that you have seen previously in Prof. Bouman’s
slides:
i i i i
s = zW + b (1)
i+1 i
z = f (s ) (2)

The post-activation output for the i th layer is the input for the
(i + 1)th layer, hence the notation zi+1 in the second equation.
Let’s now take the partial derivative of L — which, in principle, could
be any scalar — with respect to the pre-activation values sik and the
i :
learnable weights wl,k
∂L 0 k i+1 ∂L
= f (si ) Wk,• (3)
∂sik ∂si+1
∂L i ∂L
= zl (4)
i
∂wl,k ∂sik

In my opinion, these equations are better expressed using the Einstein

notation that Prof. Bouman has previously covered in this class.
Purdue University 24
What Causes Vanishing Gradients?

Relationship Between the Variances

Glorot and Bengio have argued that if the weights are properly
initialized, their variances can be made to remain approximately the
same during forward propagation. [BTW, how to best initialize the weights is an issue unto
itself in DL.] This assumption plays an important role in our examination

of the gradients of the loss as they are backpropagated.

We will start by examining the propagation of the variances in the
forward direction.
We use the following three observations: (1) The variance of a sum of
n identically distributed and independent zero-mean random variables
is simply n times the variance of each. (2) The variance of a product
of two such random variables is the product of the two variances. And
(3) In a linear chain of dependencies between such random variables,
the variance at the output of the chain will be a product of the
variances at the intermediate nodes.
Purdue University 25
What Causes Vanishing Gradients?

Forward Propagating the Variances

A variant of the first of the three observations on the last slide is that
the variance of a weighted sum of n random variables of the type
mentioned there equals a weighted sum of the individual variance
provided you square the weights.
These observations dictate that if we assume that all the information
between the input and the output is passing through that portion of
the activations where the input/output are linearly related, we can
write the following approximate expression for the variance in the
input to the i th layer where ni is the size of the layer:
i−1
i i0
Y
Var [z ] = Var [x] · ni 0 Var [w ] (5)
i 0 =0

The notation Var [x] stands for the variance in the individual elements
of the input x, Var [z i ] for the variance associated with each element
of the i th layer input zi , and Var [w i ] for the variance in each element
of the weight Wi .
Think of the variances as “signal energy” in a network.
Purdue University 26
What Causes Vanishing Gradients?

Backpropagating the Variances of the Gradients of Loss

The rationale that goes into writing the approximate formula shown
on the previous slide for how the variances propagate in the forward
direction also dictates the following relationships for a network with d
layers:
d
∂L ∂L i0
Y
Var = Var · ni 0 +1 Var [w ] (6)
∂s i ∂s d
i 0 =i
i−1 d−1
∂L i0 i0 ∂L
Y Y
Var = ni 0 Var [w ] · ni 0 +1 Var [w ] · Var [x] · Var (7)
∂w i ∂s d
i 0 =0 i 0 =i

Eq. (6) says that, in the layered structure of a neural network, while
the variances travel forward in the manner shown on the previous
slide, variances in the gradient of a scalar that exists at the last node
with respect to the preactivation values in the intermediate layers
travel backwards as shown by that equation.
As to how the variances in the gradient of the output scalar vis-a-vis
the weight elements depend on the variance of the gradient of the
same scalar at the output is shown by Eq. (7).
Purdue University 27
What Causes Vanishing Gradients?

Backpropagating the Variances of the Gradients of Loss (contd.)

If we can somehow ensure (and, as mentioned earlier on Slide 23, it is

possible to do so approximately with appropriate initialization of the
weights) that the weight element variances Var [w i ] would remain
more or less the same in all the layers, our equations for the
backpropagation of the gradients of the loss simplify to:
∂L ∂L
h id−i
Var = n · Var [w ] · Var (8)
∂s i ∂s d
∂L ∂L
h id
Var = n · Var [w ] · Var [x] · Var (9)
∂w i ∂s d

There is a huge difference between the two equations. The RHS in

the first equation depends on the layer index i, where the same in the
second equation is independent of i. Recall that d is the total number
of layers in the network.
As to how to interpret this dependence of Eq. (8) on the layer index
depends on whether the values of Var [w ] are generally less than unity
or greater than unity. In either case, this dependency is a source of
problems in deep networks.
Purdue University 28
What Causes Vanishing Gradients?

Backpropagating the Variances of the Gradients of Loss (contd.)

The two equations on the last slide say that whereas the “energy” in
the gradient of the loss with respect to the weight elements is
independent of the layer index, the same in the gradient of the loss
with respect to the preactivation output in each layer will become
more and more muted as d − i becomes larger and larger.

Here you have it: A theoretical explanation for the vanishing

gradients of the loss.

Purdue University 29
A Beautiful Explanation for Why Skip Connections Help

Outline

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

Connections

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

Connections

Purdue University 30
A Beautiful Explanation for Why Skip Connections Help

Why Do Skip Connections Help?

I’ll now present what has got to be the most beautiful explanation for
why using skip connections helps mitigate the problem of vanishing
gradients. This explanation was first presented in a 2016 paper
“Residual Networks Behave Like Ensembles of Relatively Shallow
Networks” by Veit, Wilber, and Belongie that you can download from:
http://papers.nips.cc/paper/6556- residual- networks- behave- like- ensembles- of- relatively- shallow- networks

This paper was brought to my attention about a year ago by Bharath

Comandur who has always managed to stay one step ahead of me in
our collective (meaning RVL’s) understanding of deep neural
networks. (It’s been very disconcerting, as you can imagine!!!!)

The main argument made in this paper is that using skip connections
turns a deep network into an ensemble of relatively shallow networks.
As to what that means is illustrated in the figure in the next slide.
Purdue University 31
A Beautiful Explanation for Why Skip Connections Help

Unraveling a Deep Network with Skip Connections

The figure shown below, from the previously mentioned paper by Veit,
Wilber, and Belongie, nicely explains the main point of that paper.

On the left is a 3-layer network that uses skip connections based on

some chosen building block. Three layers here means three of the
building blocks, whatever they may be.

And on the right is an unraveled view of the same network.

Purdue University 32
A Beautiful Explanation for Why Skip Connections Help

Path Lengths in the Unraveled Network

One can argue that if there are n building blocks in a network, the
total number of alternative paths in the unraveled view would be 2n .
That is because it is possible to construct a path using any desired
units of the n building blocks while bypassing the others with skip
connections.
If we index the individual units in our sequence of n building blocks
from 0 through n − 1, you can also imagine each path in the
unraveled view by a sequence of 1’s and 0’s where 1 stands for the
units included and 0 for the units bypassed.
If we think of each path as a random selection with probability p of
each unit for the path, choosing i out of n for the path would
constitute a binomial experiment. The outcome would be paths
whose lengths are characterized by a Binomial Distribution with a
mean value of p · n. With p = 0.5, the average path length in our
case would be n/2.
Purdue University 33
A Beautiful Explanation for Why Skip Connections Help

Only Short Paths Used for Backpropagation of the

Loss Gradients
While, the average path length may be n/2, it was observed
experimentally by the authors that most of the loss gradients were
backpropagated through very short paths.
The paper presents a method that the authors used to measure the
actual path lengths associated with the calculated gradients of the
loss as they are backpropagated. The results are shown below. As the
plot on the right shows, most paths are between 5 and 17 for a
network with 54 units of the building block.

Purdue University 34
Visualizing the Loss Function for a Network with Skip
Connections

Outline

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

Connections

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

Connections

Purdue University 35
Visualizing the Loss Function for a Network with Skip
Connections

Visualizing the Loss Function for a Network with

Skip Connections
One could ask if it is at all possible to visualize the effect of skip
connections on the shape of the loss function in the vicinity of the
global minimum.
The answer is yes and that’s thanks to the 2018 paper “Visualizing
the Loss Landscape of Neural Nets” by Li, Xu, Taylor, Studer, and
Goldstein that you can access here:

https://papers.nips.cc/paper/7875-visualizing-the-loss-landscape-of-neural-nets.pdf

With their visualization tool, these authors have shown that as a

network becomes deeper, its loss function becomes highly chaotic.
And that negatively impacts the generalization ability of a network.
That is, the network will show a very small training loss and but a
significant loss on the test data.
Purdue University 36
Visualizing the Loss Function for a Network with Skip
Connections

Skip Connections and the Shape of the Loss

Function

Their visualization tool also shows that when a deep network uses
skip connections, the chaotic loss function becomes significantly
smoother in the vicinity of the global minimum.

They obtain their visualizations by calculating the smallest (most

negative) eigenvalues of the Hessian around local minima, and
visualizing the results as a heat map.

The authors claim that, since in the vicinity of any local optimum, the
loss surface is bound to be nearly convex, the paths must lie in an
extremely low-dimensional space.

Purdue University 37
Visualizing the Loss Function for a Network with Skip
Connections

Visualizing the Loss Function with Skip Connections

Shown below is the authors’ visualization of the loss surface for

ResNet-56 with and without skip connections.

Purdue University 38

Analysis and Comparison of Different Microprocessors
No ratings yet
Analysis and Comparison of Different Microprocessors
6 pages
C# Interview Questions You'll Most Likely Be Asked
From Everand
C# Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Java Multithreading Interview Questions And Answers
From Everand
Java Multithreading Interview Questions And Answers
John Edward Cooper Berg
No ratings yet
Beginning Programming with C++ For Dummies
From Everand
Beginning Programming with C++ For Dummies
Stephen R. Davis
4/5 (14)
CCNA (640-802) Exam Questions Cisco
From Everand
CCNA (640-802) Exam Questions Cisco
Eddie Vi
4.5/5 (14)
MPC Book 2nd Edition 1st Printing
No ratings yet
MPC Book 2nd Edition 1st Printing
819 pages
Design of Warehouse Scale Computers (WSC)
No ratings yet
Design of Warehouse Scale Computers (WSC)
5 pages
Analyst Notebook 6 Quick Start Guide
No ratings yet
Analyst Notebook 6 Quick Start Guide
38 pages
Practice For The GRE Math Subject Test!: Charles Rambo
No ratings yet
Practice For The GRE Math Subject Test!: Charles Rambo
23 pages
27-Deep Convolutional Models - ResNet, AlexNet, InceptionNet and Others-18!09!2024
No ratings yet
27-Deep Convolutional Models - ResNet, AlexNet, InceptionNet and Others-18!09!2024
11 pages
Res Net
No ratings yet
Res Net
8 pages
Unit-2 Adl
No ratings yet
Unit-2 Adl
25 pages
Oracle Certified Associate Java Programmer OCAJP 1Z0 808
From Everand
Oracle Certified Associate Java Programmer OCAJP 1Z0 808
Manish Soni
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Computer Science, Career and Job
From Everand
Computer Science, Career and Job
Ramkrishna Ghosh
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
3 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Unit 3
No ratings yet
Unit 3
41 pages
DL-19-CNN Sequential Model 210223
No ratings yet
DL-19-CNN Sequential Model 210223
18 pages
Zehao Huang Data-Driven Sparse Structure ECCV 2018 Paper
No ratings yet
Zehao Huang Data-Driven Sparse Structure ECCV 2018 Paper
17 pages
Kotlin Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
Kotlin Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Shobo
No ratings yet
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
Unit IV
No ratings yet
Unit IV
22 pages
Python Interview Questions You'll Most Likely Be Asked
From Everand
Python Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
2/5 (1)
2311.08655v2
No ratings yet
2311.08655v2
13 pages
02 - Introduction to Convolutional Neural Networks (CNNs)
No ratings yet
02 - Introduction to Convolutional Neural Networks (CNNs)
28 pages
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
No ratings yet
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
69 pages
RNN
No ratings yet
RNN
22 pages
Res Net 2
No ratings yet
Res Net 2
40 pages
Transcript_Lec1
No ratings yet
Transcript_Lec1
60 pages
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
No ratings yet
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
65 pages
6S191_MIT_DeepLearning_L1
No ratings yet
6S191_MIT_DeepLearning_L1
108 pages
nndl (2)
No ratings yet
nndl (2)
10 pages
Deep Learning Handout
100% (1)
Deep Learning Handout
6 pages
Lec6 RNN Attention Search
No ratings yet
Lec6 RNN Attention Search
62 pages
MSCDA 605 Machine Learning Exam Model Answers May_2019
No ratings yet
MSCDA 605 Machine Learning Exam Model Answers May_2019
7 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
GNNS
No ratings yet
GNNS
7 pages
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Vb Net Programming
From Everand
Vb Net Programming
Martin Booch
No ratings yet
Unit 3 Notes
No ratings yet
Unit 3 Notes
36 pages
Some Tutorials in Computer Networking Hacking
From Everand
Some Tutorials in Computer Networking Hacking
Dr. Hidaia Mahmood Alassouli
No ratings yet
Where can buy Graph Neural Networks in Action (MEAP Version 4) Keita Broadwater ebook with cheap price
No ratings yet
Where can buy Graph Neural Networks in Action (MEAP Version 4) Keita Broadwater ebook with cheap price
40 pages
Lecture 9 Training Deep Networks
No ratings yet
Lecture 9 Training Deep Networks
20 pages
ML prep for samsung
No ratings yet
ML prep for samsung
73 pages
Graph Neural Networks
100% (1)
Graph Neural Networks
27 pages
Astro AI
No ratings yet
Astro AI
20 pages
Astro AI
No ratings yet
Astro AI
20 pages
CNN Stanford2015
No ratings yet
CNN Stanford2015
129 pages
50 Java Concepts Every Developer Should Know
From Everand
50 Java Concepts Every Developer Should Know
Hernando Abella
No ratings yet
DL UNIT 3
No ratings yet
DL UNIT 3
14 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
Over Description About The Model
No ratings yet
Over Description About The Model
3 pages
4b Image Processing
No ratings yet
4b Image Processing
63 pages
Adl Unit 1 2
No ratings yet
Adl Unit 1 2
67 pages
6S191 MIT DeepLearning L1
No ratings yet
6S191 MIT DeepLearning L1
101 pages
PROGRAMMING WITH PYTHON: Master the Basics and Beyond with Hands-On Projects and Expert Guidance (2024 Guide for Beginners)
From Everand
PROGRAMMING WITH PYTHON: Master the Basics and Beyond with Hands-On Projects and Expert Guidance (2024 Guide for Beginners)
ERROL HOWARD
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
C# Package 100 Knock: 1-Hour Mastery Series 2024 Edition
From Everand
C# Package 100 Knock: 1-Hour Mastery Series 2024 Edition
Tenko
5/5 (1)
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Vanishing Gradient Problem in Deep Learning Understanding Intuition and Solutions
No ratings yet
Vanishing Gradient Problem in Deep Learning Understanding Intuition and Solutions
8 pages
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Unit 4
No ratings yet
Unit 4
86 pages
Nonlinear System Identification ARTICULO
No ratings yet
Nonlinear System Identification ARTICULO
121 pages
A Smart Seat Belt
No ratings yet
A Smart Seat Belt
4 pages
Performance Analysis On Multicore Processors
No ratings yet
Performance Analysis On Multicore Processors
9 pages
Ibm Synapse Mimics Human Brain
No ratings yet
Ibm Synapse Mimics Human Brain
6 pages
Future of Cloud Computing Architecture
100% (1)
Future of Cloud Computing Architecture
10 pages
Multi Core ARM Processors in Mobile Devices
No ratings yet
Multi Core ARM Processors in Mobile Devices
5 pages
Big Data For Open Innovation in SMEs and Large Corporations
No ratings yet
Big Data For Open Innovation in SMEs and Large Corporations
17 pages
The End of Personal Computer
0% (1)
The End of Personal Computer
5 pages
A Dynamic Cloud With Data Privacy Preservation
No ratings yet
A Dynamic Cloud With Data Privacy Preservation
151 pages
Microcontrollers For IoT
No ratings yet
Microcontrollers For IoT
30 pages
Teachers As Self-Directed Learners - Active Positioning Through Professional Learning (2017)
100% (4)
Teachers As Self-Directed Learners - Active Positioning Through Professional Learning (2017)
185 pages
Using Word Embeddings For Text Search
No ratings yet
Using Word Embeddings For Text Search
32 pages
Semantic Segmentation of Images
No ratings yet
Semantic Segmentation of Images
76 pages
Math Expressions Lesson Artifact Standard 4
No ratings yet
Math Expressions Lesson Artifact Standard 4
2 pages
List of Chairpersons Members
100% (1)
List of Chairpersons Members
45 pages
Growing Avocados: Flowering, Pollination and Fruit SET
No ratings yet
Growing Avocados: Flowering, Pollination and Fruit SET
6 pages
Equations of Motion For A Single Degree-Of-Freedom System
No ratings yet
Equations of Motion For A Single Degree-Of-Freedom System
41 pages
Chemical Potential and Gibbs Distribution: Anders Malthe-Sørenssen
No ratings yet
Chemical Potential and Gibbs Distribution: Anders Malthe-Sørenssen
24 pages
Iso 3087 2020
No ratings yet
Iso 3087 2020
13 pages
P1 e en
100% (1)
P1 e en
22 pages
The Sexuality Scale An Instrument To Measure Sexual-Esteem, Sexual-Depression, and Sexual-Preoccupation
No ratings yet
The Sexuality Scale An Instrument To Measure Sexual-Esteem, Sexual-Depression, and Sexual-Preoccupation
29 pages
Komatsu 0000279c H0120-001003 Page
No ratings yet
Komatsu 0000279c H0120-001003 Page
1 page
(Oxford) Mitigating Equity Market Risk With Investor Sentiment XXXXXXXXXXXX
No ratings yet
(Oxford) Mitigating Equity Market Risk With Investor Sentiment XXXXXXXXXXXX
12 pages
Get Writing Game Center Apps in iOS Bringing Your Players Into the Game 1st Edition Vandad Nahavandipoor free all chapters
100% (6)
Get Writing Game Center Apps in iOS Bringing Your Players Into the Game 1st Edition Vandad Nahavandipoor free all chapters
50 pages
DRYERS AND DRYING PROCESS Group 2
No ratings yet
DRYERS AND DRYING PROCESS Group 2
9 pages
Component Diagram in UML 2.0: Veronica Carrega
No ratings yet
Component Diagram in UML 2.0: Veronica Carrega
35 pages
Arium Comfort II - Brochure
No ratings yet
Arium Comfort II - Brochure
20 pages
OBB932B
No ratings yet
OBB932B
16 pages
Daily Travel MGMT 6032
No ratings yet
Daily Travel MGMT 6032
40 pages
A Grammar of The Ugaritic Language Handbook of Oriental Studies Handbuch Der Orientalistik
100% (6)
A Grammar of The Ugaritic Language Handbook of Oriental Studies Handbuch Der Orientalistik
353 pages
7 Transport in Plants
No ratings yet
7 Transport in Plants
83 pages
Pinnacle Size and Geometry Guide Dacite 2018
No ratings yet
Pinnacle Size and Geometry Guide Dacite 2018
1 page
Seminar On Tidal Energy Report
50% (2)
Seminar On Tidal Energy Report
10 pages
Far Western University Faculty of Engineering: Specific Objectives Unit I: Elasticity
No ratings yet
Far Western University Faculty of Engineering: Specific Objectives Unit I: Elasticity
5 pages
TP-WMS-05816-DAS-A4-D1-L - Swing Check Valve Datasheet
No ratings yet
TP-WMS-05816-DAS-A4-D1-L - Swing Check Valve Datasheet
1 page
Ev 390
No ratings yet
Ev 390
4 pages
2026 2028 Syllabus
No ratings yet
2026 2028 Syllabus
56 pages
X.509 Version 3 Certificate: Validity Period
No ratings yet
X.509 Version 3 Certificate: Validity Period
5 pages
Handbook CFC-A 202205
No ratings yet
Handbook CFC-A 202205
38 pages
Programando em Perl: Nelson Corrêa de Toledo Ferraz
No ratings yet
Programando em Perl: Nelson Corrêa de Toledo Ferraz
51 pages
Chapter 6 Analysis of Feedback Control Systems
No ratings yet
Chapter 6 Analysis of Feedback Control Systems
44 pages

Vanishing Gradients in Deep Networks

Uploaded by

Vanishing Gradients in Deep Networks

Uploaded by

Using Skip Connections to Mitigate the Problem

of Vanishing Gradients in Deep Networks

Lecture Notes on Deep Learning

Avi Kak and Charles Bouman

Thursday 9th April, 2020 23:23

Of the three mitigation strategies listed above, the last — using

This lecture will start with a demonstration of the improvement in

Subsequently, I’ll explain more precisely what is meant by the vanishing

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

SkipConnection Class in My DLStudio Module

The Roles Played by the BMEnet and SkipBlock

As to the reason for why the building block is named SkipBlock, it

The overall network that is built with SkipBlock will be an instance

Two Assumptions Implicit in the Definition of

These two assumptions make it possible to define a small building

def forward(self, x): ## (A)

Purdue University Continued on the next slide .... 9

SkipBlock’s Definition (contd.)

The BMEnet Class — Building a Network with

def __init__(self, skip_connections=True, depth=32):

Continued on the next slide ....

The BMEnet Class (contd.)

Explaining the BMEnet Class

In the code in init (), when the name of a SkipBlock instance

Explaining the BMEnet Class (contd.)

The number of SkipBlock layers created in each section depends on

The four for loops are separated by either a downsmapling

The different SkipBlock versions are created by the two constructor

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

The In-Class Demo

The script playing with skip connections.py in the Examples

The Network Generated for depth=32

The Network Generated for depth=32 (contd.)

Purdue University Continued on the next slide .... 18

The Network Generated for depth=32 (contd.)

Pay attention to the total number of learnable parameters and to how

Comparing the Results

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

The Problem of Vanishing Gradients

Subsequently, I’ll present Glorot and Bengio’s equations for the

Equations for Backpropagating the Loss

In my opinion, these equations are better expressed using the Einstein

Relationship Between the Variances

of the gradients of the loss as they are backpropagated.

Forward Propagating the Variances

Backpropagating the Variances of the Gradients of Loss

Backpropagating the Variances of the Gradients of Loss (contd.)

If we can somehow ensure (and, as mentioned earlier on Slide 23, it is

There is a huge difference between the two equations. The RHS in

Backpropagating the Variances of the Gradients of Loss (contd.)

Here you have it: A theoretical explanation for the vanishing

1 A Demonstration of the Power of Skip Connections

2 Comparing the Classification Performance with and without Skip

3 What Causes Vanishing Gradients?

4 A Beautiful Explanation for Why Skip Connections Help

5 Visualizing the Loss Function for a Network with Skip

Why Do Skip Connections Help?

This paper was brought to my attention about a year ago by Bharath

Unraveling a Deep Network with Skip Connections

def init(self, skip_connections=True, depth=32):