0% found this document useful (0 votes)
9 views

Convolutional Neural Networks

Uploaded by

Yasha Wakhle
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Convolutional Neural Networks

Uploaded by

Yasha Wakhle
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Convolutional Neural Networks

Introduction

 A convolutional neural network (or ConvNet) is a type of feed-forward artificial neural network
 The architecture of a ConvNet is designed to take advantage of the 2D structure of an input image.

VS

 
 A ConvNet is comprised of one or more convolutional layers (often with a pooling step) and then
followed by one or more fully connected layers as in a standard multilayer neural network.

Page  2
Motivation behind ConvNets

 Consider an image of size 200x200x3 (200 wide, 200 high, 3 color channels)
– a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 200*200*3
= 120,000 weights.
– Due to the presence of several such neurons, this full connectivity is waste and the huge number of
parameters would quickly lead to overfitting

 However, in a ConvNet, the neurons in a layer will only be connected to a small region of the layer
before it, instead of all of the neurons in a fully-connected manner.
– the final output layer would have dimensions 1x1xN, because by the end of the ConvNet architecture we will
reduce the full image into a single vector of class scores (for N classes), arranged along the depth
dimension
 Vanishing Gradient Problem in MLP

Page  3
MLP VS ConvNet

Input Input
Hidden Hidden

Output Output

Multilayered Convolutional
Perceptron: Neural Network
All Fully With Partially
Connected Connected
Layers Convolution Layer

Page  4
MLP vs ConvNet

 A regular 3-layer Neural


Network.

 A ConvNet arranges its


neurons in three dimensions
(width, height, depth), as
visualized in one of the
layers.

Page  5
How ConvNet Works

 For example, a ConvNet takes the input as an image which can be classified as „X‟ or „O‟

A two-dimensional
array of pixels

CNN X or O
 In a simple case, „X‟ would look like:

Page  6
How ConvNet Works

 What about trickier cases?

CNN
X

CNN
O

Page  7
How ConvNet Works – What Computer Sees

-1 -1 -1 -1 -1 -1 -1 -1 -1

? -1 -1 -1 -1 -1 -1 -1 -1 -1

=
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Page  8
How ConvNet Works

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

=
x
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Page  9
How ConvNet Works – What Computer Sees

 Since the pattern doesnot match exactly, the computer will not be able to classify this as „X‟

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 X -1 -1 -1 -1 X X -1
-1 X X -1 -1 X X -1 -1
-1 -1 X 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 X -1 -1
-1 -1 X X -1 -1 X X -1
-1 X X -1 -1 -1 -1 X -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Page  10
ConvNet Layers (At a Glance)

 CONV layer will compute the output of neurons that are connected to local regions in the input,
each computing a dot product between their weights and a small region they are connected to in
the input volume.

 RELU layer will apply an elementwise activation function, such as the max(0,x) thresholding at
zero. This leaves the size of the volume unchanged.

 POOL layer will perform a downsampling operation along the spatial dimensions (width, height).

 FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1xN],
where each of the N numbers correspond to a class score, such as among the N categories.

Page  11
Recall – What Computer Sees

 Since the pattern doesnot match exactly, the computer will not be able to classify this as „X‟

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 X -1 -1 -1 -1 X X -1
-1 X X -1 -1 X X -1 -1
-1 -1 X 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 X -1 -1
-1 -1 X X -1 -1 X X -1
-1 X X -1 -1 -1 -1 X -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
 What got changed?

Page  12
Convolutional Layer

 Convolution layer will work to identify patterns (features) instead of individual pixels

=
Page  13
Convolutional Layer - Filters

 The CONV layer‟s parameters consist of a set of learnable filters.


 Every filter is small spatially (along width and height), but extends through the full depth of the input
volume.
 During the forward pass, we slide (more precisely, convolve) each filter across the width and height
of the input volume and compute dot products between the entries of the filter and the input at any
position.

1 -1 -1 1 -1 1 -1 -1 1
-1 1 -1 -1 1 -1 -1 1 -1
-1 -1 1 1 -1 1 1 -1 -1

Page  14
Multiple Filters
Convolutional Layer – Filters – Computation Example

Input Size (W): 9


Filter Size (F): 3 X 3
Stride (S): 1
9X9 Filters: 1 7X7
Padding (P): 0
-1 -1 -1 -1 -1 -1 -1 -1 -1 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33

-1 1 -1 -1 -1 -1 -1 1 -1 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11


-1 -1 1 -1 -1 -1 1 -1 -1
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-1 -1 -1 1 -1 1 -1 -1 -1 1 -1 -1
-1
-1
-1
-1
-1
-1
-1
1
1
-1
-1
1
-1
-1
-1
-1
-1
-1
-1 1 -1
-1 -1 1
= 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11


-1 -1 1 -1 -1 -1 1 -1 -1
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

Feature Map Size = 1+ (W – F + 2P)/S


= 1+ (9 – 3 + 2 X 0)/1 = 7
Page  23
Convolutional Layer – Filters – Output Feature Map

 Output Feature -1 -1 -1 -1 -1 -1 -1 -1 -1
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-1 1 -1 -1 -1 -1 -1 1 -1
Map of One -1 -1 1 -1 -1 -1 1 -1 -1
1 -1 -1 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

-1 -1 -1 1 -1 1 -1 -1 -1
complete
=
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

-1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

convolution: -1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-1 -1 1 -1 -1 -1 1 -1 -1
– Filters: 3 -1
-1
1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
1
-1
-1
-1
-0.11

0.33
0.11

-0.11
-0.11

0.55
0.33

0.33
-0.11

0.11
1.00

-0.11
-0.11

0.77

– Filter Size: 3 X 3 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33

-1 1 -1 -1 -1 -1 -1 1 -1
– Stride: 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 1
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55

=
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11

-1 -1 -1 1 -1 1 -1 -1 -1
 Conclusion: -1 -1 1 -1 -1 -1 1 -1 -1
1 -1 1 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11

-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55


-1 1 -1 -1 -1 -1 -1 1 -1
– Input Image:
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77


9X9 -1 1 -1 -1 -1 -1 -1 1 -1 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
-1 -1 1 -1 -1 -1 1 -1 -1
– Output of -1 -1 1
=
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-1 -1 -1 1 -1 1 -1 -1 -1
Convolution: -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 1 -1 -1 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

7X7X3 -1 1 -1 -1 -1 -1 -1 1 -1
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33


-1 -1 -1 -1 -1 -1 -1 -1 -1
Page  24
Convolutional Layer – Output

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

-1 -1 -1 -1 -1 -1 -1 -1 -1 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33

-1 1 -1 -1 -1 -1 -1 1 -1 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55


-1 -1 1 -1 -1 -1 1 -1 -1
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11

-1 -1 -1 1 -1 1 -1 -1 -1 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11

-1 -1 1 -1 -1 -1 1 -1 -1
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
-1 1 -1 -1 -1 -1 -1 1 -1
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33


Page  25
Rectified Linear Units (ReLUs)

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

Page  28
Rectified Linear Units (ReLUs)

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

Page  29
Rectified Linear Units (ReLUs)

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

Page  30
Pooling Layer

 The pooling layers down-sample the previous layers feature map.

 Its function is to progressively reduce the spatial size of the representation to reduce the amount of
parameters and computation in the network

 The pooling layer often uses the Max operation to perform the downsampling process

Page  31
Pooling

 Pooling Filter Size = 2 X 2, Stride = 2

1.00

Page  32
Pooling

 Pooling Filter Size = 2 X 2, Stride = 2

1.00 0.33

Page  33
Pooling

 Pooling Filter Size = 2 X 2, Stride = 2

1.00 0.33 0.55

Page  34
Pooling

 Pooling Filter Size = 2 X 2, Stride = 2

1.00 0.33 0.55 0.33

Page  35
Pooling

 Pooling Filter Size = 2 X 2, Stride = 2

1.00 0.33 0.55 0.33

0.33

Page  36
Pooling

 Pooling Filter Size = 2 X 2, Stride = 2

1.00 0.33 0.55 0.33

0.33 1.00 0.33 0.55

0.55 0.33 1.00 0.11

0.33 0.55 0.11 0.77

Page  37
Pooling Layer : Average Pooling
Pooling

1.00 0.33 0.55 0.33

0.33 1.00 0.33 0.55

0.55 0.33 1.00 0.11

0.33 0.55 0.11 0.77

0.55 0.33 0.55 0.33

0.33 1.00 0.55 0.11

0.55 0.55 0.55 0.11

0.33 0.11 0.11 0.33

0.33 0.55 1.00 0.77

0.55 0.55 1.00 0.33

1.00 1.00 0.11 0.55

0.77 0.33 0.55 0.33

Page  39
Layers get stacked

1.00 0.33 0.55 0.33

0.33 1.00 0.33 0.55

0.55 0.33 1.00 0.11


-1 -1 -1 -1 -1 -1 -1 -1 -1
0.33 0.55 0.11 0.77
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
0.55 0.33 0.55 0.33
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 0.33 1.00 0.55 0.11

-1 -1 -1 1 -1 1 -1 -1 -1 0.55 0.55 0.55 0.11

-1 -1 1 -1 -1 -1 1 -1 -1 0.33 0.11 0.11 0.33


-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
0.33 0.55 1.00 0.77

0.55 0.55 1.00 0.33

1.00 1.00 0.11 0.55

0.77 0.33 0.55 0.33

Page  40
Layers Get Stacked - Example

224 X 224 X 3 224 X 224 X 64 112 X 112 X 64

CONVOLUTION POOLING
WITH 64 FILTERS (DOWNSAMPLING)
Page  41
Deep stacking

1.00 0.55
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.55 1.00
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 1.00 0.55
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.55 0.55
-1 -1 -1 -1 -1 -1 -1 -1 -1

0.55 1.00

1.00 0.55

Page  42
Fully connected layer

 Fully connected layers are the normal flat


feed-forward neural network layers. 1.00

0.55

 These layers may have a non-linear 0.55


1.00 0.55
activation function or a softmax activation
1.00
in order to predict classes. 0.55 1.00

1.00

1.00 0.55 0.55


 To compute our output, we simply re- 0.55 0.55 0.55
arrange the output matrices as a 1-D
array. 0.55 1.00
0.55

0.55
1.00 0.55
1.00

1.00

0.55

Page  43
Fully connected layer

 A summation of product of inputs and weights at each output node determines the final prediction

0.55

1.00

1.00

0.55

0.55
X
0.55

0.55

0.55

1.00

0.55
O
0.55

1.00

Page  44
Putting it all together

-1
-1
-1
-1
-1
-1
1
-1
-1
-1
-1
-1
1
-1
-1
-1
-1
-1
1
-1
-1
-1
-1
-1
1
-1
-1
-1
1
-1
-1
-1
1
-1
-1
-1
1
-1
-1
-1
-1
-1
-1
-1
-1
X
-1 -1 -1 1 -1 1 -1 -1 -1

O
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Page  45
Example of CNN:
ReLu Function
Pooling Layer
Stacking Up The Layers
Example of ConvNet

(37+0-5/2) + 1
= 16 + 1 = 17
Different CNN Architectures

LeNet
Alexnet
VGG-19
ResNet
LeNet5
Sample Exercises:

CNN architecture for image classification: The input images are RGB images with
dimensions 128x128 pixels.

Design a CNN architecture with the following components:


• Two convolutional layers with 3x3 filters, ReLU activation, and 32 filters each.
• Max pooling (2x2) after each convolutional layer.
• A fully connected layer with 128 neurons and ReLU activation.
• Output layer with 10 neurons and softmax activation for multiclass classification.
• Calculate the total number of parameters in the convolutional layers, the fully
connected layer, and the entire network. Draw architecture and show your
calculations step by step.
Hyperparameters (knobs)

 Convolution
– Filter Size
– Number of Filters
– Padding
– Stride
 Pooling
– Window Size
– Stride
 Fully Connected
– Number of neurons

Page  58
Case Studies

 LeNet – 1998
 AlexNet – 2012
 ZFNet – 2013
 VGG – 2014
 GoogLeNet – 2014
 ResNet – 2015

The ImageNet Large Scale Visual Recognition


Challenge (ILSVRC)

Page  59
Deep vs Shallow Networks
What happens when we continue stacking deeper layers on a “plain”
convolutional neural network?

56-layer
Training error

56-layer

Test error
20-layer

20-layer

Iterations Iterations

56-layer model performs worse on both training and test error


-> The deeper model performs worse, but it‟s not caused by overfitting!
Deeper models are harder to optimize
The deeper model should be able to perform at least as well as the shallower
model.
A solution by construction is copying the learned layers from the shallower
model and setting additional layers to identity mapping.
Challenges
• Deeper Neural Networks start to degrade in performance.
• Vanish/Exploding Gradient – May lead for extremely
complex parameters initializations to make it work. Still
might suffer from Vanish/Exploding even for the best
parameters.
• Long training times – Due to too many training parameters.
Partial Solutions for Vanish/Exploding gradients
• Batch Normalization – To rescale the weights over some batch.
• Smart Initialization of weights – Like for example Xavier initialization.
• Train portions of the network individually.
Related Prior Work - Highway networks

• Adding features from previous time steps has been used in


various tasks
• Most notable of these are Highway networks proposed by
Srivastava et al.
• Highway networks feature residual connections Residual
networks have the form
Y = f(x ) + x
• Highway networks have the form

Y = f(x )sigmoid(Wx + b) + x(1 —sigmoid (Wx + b))


Highway networks – cont.
Highway networks enabled information flow from the past but due to the
gating function
ResNet

• A specialized network introduced by Microsoft.


• Connects inputs of layers into farther part of that network to
allow “shortcuts”.
• Simple idea – great improvements with both performance and
train time.
Plain Network

𝑎[𝑙+1]
𝑎[𝑙] 𝑎 [𝑙+2]

𝑧 [𝑙+2] = 𝑊 [𝑙+2] 𝑎[𝑙+1] + 𝑏[𝑙+2]


𝑧 [𝑙+1] = 𝑊 [𝑙+1] 𝑎[𝑙] + 𝑏 [𝑙+1]
“output”
“linear”

𝑎[𝑙+1] = 𝑔(𝑧 [𝑙+1] )


“relu”
𝑎[𝑙+2] = 𝑔 𝑧 𝑙+2
“relu on output”
Residual Blocks

𝑎[𝑙+1]
𝑎[𝑙] 𝑎[𝑙+2]

𝑎[𝑙+1] = 𝑔(𝑧 [𝑙+1] )


“relu”
𝑧 [𝑙+1] = 𝑊 [𝑙+1] 𝑎[𝑙] + 𝑏 [𝑙+1]
“linear”
𝑧 [𝑙+2] = 𝑊 [𝑙+2] 𝑎[𝑙+1] + 𝑏 [𝑙+2]
“output”
𝑎[𝑙+2] = 𝑔 𝑧 𝑙+2 + 𝑎 𝑙
“relu on output plus input”
Skip Connections “shortcuts”

• Such connections are referred as skipped connections or shortcuts. In general


similar models could skip over several layers.
• They refer to residual part of the network as a unit with input and output.
• Such residual part receives the input as an amplifier to its output – The
dimensions usually are the same.
• Another option is to use a projection to the output space.
• Either way – no additional training parameters are used.
Residual Blocks (skip connections)
Deeper Bottleneck Architecture (Cont.)

• Addresses high training time of very deep networks.


• Keeps the time complexity same as the two layered
convolution
• Allows us to increase the number of layers
• Allows the model to converge much faster.
• 152-layer ResNet has 11.3 billion FLOPS while VGG-16/19
nets has 15.3/19.6 billion FLOPS.
Why Do ResNets Work Well?

• Having a “regular” network that is very deep might actually hurt


performance because of the vanishing and exploding gradients
• In most cases, ResNets will simply stop improving rather then decrease
in performance
• 𝑎[𝑙+2] = 𝑔 𝑧 𝑙+2 + 𝑎 𝑙 = 𝑔(𝑤 𝑙+1 𝑎 𝑙+1 + 𝑏 𝑙 + 𝑎[𝑙] )
• If the layer is not “useful”, L2 regularization will bring it’s parameters
very close to zero, resulting in 𝑎[𝑙+2] = 𝑔 𝑎[𝑙] = 𝑎[𝑙] (when using ReLU)
Why Do ResNets Work Well? (Cont)

• In theory ResNet is still identical to plain networks, but in


practice due to the above the convergence is much faster.
• No additional training parameters introduced.
• No addition complexity introduced.
Training ResNet in practice

• Batch Normalization after every CONV layer.


• Xavier/2 initialization from He et al.
• SGD + Momentum (0.9)
• Learning rate: 0.1, divided by 10 when validation
error plateaus.
• Mini-batch size 256.
• Weight decay of 1e-5.
• No dropout used.
Loss Function

• For measuring the loss of the model a combination of cross-entropy


and softmax were used.
• The output of the cross-entropy was normalized using softmax
function.
Reduce Learning Time with Random Layer Drops

• Dropping layers during training, and using the full network in


testing.
• Residual block are used as network’s building block.
• During training, input flows through both the shortcut and the
weights.
• Training: Each layer has a “survival probability” and is randomly
dropped.
• Testing: all blocks are kept active.
• Re-calibrated according to its survival probability during training.
Thank you!!

You might also like