Module 3 Notes
Module 3 Notes
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional
layer, Pooling layer, and fully connected layers.
The Convolutional layer applies filters to the input image to extract features, the Pooling
layer downsamples the image to reduce computation, and the fully connected layer makes
the final prediction. The network learns the optimal filters through backpropagation and
gradient descent.
Convolution Neural Networks or covnets are neural networks that share their parameters.
Imagine you have an image. It can be represented as a cuboid having its length, width
(dimension of the image), and height (i.e the channel as images generally have red, green,
and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called
a filter or kernel on it, with say, K outputs and representing them vertically. Now slide that
neural network across the whole image, as a result, we will get another image with different
widths, heights, and depths. Instead of just R, G, and B channels now we have more channels
but lesser width and height. This operation is called Convolution. If the patch size is the
same as that of the image it will be a regular neural network. Because of this small patch, we
have fewer weights.
Image source: Deep Learning Udacity
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
• Convolution layers consist of a set of learnable filters (or kernels) having small
widths and heights and the same depth as that of input volume (3 if the input layer
is image input).
• For example, if we have to run convolution on an image with dimensions
34x34x3. The possible size of filters can be axax3, where ‘a’ can be anything like
3, 5, or 7 but smaller as compared to the image dimension.
• During the forward pass, we slide each filter across the whole input volume step
by step where each step is called stride (which can have a value of 2, 3, or even
4 for high-dimensional images) and compute the dot product between the kernel
weights and patch from input volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them
together as a result, we’ll get output volume having a depth equal to the number
of filters. The network will learn all the filters.
What is convolution?
In purely mathematical terms, convolution is a function derived from two given functions by
integration which expresses how the shape of one is modified by the other.
Here are the three elements that enter into the convolution operation:
• Input image
• Feature detector
• Feature map
Sometimes a 5×5 or a 7×7 matrix is used as a feature detector, but the more conventional one,
and that is the one that we will be working with, is a 3×3 matrix. The feature detector is often
referred to as a “kernel” or a “filter,” which you might come across as you dig into other
material on the topic.
You can think of the feature detector as a window consisting of 9 (3×3) cells. Here is what you
do with it:
• You place it over the input image beginning from the top-left corner within the
borders you see demarcated above, and then you count the number of cells in which
the feature detector matches the input image.
• The number of matching cells is then inserted in the top-left cell of the feature map.
• You then move the feature detector one cell to the right and do the same thing. This
movement is called a and since we are moving the feature detector one cell at time,
that would be called a stride of one pixel.
• What you will find in this example is that the feature detector's middle-left cell with
the number 1 inside it matches the cell that it is standing over inside the input image.
That's the only matching cell, and so you write “1” in the next cell in the feature map,
and so on and so forth.
• After you have gone through the whole first row, you can then move it over to the
next row and go through the same process.
It's important not to confuse the feature map with the other two elements. The cells of the
feature map can contain any digit, not only 1's and 0's. After going over every pixel in the input
image in the example above, we would end up with these results:
By the way, just like feature detector can also be referred to as a kernel or a filter, a feature
map is also known as an activation map and both terms are also interchangeable.
• a digital image is 2D grid image , since neural network expects a vector as input ,
one idea to deal with images would be to flatten that image and feed the output of
the flattening operation to the neural network and this would work to some extent
But eventually ,that flattened vector won’t be the same for a translated image
The neural network would have to learn very different parameters in order to classify the objects
, which Is difficult job since natural images are very variant (lightning, translated , angles …..)
Also it is worth mentioning that the input Vector would be relatively big 64*64*3(RGB images)
which can cause problem with memory while using neural network since we will have in The
first layer with just 10 neurons alone (64*64*3*10) Weights to train
• Sparse Connectivity : when processing an image, the input image might have
thousands or millions of pixels, but we can detect small, meaningful features such
as edges with kernels that occupy only tens or hundreds of pixels. This means that
we need to store fewer parameters, which both reduces the memory requirements
of the model and improves its statistical efficiency. It also means that computing
the output requires fewer operations. These improvements in efficiency are usually
quite large. If there are m inputs and n outputs, then matrix multiplication requires
m×n parameters and the algorithms used in practice have O(m × n) runtime (per
example). If we limit the number of connections each output may have to k, then
the sparsely connected approach requires only k × n parameters and O(k × n)
runtime
• Parameter sharing : In a convolutional neural net, each member of the kernel is
used at every position of the input (except perhaps some of the boundary pixels,
depending on the design decisions regarding the boundary). The parameter sharing
used by the convolution operation means that rather than learning a separate set of
parameters for every location, we learn only one set. This does not affect the
runtime of forward propagation it is still O(k × n) but it does further reduce the
storage requirements of the model to k parameters
Pooling Layers
• Pooling layers are used to reduce the dimensions of the feature maps. Thus, it
reduces the number of parameters to learn and the amount of computation
performed in the network.
• The pooling layer summarises the features present in a region of the feature map
generated by a convolution layer. So, further operations are performed on
summarised features instead of precisely positioned features generated by the
convolution layer. This makes the model more robust to variations in the
position of the features in the input image.
Max Pooling
1. Max pooling is a pooling operation that selects the maximum element from the
region of the feature map covered by the filter. Thus, the output after max-
pooling layer would be a feature map containing the most prominent features of
the previous feature map.
Average Pooling
1. Average pooling computes the average of the elements present in the region of
feature map covered by the filter. Thus, while max pooling gives the most
prominent feature in a particular patch of the feature map, average pooling gives
the average of features present in a patch.
In convolutional neural networks (CNNs), the pooling layer is a common type of layer that
is typically added after convolutional layers. The pooling layer is used to reduce the spatial
dimensions (i.e., the width and height) of the feature maps, while preserving the depth (i.e.,
the number of channels).
1. The pooling layer works by dividing the input feature map into a set of non-
overlapping regions, called pooling regions. Each pooling region is then
transformed into a single output value, which represents the presence of a
particular feature in that region. The most common types of pooling operations
are max pooling and average pooling.
2. In max pooling, the output value for each pooling region is simply the maximum
value of the input values within that region. This has the effect of preserving the
most salient features in each pooling region, while discarding less relevant
information. Max pooling is often used in CNNs for object recognition tasks, as
it helps to identify the most distinctive features of an object, such as its edges
and corners.
3. In average pooling, the output value for each pooling region is the average of the
input values within that region. This has the effect of preserving more
information than max pooling, but may also dilute the most salient features.
Average pooling is often used in CNNs for tasks such as image segmentation
and object detection, where a more fine-grained representation of the input is
required.
Pooling layers are typically used in conjunction with convolutional layers in a CNN, with
each pooling layer reducing the spatial dimensions of the feature maps, while the
convolutional layers extract increasingly complex features from the input. The resulting
feature maps are then passed to a fully connected layer, which performs the final
classification or regression task.
Advantages of Pooling Layer:
1. Dimensionality reduction: The main advantage of pooling layers is that they
help in reducing the spatial dimensions of the feature maps. This reduces the
computational cost and also helps in avoiding overfitting by reducing the
number of parameters in the model.
2. Translation invariance: Pooling layers are also useful in achieving translation
invariance in the feature maps. This means that the position of an object in the
image does not affect the classification result, as the same features are detected
regardless of the position of the object.
3. Feature selection: Pooling layers can also help in selecting the most important
features from the input, as max pooling selects the most salient features and
average pooling preserves more information.
Disadvantages of Pooling Layer:
1. Information loss: One of the main disadvantages of pooling layers is that they
discard some information from the input feature maps, which can be important
for the final classification or regression task.
2. Over-smoothing: Pooling layers can also cause over-smoothing of the feature
maps, which can result in the loss of some fine-grained details that are important
for the final classification or regression task.
3. Hyperparameter tuning: Pooling layers also introduce hyperparameters such as
the size of the pooling regions and the stride, which need to be tuned in order to
achieve optimal performance. This can be time-consuming and requires some
expertise in model building.
We can imagine conv net as being similar to a fully connected net, but with an infinitely
strong prior over its weight: weights for one hidden unit must be identical to weights of
its neighbour but shifted in space.
Like any other prior, conv and pooling are only useful when the assumption made by the prior
are reasonably accurate. If not, underfitting. When a task involves incorporating information
from very distance locations in the input, then the prior imposed by conv maybe impropriate.
Full Convolution
0 Padding 1 stride
Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by
down sampling:
Some 0 Paddings and 1 stride
Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width
at each layer. We are forced to choose between shrinking the spatial extent of the network
rapidly and using small kernel. 0 padding allows us to control the kernel width and the size of
the output independently.
Usually the optimal amount of 0 padding lies somewhere between 'Valid' or 'Same'
Unshared Convolution
In some case when we do not want to use convolution but want to use locally connected layer.
We use Unshared convolution. Indices into weight W
Useful when we know that each feature should be a function of a small part of space, but no
reason to think that the same feature should occur accross all the space. eg: look for mouth only
in the bottom half of the image.
It can be also useful to make versions of convolution or local connected layers in which the
connectivity is further restricted, eg: constrain each output channeel i to be a function of only
a subset of the input channel.
Adv: * reduce memory consumption * increase statistical efficiency * reduce computation for
both forward and backward prop.
Tiled Convolution
Learn a set of kernels that we rotate through as we move through space. Immediately
neighboring locations will have different filters, but the memory requirement for storing the
parameters will increase by a factor of the size of this set of kernels. Comparison on locally
connected layers, tiled convolution and stardard convolution:
Local connected layers and tiled convolutional layer with max pooling: the detector units of
these layers are driven by different filters. If the filters learn to detect different tranformed
version of the same underlying features, then the max-pooled units become invariant to the
learned transformation.
Structured Output
Convolution networks can be used to output a high-D structured object, rather than just
redicting a class label for a classification task or a real value for regression tasks. Eg:
The model might emit a tensor S where S i,j,k is the probability that pixel (j, k) of the input
belongs to class i.
Issue, the output plane can be smaller than input plane. Review:
Repeat this refinement step serveral times corresponds to using the same convolution at each
stage, sharing weights between last layers of the deep net. Recurrent Convolutional Network:
Data Type
Conv Net can also process input with varing spetial extents.
• Make sense: Same kind of observation, such as different size of image, different
length of recording, different width of observation over space, and so forth.
• Not make sense: the input optionally have different kinds of observation such as
convolving the same weights over features corresponding to the grades as well as
the features corresponding to the test scores.
Efficient Convolution Algorithms
How to speed up convolution?
Even techniques that improve the efficiency of only forward propagation are useful because in
the commercial settings, it is typical to devote more resource to deployment of network than to
its training.