0% found this document useful (0 votes)
10 views

Module 3 Notes

A Convolutional Neural Network (CNN) is a deep learning architecture primarily used in computer vision to interpret visual data by extracting features from images through various layers including convolutional, pooling, and fully connected layers. The convolutional layer applies filters to the input image to create feature maps, while pooling layers reduce dimensionality to enhance computational efficiency and robustness. CNNs leverage properties like sparse connectivity and parameter sharing to effectively learn from high-dimensional data while maintaining translation invariance.

Uploaded by

hicey94162
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Module 3 Notes

A Convolutional Neural Network (CNN) is a deep learning architecture primarily used in computer vision to interpret visual data by extracting features from images through various layers including convolutional, pooling, and fully connected layers. The convolutional layer applies filters to the input image to create feature maps, while pooling layers reduce dimensionality to enhance computational efficiency and robustness. CNNs leverage properties like sparse connectivity and parameter sharing to effectively learn from high-dimensional data while maintaining translation invariance.

Uploaded by

hicey94162
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Convolutional Neural Networks

A Convolutional Neural Network (CNN) is a type of Deep Learning neural network


architecture commonly used in Computer Vision. Computer vision is a field of Artificial
Intelligence that enables a computer to understand and interpret the image or visual data.
Convolutional Neural Network (CNN) is the extended version of artificial neural networks
(ANN) which is predominantly used to extract the feature from the grid-like matrix dataset.
For example visual datasets like images or videos where data patterns play an extensive role.

CNN architecture

Convolutional Neural Network consists of multiple layers like the input layer, Convolutional
layer, Pooling layer, and fully connected layers.
The Convolutional layer applies filters to the input image to extract features, the Pooling
layer downsamples the image to reduce computation, and the fully connected layer makes
the final prediction. The network learns the optimal filters through backpropagation and
gradient descent.

How Convolutional Layers works

Convolution Neural Networks or covnets are neural networks that share their parameters.
Imagine you have an image. It can be represented as a cuboid having its length, width
(dimension of the image), and height (i.e the channel as images generally have red, green,
and blue channels).

Now imagine taking a small patch of this image and running a small neural network, called
a filter or kernel on it, with say, K outputs and representing them vertically. Now slide that
neural network across the whole image, as a result, we will get another image with different
widths, heights, and depths. Instead of just R, G, and B channels now we have more channels
but lesser width and height. This operation is called Convolution. If the patch size is the
same as that of the image it will be a regular neural network. Because of this small patch, we
have fewer weights.
Image source: Deep Learning Udacity

Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
• Convolution layers consist of a set of learnable filters (or kernels) having small
widths and heights and the same depth as that of input volume (3 if the input layer
is image input).
• For example, if we have to run convolution on an image with dimensions
34x34x3. The possible size of filters can be axax3, where ‘a’ can be anything like
3, 5, or 7 but smaller as compared to the image dimension.
• During the forward pass, we slide each filter across the whole input volume step
by step where each step is called stride (which can have a value of 2, 3, or even
4 for high-dimensional images) and compute the dot product between the kernel
weights and patch from input volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them
together as a result, we’ll get output volume having a depth equal to the number
of filters. The network will learn all the filters.

Layers used to build ConvNets

A complete Convolution Neural Networks architecture is also known as covnets. A covnets


is a sequence of layers, and every layer transforms one volume to another through a
differentiable function.

Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.


• Input Layers: It’s the layer in which we give input to our model. In CNN,
Generally, the input will be an image or a sequence of images. This layer holds
the raw input of the image with width 32, height 32, and depth 3.
• Convolutional Layers: This is the layer, which is used to extract the feature
from the input dataset. It applies a set of learnable filters known as the kernels to
the input images. The filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5
shape. it slides over the input image data and computes the dot product between
kernel weight and the corresponding input image patch. The output of this layer
is referred ad feature maps. Suppose we use a total of 12 filters for this layer we’ll
get an output volume of dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of the
preceding layer, activation layers add nonlinearity to the network. it will apply an
element-wise activation function to the output of the convolution layer. Some
common activation functions are RELU: max(0, x), Tanh, Leaky RELU, etc.
The volume remains unchanged hence output volume will have dimensions 32 x
32 x 12.
• Pooling layer: This layer is periodically inserted in the covnets and its main
function is to reduce the size of volume which makes the computation fast reduces
memory and also prevents overfitting. Two common types of pooling layers
are max pooling and average pooling. If we use a max pool with 2 x 2 filters and
stride 2, the resultant volume will be of dimension 16x16x12.

Image source: cs231n.stanford.edu

• Flattening: The resulting feature maps are flattened into a one-dimensional


vector after the convolution and pooling layers so they can be passed into a
completely linked layer for categorization or regression.
• Fully Connected Layers: It takes the input from the previous layer and
computes the final classification or regression task.
• Output Layer: The output from the fully connected layers is then fed into a
logistic function for classification tasks like sigmoid or softmax which converts
the output of each class into the probability score of each class.

What is convolution?

In purely mathematical terms, convolution is a function derived from two given functions by
integration which expresses how the shape of one is modified by the other.

The Convolution Operation

Here are the three elements that enter into the convolution operation:

• Input image
• Feature detector
• Feature map

Sometimes a 5×5 or a 7×7 matrix is used as a feature detector, but the more conventional one,
and that is the one that we will be working with, is a 3×3 matrix. The feature detector is often
referred to as a “kernel” or a “filter,” which you might come across as you dig into other
material on the topic.

How exactly does the Convolution Operation work?

You can think of the feature detector as a window consisting of 9 (3×3) cells. Here is what you
do with it:

• You place it over the input image beginning from the top-left corner within the
borders you see demarcated above, and then you count the number of cells in which
the feature detector matches the input image.
• The number of matching cells is then inserted in the top-left cell of the feature map.
• You then move the feature detector one cell to the right and do the same thing. This
movement is called a and since we are moving the feature detector one cell at time,
that would be called a stride of one pixel.
• What you will find in this example is that the feature detector's middle-left cell with
the number 1 inside it matches the cell that it is standing over inside the input image.
That's the only matching cell, and so you write “1” in the next cell in the feature map,
and so on and so forth.
• After you have gone through the whole first row, you can then move it over to the
next row and go through the same process.

It's important not to confuse the feature map with the other two elements. The cells of the
feature map can contain any digit, not only 1's and 0's. After going over every pixel in the input
image in the example above, we would end up with these results:
By the way, just like feature detector can also be referred to as a kernel or a filter, a feature
map is also known as an activation map and both terms are also interchangeable.

Motivation for CNN

• a digital image is 2D grid image , since neural network expects a vector as input ,
one idea to deal with images would be to flatten that image and feed the output of
the flattening operation to the neural network and this would work to some extent

But eventually ,that flattened vector won’t be the same for a translated image
The neural network would have to learn very different parameters in order to classify the objects
, which Is difficult job since natural images are very variant (lightning, translated , angles …..)

Also it is worth mentioning that the input Vector would be relatively big 64*64*3(RGB images)
which can cause problem with memory while using neural network since we will have in The
first layer with just 10 neurons alone (64*64*3*10) Weights to train

Natural images have 2 main characteristics

• Locality : nearby pixels are more strongly correlated

• Translation invariance: meaningful patterns can occur anywhere in the image

How Convolutional Neural Network solve The problem for images ?

· The answer to this question is 3 characteristics of The CNN

• Sparse Connectivity : when processing an image, the input image might have
thousands or millions of pixels, but we can detect small, meaningful features such
as edges with kernels that occupy only tens or hundreds of pixels. This means that
we need to store fewer parameters, which both reduces the memory requirements
of the model and improves its statistical efficiency. It also means that computing
the output requires fewer operations. These improvements in efficiency are usually
quite large. If there are m inputs and n outputs, then matrix multiplication requires
m×n parameters and the algorithms used in practice have O(m × n) runtime (per
example). If we limit the number of connections each output may have to k, then
the sparsely connected approach requires only k × n parameters and O(k × n)
runtime
• Parameter sharing : In a convolutional neural net, each member of the kernel is
used at every position of the input (except perhaps some of the boundary pixels,
depending on the design decisions regarding the boundary). The parameter sharing
used by the convolution operation means that rather than learning a separate set of
parameters for every location, we learn only one set. This does not affect the
runtime of forward propagation it is still O(k × n) but it does further reduce the
storage requirements of the model to k parameters

• Equivariance : In the case of convolution, the particular form of parameter sharing


causes the layer to have a property called equivariance to translation. use the same
network parameters to detect local patterns at many locations in the image

Pooling Layers
• Pooling layers are used to reduce the dimensions of the feature maps. Thus, it
reduces the number of parameters to learn and the amount of computation
performed in the network.
• The pooling layer summarises the features present in a region of the feature map
generated by a convolution layer. So, further operations are performed on
summarised features instead of precisely positioned features generated by the
convolution layer. This makes the model more robust to variations in the
position of the features in the input image.

Types of Pooling Layers:

Max Pooling
1. Max pooling is a pooling operation that selects the maximum element from the
region of the feature map covered by the filter. Thus, the output after max-
pooling layer would be a feature map containing the most prominent features of
the previous feature map.
Average Pooling
1. Average pooling computes the average of the elements present in the region of
feature map covered by the filter. Thus, while max pooling gives the most
prominent feature in a particular patch of the feature map, average pooling gives
the average of features present in a patch.

In convolutional neural networks (CNNs), the pooling layer is a common type of layer that
is typically added after convolutional layers. The pooling layer is used to reduce the spatial
dimensions (i.e., the width and height) of the feature maps, while preserving the depth (i.e.,
the number of channels).
1. The pooling layer works by dividing the input feature map into a set of non-
overlapping regions, called pooling regions. Each pooling region is then
transformed into a single output value, which represents the presence of a
particular feature in that region. The most common types of pooling operations
are max pooling and average pooling.
2. In max pooling, the output value for each pooling region is simply the maximum
value of the input values within that region. This has the effect of preserving the
most salient features in each pooling region, while discarding less relevant
information. Max pooling is often used in CNNs for object recognition tasks, as
it helps to identify the most distinctive features of an object, such as its edges
and corners.
3. In average pooling, the output value for each pooling region is the average of the
input values within that region. This has the effect of preserving more
information than max pooling, but may also dilute the most salient features.
Average pooling is often used in CNNs for tasks such as image segmentation
and object detection, where a more fine-grained representation of the input is
required.
Pooling layers are typically used in conjunction with convolutional layers in a CNN, with
each pooling layer reducing the spatial dimensions of the feature maps, while the
convolutional layers extract increasingly complex features from the input. The resulting
feature maps are then passed to a fully connected layer, which performs the final
classification or regression task.
Advantages of Pooling Layer:
1. Dimensionality reduction: The main advantage of pooling layers is that they
help in reducing the spatial dimensions of the feature maps. This reduces the
computational cost and also helps in avoiding overfitting by reducing the
number of parameters in the model.
2. Translation invariance: Pooling layers are also useful in achieving translation
invariance in the feature maps. This means that the position of an object in the
image does not affect the classification result, as the same features are detected
regardless of the position of the object.
3. Feature selection: Pooling layers can also help in selecting the most important
features from the input, as max pooling selects the most salient features and
average pooling preserves more information.
Disadvantages of Pooling Layer:
1. Information loss: One of the main disadvantages of pooling layers is that they
discard some information from the input feature maps, which can be important
for the final classification or regression task.
2. Over-smoothing: Pooling layers can also cause over-smoothing of the feature
maps, which can result in the loss of some fine-grained details that are important
for the final classification or regression task.
3. Hyperparameter tuning: Pooling layers also introduce hyperparameters such as
the size of the pooling regions and the stride, which need to be tuned in order to
achieve optimal performance. This can be time-consuming and requires some
expertise in model building.

Convolution and Pooling as an infinitely strong prior


• A weak prior: high entropy. Eg: Gaussian with high variance. Such prior allows
the data to move more or less freely.
• A strong prior: low entropy. Eg: Gaussian with low variance. Such piror plays a

We can imagine conv net as being similar to a fully connected net, but with an infinitely
strong prior over its weight: weights for one hidden unit must be identical to weights of
its neighbour but shifted in space.

Over all, infinitely strong layer:


• Conv layer: the function the layer should learn contains only local interaction and
is equivariant to translation
• Pooling layer: each unit should be invariant to small translation

Like any other prior, conv and pooling are only useful when the assumption made by the prior
are reasonably accurate. If not, underfitting. When a task involves incorporating information
from very distance locations in the input, then the prior imposed by conv maybe impropriate.

Variants of convolution functions

Full Convolution

0 Padding 1 stride

Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by
down sampling:
Some 0 Paddings and 1 stride

Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width
at each layer. We are forced to choose between shrinking the spatial extent of the network
rapidly and using small kernel. 0 padding allows us to control the kernel width and the size of
the output independently.

Special case of 0 padding:

• Valid: no 0 padding is used. Limited number of layers.


• Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels
near the border influence fewer output pixels than the input pixels near the center.
• Full: Enough zeros are added for every pixels to be visited k (kernel width) times in
each direction, resulting width m + k - 1. Difficult to learn a single kernel that performs
well at all positions in the convolutional feature map.

Usually the optimal amount of 0 padding lies somewhere between 'Valid' or 'Same'
Unshared Convolution

In some case when we do not want to use convolution but want to use locally connected layer.
We use Unshared convolution. Indices into weight W

• i: the output channel


• j: the output row;
• k: the output column
• l: the input channel
• m: row offset within input
• n: column offset within input

Comparison on local connections, convolution and full connection

Useful when we know that each feature should be a function of a small part of space, but no
reason to think that the same feature should occur accross all the space. eg: look for mouth only
in the bottom half of the image.
It can be also useful to make versions of convolution or local connected layers in which the
connectivity is further restricted, eg: constrain each output channeel i to be a function of only
a subset of the input channel.

Adv: * reduce memory consumption * increase statistical efficiency * reduce computation for
both forward and backward prop.

Tiled Convolution

Learn a set of kernels that we rotate through as we move through space. Immediately
neighboring locations will have different filters, but the memory requirement for storing the
parameters will increase by a factor of the size of this set of kernels. Comparison on locally
connected layers, tiled convolution and stardard convolution:
Local connected layers and tiled convolutional layer with max pooling: the detector units of
these layers are driven by different filters. If the filters learn to detect different tranformed
version of the same underlying features, then the max-pooled units become invariant to the
learned transformation.
Structured Output
Convolution networks can be used to output a high-D structured object, rather than just
redicting a class label for a classification task or a real value for regression tasks. Eg:

The model might emit a tensor S where S i,j,k is the probability that pixel (j, k) of the input
belongs to class i.
Issue, the output plane can be smaller than input plane. Review:

Strategy for size reduction issue:

• avoid pooling altogether


• emit a lower-resolution grid of labels
• pooling operator with unit stride
One strategy for pixel-wise labeling of images is to produce an initial guess of the image label.

1. produce an initial guess of the image labels.


2. refine this initial guess using the interactions between neighboring pixels.

Repeat this refinement step serveral times corresponds to using the same convolution at each
stage, sharing weights between last layers of the deep net. Recurrent Convolutional Network:
Data Type

Conv Net can also process input with varing spetial extents.

Image with different size:

• No further design needed: label each pixel


• Further design such as add pooling layers whose pooling region scale in size
proportional to the size of input: one label per image
Convolution of different size of input

• Make sense: Same kind of observation, such as different size of image, different
length of recording, different width of observation over space, and so forth.
• Not make sense: the input optionally have different kinds of observation such as
convolving the same weights over features corresponding to the grades as well as
the features corresponding to the test scores.
Efficient Convolution Algorithms
How to speed up convolution?

1. Parallel Computation Resources


2. Selecting Appropriate Algorithms

• Fourier transform: converting input and kernel into frequency space.


Perform point-wise multiplication. Convert them back to time domain
using an inverse Fourier transform.
• When d-D kernel can be expressed as outer product of o vectors, the
kenel is called seperable. Composing d 1-D convolution with each of
these vectors is significantly faster than performing 1 d-D convolution
with their outer product. The naive approach requires O(wd) runtime and
parameter storage place. Seperable approach requires O(w∗d) runtime
and storage place.

Even techniques that improve the efficiency of only forward propagation are useful because in
the commercial settings, it is typical to devote more resource to deployment of network than to
its training.

You might also like