Deep Learning & Image Processing [Notes]
Deep Learning & Image Processing [Notes]
Now that you have been introduced to the basics of machine learning and
how it works, let’s see the different types of machine learning methods.
1. Supervised Learning
In supervised learning, the data is already labelled, which means you
know the target variable. Using this method of learning, systems can
predict future outcomes based on past data. It requires that at least an
input and output variable be given to the model for it to be trained.
Below is an example of a supervised learning method. The algorithm is
trained using labelled data of dogs and cats. The trained model predicts
whether the new image is that of a cat or a dog.
2. Unsupervised Learning
3. Reinforcement Learning
The goal of reinforcement learning is to train an agent to complete a task
within an uncertain environment. The agent receives observations and a
reward from the environment and sends actions to the environment. The
reward measures how successful action is with respect to completing the
task goal.
Below is an example that shows how a machine is trained to identify
shapes.
Artificial Neuron
Neural networks are a collection of artificial neurons arranged in a
particular structure. In this segment, you will understand how a single
artificial neuron works, i.e., how it converts inputs into outputs. You will
also understand the topology or structure of large neural networks. Let’s
get started by understanding the basic structure of an artificial neuron.
Here, a represent the inputs, w represent the weights associated with the
inputs, and b represents the bias of the neuron.
Multiple artificial neurons in a neural network are arranged in different
layers. The first layer is known as the input layer, and the last layer is
called the output layer. The layers in between these two are the hidden
layers.
The number of neurons in the input layer is equal to the number of
attributes in the data set, and the number of neurons in the output layer is
determined by the number of classes of the target variable (for a
classification problem).
For a regression problem, the number of neurons in the output layer
would be 1 (a numeric variable). Take a look at the image given below to
understand the topology of neural networks in the case of classification
and regression problems.
Note that the number of hidden layers or the number of neurons in each
hidden layer or the activation functions used in the neural network
changes according to the problem, and these details determine the
topology or structure of the neural network.
The most important thing to note is that inputs can only be numeric. For
different types of input data, you need to use different ways to convert
the inputs into a numeric form. The most commonly used inputs for
ANNs are as follows:
Structured data: The type of data that we use in standard machine
learning algorithms with multiple features and available in two
dimensions, such that the data can be represented in a tabular format,
can be used as input for training ANNs. Such data can be stored
in CSV files, MAT files, Excel files, etc. This is highly convenient
because the input to an ANN is usually given as a numeric feature
vector. Such structured data eases the process of feeding the input into
the ANN.
Text data: For text data, you can use a one-hot vector or word
embedding corresponding to a certain word. For example, in one hot
vector encoding, if the vocabulary size is |V|, then you can represent
the word wn as a one-hot vector of size |V| with '1' at the nth element
with all other elements being zero. The problem with one-hot
representation is that, usually, the vocabulary size |V| is huge, in tens
of thousands at least; hence, it is often better to use word embedding’s
that are a lower-dimensional representation of each word. The one-hot
encoded array of the digits 0–9 will look as shown below.
data = np.array([0,1,2,3,4,5,6,7,8,9])
print(data.shape)
one_hot(data)
(10,)
array([[1.,0.,0.,0.,0.,0.,0.,0.,0.,0.,],
Prepared by Megha B S MeVi Technologies LLP Page 10
FDP on Image Processing and Deep Learning using Python
[0.,1.,0.,0.,0.,0.,0.,0.,0.,0.,],
[0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,],
[0.,0.,0.,1.,0.,0.,0.,0.,0.,0.,],
[0.,0.,0.,0.,1.,0.,0.,0.,0.,0.,],
[0.,0.,0.,0.,0.,1.,0.,0.,0.,0.,],
[0.,0.,0.,0.,0.,0.,1.,0.,0.,0.,],
[0.,0.,0.,0.,0.,0.,0.,1.,0.,0.,],
[0.,0.,0.,0.,0.,0.,0.,0.,1.,0.,],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,1.,]])
Softmax Function
A softmax output is similar to what we get from a multiclass logistic
function commonly used to compute the probability of an output
belonging to one of the multiple classes. It is given by the following
formula:
Suppose the output layer of a data set has 3 neurons and all of them have
the same input x (coming from the previous layers in the network). The
weights associated with them are represented as w0, w1and w2. In such a
case, the probabilities of the input belonging to each of the classes are
expressed as follows:
ow that you have understood how the output is obtained from the
softmax function and how different types of inputs are fed into the ANN,
Prepared by Megha B S MeVi Technologies LLP Page 14
FDP on Image Processing and Deep Learning using Python
let's learn how to define inputs and outputs for image recognition on the
famous MNIST data set for multiclass classification.
There are various problems you will face while trying to recognise
handwritten text using an algorithm, including:
Noise in the image
The orientation of the text
Non-uniformity in the spacing of text
Non-uniformity in handwriting
The MNIST data set takes care of some of these problems, as the digits are
written in a box. Now the only problem the network needs to handle is
the non-uniformity in handwriting. Since the images in the MNIST data
set are 28 X 28 pixels, the input layer has 784 neurons (each neuron takes
1 pixel as an input) and the output layer has 10 neurons (each giving the
probability of the input image belonging to any of the 10 classes). The
image is classified into the class with the highest probability in the output
layer.
In the image above, you can see that x1, x2 and x3 are the inputs, and
their weighted sum along with bias is fed into the neuron to give the
calculated result as the output.
n the image above, z is the cumulative input. You can see how the
weights affect the inputs depending on their magnitudes. Also, z is the
dot product of the weights and inputs plus the bias.
For example, as we can see in the image below, we sometimes have data
in non-linear shapes such as circular or elliptical. If you want to classify
the two circles into two groups, a linear model will not be able to do this,
but a neural network with multiple neurons and non-linear activation
functions can help you achieve this.
input value as the output. We can control the parameter to allow how
much ‘learning emphasis’ should be given to the negative value.
As you learnt in the previous session, the weight matrix between layer 0
(input layer) and layer 1 (the first hidden layer) is denoted by W. The dot
product between the matrix W and the input vector xi along with the bias
vector b, i.e., W.xi+b, acts as the cumulative input z to layer 1. The
activation function is applied to this cumulative input z to compute the
output h of layer 1.
The dimensions of the input are (3,1). There are two neurons in the first
hidden layer. Hence, the cumulative input z^1 will be given as:
Std. Number of Rooms Std. House Size (sq. ft.) Price ($)
3 1,340 3,13,000
5 3,650 23,84,000
3 1,930 3,42,000
3 2,000 4,20,000
4 1,940 5,50,000
2 880 4,90,000
Std. Number of Rooms Std. House Size (sq. ft.) Price ($)
-0.32 -0.66 -0.54
1.61 1.8 2.03
-0.32 -0.03 -0.51
-0.32 -0.03 -0.41
0.65 -0.02 -0.25
-1.29 -1.15 -0.32
We want to build a neural network that will predict the price of a house,
given two input attributes: number of rooms and house size. Let’s start
with the structure of the neural network that we will consider for this
case. We have an input layer with two input nodes, x1and x2, one hidden
layer with two nodes, a sigmoid activation function and finally an output
layer with a linear activation function (since this is a regression problem),
as shown below.
Let’s start by initialising the weights and biases to the following values:
Remember, the superscript denotes the layer to which it belongs and the
subscript denotes the node in that particular layer.
Hence, performing the forward pass through the neural network using
the input as [-0.32, -0.66] gives us the output as 0.63. The prediction is
very different from the actual value of -0.54, but this is to be expected
because we initialised the neural network with random weights and
biases. As we train the neural network, we will update these parameters
and get better predictions through multiple iterations. In the upcoming
session, we will cover this process in depth.
There are some important things to notice here. In both the regression
and classification problems, the same algorithm is used till the last step.
In the final step, in the classification problem, p defines the probability
vector, which gives the probability of the data point belonging to a
particular class among different possible classes or categories. In the
regression problem, p represents the predicted output obtained, which
we will normally refer to as h^L. Let’s discuss the classification problem.
We use the softmax output, which we had defined in an earlier session,
which gives us the probability vector pi of an input belonging to one of
the multiple output classes (c):
For the regression problem, we can skip the third and fourth steps, i.e.,
computing the probability and normalising the ‘predicted output vector’
p, because in a regression problem, the output is h^L ,i.e., the value we
obtain from the single output node, and we usually compare the output
obtained from the ANN directly with the ground truth. We do not need
to perform any further operations on the predicted output to get
probabilities in a regression problem.
Note that W^o (the weights of the output layer) can also be written as W
^L+1
We have the last weight matrix W^3 as W^O. The output layer classifies
the input into one of these three labels: 1, 2 or 3. The first neuron outputs
the probability for label 1, the second neuron outputs the probability for
label 2 and the third neuron outputs the probability for label 3.
Loss Function
Now that we know how to calculate the predicted output from a neural
network when given an input, we want to check if the neural network
predicted it correctly. We will revisit the calculations we had done in the
previous segment on the housing price prediction problem.
As you can see in the table above, the predicted price is not the same or
even close to the actual price. So, we want to know how wrong the
prediction of the neural network is and want to quantify this error in the
prediction. A loss function or cost function will help us quantify such
errors.
A loss function or cost function is a function that maps an event or values
of one or more variables onto a real number intuitively, representing
some ‘cost’ associated with the ‘event’, as shown below:
will see how this adjustment is done in subsequent sessions. For now, we
will concentrate on how to compute the loss.
Let’s consider the regression problem where we predict the house price,
given the number of rooms and the size of the house. Here, we will use
the RSS method to calculate the loss.
As given above, the MSE is the mean square error of all the samples in
the given data. This gives us a quantified method of measuring how well
the neural network is predicting the output.
Now, let’s take a look at the loss function for the classification problem.
Now that we have learnt about the forward pass and the loss function for
regression and classification problems, we know that given any input
and its actual output, we can assess the behaviour of the neural network.
An important point to note is that if the data is large (which is often the
case), the loss calculation itself can get pretty messy. For example, if you
have a million data points, they will be fed into the network (in batches),
the output will be calculated using feed forward, and the loss/cost Li(for
ith data point) will be calculated. The total loss is the sum of losses of all
the individual data points. Hence:
The total loss L is a function of w's and b's. Once the total loss is
computed, the weights and biases are updated (in the direction of
decreasing loss). In other words, L is minimised with respect to the w's
and b’s.
One important point to note here is that we minimise the average of the
total loss and not the total loss that you will get to see shortly.
Minimising the average loss implies that the total loss is getting
minimised.
The same can be written for biases. Note that weights and biases are
often collectively represented by one matrix called W. Going forward, W
will, by default, refer to the matrix of all weights and biases.
The main challenge is that W is a huge matrix, and thus, the total loss L
as a function of W is a complex function.
We took the following steps when passing an input through the network:
UNDERSTANDING CONVOLUTIONS – I
INTRODUCTION
Convolutional Neural Networks, or CNNs, are neural networks
specialised to work with visual data, i.e., images and videos (though not
restricted to them). They are very similar to the vanilla neural networks
or the multilayer perceptrons (MLPs), where every neuron in one layer is
connected to every neuron in the next layer. They also follow the same
general principles of feed forward, back propagation, weights, biases, etc.
However, there are certain features of CNNs that make them perform
extremely well on image processing tasks.
The ANN architecture can solve any problem, but there are two main
limitations associated with the simple MLP architectures:
1. The architecture offers a wide range of variations in the network
through depth (number of layers), width (size of the layer),
activation functions, etc., that make it impossible to find the best
architecture for a given problem.
2. The MLP architecture does not have the capability to preserve any
spatial information from the underlying image. As you have seen
while working with the MNIST data set in the previous module, the
information stored in a 2-D or 3-D format is flattened into a 1-D
array to map one neuron to each pixel.
EVOLUTION OF CNNs
Although the vanilla neural networks (MLPs) can learn extremely
complex functions, their architecture does not exploit what we know
about how the brain reads and processes images. For this reason,
although MLPs are successful in solving many complex problems, they
have not been able to achieve any major breakthroughs in the image
processing domain.
On the other hand, the architecture of CNNs uses many of the working
principles of the animal visual system and, thus, they have been able to
achieve extraordinary results in image-related learning tasks.
You can see the impressive results of CNNs in the ILSVRC and how they
have transformed image processing since the 2010s. Over the period,
there have been many variants of CNNs like AlexNet and VGGNet which
have gotten better with time. This can be seen from the error rate of these
architectures on the ImageNet data set. It started from nearly 30% in 2010
and has now reduced to nearly 4% with the ResNet architecture, which is
Prepared by Megha B S MeVi Technologies LLP Page 39
FDP on Image Processing and Deep Learning using Python
one of the recent variants of CNN. This event is significant as the error
rate is even lesser than that of a human (5%).
The entire process has taken a lot of time to deliver fruitful results.
3. Scale variation: Different sizes of the object with respect to the image
size.
Each neuron in the retina focuses on one part of the image and that
part of the image is called the receptive field of that neuron. The
following figure shows a certain region of the receptive field of a cat.
The receptive fields of all neurons are almost identical in shape and
size.
The receptive field has excitatory and inhibitory regions. The excitatory
region (denoted by the triangular marks) forms the key feature or the
element that the neuron is trained to look for and the inhibitory region
is like the background (marked by the crosses).
The neurons only ‘fire’ when there is a contrast between the excitatory
and the inhibitory regions. If we splash light over the excitatory and
inhibitory regions together, the neurons do not ‘fire’ (respond) because
of no contrast between them. If we splash light just over the excitatory
region, neurons respond because of the contrast. This helps us identify
a given feature in the receptive field of a particular neuron.
Every neuron is responsible for finding a ‘feature’ but how does this
information translate ahead in the system?
The units or the neurons at the initial level do very basic and specific
tasks such as picking raw features (for example, horizontal edges) in the
image. The subsequent units try to build on top of this to extract more
abstract features such as identifying textures and detecting movement.
The layers 'higher' in the hierarchy typically aggregate the features in the
lower ones.
The following image illustrates the hierarchy in units – the first level
extracts low-level features (such as vertical edges) from the image, while
the second level calculates the statistical aggregate of the first layer to
extract higher-level features (such as texture and colour schemes).
Apart from explaining the visual system, the paper also suggested that
similar phenomena have been observed in the auditory system and in
touch and pressure in the somatosensory system. This suggests that
CNN-like architectures can be used for speech processing and analysing
signals coming from touch sensors or pressure sensors as well.
We have already discussed most of the key ideas of the CNN architecture
through this paper. Let's have a look at some of the conclusions:
Each unit, or neuron, is dedicated to its own receptive field. Thus,
every unit is meant to ignore everything other than what is found in its
own receptive field.
The receptive field of each neuron is almost identical in shape and size.
The subsequent layers compute the statistical aggregate of the previous
layers of units. This is analogous to the 'pooling layer' in a typical
CNN.
You already know that the input to any neural network should be
numeric. Fortunately, images are naturally represented as arrays (or
matrices) of numbers.
Grayscale images: This is the most basic input as it only has one channel
that stores the shades of grey. To summarise, let’s consider this sample
image of a 'zero' from the MNIST data set:
One important point to note is that an image with more pixels holds more
information and, hence, has better clarity as compared to an image with a
lower pixel count. You would have come across this terminology while
purchasing smartphones. When we say that the phone has a 10-
megapixel camera, it means each image captured by that camera will
have 10 million pixels. A 20-megapixel camera will have twice as many
pixels and, hence, the captured image will have better clarity because of
the extra details stored in it.
For a video classification task, here's what we can do. Suppose the videos
are of length one minute each. If we extract frames from each video at the
rate of two frames per second (FPS), we will have 120 frames (or images)
per video. These images are then pushed into a convolutional neural
network to extract different features from the image. All this information
is stored in terms of feature vectors. Thus, we have 120 feature vectors
representing each video at the end of the CNN framework.
The CNN architecture has been closely derived from the visual system in
mammals. The VGGNet architecture that we discussed was specially
designed for the ImageNet challenge, a classification task with 1,000
categories. Thus, it takes colored images as the input and the softmax
layer at the end has 1,000 categories. The blue layers are the
convolutional layers, while the yellow ones are pooling layers.
Lastly, you also have the fully connected (FC) layers that help in the
final task of classification or regression.
The most important point to notice here is that the initial part of the
network (convolution and pooling layers) acts as a feature extractor for
images. For example, the VGGNet discussed earlier can extract a 4096-
dimensional feature vector representing each input image. This feature
vector is fed to the later part of the network that consists of a softmax
layer for classification. However, you can use the feature vector to
perform other tasks (such as video analysis, object detection and image
segmentation).
APPLICATION OF CNNs
Object localisation: Identifying the local region of the objects (as a
rectangular area) and classifying them.
Now that you have a broad sense of the different applications of CNNs,
let's see some more examples of image processing applications in
different industries.
You learnt about the basics of convolutional neural networks and their
common applications in computer vision such as image classification and
object detection. You also learnt that CNNs are not limited to images but
can be extended to videos, text, audio, etc.
The design of CNNs uses many observations from the animal visual
system, such as each retinal neuron looking at its own (identical)
receptive field, some neurons responding proportionally to the
summation over excitatory regions (pooling), and images being
perceived in a hierarchical manner.
You learnt that images are naturally represented in the form of arrays of
numbers. Grayscale images have a single channel, while colour images
have three channels (RGB). The number of channels or the 'depth' of the
image can vary depending on how we represent the image. Each channel
of a pixel, usually between 0 and 255, indicates the 'intensity' of a certain
colour.
Coming to the CNN architecture, a typical CNN unit (or layer) in a large
CNN-based network comprises convolution layers, pooling layers and
fully connected layers.
UNDERSTANDING CONVOLUTIONS - I
You have a fair idea of the three main terms related to the CNN
architecture:
Convolutions
Feature maps
Pooling
These components are closely derived from how the visual system works
in mammals. They perform three broad tasks as listed below:
1. Scanning the image
2. Identifying different features
3. Aggregating the learning from individual neurons in succeeding
layers
Prepared by Megha B S MeVi Technologies LLP Page 52
FDP on Image Processing and Deep Learning using Python
The blue-coloured 4×4 matrix is the input image. The 3×3 box that moves
over the blue map is the filter, and the resultant output is the green-
coloured 2×2 matrix.
Let’s take a look at another example for more clarity.
Convolution Example
Consider the image shown below and convolve it with a 3×3 filter to
produce a 3×3 array.
UNDERSTANDING CONVOLUTIONS - II
You saw an example of how the convolution operation (using an
appropriate filter) detects certain features in images, such as edges.
Note that each column of the 4×4 output matrix looks at exactly three
columns of the input image. The idea behind using this filter is that it
captures the amount of change (or gradient) in the intensity of the
corresponding columns in the input image along the horizontal direction.
Other Filters
Based on the filter for vertical edge detection, you can design a filter for
detecting the horizontal edges. As shown below, the objective will now
be able to capture the change in intensity in the vertical direction.
Although only simple filters have been discussed here, you can design
arbitrarily complex filters for detecting edges and other patterns. For
example, the following image shows the Sobel filter, which can detect
both horizontal and vertical edges in complex images.
STRIDE
In the previous segment, the filter was moved by exactly one pixel (both
horizontally and vertically) while performing convolutions. However, it
is not the only way to do convolutions. You can move the filter by an
arbitrary number of pixels based on the task requirement. This concept is
known as stride. There is nothing sacrosanct about the stride length 1.
You can alter the stride value based on your underlying objective.
If you increase the stride value, some of the information will be lost in
translation from one layer to another. If you think that you do not need
many fine-grained features for your task, you can use a higher stride
length (2 or more).
As you can see, a higher stride value also results in the same result. The
benefit of increasing the stride length is that it results in faster processing
and less information is translated between the layers. Let’s understand
this concept through the image provided below.
Case 1:
In the first case, suppose you are building a CNN model to identify
the presence of any water body in a given satellite image. This task
can be done smartly by not exploring each and every pixel of the
image. The output will be ‘yes’ if the object is found in any part of
the image. Hence, you can skip a few pixels and save both time and
computational resources in the process.
Case 2:
Suppose you are building a model on the same image to extract the
region covered by the water body in the given area. Here, you will
be expected to closely map the entire structure of the water body,
and hence, you would need to extract all the granular elements from
the image. Therefore, the stride value should not be kept high.
PADDING
So far, you have gained an understanding of the basic purpose of
convolution. The convolution process helps identify features such as
edges, shapes and objects in an image. You can control the level or the
granularity of these features using the stride value of the filter. Moreover,
the level of abstraction can be controlled using the number of layers
stacked in the architecture.
However, this task gets challenging with an increase in the filter size or
the stride value. Both these practices result in a reduction in size from
input to output. If you remember, the 7×7 matrix shrunk to 5×5 when
convolved with a 3×3-sized filter and a stride of 1. Owing to the reduction
in size, it becomes difficult for the latter layers to perform the task of
convolution.
As mentioned in this video, padding helps you manage the size of the
output of convolution. Padding of ‘x’ means that ‘x units’ of rows and
columns are added all around the image. As shown in the image given
below, padding of 1 has been introduced around a 5×5 image.
Prepared by Megha B S MeVi Technologies LLP Page 58
FDP on Image Processing and Deep Learning using Python
When you convolve an image without padding (using any filter size), the
output size is smaller than the image, i.e., the output ‘shrinks’. However,
with the introduction of padding layers, the size of the output increases.
It is important to note that only the width and the height decrease (not
the depth) when you convolve without padding. The depth of the output
depends on the number of filters used, which will be discussed in a later
segment.
Large CNNs have tens (or even hundreds) of such convolutional layers
(recall VGGNet). So, you might incur massive ‘information loss’ as you
build deeper networks. This is one of the main reasons why padding is
important: It helps maintain the size of the output arrays and avoid
information loss. Of course, in many layers, you want to shrink the
output (as shown below), but in many others, you maintain the size of the
output.
CHOOSING VALUES
You learnt that you have multiple parameters associated with the
convolution process, which is listed below:
Filter size
Stride
Padding
However, you cannot convolve the images with just any combination of
these attributes. You need to alter them based on your input image and
the required output from the architecture. Let’s understand this aspect in
more detail
You cannot convolve a 6×6 image with a 3×3 filter using a stride of 2. Let’s
revisit the formula that is used to calculate the output size using the input
size, filter size, padding and stride length.
The only difference is that now, the filters will be 3D, such as 3×3×3 or
5×5×3. In these examples, the last ‘3’ indicates that the filter has as many
channels as the image. For example, if you have an image of size
224×224×3, you can use filters of sizes 3×3×3, 5×5×3, 7×7×3, etc. (with
appropriate padding).
Instead of 9 ‘3×3’ values, the filter will now hold 27 ‘3×3×3’ values. In the
next segment, you will understand how to determine these values in
order to extract a feature from the image.
In some cases, you are only interested in one particular feature from the
image, such as vertical edge detection, a colour or a texture. Here, you are
not asking the network to learn anything and use the architecture to just
extract a basic feature. For such tasks, you would be expected to design a
filter for the network, as it will only check for the required feature across
all the patches of the image.
Components:
I: Image tensor (5×5×3)
W: Weight tensor (3×3×3)
P: Patch from the image to be convolved using the filter (3×3×3)
So, the output of the convolution process will be a matrix that holds the
dot product of the two tensors (W and P).
This equation is similar to the feed forward equation, as the inputs from
the image are multiplied by the filter values to obtain the output for that
particular layer. As in the ANN architecture, the filters are learnt during
training, i.e., during back propagation. Hence, the individual values of
the filters are often called the weights of a CNN.
In the discussion so far, you have only learnt about weights; however,
convolutional layers (that is, filters) also have biases. Let’s take a look at
an example to understand this concept better. Suppose you have an RGB
image and a 2×2×3 filter as shown below. The filter has three channels,
each of which convolves the corresponding channel of the image. Thus,
each step in the convolution involves the element-wise multiplication of
The following Image depicts the convolution operation. Note that in each
step, a single scalar number is generated, and at the end of the
convolution, a 2D array is generated.
You can express the convolution operation as a dot product between the
weights and the input image. If you treat the 2×2×3 filter as a vector w of
length 12 and the 12 corresponding elements of the input image as the
vector p (that is, both are unrolled from a 3D tensor to a 1D vector), each
step of the convolution is simply the dot product of w^Tand p. The dot
product is computed at every patch to obtain a (3×3) output array as
shown in this Image provided above.
Apart from the weights, each filter can also have a bias. In this case, the
output of the convolutional operation is a 3×3 array (or a vector of length
9). So, the bias will be a vector of length 9. However, a common practice
in CNNs is that all the individual elements in the bias vector have the
same value (called tied biases). For example, a tied bias for the filter
shown in the GIF given above can be represented as shown below:
FEATURE MAPS
As you learnt in the previous segment, the values of the filters, or the
weights, are learnt during training. Let's now understand how multiple
filters are used to detect various features in images. In this segment, you
will learn the concepts of neurons and feature maps.
The following figure shows feature maps derived from the input image
using three different filters.
You can have multiple such neurons convolving an image, each having a
different set of weights and each producing a feature map.
Convolution in VGGNet
Before moving on to the next component of CNNs, let’s summarise the
entire process of convolution using the VGGNet architecture.
The first convolutional layer takes the input image of size 224×224×3, uses
a 3×3×3 filter (with some padding) and produces an output of 224×224.
This 224×224 output is then fed to a ReLU to generate a 224×224 feature
map. Note that the term ‘feature map’ refers to the (non-linear) output of
the activation function, instead of the input to the activation function
(that is, the output of the convolution).
The 64 feature maps, or the 224×224×64 tensor, are then fed to a pooling
layer, which you will explore in the next segment. However, as you
proceed ahead, the network extracts more features and finally ends up
with the 1×1×4096 tensor. This suggests that there are 4096 different
feature maps at the end of the convolution blocks. Now, the architecture
can leverage these 4096 features to classify the image accurately.
POOLING
The convolution layer takes care of the first two steps by extracting
different features from the image at different levels. After extracting the
features (in the form of feature maps), CNNs typically aggregate these
features using the pooling layer. Let’s take a look at how the pooling
layer works and how it is useful in extracting higher-level features.
The two most popular aggregate functions used in pooling are ‘max’ and
‘average’. The intuition behind these functions is as follows:
Max pooling: If any one of the patches says something strongly about
the presence of a certain feature, then the pooling layer counts that
feature as ‘detected’.
Average pooling: If one patch says something very firmly about the
presence of a certain feature but the other ones disagree, the pooling
layer takes the average to find out.
On the other hand, the pooling process also had a disadvantage. It loses a
lot of information. Having said that, pooling has empirically been proven
to improve the performance of most of the deep CNNs
Pooling in VGGNet
As mentioned in the video, the pooling layer of VGGNet is defined as
follows:
Window size: 2×2 (Aggregation of four 2×2 patches into one value)
Stride: 1
The convolution and pooling layers together form a CNN unit or a CNN
layer. This unit is responsible for extracting the features from the image.
You start with an original image and do convolutions using different
filters to get multiple feature maps at different levels. The pooling layer
calculates the statistical aggregate of the feature maps to reduce the
spatial size and to make the entire process robust against the local
transformations.
Once all the features are extracted, the output from the convolution layers
is flattened in the FC layers. As shown in the above image, the size of the
last pooling layer (7×7×512) is reduced to (1×1×4096). These 4096 features
are fed to the softmax layer for the final task of classification.
SUMMARY
This session focused on the key components of convolution neural
networks (CNNs), which are as follows:
Convolution layer
Feature maps
Pooling layer
The formula to calculate the output shape after convolution is given by:
The filters are learned during training (back propagation). Each filter
(consisting of weights and biases) is called a neuron, which covers a small
patch of the entire image. Multiple neurons are used to convolve an
image (or feature maps from the previous layers) to generate new feature
maps. The feature maps contain the output of convolution + non-linear
activation operations on the input.
The architecture is given in the table provided below (taken from the
original paper). Each column in the table (from A-E) denotes an
architecture that the team had experimented with. In this discussion, we
will refer to only column D, which refers to VGG-16 (column E is VGG-
19).
In all the convolutional layers, the same stride length of 1 pixel is used
with a padding of 1 pixel on each side, thereby preserving the spatial
dimensions (height and width) of the output.
After every set of convolutional layers, there is a max pooling layer. All
the pooling layers in the network use a window of 2 x 2 pixels with stride
2. Finally, the output of the last pooling layer is flattened and fed to a
fully connected (FC) layer with 4,096 neurons, followed by another FC
layer of 4,096 neurons, and finally to a 1000-softmax output. The softmax
layer uses the usual cross-entropy loss. All layers apart from the softmax
use the ReLU activation function.
The number of parameters and the output size from any layer can be
calculated as demonstrated in the MNIST Notebook on the previous
page. For example, the first convolutional layer takes a (224, 224, 3) image
as the input and has 64 filters of size (3, 3, 3). Note that the depth of a
filter is always equal to the number of channels in the input that it
convolves. Thus, the first convolutional layer has 64 x 3 x 3 x 3 (weights) +
64 (biases) = 1,792 trainable parameters. Since stride and padding of 1
pixel are used, the output spatial size is preserved, and the output will be
(224, 224, and 64).