0% found this document useful (0 votes)
7 views

Deep Learning & Image Processing [Notes]

Uploaded by

mevi.programs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Deep Learning & Image Processing [Notes]

Uploaded by

mevi.programs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

FDP on Image Processing and Deep Learning using Python

DEEP LEARNING & IMAGE PROCESSING [NOTES-1]

What is Artificial Intelligence?

AI is a broader term that describes the capability of the machine to learn


and solve problems just like humans. In other words, AI refers to the
replication of humans, how it thinks, works and functions.
Artificial Intelligence is the concept of creating smart intelligent
machines.
Machine Learning is a subset of artificial intelligence that helps you
build AI-driven applications.
Deep Learning is a subset of machine learning that uses vast volumes of
data and complex algorithms to train a model.

How Does Machine Learning Work?


Machine learning accesses vast amounts of data (both structured and
unstructured) and learns from it to predict the future. It learns from the

Prepared by Megha B S MeVi Technologies LLP Page 1


FDP on Image Processing and Deep Learning using Python

data by using multiple algorithms and techniques. Below is a diagram


that shows how a machine learns from data.

Now that you have been introduced to the basics of machine learning and
how it works, let’s see the different types of machine learning methods.

Types of Machine Learning


Machine learning algorithms are classified into three main categories:

1. Supervised Learning
In supervised learning, the data is already labelled, which means you
know the target variable. Using this method of learning, systems can
predict future outcomes based on past data. It requires that at least an
input and output variable be given to the model for it to be trained.
Below is an example of a supervised learning method. The algorithm is
trained using labelled data of dogs and cats. The trained model predicts
whether the new image is that of a cat or a dog.

Some examples of supervised learning include linear regression, logistic


regression, support vector machines, Naive Bayes, and decision tree.

2. Unsupervised Learning

Prepared by Megha B S MeVi Technologies LLP Page 2


FDP on Image Processing and Deep Learning using Python

Unsupervised learning algorithms employ unlabelled data to discover


patterns from the data on their own. The systems are able to identify
hidden features from the input data provided. Once the data is more
readable, the patterns and similarities become more evident.
Below is an example of an unsupervised learning method that trains a
model using unlabelled data. In this case, the data consists of different
vehicles. The purpose of the model is to classify each kind of vehicle.

Some examples of unsupervised learning include k-means clustering,


hierarchical clustering, and anomaly detection.

3. Reinforcement Learning
The goal of reinforcement learning is to train an agent to complete a task
within an uncertain environment. The agent receives observations and a
reward from the environment and sends actions to the environment. The
reward measures how successful action is with respect to completing the
task goal.
Below is an example that shows how a machine is trained to identify
shapes.

Prepared by Megha B S MeVi Technologies LLP Page 3


FDP on Image Processing and Deep Learning using Python

Examples of reinforcement learning algorithms include Q-learning and


Deep Q-learning Neural Networks.
Machine Learning Processes
Machine Learning involves seven steps:

Machine Learning Applications

 Sales forecasting for different products


 Fraud analysis in banking
 Product recommendations
 Stock price prediction

What Is Deep Learning?


Deep learning can be considered as a subset of machine learning. It is a
field that is based on learning and improving on its own by examining
computer algorithms. While machine learning uses simpler concepts,
deep learning works with artificial neural networks, which are designed
to imitate how humans think and learn. Until recently, neural networks
Prepared by Megha B S MeVi Technologies LLP Page 4
FDP on Image Processing and Deep Learning using Python

were limited by computing power and thus were limited in complexity.


However, advancements in Big Data analytics have permitted larger,
sophisticated neural networks, allowing computers to observe, learn, and
react to complex situations faster than humans. Deep learning has aided
image classification, language translation, speech recognition. It can be
used to solve any pattern recognition problem and without human
intervention.
Artificial neural networks, comprising many layers, drive deep learning.
Deep Neural Networks (DNNs) are such types of networks where each
layer can perform complex operations such as representation and
abstraction that make sense of images, sound, and text. Considered the
fastest-growing field in machine learning, deep learning represents a
truly disruptive digital technology, and it is being used by increasingly
more companies to create new business models.

Deep Learning vs. Machine Learning


Aspect Machine Learning Deep Learning
Requires less data to Needs large amounts of data
Data Dependency
train effectively. to train effectively.
Generally less Requires high-end hardware
Hardware
demanding; can work (especially GPUs) due to its
Requirements
on low-end machines. computational complexity.
Often more Less interpretable because of
Interpretability interpretable due to complex model
simpler models. architectures.
Requires manual Learns features
Feature intervention for feature automatically, minimizing
Engineering extraction and the need for manual feature
selection. engineering.
Training Time Typically faster to train Requires longer training

Prepared by Megha B S MeVi Technologies LLP Page 5


FDP on Image Processing and Deep Learning using Python

than deep learning times due to more complex


models. architectures.
Utilizes simpler
Uses complex neural
Model algorithms like linear
networks with multiple
Complexity regression, decision
layers.
trees, etc.
Excelling in areas with
Well-suited for small to substantial data and
Application Scope medium-sized data sets complex problems like
and simpler problems. image and speech
recognition.
Outputs are generally
in the form of Outputs can be more
Output
numerical values, complex, like entire new
Interpretation
labels, or simple images or sequences of text.
categories.
More feasible in
Less feasible due to the
Real-time machine learning with
heavy computational
Learning models that require less
requirements.
computational power.
Involves a variety of
algorithms that can be Primarily revolves around
Algorithm
applied depending on different architectures of
Variability
the type and structure deep neural networks.
of the data.
More dependent on
Less human intervention in
human expertise for
Human processing raw data but
setting up models and
Intervention requires careful network
choosing the right
architecture design.
algorithms.

Prepared by Megha B S MeVi Technologies LLP Page 6


FDP on Image Processing and Deep Learning using Python

Libraries like TensorFlow,


Libraries like Scikit-
Keras, and PyTorch are
Software Libraries learn, WEKA are
more tailored to deep
commonly used.
learning.
Approaches problems
Approaches problems
with traditional
Problem-Solving through layers of
algorithms that may or
Approach abstraction, learning from
may not involve
vast amounts of data.
iterative learning.
Success with Less effective with Highly effective with
Unstructured unstructured data if unstructured data like text,
Data carefully pre-processed. images, and audio.
More complex and time-
Easier and quicker to
Update and Re- consuming to update and
update and retrain with
training retrain models with new
new data.
data.

Artificial Neuron
Neural networks are a collection of artificial neurons arranged in a
particular structure. In this segment, you will understand how a single
artificial neuron works, i.e., how it converts inputs into outputs. You will
also understand the topology or structure of large neural networks. Let’s
get started by understanding the basic structure of an artificial neuron.

However, in perceptron’s, the commonly used activation/output is the


step function, whereas in the case of ANNs, the activation functions are
non-linear functions.

Prepared by Megha B S MeVi Technologies LLP Page 7


FDP on Image Processing and Deep Learning using Python

Here, a represent the inputs, w represent the weights associated with the
inputs, and b represents the bias of the neuron.
Multiple artificial neurons in a neural network are arranged in different
layers. The first layer is known as the input layer, and the last layer is
called the output layer. The layers in between these two are the hidden
layers.
The number of neurons in the input layer is equal to the number of
attributes in the data set, and the number of neurons in the output layer is
determined by the number of classes of the target variable (for a
classification problem).
For a regression problem, the number of neurons in the output layer
would be 1 (a numeric variable). Take a look at the image given below to
understand the topology of neural networks in the case of classification
and regression problems.

Prepared by Megha B S MeVi Technologies LLP Page 8


FDP on Image Processing and Deep Learning using Python

Note that the number of hidden layers or the number of neurons in each
hidden layer or the activation functions used in the neural network
changes according to the problem, and these details determine the
topology or structure of the neural network.

So far, you have understood the basic structure of artificial neural


networks. To summarise, there are six main elements that must be
specified for any neural network. They are as follows:
 Input layer
 Output layer
 Hidden layers
 Network topology or structure
 Weights and biases
 Activation functions

Inputs and Outputs of a Neural Network


Prepared by Megha B S MeVi Technologies LLP Page 9
FDP on Image Processing and Deep Learning using Python

The number of neurons in the input layer is determined by the input


given to the network, and the number of neurons in the output layer is
equal to the number of classes (for a classification task) or is one (for a
regression task).

The most important thing to note is that inputs can only be numeric. For
different types of input data, you need to use different ways to convert
the inputs into a numeric form. The most commonly used inputs for
ANNs are as follows:
 Structured data: The type of data that we use in standard machine
learning algorithms with multiple features and available in two
dimensions, such that the data can be represented in a tabular format,
can be used as input for training ANNs. Such data can be stored
in CSV files, MAT files, Excel files, etc. This is highly convenient
because the input to an ANN is usually given as a numeric feature
vector. Such structured data eases the process of feeding the input into
the ANN.
 Text data: For text data, you can use a one-hot vector or word
embedding corresponding to a certain word. For example, in one hot
vector encoding, if the vocabulary size is |V|, then you can represent
the word wn as a one-hot vector of size |V| with '1' at the nth element
with all other elements being zero. The problem with one-hot
representation is that, usually, the vocabulary size |V| is huge, in tens
of thousands at least; hence, it is often better to use word embedding’s
that are a lower-dimensional representation of each word. The one-hot
encoded array of the digits 0–9 will look as shown below.

data = np.array([0,1,2,3,4,5,6,7,8,9])
print(data.shape)
one_hot(data)

(10,)
array([[1.,0.,0.,0.,0.,0.,0.,0.,0.,0.,],
Prepared by Megha B S MeVi Technologies LLP Page 10
FDP on Image Processing and Deep Learning using Python

[0.,1.,0.,0.,0.,0.,0.,0.,0.,0.,],
[0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,],
[0.,0.,0.,1.,0.,0.,0.,0.,0.,0.,],
[0.,0.,0.,0.,1.,0.,0.,0.,0.,0.,],
[0.,0.,0.,0.,0.,1.,0.,0.,0.,0.,],
[0.,0.,0.,0.,0.,0.,1.,0.,0.,0.,],
[0.,0.,0.,0.,0.,0.,0.,1.,0.,0.,],
[0.,0.,0.,0.,0.,0.,0.,0.,1.,0.,],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,1.,]])

 Images: Images are naturally represented as arrays of numbers and


can thus be fed into the network directly. These numbers are the raw
pixels of an image. ‘Pixel’ is short for ‘picture element’. In images,
pixels are arranged in rows and columns (an array of pixel elements).
The figure given below shows the image of a handwritten 'zero' in the
MNIST data set (black and white) and its corresponding
representation in NumPy as an array of numbers. The pixel values are
high where the intensity is high, i.e., the colour is bright, while the
values are low in the black regions, as shown below.

 Images (cont.): In a neural network, each pixel of the input image is


a feature. For example, the image provided above is an 18 x 18 array.
Hence, it will be fed as a vector of size 324 into the network. Note that
the image given above is black and white (also called a grayscale
image), and thus, each pixel has only one ‘channel’. If it were
a coloured image called an RGB (Red, Green and Blue) image, each
pixel would have three channels, one each for red, blue, and green, as
shown below. Hence, the number of neurons in the input layer would
Prepared by Megha B S MeVi Technologies LLP Page 11
FDP on Image Processing and Deep Learning using Python

be 18 x 18 x 3 = 972. The three channels of an RGB image are shown


below.

 Speech: In the case of a speech/voice input, the basic input unit is in


the form of phonemes. These are the distinct units of speech in any
language. The speech signal is in the form of waves, and to convert
these waves into numeric inputs, you need to use Fourier Transforms
(you do not need to worry about this as it is covering areas of
specialised mathematics that will not be covered in this course). Note
that the input after conversion should be numeric, so you are able to
feed it into a neural network.
Now that you have learnt how to feed input vectors into neural networks,
let’s understand how the output layers are specified.
Depending on the nature of the given task, the outputs of neural
networks can either be in the form of classes (if it is a classification
problem) or numeric (if it is a regression problem). One of the commonly
used output functions is the softmax function for classification. Take a
look at the graphical representation of the softmax function shown below.

Prepared by Megha B S MeVi Technologies LLP Page 12


FDP on Image Processing and Deep Learning using Python

Softmax Function
A softmax output is similar to what we get from a multiclass logistic
function commonly used to compute the probability of an output
belonging to one of the multiple classes. It is given by the following
formula:

Where c is the number of classes or neurons in the output layer, x ′ is the


input to the network, and wi are the weights associated with the inputs.

Suppose the output layer of a data set has 3 neurons and all of them have
the same input x (coming from the previous layers in the network). The
weights associated with them are represented as w0, w1and w2. In such a
case, the probabilities of the input belonging to each of the classes are
expressed as follows:

Prepared by Megha B S MeVi Technologies LLP Page 13


FDP on Image Processing and Deep Learning using Python

o, we have seen the softmax function as a commonly used output


function in multiclass classification. Now, let’s understand how the
softmax function translates to the sigmoid function in the special case
of binary classification.
In the case of a sigmoid output, there is only one neuron in the output
layer because if there are two classes with probabilities p0 and p1, we
know that p0 + p1 = 1. Hence, we need to compute the value of
either p0 or p1. In other words, the sigmoid function is just a special case
of the softmax function (since binary classification is a special case of
multiclass classification).In fact, we can derive the sigmoid function from
the softmax function, as shown below. Let's assume that the softmax
function has two neurons with the following outputs:

ow that you have understood how the output is obtained from the
softmax function and how different types of inputs are fed into the ANN,
Prepared by Megha B S MeVi Technologies LLP Page 14
FDP on Image Processing and Deep Learning using Python

let's learn how to define inputs and outputs for image recognition on the
famous MNIST data set for multiclass classification.
There are various problems you will face while trying to recognise
handwritten text using an algorithm, including:
 Noise in the image
 The orientation of the text
 Non-uniformity in the spacing of text
 Non-uniformity in handwriting
The MNIST data set takes care of some of these problems, as the digits are
written in a box. Now the only problem the network needs to handle is
the non-uniformity in handwriting. Since the images in the MNIST data
set are 28 X 28 pixels, the input layer has 784 neurons (each neuron takes
1 pixel as an input) and the output layer has 10 neurons (each giving the
probability of the input image belonging to any of the 10 classes). The
image is classified into the class with the highest probability in the output
layer.

Workings of a Single Neuron


Now that you have seen how inputs are fed into a neuron and how
outputs are obtained using activation functions, let’s reiterate the
concepts with a short summary.

Prepared by Megha B S MeVi Technologies LLP Page 15


FDP on Image Processing and Deep Learning using Python

In the image above, you can see that x1, x2 and x3 are the inputs, and
their weighted sum along with bias is fed into the neuron to give the
calculated result as the output.

To summarise, the weights are applied to the inputs respectively, and


along with the bias, the cumulative input is fed into the neuron. An
activation function is then applied on the cumulative input to obtain the
output of the neuron. We have seen some of the activation functions such
as softmax and sigmoid in the previous segment. We will explore other
types of activation functions in the next segment. These functions apply
non-linearity to the cumulative input to enable the neural network to
identify complex non-linear patterns present in the data set.
An in-depth representation of the cumulative input as the output is given
below.

Prepared by Megha B S MeVi Technologies LLP Page 16


FDP on Image Processing and Deep Learning using Python

n the image above, z is the cumulative input. You can see how the
weights affect the inputs depending on their magnitudes. Also, z is the
dot product of the weights and inputs plus the bias.

The image provided below shows the graphical representation of a linear


function and one of the possible representations of a non-linear function.

The activation functions introduce non-linearity in the network, thereby


enabling the network to solve highly complex problems. Problems that
take the help of neural networks require the ANN to recognise complex
patterns and trends in the given data set. If we do not introduce non-
linearity, the output will be a linear function of the input vector. This will
not help us in understanding more complex patterns present in the data
set.

Prepared by Megha B S MeVi Technologies LLP Page 17


FDP on Image Processing and Deep Learning using Python

For example, as we can see in the image below, we sometimes have data
in non-linear shapes such as circular or elliptical. If you want to classify
the two circles into two groups, a linear model will not be able to do this,
but a neural network with multiple neurons and non-linear activation
functions can help you achieve this.

Non Linear Activation Function


While choosing activation functions, you need to ensure that they are:
 Non-linear,
 Continuous, and
 Monotonically increasing.
The different commonly used activation functions are represented below.

Prepared by Megha B S MeVi Technologies LLP Page 18


FDP on Image Processing and Deep Learning using Python

The features of these activation functions are as follows:


 Sigmoid: When this type of function is applied, the output from the
activation function is bound between 0 and 1 and is not centred on
zero. A sigmoid activation function is usually used when we want to
regularise the magnitude of the outputs we get from a neural network
and ensure that this magnitude does not blow up.
 Tanh (Hyperbolic Tangent): When this type of function is applied, the
output is centred around 0 and bound between -1 and 1, unlike a
sigmoid function in which case, it is centred around 0.5 and will give
only positive outputs. Hence, the output is centred around zero for
tanh.
 ReLU (Rectified Linear Unit): The output of this activation function is
linear in nature when the input is positive and the output is zero when
the input is negative. This activation function allows the network to
converge very quickly, and hence, its usage is computationally
efficient. However, its use in neural networks does not help the
network to learn when the values are negative.
 Leaky ReLU (Leaky Rectified Linear Unit): This activation function is
similar to ReLU. However, it enables the neural network to learn even
when the values are negative. When the input to the function is
negative, it dampens the magnitude, i.e., the input is multiplied with
an epsilon factor that is usually a number less than one. On the other
hand, when the input is positive, the function is linear and gives the

Prepared by Megha B S MeVi Technologies LLP Page 19


FDP on Image Processing and Deep Learning using Python

input value as the output. We can control the parameter to allow how
much ‘learning emphasis’ should be given to the negative value.

Prepared by Megha B S MeVi Technologies LLP Page 20


FDP on Image Processing and Deep Learning using Python

DEEP LEARNING & IMAGE PROCESSING [NOTES-2]

Flow of Information between Layers


In artificial neural networks, the output from one layer is used as input to
the next layer. Such networks are called feed forward neural networks.
This means that there are no loops in the network, i.e., information is
always fed forward, never fed backwards. Let’s start by understanding
the feed forward mechanism between the two layers.
An image of a subset of the neural network is shown below:

As you learnt in the previous session, the weight matrix between layer 0
(input layer) and layer 1 (the first hidden layer) is denoted by W. The dot
product between the matrix W and the input vector xi along with the bias
vector b, i.e., W.xi+b, acts as the cumulative input z to layer 1. The
activation function is applied to this cumulative input z to compute the
output h of layer 1.

Prepared by Megha B S MeVi Technologies LLP Page 21


FDP on Image Processing and Deep Learning using Python

Let’s take the above-mentioned example and perform matrix


multiplication to get a vectorised method to compute the output of layer
1 from the inputs of layer 0.
Here, the following input is given:

The dimensions of the input are (3,1). There are two neurons in the first
hidden layer. Hence, the cumulative input z^1 will be given as:

Prepared by Megha B S MeVi Technologies LLP Page 22


FDP on Image Processing and Deep Learning using Python

Prepared by Megha B S MeVi Technologies LLP Page 23


FDP on Image Processing and Deep Learning using Python

To summarise, the steps involved in computing the output of the ith


neuron in layer l is as follows:
 Multiply each row of the weight matrix with the output from the
previous layer to obtain the weighted sum of inputs from the
previous layer.
 Convert the weighted sum into the cumulative input by adding the
bias vector.
 Apply the activation function σ(x) to the cumulative input to obtain
the output vector h.

Forward Pass - Demonstration


To reiterate, the problem statement is to predict the price of houses, given
the size of the houses and the number of rooms available.

Std. Number of Rooms Std. House Size (sq. ft.) Price ($)
3 1,340 3,13,000
5 3,650 23,84,000
3 1,930 3,42,000
3 2,000 4,20,000
4 1,940 5,50,000
2 880 4,90,000

Prepared by Megha B S MeVi Technologies LLP Page 24


FDP on Image Processing and Deep Learning using Python

Std. Number of Rooms Std. House Size (sq. ft.) Price ($)
-0.32 -0.66 -0.54
1.61 1.8 2.03
-0.32 -0.03 -0.51
-0.32 -0.03 -0.41
0.65 -0.02 -0.25
-1.29 -1.15 -0.32

We want to build a neural network that will predict the price of a house,
given two input attributes: number of rooms and house size. Let’s start
with the structure of the neural network that we will consider for this
case. We have an input layer with two input nodes, x1and x2, one hidden
layer with two nodes, a sigmoid activation function and finally an output
layer with a linear activation function (since this is a regression problem),
as shown below.

Prepared by Megha B S MeVi Technologies LLP Page 25


FDP on Image Processing and Deep Learning using Python

Now, to understand how the data moves forward in the network to


enable the neural network to make predictions, we will initialise the
weights and biases with random values. We recommend that you keep a
pen and paper handy for practising the computations that will be
performed further. The intention is that as this network gets trained, the
weights and biases will be updated as per the data such that the predicted
output will eventually be the same or at least similar to the actual output.

Let’s start by initialising the weights and biases to the following values:

Remember, the superscript denotes the layer to which it belongs and the
subscript denotes the node in that particular layer.

To showcase the step-by-step computation of the output, let’s take the


first example as the input vector:

Prepared by Megha B S MeVi Technologies LLP Page 26


FDP on Image Processing and Deep Learning using Python

Prepared by Megha B S MeVi Technologies LLP Page 27


FDP on Image Processing and Deep Learning using Python

Prepared by Megha B S MeVi Technologies LLP Page 28


FDP on Image Processing and Deep Learning using Python

Hence, performing the forward pass through the neural network using
the input as [-0.32, -0.66] gives us the output as 0.63. The prediction is
very different from the actual value of -0.54, but this is to be expected
because we initialised the neural network with random weights and
biases. As we train the neural network, we will update these parameters
and get better predictions through multiple iterations. In the upcoming
session, we will cover this process in depth.

This was a demonstration of how information flows forward in a neural


network from the input to the output, i.e., the forward pass to make a
prediction.

Prepared by Megha B S MeVi Technologies LLP Page 29


FDP on Image Processing and Deep Learning using Python

Feed forward Algorithm


Having understood how information flows in the network for a
regression problem, let’s write the pseudo code for a feed forward pass
through the network for a single data point xi.
The pseudo code for a feed forward pass is given below:

There are some important things to notice here. In both the regression
and classification problems, the same algorithm is used till the last step.
In the final step, in the classification problem, p defines the probability
vector, which gives the probability of the data point belonging to a
particular class among different possible classes or categories. In the
regression problem, p represents the predicted output obtained, which
we will normally refer to as h^L. Let’s discuss the classification problem.
We use the softmax output, which we had defined in an earlier session,
which gives us the probability vector pi of an input belonging to one of
the multiple output classes (c):

The classification feed forward algorithm has been extensively used in


industries like finance, healthcare, travel etc. Considering the finance
industry, one of the applications of this algorithm is categorising
customer applications for credit cards as ‘Good’, ‘Bad’ or ‘Needing
further analysis’ by credit card companies. For this, credit card companies
consider different factors such as annual salary, any outstanding debts
and age. These can be the features in the input vector that is fed into a

Prepared by Megha B S MeVi Technologies LLP Page 30


FDP on Image Processing and Deep Learning using Python

neural network, which then predicts which category the customer


belongs to.

For the regression problem, we can skip the third and fourth steps, i.e.,
computing the probability and normalising the ‘predicted output vector’
p, because in a regression problem, the output is h^L ,i.e., the value we
obtain from the single output node, and we usually compare the output
obtained from the ANN directly with the ground truth. We do not need
to perform any further operations on the predicted output to get
probabilities in a regression problem.
Note that W^o (the weights of the output layer) can also be written as W
^L+1

Comprehension based Questions


Let’s try to implement the same algorithm for a classification problem
and answer a few questions. Given below is the representation of an
ANN.

We have the last weight matrix W^3 as W^O. The output layer classifies
the input into one of these three labels: 1, 2 or 3. The first neuron outputs
the probability for label 1, the second neuron outputs the probability for
label 2 and the third neuron outputs the probability for label 3.

Prepared by Megha B S MeVi Technologies LLP Page 31


FDP on Image Processing and Deep Learning using Python

The primary goal in machine learning is to get the predicted output to be


the same or as close to the ground truth output as possible. We have seen
the feed forward algorithm and learnt how to compute each element in
an ANN. Now, we want to train the neural network to get the predicted
output as close as possible to the actual output. In order to do this, in the
next segment, we will discuss the Loss function, which quantifies the
difference between the predicted output and the actual output.

Loss Function
Now that we know how to calculate the predicted output from a neural
network when given an input, we want to check if the neural network
predicted it correctly. We will revisit the calculations we had done in the
previous segment on the housing price prediction problem.

Std. Number of Std. House Size


Predicted Price Actual Price
Rooms (sq. ft.)
-0.32 -0.66 0.63 -0.54

As you can see in the table above, the predicted price is not the same or
even close to the actual price. So, we want to know how wrong the
prediction of the neural network is and want to quantify this error in the
prediction. A loss function or cost function will help us quantify such
errors.
A loss function or cost function is a function that maps an event or values
of one or more variables onto a real number intuitively, representing
some ‘cost’ associated with the ‘event’, as shown below:

Neural networks minimise the error in the prediction by optimising the


loss function with respect to the parameters in the network. In other
words, this optimisation is done by adjusting the weights and biases. We
Prepared by Megha B S MeVi Technologies LLP Page 32
FDP on Image Processing and Deep Learning using Python

will see how this adjustment is done in subsequent sessions. For now, we
will concentrate on how to compute the loss.

In the case of regression, the most commonly used loss function is


MSE/RSS. In the case of classification, the most commonly used loss
function is Cross Entropy/Log Loss.

Let’s consider the regression problem where we predict the house price,
given the number of rooms and the size of the house. Here, we will use
the RSS method to calculate the loss.

Std. Number of Std. House Size


Predicted Price Actual Price
Rooms (sq. ft.)
-0.32 -0.66 0.63 -0.54

In this example, we get a prediction of 0.63, but the expected output is -


0.54. Let’s calculate the loss using RSS:

Prepared by Megha B S MeVi Technologies LLP Page 33


FDP on Image Processing and Deep Learning using Python

As given above, the MSE is the mean square error of all the samples in
the given data. This gives us a quantified method of measuring how well
the neural network is predicting the output.
Now, let’s take a look at the loss function for the classification problem.
Now that we have learnt about the forward pass and the loss function for
regression and classification problems, we know that given any input
and its actual output, we can assess the behaviour of the neural network.

What Is Learning in Neural Networks?


The task of training neural networks is similar to that of other ML models
such as linear regression and logistic regression. The predicted output
(output from the last layer) minus the actual output is the cost (or the
loss), and we have to tune the parameters w and b such that the total cost
is minimised.

The loss function for a regression model can be given as follows:

To start training a neural network, we randomly initialise the weights at


the outset.

An important point to note is that if the data is large (which is often the
case), the loss calculation itself can get pretty messy. For example, if you
have a million data points, they will be fed into the network (in batches),
the output will be calculated using feed forward, and the loss/cost Li(for
ith data point) will be calculated. The total loss is the sum of losses of all
the individual data points. Hence:

Prepared by Megha B S MeVi Technologies LLP Page 34


FDP on Image Processing and Deep Learning using Python

The total loss L is a function of w's and b's. Once the total loss is
computed, the weights and biases are updated (in the direction of
decreasing loss). In other words, L is minimised with respect to the w's
and b’s.

One important point to note here is that we minimise the average of the
total loss and not the total loss that you will get to see shortly.
Minimising the average loss implies that the total loss is getting
minimised.

This can be done using any optimisation routine such as gradient


descent.

The parameter being optimised is iterated in the direction of reducing


cost according to the following rule.

The same can be written for biases. Note that weights and biases are
often collectively represented by one matrix called W. Going forward, W
will, by default, refer to the matrix of all weights and biases.

The main challenge is that W is a huge matrix, and thus, the total loss L
as a function of W is a complex function.

Back propagation Algorithm


In the previous segment, you learnt how gradient descent is used in
learning neural networks. The training is done using back propagation.
We also discussed a detailed numerical example of back propagation for
a single input in a simple neural network.

We took the following steps when passing an input through the network:

Prepared by Megha B S MeVi Technologies LLP Page 35


FDP on Image Processing and Deep Learning using Python

1. Forward propagation of the input through the network with random


initial values for weights and biases
2. Making a prediction and computing the overall loss
3. Updating model parameters using back propagation i.e., updating the
weights and biases in the network, using gradient descent
4. Forward propagation of the input through the network with updated
parameters leading to a decrease in the overall loss
5. Repeat the process until the optimum values of weights and biases are
obtained such that the model makes acceptable predictions

The pseudo code/pseudo-algorithm is given as follows:


1: Initialise with the input
Forward Propagation
2: For each layer, compute the cumulative input and apply the non-linear
activation function on the cumulative input of each neuron of each layer
to get the output.
3: For classification, get the probabilities of the observation belonging to a
class, and for regression, compute the numeric output.
4: Assess the performance of the neural network through a loss function,
for example, a cross-entropy loss function for classification and RMSE for
regression.
Back propagation
5: From the last layer to the first layer, for each layer, compute the
gradient of the loss function with respect to the weights at each layer and
all the intermediate gradients.
6: Once all the gradients of the loss with respect to the weights (and
biases) are obtained, use an optimisation technique like gradient descent
to update the values of the weights and biases.
Repeat this process until the model gives acceptable predictions:
7: Repeat the process for a specified number of iterations or until the
predictions made by the model are acceptable.

Prepared by Megha B S MeVi Technologies LLP Page 36


FDP on Image Processing and Deep Learning using Python

DEEP LEARNING & IMAGE PROCESSING [NOTES-3]

UNDERSTANDING CONVOLUTIONS – I
INTRODUCTION
Convolutional Neural Networks, or CNNs, are neural networks
specialised to work with visual data, i.e., images and videos (though not
restricted to them). They are very similar to the vanilla neural networks
or the multilayer perceptrons (MLPs), where every neuron in one layer is
connected to every neuron in the next layer. They also follow the same
general principles of feed forward, back propagation, weights, biases, etc.
However, there are certain features of CNNs that make them perform
extremely well on image processing tasks.

The ANN architecture can solve any problem, but there are two main
limitations associated with the simple MLP architectures:
1. The architecture offers a wide range of variations in the network
through depth (number of layers), width (size of the layer),
activation functions, etc., that make it impossible to find the best
architecture for a given problem.
2. The MLP architecture does not have the capability to preserve any
spatial information from the underlying image. As you have seen
while working with the MNIST data set in the previous module, the
information stored in a 2-D or 3-D format is flattened into a 1-D
array to map one neuron to each pixel.

Prepared by Megha B S MeVi Technologies LLP Page 37


FDP on Image Processing and Deep Learning using Python

So, these reasons resulted in the development of the CNN architecture,


which has tried to solve the problems that existed with the simple MLP
architectures.

EVOLUTION OF CNNs
Although the vanilla neural networks (MLPs) can learn extremely
complex functions, their architecture does not exploit what we know
about how the brain reads and processes images. For this reason,
although MLPs are successful in solving many complex problems, they
have not been able to achieve any major breakthroughs in the image
processing domain.

On the other hand, the architecture of CNNs uses many of the working
principles of the animal visual system and, thus, they have been able to
achieve extraordinary results in image-related learning tasks.

Prepared by Megha B S MeVi Technologies LLP Page 38


FDP on Image Processing and Deep Learning using Python

As can be seen, the pre-2000s era showed progress from a theoretical


standpoint in processing visual data, but the lack of resources like
computing power, GPUs and the unavailability of visual data resulted in
slower progress. However, from the late 1990s, the development in these
sections has resulted in speedy development.

THE IMAGENET CHALLENGE


The ImageNet challenge has been the benchmark of testing image
processing frameworks for the past decade. CNNs had first demonstrated
their extraordinary performance in the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). The ILSVRC uses a list of about 1,000
image categories or ‘classes’ and has about 1.2 million training images
(have increased over years). The original objective of this challenge is
image classification.

You can see the impressive results of CNNs in the ILSVRC and how they
have transformed image processing since the 2010s. Over the period,
there have been many variants of CNNs like AlexNet and VGGNet which
have gotten better with time. This can be seen from the error rate of these
architectures on the ImageNet data set. It started from nearly 30% in 2010
and has now reduced to nearly 4% with the ResNet architecture, which is
Prepared by Megha B S MeVi Technologies LLP Page 39
FDP on Image Processing and Deep Learning using Python

one of the recent variants of CNN. This event is significant as the error
rate is even lesser than that of a human (5%).
The entire process has taken a lot of time to deliver fruitful results.

CHALLENGES IN IMAGE PROCESSING


CNNs are specialised architectures that work particularly well with
visual data, i.e., images and videos. They have been largely responsible
for revolutionising 'Deep Learning' by setting new benchmarks for many
image processing tasks that were very recently considered extremely
hard.

1. Translational variation: Objects being placed at different locations in


the image.

2. Viewpoint variation: Different orientations of the image with respect


to the camera.

Prepared by Megha B S MeVi Technologies LLP Page 40


FDP on Image Processing and Deep Learning using Python

3. Scale variation: Different sizes of the object with respect to the image
size.

4. Illumination conditions: Lighting and illumination in the image.

5. Background clutter: Inclusion of different elements in the


background.

6. Deformations: Variation in the inherent shape of the object.

Prepared by Megha B S MeVi Technologies LLP Page 41


FDP on Image Processing and Deep Learning using Python

The algorithms have to accommodate all the aforementioned variations


and still result in the correct output. Therefore, a common task of visual
recognition (like identifying a ‘cat’ or a ‘dog’) that is considered simple
for humans becomes a big challenge for algorithms.

VISUAL SYSTEM OF MAMMALS - I


We had mentioned that the architecture of CNNs is motivated by the
visual system of mammals. In this segment, we will discuss an influential
paper named ‘Receptive Field for Single Neurons in the Cat’s Striate
Cortex’ published by Hubel and Wiesel. The reason for this is that a lot of
components of the CNN architecture were inspired by this research.

This was basically a bunch of experiments conducted to understand a


cat’s visual system. In the experiments, spots of light (of various shapes
and sizes) were made to fall on the retina of a cat and, using an
appropriate mechanism, the response of the neurons in the cat's retina
was recorded. This provided a way to observe which types of spots make
some particular neurons 'fire', how groups of neurons respond to spots of
certain shapes, etc.

 Each neuron in the retina focuses on one part of the image and that
part of the image is called the receptive field of that neuron. The
following figure shows a certain region of the receptive field of a cat.

 The receptive fields of all neurons are almost identical in shape and
size.
 The receptive field has excitatory and inhibitory regions. The excitatory
region (denoted by the triangular marks) forms the key feature or the

Prepared by Megha B S MeVi Technologies LLP Page 42


FDP on Image Processing and Deep Learning using Python

element that the neuron is trained to look for and the inhibitory region
is like the background (marked by the crosses).
 The neurons only ‘fire’ when there is a contrast between the excitatory
and the inhibitory regions. If we splash light over the excitatory and
inhibitory regions together, the neurons do not ‘fire’ (respond) because
of no contrast between them. If we splash light just over the excitatory
region, neurons respond because of the contrast. This helps us identify
a given feature in the receptive field of a particular neuron.

 The strength of the response is proportional to the summation over


only the excitatory region (not the inhibitory region). Instead of
looking at the response from individual neurons, the response from
multiple neurons is aggregated using some statistical measure (sum,
average, etc.). This concept is replicated as a pooling layer in CNNs
which corresponds to this observation.

VISUAL SYSTEM OF MAMMALS - II


You have already seen that every neuron is trained to look at a particular
patch in the retina called the receptive field of that neuron. This neuron is
designed to look for a particular feature in the image that gives rise to the
excitatory region. However, there are multiple questions that should
come to your mind:
 As different parts are scanned by different neurons, do all the neurons
'see' the same 'features', or are some neurons specialised to 'see' certain
features?

Prepared by Megha B S MeVi Technologies LLP Page 43


FDP on Image Processing and Deep Learning using Python

 Every neuron is responsible for finding a ‘feature’ but how does this
information translate ahead in the system?

The units or the neurons at the initial level do very basic and specific
tasks such as picking raw features (for example, horizontal edges) in the
image. The subsequent units try to build on top of this to extract more
abstract features such as identifying textures and detecting movement.
The layers 'higher' in the hierarchy typically aggregate the features in the
lower ones.

The following image illustrates the hierarchy in units – the first level
extracts low-level features (such as vertical edges) from the image, while
the second level calculates the statistical aggregate of the first layer to
extract higher-level features (such as texture and colour schemes).

Using this idea, if we design a complex network with multiple layers to


do image classification (for example), the layers in the network should do
something like this:
1. The first layer extracts raw features like vertical and horizontal
edges.
2. The second layer extracts more abstract features such as textures
(using the features extracted by the first layer).
3. The subsequent layers may identify certain parts of the image such
as skin, hair, nose and mouth based on the textures.
4. Layers further up may identify faces, limbs, etc.
5. Finally, the last layer may classify the image as 'human', 'cat', etc.

Prepared by Megha B S MeVi Technologies LLP Page 44


FDP on Image Processing and Deep Learning using Python

This divides the entire process into two parts:


1. Feature extraction
2. Final task: Classification/regression

Apart from explaining the visual system, the paper also suggested that
similar phenomena have been observed in the auditory system and in
touch and pressure in the somatosensory system. This suggests that
CNN-like architectures can be used for speech processing and analysing
signals coming from touch sensors or pressure sensors as well.

We have already discussed most of the key ideas of the CNN architecture
through this paper. Let's have a look at some of the conclusions:
 Each unit, or neuron, is dedicated to its own receptive field. Thus,
every unit is meant to ignore everything other than what is found in its
own receptive field.
 The receptive field of each neuron is almost identical in shape and size.
 The subsequent layers compute the statistical aggregate of the previous
layers of units. This is analogous to the 'pooling layer' in a typical
CNN.

Prepared by Megha B S MeVi Technologies LLP Page 45


FDP on Image Processing and Deep Learning using Python

 Inference or the perception of the image happens at various levels of


abstraction. The first layer pulls out raw features, and the subsequent
layers pull out higher-level features based on the previous features and
so on. Finally, the network gets an overall perception of an image in
the last layer.

INPUTS IN A CNN - DIGITAL IMAGES


Before we dig deep into the architecture of CNNs, let's try to understand
the inputs that are fed into them. CNNs generally deal with visual data in
the form of images and videos. This segment will help you understand
the elements associated with the images.

You already know that the input to any neural network should be
numeric. Fortunately, images are naturally represented as arrays (or
matrices) of numbers.

Grayscale images: This is the most basic input as it only has one channel
that stores the shades of grey. To summarise, let’s consider this sample
image of a 'zero' from the MNIST data set:

 Images are made up of pixels.


 The height and width of this image are 18 pixels, so it is stored as an 18
x 18 array.
 Each pixel's value lies between 0 and 255.
 The pixels having a value close to 255 appear white (since the pixels
represent the intensity of white), and those close to 0 appear black.
Prepared by Megha B S MeVi Technologies LLP Page 46
FDP on Image Processing and Deep Learning using Python

One important point to note is that an image with more pixels holds more
information and, hence, has better clarity as compared to an image with a
lower pixel count. You would have come across this terminology while
purchasing smartphones. When we say that the phone has a 10-
megapixel camera, it means each image captured by that camera will
have 10 million pixels. A 20-megapixel camera will have twice as many
pixels and, hence, the captured image will have better clarity because of
the extra details stored in it.

 Each pixel in a colour image is an array representing the intensities of


red, blue and green. The red, blue and green layers are called channels.
 Every pixel comprises the information of all three channels (RGB).
 The height and width of the given image is five pixels each. However,
the size of the matrix is 5 x 5 x 3.
 The reason for selecting these three colours (red, green and blue) is that
all the colours can be made by mixing red, blue and green at different
degrees of ‘saturation’ (0-100% intensity). For example, a pure red pixel
has 100% intensity of red and 0% intensity of blue and green. So, it is
represented as (255,0,0). White is the combination of 100% intensity of
red, green and blue. So, it is represented as (255,255,255).

Prepared by Megha B S MeVi Technologies LLP Page 47


FDP on Image Processing and Deep Learning using Python

INPUTS IN A CNN - VIDEOS


In this segment, you will understand the process of video analysis using
CNNs. A video is basically a sequence of frames where each frame is an
image. You already know that CNNs can be used to extract features from
an image. Let's now see how CNNs can be used to process a series of
images (i.e., videos).

Let's summarise the process of video analysis using a CNN + RNN, or


recurrent neural network, stack. At this point, you only need to
understand that RNNs are good at processing sequential information
such as videos (a sequence of images) and text (a sequence of words or
sentences). We will limit our studies to CNNs in this course.

For a video classification task, here's what we can do. Suppose the videos
are of length one minute each. If we extract frames from each video at the
rate of two frames per second (FPS), we will have 120 frames (or images)
per video. These images are then pushed into a convolutional neural
network to extract different features from the image. All this information
is stored in terms of feature vectors. Thus, we have 120 feature vectors
representing each video at the end of the CNN framework.

These 120 feature vectors, representing a video as a sequence of images,


can now be fed sequentially into an RNN which classifies the videos into
one of the categories. The main point here is that a CNN acts as a feature
extractor for images and, thus, can be used in a variety of ways to process
images.

Prepared by Megha B S MeVi Technologies LLP Page 48


FDP on Image Processing and Deep Learning using Python

CNN ARCHITECTURE - OVERVIEW


Now, you have a basic idea about the visual system of mammals and
have also understood the different inputs that can be fed to a CNN
framework. In this segment, we will analyse the architecture of a popular
CNN called VGGNet. This is one of the truest CNN architectures and,
hence, observing the VGGNet architecture will give you a high-level
overview of the common types of CNN layers before you study each one
of them in detail.

Let's dig a little deeper into CNN architectures now.

The CNN architecture has been closely derived from the visual system in
mammals. The VGGNet architecture that we discussed was specially
designed for the ImageNet challenge, a classification task with 1,000
categories. Thus, it takes colored images as the input and the softmax
layer at the end has 1,000 categories. The blue layers are the
convolutional layers, while the yellow ones are pooling layers.

There are three main concepts you will study in CNNs:


 Convolution layers: Used for scanning the image for different
features
 Feature maps: Store the information for different features extracted
from the image
 Pooling layers: For the aggregation of the information from different
neurons in the previous layer

Prepared by Megha B S MeVi Technologies LLP Page 49


FDP on Image Processing and Deep Learning using Python

 Lastly, you also have the fully connected (FC) layers that help in the
final task of classification or regression.
The most important point to notice here is that the initial part of the
network (convolution and pooling layers) acts as a feature extractor for
images. For example, the VGGNet discussed earlier can extract a 4096-
dimensional feature vector representing each input image. This feature
vector is fed to the later part of the network that consists of a softmax
layer for classification. However, you can use the feature vector to
perform other tasks (such as video analysis, object detection and image
segmentation).

APPLICATION OF CNNs
Object localisation: Identifying the local region of the objects (as a
rectangular area) and classifying them.

Semantic segmentation: Identifying the exact shapes of the objects


(pixel by pixel) and classifying them.

Prepared by Megha B S MeVi Technologies LLP Page 50


FDP on Image Processing and Deep Learning using Python

Optical character recognition (OCR): Recognising characters in an


image (text from the document or number plates).

Now that you have a broad sense of the different applications of CNNs,
let's see some more examples of image processing applications in
different industries.

CNNs have various other applications in different sectors such as


healthcare, insurance and surveillance.
 Many medical imaging applications used in radiology, cardiology,
gastroenterology, etc., involve classification, detection and
segmentation of objects which can be analysed using CNNs.
 The capability of the CNN framework to act as an OCR is leveraged
across different industries to generate digital documents from any
form of physical documents.
 Like the characters, the CNNs can also identify facial patterns and help
in different applications like facial recognition and image tagging.
 Autonomous driving heavily uses object detection and localisation to
detect the objects present around the car and help in deciding the next
actions accurately.
 Moreover, CNNs can also be applied while working with non-visual
data like speech signals or time series by identifying the different
patterns present in them.

Prepared by Megha B S MeVi Technologies LLP Page 51


FDP on Image Processing and Deep Learning using Python

You learnt about the basics of convolutional neural networks and their
common applications in computer vision such as image classification and
object detection. You also learnt that CNNs are not limited to images but
can be extended to videos, text, audio, etc.

The design of CNNs uses many observations from the animal visual
system, such as each retinal neuron looking at its own (identical)
receptive field, some neurons responding proportionally to the
summation over excitatory regions (pooling), and images being
perceived in a hierarchical manner.

You learnt that images are naturally represented in the form of arrays of
numbers. Grayscale images have a single channel, while colour images
have three channels (RGB). The number of channels or the 'depth' of the
image can vary depending on how we represent the image. Each channel
of a pixel, usually between 0 and 255, indicates the 'intensity' of a certain
colour.
Coming to the CNN architecture, a typical CNN unit (or layer) in a large
CNN-based network comprises convolution layers, pooling layers and
fully connected layers.

UNDERSTANDING CONVOLUTIONS - I
You have a fair idea of the three main terms related to the CNN
architecture:
 Convolutions
 Feature maps
 Pooling
These components are closely derived from how the visual system works
in mammals. They perform three broad tasks as listed below:
1. Scanning the image
2. Identifying different features
3. Aggregating the learning from individual neurons in succeeding
layers
Prepared by Megha B S MeVi Technologies LLP Page 52
FDP on Image Processing and Deep Learning using Python

The convolution operation is the summation of the element-wise product


of two matrices. In CNNs, the image and the filter are translated into
arrays of numbers, which are then convolved together to produce the
final result. Take a look at the graphic given below to visualise the entire
process.

The blue-coloured 4×4 matrix is the input image. The 3×3 box that moves
over the blue map is the filter, and the resultant output is the green-
coloured 2×2 matrix.
Let’s take a look at another example for more clarity.

Convolution Example
Consider the image shown below and convolve it with a 3×3 filter to
produce a 3×3 array.

Prepared by Megha B S MeVi Technologies LLP Page 53


FDP on Image Processing and Deep Learning using Python

UNDERSTANDING CONVOLUTIONS - II
You saw an example of how the convolution operation (using an
appropriate filter) detects certain features in images, such as edges.

Prepared by Megha B S MeVi Technologies LLP Page 54


FDP on Image Processing and Deep Learning using Python

A filter is learnt by the network through the process of back propagation


through the network. However, depending on your use case, you can use
some predefined filters to extract specific features from the image.

Another interesting observation can be seen in the convolution output


provided below. Only the middle two columns (columns 2 and 3) of the
output matrix are nonzero, while the two extreme columns (columns 1
and 4) are zero. This is an example of vertical edge detection.

Note that each column of the 4×4 output matrix looks at exactly three
columns of the input image. The idea behind using this filter is that it
captures the amount of change (or gradient) in the intensity of the
corresponding columns in the input image along the horizontal direction.

For example, the output in columns 1 and 4 is 0 (20 - 20 and 10 - 10,


respectively), which implies that there is no change in intensity in the
first three and the last three columns of the input image, respectively. On
the other hand, the output in columns 2 and 3 is 30 (that is, 20 - (-10)),
indicating that there is a gradient in the intensity of the corresponding
columns of the input image.

Through similar processes, the CNN architecture extracts a given feature


using filters. However, the task does not end here. Like mammals, the
Prepared by Megha B S MeVi Technologies LLP Page 55
FDP on Image Processing and Deep Learning using Python

convolution layers also operate in a hierarchical order. The CNN


architecture has multiple convolutional layers stacked together to build
on top of the basic or elementary features extracted by the initial layers.
As the level of abstraction increases, the depth of the model can be
increased by increasing the number of these convolution layers.

Other Filters
Based on the filter for vertical edge detection, you can design a filter for
detecting the horizontal edges. As shown below, the objective will now
be able to capture the change in intensity in the vertical direction.

Although only simple filters have been discussed here, you can design
arbitrarily complex filters for detecting edges and other patterns. For
example, the following image shows the Sobel filter, which can detect
both horizontal and vertical edges in complex images.

Prepared by Megha B S MeVi Technologies LLP Page 56


FDP on Image Processing and Deep Learning using Python

STRIDE
In the previous segment, the filter was moved by exactly one pixel (both
horizontally and vertically) while performing convolutions. However, it
is not the only way to do convolutions. You can move the filter by an
arbitrary number of pixels based on the task requirement. This concept is
known as stride. There is nothing sacrosanct about the stride length 1.
You can alter the stride value based on your underlying objective.

If you increase the stride value, some of the information will be lost in
translation from one layer to another. If you think that you do not need
many fine-grained features for your task, you can use a higher stride
length (2 or more).

As you can see, a higher stride value also results in the same result. The
benefit of increasing the stride length is that it results in faster processing
and less information is translated between the layers. Let’s understand
this concept through the image provided below.

Prepared by Megha B S MeVi Technologies LLP Page 57


FDP on Image Processing and Deep Learning using Python

Case 1:
In the first case, suppose you are building a CNN model to identify
the presence of any water body in a given satellite image. This task
can be done smartly by not exploring each and every pixel of the
image. The output will be ‘yes’ if the object is found in any part of
the image. Hence, you can skip a few pixels and save both time and
computational resources in the process.
Case 2:
Suppose you are building a model on the same image to extract the
region covered by the water body in the given area. Here, you will
be expected to closely map the entire structure of the water body,
and hence, you would need to extract all the granular elements from
the image. Therefore, the stride value should not be kept high.

PADDING
So far, you have gained an understanding of the basic purpose of
convolution. The convolution process helps identify features such as
edges, shapes and objects in an image. You can control the level or the
granularity of these features using the stride value of the filter. Moreover,
the level of abstraction can be controlled using the number of layers
stacked in the architecture.

However, this task gets challenging with an increase in the filter size or
the stride value. Both these practices result in a reduction in size from
input to output. If you remember, the 7×7 matrix shrunk to 5×5 when
convolved with a 3×3-sized filter and a stride of 1. Owing to the reduction
in size, it becomes difficult for the latter layers to perform the task of
convolution.

As mentioned in this video, padding helps you manage the size of the
output of convolution. Padding of ‘x’ means that ‘x units’ of rows and
columns are added all around the image. As shown in the image given
below, padding of 1 has been introduced around a 5×5 image.
Prepared by Megha B S MeVi Technologies LLP Page 58
FDP on Image Processing and Deep Learning using Python

When you convolve an image without padding (using any filter size), the
output size is smaller than the image, i.e., the output ‘shrinks’. However,
with the introduction of padding layers, the size of the output increases.
It is important to note that only the width and the height decrease (not
the depth) when you convolve without padding. The depth of the output
depends on the number of filters used, which will be discussed in a later
segment.

Padding helps we overcome both the challenges discussed at the


beginning of the segment. However, you still have not learnt how to fill
these additional rows and columns.

As mentioned in this video, the most common way to do padding is to


populate the dummy rows/columns with zeros, which is termed as zero-
padding. It can also be a user-defined value; however, instead of an
arbitrary value, it is generally zero or the pixel values at the edges.

Prepared by Megha B S MeVi Technologies LLP Page 59


FDP on Image Processing and Deep Learning using Python

Preserving the size during convolution


If you want to maintain the same size, you can still use the technique of
padding. Here, the padding layers are added in a manner such that the
input and the output have the same size. The convolution process shrinks
the input. For example, convolving a 6×6 image with a 3×3 filter and a
stride of 1 gives a 4×4 output. Further, convolving the 4×4 output with a
3×3 filter will give a 2×2 output. The size has reduced from 6×6 to 2×2 in
just two convolutions.

Large CNNs have tens (or even hundreds) of such convolutional layers
(recall VGGNet). So, you might incur massive ‘information loss’ as you
build deeper networks. This is one of the main reasons why padding is
important: It helps maintain the size of the output arrays and avoid
information loss. Of course, in many layers, you want to shrink the
output (as shown below), but in many others, you maintain the size of the
output.

Prepared by Megha B S MeVi Technologies LLP Page 60


FDP on Image Processing and Deep Learning using Python

CHOOSING VALUES
You learnt that you have multiple parameters associated with the
convolution process, which is listed below:
 Filter size
 Stride
 Padding

However, you cannot convolve the images with just any combination of
these attributes. You need to alter them based on your input image and
the required output from the architecture. Let’s understand this aspect in
more detail

You cannot convolve a 6×6 image with a 3×3 filter using a stride of 2. Let’s
revisit the formula that is used to calculate the output size using the input
size, filter size, padding and stride length.

We are given the following sizes:


 Image - n × n
 Filter - k × k
 Padding - p
 Stride – s

Prepared by Megha B S MeVi Technologies LLP Page 61


FDP on Image Processing and Deep Learning using Python

Convolution over 3D images


So far, you have learnt how to apply convolution only on 2D arrays.
However, most images are coloured, and thus, comprise multiple
channels (such as RGB). They are generally represented as a 3D matrix of
size ‘height x width x channels’. However, the convolution process does
not change with the inclusion of the third dimension. To convolve such
images, you simply use 3D filters. The basic process of convolution is still
the same: You take the element-wise products and sum up their values.

The only difference is that now, the filters will be 3D, such as 3×3×3 or
5×5×3. In these examples, the last ‘3’ indicates that the filter has as many
channels as the image. For example, if you have an image of size
224×224×3, you can use filters of sizes 3×3×3, 5×5×3, 7×7×3, etc. (with
appropriate padding).

Instead of 9 ‘3×3’ values, the filter will now hold 27 ‘3×3×3’ values. In the
next segment, you will understand how to determine these values in
order to extract a feature from the image.

WEIGHTS AND BIASES IN A CNN


Filters or kernels are the most crucial components of the convolution
process, as they help extract the features from the image that later helps
achieve the final target. You have already learnt how to use a predefined
filter to detect a vertical edge in an image. Similarly, different filters can

Prepared by Megha B S MeVi Technologies LLP Page 62


FDP on Image Processing and Deep Learning using Python

be used to identify a variety of features from the image. The difference


between all these filters (assuming that the size is the same) will be the
values that will be used to convolve the underlying image.

In some cases, you are only interested in one particular feature from the
image, such as vertical edge detection, a colour or a texture. Here, you are
not asking the network to learn anything and use the architecture to just
extract a basic feature. For such tasks, you would be expected to design a
filter for the network, as it will only check for the required feature across
all the patches of the image.

However, when it comes to neural networks, you do not design a filter;


instead, you want the network to learn the values at each step. At each
layer of CNN, a feature is extracted that will later help in the final task.
As the task is different across use cases, the filter needs to be altered for
every use case. For example, you cannot use the filters that were used to
identify the scenery in the image to detect humans in the image. This is
similar to the weights used in the ANN architecture. For different use
cases, the network learns weights from scratch so that it is able to perform
the required task accurately. In the case of CNNs, filters act as weight
vectors and must be learnt using the technique of back propagation. The
image given below summarises the entire process.

Prepared by Megha B S MeVi Technologies LLP Page 63


FDP on Image Processing and Deep Learning using Python

Components:
I: Image tensor (5×5×3)
W: Weight tensor (3×3×3)
P: Patch from the image to be convolved using the filter (3×3×3)

So, the output of the convolution process will be a matrix that holds the
dot product of the two tensors (W and P).

Output: Image (I) × W→ sum (P• W)

This equation is similar to the feed forward equation, as the inputs from
the image are multiplied by the filter values to obtain the output for that
particular layer. As in the ANN architecture, the filters are learnt during
training, i.e., during back propagation. Hence, the individual values of
the filters are often called the weights of a CNN.

In the discussion so far, you have only learnt about weights; however,
convolutional layers (that is, filters) also have biases. Let’s take a look at
an example to understand this concept better. Suppose you have an RGB
image and a 2×2×3 filter as shown below. The filter has three channels,
each of which convolves the corresponding channel of the image. Thus,
each step in the convolution involves the element-wise multiplication of

Prepared by Megha B S MeVi Technologies LLP Page 64


FDP on Image Processing and Deep Learning using Python

12 pairs of numbers and the addition of the resultant products to obtain a


single scalar output.

The following Image depicts the convolution operation. Note that in each
step, a single scalar number is generated, and at the end of the
convolution, a 2D array is generated.

Prepared by Megha B S MeVi Technologies LLP Page 65


FDP on Image Processing and Deep Learning using Python

You can express the convolution operation as a dot product between the
weights and the input image. If you treat the 2×2×3 filter as a vector w of
length 12 and the 12 corresponding elements of the input image as the
vector p (that is, both are unrolled from a 3D tensor to a 1D vector), each
step of the convolution is simply the dot product of w^Tand p. The dot
product is computed at every patch to obtain a (3×3) output array as
shown in this Image provided above.

Apart from the weights, each filter can also have a bias. In this case, the
output of the convolutional operation is a 3×3 array (or a vector of length
9). So, the bias will be a vector of length 9. However, a common practice
in CNNs is that all the individual elements in the bias vector have the
same value (called tied biases). For example, a tied bias for the filter
shown in the GIF given above can be represented as shown below:

FEATURE MAPS
As you learnt in the previous segment, the values of the filters, or the
weights, are learnt during training. Let's now understand how multiple
filters are used to detect various features in images. In this segment, you
will learn the concepts of neurons and feature maps.

Prepared by Megha B S MeVi Technologies LLP Page 66


FDP on Image Processing and Deep Learning using Python

Let's summarise the important concepts and terms discussed


 A neuron is a filter whose weights are learnt during training. For
example, a 3×3×3 filter (or neuron) has 27 weights. Each neuron relates
to a particular region in the input (that is, its ‘receptive field’).
 A feature map is a mapping of where a certain feature is found in the
image. It is derived using a collection of multiple neurons each of
which relates to a different region of the input with the same weights.
All neurons associated with a particular feature map extract the same
feature (but from different regions of the input).

The following figure shows feature maps derived from the input image
using three different filters.

You can have multiple such neurons convolving an image, each having a
different set of weights and each producing a feature map.

Convolution in VGGNet
Before moving on to the next component of CNNs, let’s summarise the
entire process of convolution using the VGGNet architecture.

The VGGNet architecture was designed with the following specifications:


 Input size: 224×224×3

Prepared by Megha B S MeVi Technologies LLP Page 67


FDP on Image Processing and Deep Learning using Python

 Filter size: 3×3×3


 Stride: 1
 Padding: Same padding (The input and the output sizes from one
convolution layer to another convolution layer are the same)
 13 convolution layers
 Feature maps: 4096 at the end of the convolution layers

The first convolutional layer takes the input image of size 224×224×3, uses
a 3×3×3 filter (with some padding) and produces an output of 224×224.
This 224×224 output is then fed to a ReLU to generate a 224×224 feature
map. Note that the term ‘feature map’ refers to the (non-linear) output of
the activation function, instead of the input to the activation function
(that is, the output of the convolution).

Similarly, multiple other 224×224 feature maps are generated using


different 3×3×3 filters. In the case of VGGNet, 64 feature maps of size
224×224 are generated, which are referred to in the figure as the tensor
224×224×64. Each of the 64 feature maps tries to identify certain features
(such as edges, textures, etc.) in the 224×224×3 input image. The
224×224×64 tensor is the output of the first convolutional layer. In other
words, the first convolutional layer consists of 64 3×3×3 filters, and hence,
they contain 64×27 trainable weights (assuming that there are no biases).

Prepared by Megha B S MeVi Technologies LLP Page 68


FDP on Image Processing and Deep Learning using Python

The 64 feature maps, or the 224×224×64 tensor, are then fed to a pooling
layer, which you will explore in the next segment. However, as you
proceed ahead, the network extracts more features and finally ends up
with the 1×1×4096 tensor. This suggests that there are 4096 different
feature maps at the end of the convolution blocks. Now, the architecture
can leverage these 4096 features to classify the image accurately.

POOLING
The convolution layer takes care of the first two steps by extracting
different features from the image at different levels. After extracting the
features (in the form of feature maps), CNNs typically aggregate these
features using the pooling layer. Let’s take a look at how the pooling
layer works and how it is useful in extracting higher-level features.

The pooling layer helps to determine whether a particular region in the


image has the required feature or not. It essentially looks at larger regions
(having multiple patches) of the image and captures an aggregate statistic
(max, average, etc.) of each region. In other words, it makes the network
invariant to local transformations.

The two most popular aggregate functions used in pooling are ‘max’ and
‘average’. The intuition behind these functions is as follows:

Prepared by Megha B S MeVi Technologies LLP Page 69


FDP on Image Processing and Deep Learning using Python

 Max pooling: If any one of the patches says something strongly about
the presence of a certain feature, then the pooling layer counts that
feature as ‘detected’.
 Average pooling: If one patch says something very firmly about the
presence of a certain feature but the other ones disagree, the pooling
layer takes the average to find out.

Max pooling is effective when you want to extract distinguished features.


However, both max and average pooling techniques are effective when
you want to make the network invariant to local transformations. You can
also observe that pooling operates on each feature map independently. It
reduces the size (width and height) of each feature map, but the number
of feature maps remains constant.

Pooling offers the advantage of making the representation more compact


by reducing the spatial size (height and width) of the feature maps. This
attribute is really helpful when you want to make the entire process
faster, since the decrease in the size of the layers results in the reduction
of the number of convolution operations to be performed at each layer.
This aspect is depicted in the following image:

Prepared by Megha B S MeVi Technologies LLP Page 70


FDP on Image Processing and Deep Learning using Python

On the other hand, the pooling process also had a disadvantage. It loses a
lot of information. Having said that, pooling has empirically been proven
to improve the performance of most of the deep CNNs

Pooling in VGGNet
As mentioned in the video, the pooling layer of VGGNet is defined as
follows:
 Window size: 2×2 (Aggregation of four 2×2 patches into one value)
 Stride: 1

Prepared by Megha B S MeVi Technologies LLP Page 71


FDP on Image Processing and Deep Learning using Python

The VGGNet architecture performs pooling at multiple layers in the


architecture. This helps reduce the calculations in the entire process,
which, in turn, helps in a faster result generation when the model is
deployed.

PUTTING THE COMPONENTS TOGETHER


You have now studied all the main components of a typical CNN,
including convolutions, feature maps and pooling layers. Let’s now
quickly summarise and put them together to get an overall picture of
CNN architecture.

To summarise, a typical CNN architecture comprises three types of


layers:
1. Convolution layers
2. Pooling layers
3. Fully-connected (FC) layers

The convolution and pooling layers together form a CNN unit or a CNN
layer. This unit is responsible for extracting the features from the image.
You start with an original image and do convolutions using different
filters to get multiple feature maps at different levels. The pooling layer
calculates the statistical aggregate of the feature maps to reduce the
spatial size and to make the entire process robust against the local
transformations.

Prepared by Megha B S MeVi Technologies LLP Page 72


FDP on Image Processing and Deep Learning using Python

The above image is a simple representation of a CNN unit in the VGGNet


architecture. Each feature map, of size c×c, is pooled to generate a c/2×c/2
output (for a standard 2×2 pooling. Moreover, the pooling process only
reduces the height and the width of a feature map, and not its depth (that
is, the number of channels).

Once all the features are extracted, the output from the convolution layers
is flattened in the FC layers. As shown in the above image, the size of the
last pooling layer (7×7×512) is reduced to (1×1×4096). These 4096 features
are fed to the softmax layer for the final task of classification.

SUMMARY
This session focused on the key components of convolution neural
networks (CNNs), which are as follows:
 Convolution layer
 Feature maps
 Pooling layer

You learnt that specialised filters, or kernels, can be designed to extract


specific features from an image (such as vertical edges). A filter convolves
an image and extracts features from each ‘patch’. Multiple filters are used
Prepared by Megha B S MeVi Technologies LLP Page 73
FDP on Image Processing and Deep Learning using Python

to extract different features from the image. Convolutions can be done


using various combinations of strides and paddings.

The formula to calculate the output shape after convolution is given by:

The filters are learned during training (back propagation). Each filter
(consisting of weights and biases) is called a neuron, which covers a small
patch of the entire image. Multiple neurons are used to convolve an
image (or feature maps from the previous layers) to generate new feature
maps. The feature maps contain the output of convolution + non-linear
activation operations on the input.

A typical CNN unit (or layer) in a large CNN-based network comprises


multiple filters (or neurons), followed by non-linear activations and then
a pooling layer. The pooling layer computes a statistical aggregate (max,
sum, etc.) over various regions of the input and reduces the sensitivity to
minor, local variations in the image. Multiple such CNN units are then
stacked together, followed by some fully connected layers, to form deep
convolutional networks.

Prepared by Megha B S MeVi Technologies LLP Page 74


FDP on Image Processing and Deep Learning using Python

COMPREHENSION - VGG16 ARCHITECTURE


The VGG-16 was trained on the ImageNet challenge (ILSVRC) 1000-class
classification task. The network takes a (224, 224, 3) RBG image as the
input. The '16' in its name comes from the fact that the network has 16
layers with trainable weights, 13 convolutional layers and 3 fully
connected ones (the VGG team had tried many other configurations, such
as the VGG-19, which is also quite popular).

The architecture is given in the table provided below (taken from the
original paper). Each column in the table (from A-E) denotes an
architecture that the team had experimented with. In this discussion, we
will refer to only column D, which refers to VGG-16 (column E is VGG-
19).

Prepared by Megha B S MeVi Technologies LLP Page 75


FDP on Image Processing and Deep Learning using Python

The convolutional layers are denoted in the table as conv<size of filter>-


<number of filters>. Thus, conv3-64 means 64 (3, 3) square filters. Note
that all the conv layers in VGG-16 use (3, 3) filters and the number of
filters increases in powers of two (64, 128, 256, 512).

In all the convolutional layers, the same stride length of 1 pixel is used
with a padding of 1 pixel on each side, thereby preserving the spatial
dimensions (height and width) of the output.

After every set of convolutional layers, there is a max pooling layer. All
the pooling layers in the network use a window of 2 x 2 pixels with stride
2. Finally, the output of the last pooling layer is flattened and fed to a
fully connected (FC) layer with 4,096 neurons, followed by another FC
layer of 4,096 neurons, and finally to a 1000-softmax output. The softmax
layer uses the usual cross-entropy loss. All layers apart from the softmax
use the ReLU activation function.

The number of parameters and the output size from any layer can be
calculated as demonstrated in the MNIST Notebook on the previous
page. For example, the first convolutional layer takes a (224, 224, 3) image
as the input and has 64 filters of size (3, 3, 3). Note that the depth of a
filter is always equal to the number of channels in the input that it
convolves. Thus, the first convolutional layer has 64 x 3 x 3 x 3 (weights) +
64 (biases) = 1,792 trainable parameters. Since stride and padding of 1
pixel are used, the output spatial size is preserved, and the output will be
(224, 224, and 64).

Prepared by Megha B S MeVi Technologies LLP Page 76

You might also like