Convolutional Neural networks.docx
Convolutional Neural networks.docx
Deep Learning has proved to be a very powerful tool because of its ability to handle large amounts of
data. The interest to use hidden layers has surpassed traditional techniques, especially in pattern
recognition. One of the most popular deep neural networks is Convolutional Neural Networks (also
known as CNN or ConvNet) in deep learning, especially when it comes to Computer Vision
applications.
Sincethe1950s, the early days of AI, researchers have struggled to make a system that can
understand visual data. In the following years, this field came to be known as Computer Vision. In
2012, computer vision took a quantum leap when a group of researchers from the University of
Toronto developed an AI model that surpassed the best image recognition algorithms, and that too
by a large margin. The AI system, which became known as AlexNet (named after its main creator,
Alex Krizhevsky), won the 2012 ImageNet computer vision contest with an amazing 85 percent
accuracy. The runner-up scored a modest 74 percent on the test. At the heart of AlexNet was
Convolutional Neural Networks a special type of neural network that roughly imitates human vision.
Background of CNNs
CNN’s were first developed and used around the 1980s. The most that a CNN could do at that time
was recognize hand written digits. It was mostly used in the postal sectors to read zip codes,
pincodes, etc. The important thing to remember about any deep learning model is that it requires a
large amount of data to train and also requires a lot of computing resources.
In the past few decades, Deep Learning has proved to be a very powerful tool because of its ability to
handle large amounts of data. The interest to use hidden layers has surpassed traditional techniques,
especially in pattern recognition. One of the most popular deep neural networks is Convolutional
Neural Networks (also known as CNN or ConvNet).
What Is a CNN?
Now when we think of a neural network we think about matrix multiplications but that is not the
case with ConvNet. It uses a special technique called Convolution. Now in mathematics convolution
is a mathematical operation on two functions that produces a third function that expresses how the
shape of one is modified by the other.
Bottom line is that the ConvNet role to reduce the images into a form that is easier to process,
without losing features crucial for good prediction.
An RGB image is nothing but a matrix of pixel values having three planes whereas a grayscale image
is the same but it has a single plane.
The above image shows what a convolution is. We take a filter/kernel (3×3 matrix) and apply it to the
input image to get the convolved feature. This convolved feature is passed on to the next layer.
In the case of RGB color, see the example:
Convolutional neural networks are composed of multiple layers of artificial neurons. Artificial
neurons, a rough imitation of their biological counterparts, are mathematical functions that calculate
the weighted sum of multiple inputs and outputs an activation value. When you input an image in a
ConvNet, each layer generates several activation functions that are passed on to the next layer.
The first layer usually extracts basic features such as horizontal or diagonal edges. This output is
passed on to the next layer which detects more complex features such as corners or combinational
edges. As we move deeper into the network it can identify even more complex features such as
objects, faces, etc.
Based on the activation map of the final convolution layer, the classification layer outputs a set of
confidence scores (values between 0 and 1) that specify how likely the image is to belong to a “class.”
For instance, if you have a ConvNet that detects cats, dogs, and horses, the output of the final layer is
the possibility that the input image contains any of those animals.
What Is a Pooling Layer?
Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial size of the
Convolved Feature. This is to decrease the computational power required to process the data by
reducing the dimensions. There are two types of pooling average pooling and max pooling.
In Max Pooling, the maximum value of a pixel from a portion of the image covered by the kernel is
found out. Max Pooling also performs as a Noise Suppressant. It discards the noisy activations
altogether and also performs de-noising along with dimensionality reduction. On the other hand,
Average Pooling returns the average of all the values from the portion of the image covered by the
Kernel. Average Pooling simply performs dimensionality reduction as a noise suppressing
mechanism. Hence, we can say that Max Pooling performs a lot better than Average Pooling.
What are Convolutional Neural Networks (CNNs)?
A Convolutional Neural Network (CNN) is a type of deep learning algorithm specifically designed for
image processing and recognition tasks. Compared to alternative classification models, CNNs require
less preprocessing as they can automatically learn hierarchical feature representations from raw
input images. They excel at assigning importance to various objects and features within the images
through convolutional layers, which apply filters to detect local patterns. The connectivity pattern in
CNNs is inspired by the visual cortex in the human brain, where neurons respond to specific regions
or receptive fields in the visual space. This architecture enables CNNs to effectively capture spatial
relationships and patterns in images. By stacking multiple convolutional and pooling layers, CNNs can
learn increasingly complex features, leading to high accuracy in tasks like image classification, object
detection, and segmentation.
The CNN architecture comprises three main layers: convolutional layers, pooling layers, and a fully
connected (FC) layer. There can be multiple convolutional and pooling layers. The more layers in the
network, the greater is the complexity and (theoretically) the accuracy of the machine learning
model. Each additional layer that processes the input data increases the model’s ability to recognize
objects and patterns in the data.
The Convolutional Layer : Convolutional layers are the key building block of the network, where
most of the computations are carried out. It works by applying a filter to the input data to identify
features. This filter, known as a feature detector, checks the image input’s receptive fields for a given
feature. This operation is referred to as convolution. The filter is a two-dimensional array of weights
that represents part of a 2-dimensional image. A filter is typically a 3×3 matrix, although there are
other possible sizes. The filter is applied to a region with in the input image and calculates a dot
product between the pixels, which is fed to an output array. The filter then shifts and repeats the
process until it has covered the whole image. The final output of all the filter processes is called the
feature map. A convolutional layer is typically followed by a pooling layer. Together, the convolutional
and pooling layers make up a convolutional block.
Additional convolution blocks will follow the first block, creating a hierarchical structure with later
layers learning from the earlier layers.
This layer is the first layer that is used to extract the various features from the input images. In this
layer, the mathematical operation of convolution is performed between the input image and a filter
of a particular size M x M. By sliding the filter over the input image, the dot product is taken between
the filter and the parts of the input image with respect to the size of the filter (MxM). The output is
termed as the Feature map which gives us information about the image such as the corners and
edges. Later, this feature map is fed to other layers to learn several other features of the input image.
The Pooling Layers: A pooling or down sampling layer reduces the dimensionality of the input. Like a
convolutional operation, pooling operations use a filter to sweep the whole input image, but it
doesn’t use weights. The filter instead uses an aggregation function to populate the output array
based on the receptive field’s values. There are two key types of pooling:
● Average pooling: The filter calculates the receptive field’s average value when it scans the
input.
● Max pooling: The filter sends the pixel with the maximum value to populate the output
array. This approach is more common than average pooling.
The Fully Connected (FC)Layer: The FC layer performs classification tasks using the features that the
previous layers and filters extracted. Instead of ReLu functions, the FC layer typically uses a softmax
function that classifies inputs more appropriately and produces a probability score between 0 and 1.
Dropout: Usually, when all the features are connected to the FC layer, it can cause over fitting in the
training dataset. Over fitting occurs when a particular model works so well on the training data
causing a negative impact in the model’s performance when used on a new data. To overcome this
problem, a dropout layer is utilised wherein a few neurons are dropped from the neural network
during training process resulting in reduced size of the model. On passing a dropout of 0.3, 30% of
the nodes are dropped out randomly from the neural network. Dropout results in improving the
performance of a machine learning model as it prevents overfitting by making the network simpler. It
drops neurons from the neural networks during training.
Activation Functions: Finally, one of the most important parameters of the CNN model is the
activation function. They are used to learn and approximate any kind of continuous and complex
relationship between variables of the network. In simple words, it decides which information of the
model should fire in the forward direction and which ones should not at the end of the network. It
adds non-linearity to the network. There are several commonly used activation functions such as the
ReLU, Softmax, tanH and the Sigmoid functions. Each of these functions have a specific usage. For a
binary classification CNN model, sigmoid and softmax functions are preferred, for a multi-class
classification, generally softmax is used. In simple terms, activation functions in a CNN model
determine whether a neuron should be activated or not. It decides whether the input to the work is
important or not predict using mathematical operations.
They allow the stacking of multiple layers of neurons as the output would now be a non-linear
combination of input passed through multiple layers.
Any output can be represented as a functional computation in a neural network. Below are different
non-linear neural networks activation functions and their characteristics:
i. Sigmoid / Logistic Activation Function: This function takes any real value as input and outputs
values in the range of 0 to 1. The larger the input (more positive), the closer the output value will
be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0,as
shown below:
It is commonly used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because
of its range.
The function is differentiable and provides a smooth gradient, i.e., preventing jumps in output values.
This is represented by an S-shape of the sigmoid activation function.
ii. Tanh Function (Hyperbolic Tangent): Tanh function is very similar to the sigmoid/ logistic
activation function, and even has the same S-shape with the difference in output range of -1 to 1.
In Tanh, the larger the input (more positive), the closer the output value will be to 1.0, whereas
the smaller the input (more negative), the closer the output will be to -1.0.
The output of the tanh activation function is Zero centered; hence we can easily map the output
values as strongly negative, neutral, or strongly positive.
Usually used in hidden layers of a neural network as its values lie between-1 to 1; therefore, the
mean for the hidden layer comes out to be 0 or very close to it. It helps in centering the data and
makes learning for next layer much easier.
iii. ReLU Function: ReLU stands for Rectified Linear Unit. Although it gives an impression of a linear
function, ReLU has a derivative function and allows for backpropagation while simultaneously
making it computationally efficient. The main catch here is that the ReLU function does not
activate all the neurons at the same time. The neurons will only be deactivated if the output of
the linear transformation is less than 0.
Since only a certain number of neurons are activated, the ReLU function is far more computationally
efficient when compared to the sigmoid and tanh functions.
ReLU accelerates the convergence of gradient descent towards the global minimum of the loss
function due to its linear, non-saturating property.
The Dying ReLU problem : The negative side of the graph makes the gradient value zero. Due to this
reason, during the backpropagation process, the weights and biases for some neurons are not
updated. This can create dead neurons which never get activated. All the negative input values
become zero immediately, which decreases the model’s ability to fit or train from the data properly.
iv. Leaky ReLU Function: Leaky ReLU is an improved version of ReLU function to solve the Dying
ReLU problem as it has a small positive slope in the negative area.
The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable
back propagation, even for negative input values. By making this minor modification for negative
input values, the gradient of the left side of the graph comes out to be a non-zero value. Therefore,
we would no longer encounter dead neurons in that region.
The limitations that this function faces include: The predictions may not be consistent for negative
input values. The gradient for negative values is a small value that makes the learning of model
parameters time-consuming.
There are different CNN LeNet, AlexNet, VGG-16 Net, ResNet and Inception Net
Deep CNNs (Convolutional Neural Networks) like LeNet and AlexNet are widely used architectures in
deep learning for image classification and recognition.
1. LeNet-5
This is also known as the Classic Neural Network that was designed by Yann LeCun, Leon Bottou,
Yosuha Bengio and Patrick Haffner for handwritten and machine-printed character recognition in
1990’s which they called LeNet-5. The architecture was designed to identify handwritten digits in the
MNIST data-set. The architecture is pretty straightforward and simple to understand. The input
images were gray scale with dimension of 32*32*1 followed by two pairs of Convolution layer with
stride 2 and Average pooling layer with stride 1. Finally, fully connected layers with Softmax
activation in the output layer. Traditionally, this network had 60,000 parameters in total.
Architecture Overview
LeNet-5 consists of 7 layers, including convolutional, pooling, and fully connected layers:
2. AlexNet
This network was very similar to LeNet-5 but was deeper with 8 layers, with more filters, stacked
convolutional layers, max pooling, dropout, data augmentation, ReLU and SGD. AlexNet was the
winner of the ImageNet ILSVRC-2012 competition, designed by Alex Krizhevsky, Ilya Sutskever and
Geoffery E. Hinton. It was trained on two Nvidia Geforce GTX 580 GPUs, therefore, the network was
split into two pipelines. AlexNet has 5 Convolution layers and 3 fully connected layers. AlexNet
consists of approximately 60 M parameters. A major drawback of this network was that it comprises
of too many hyper-parameters.
Architecture Overview
3. VGG-16 Net
The major shortcoming of too many hyper-parameters of AlexNet was solved by VGG Net by
replacing large kernel-sized filters (11 and 5 in the first and second convolution layer, respectively)
with multiple 3×3 kernel-sized filters one after another. The architecture developed by Simonyan and
Zisserman was the 1st runner up of the Visual Recognition Challenge of 2014. The architecture
consist of 3*3 Convolutional filters, 2*2 Max Pooling layer with a stride of 1, keeping the padding
same to preserve the dimension. In total, there are 16 layers in the network where the input image is
RGB format with dimension of 224*224*3, followed by 5 pairs of Convolution (filters: 64, 128,
256,512,512) and Max Pooling. The output of these layers is fed into three fully connected layers and
a softmax function in the output layer. In total there are 138 Million parameters in VGG Net.
Training CNNs
Training a Convolutional Neural Network (CNN) involves multiple steps, including forward
propagation, loss computation, backpropagation, and weight updates.
o Input images are passed through convolutional layers, activation functions, pooling
layers, and fully connected layers.
o The loss function (e.g., cross-entropy loss) measures the difference between
predicted and actual labels.
o Weights are updated using gradient descent (or an optimizer like Adam, SGD,
RMSprop).
o The process is repeated over multiple epochs until convergence (when loss stops
decreasing significantly).
Weights Initialization
Weights Initialization refers to setting the initial values of the CNN's weights before training begins.
Proper initialization prevents problems like vanishing/exploding gradients and speeds up
convergence.
o All weights = 0 → All neurons learn the same thing → No learning happens.
o Formula
● Formula:
Batch Normalization (BatchNorm) helps stabilize and speed up CNN training by normalizing
activations during training.
Benefits of BatchNorm
Hyperparameter tuning is the process of finding the best set of hyperparameters to optimize CNN
performance.
A pretrained ConvNet is a CNN model that has already been trained on a large dataset (e.g.,VGG-16,
ALexNet, ImageNet) and can be used for new tasks like image classification, object detection, or
feature extraction.
✅ Saves Time & Compute Power – Training a deep CNN from scratch requires a huge dataset and
✅ Improves Accuracy – Pretrained models have already learned useful features, leading to better
days of GPU computation.
✅ Works Well with Small Datasets – If you don’t have much labeled data, pretrained models can
performance.
Practical Question: Classify MNIST dataset using any pertained model like AlexNet, LeNet
Solution:
import tensorflow as tf
import numpy as np
# Load MNIST dataset
def create_alexnet():
model = keras.Sequential([
layers.BatchNormalization(),
layers.BatchNormalization(),
layers.Flatten(),
layers.Dense(4096, activation='relu'),
layers.Dropout(0.5),
layers.Dense(4096, activation='relu'),
layers.Dropout(0.5),
])
return model
model = create_alexnet()