0% found this document useful (0 votes)
1 views

Convolutional Neural networks.docx

Convolutional Neural Networks (CNNs) are a class of deep learning algorithms primarily used for image processing and recognition, leveraging convolutional layers to automatically learn hierarchical feature representations. They consist of multiple layers, including convolutional, pooling, and fully connected layers, which work together to extract features and classify images. CNNs have revolutionized computer vision since their inception, particularly highlighted by the success of AlexNet in 2012, achieving significant advancements in image recognition accuracy.

Uploaded by

niharikajain1604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Convolutional Neural networks.docx

Convolutional Neural Networks (CNNs) are a class of deep learning algorithms primarily used for image processing and recognition, leveraging convolutional layers to automatically learn hierarchical feature representations. They consist of multiple layers, including convolutional, pooling, and fully connected layers, which work together to extract features and classify images. CNNs have revolutionized computer vision since their inception, particularly highlighted by the success of AlexNet in 2012, achieving significant advancements in image recognition accuracy.

Uploaded by

niharikajain1604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Convolutional Neural Networks

Deep Learning has proved to be a very powerful tool because of its ability to handle large amounts of
data. The interest to use hidden layers has surpassed traditional techniques, especially in pattern
recognition. One of the most popular deep neural networks is Convolutional Neural Networks (also
known as CNN or ConvNet) in deep learning, especially when it comes to Computer Vision
applications.

Sincethe1950s, the early days of AI, researchers have struggled to make a system that can
understand visual data. In the following years, this field came to be known as Computer Vision. In
2012, computer vision took a quantum leap when a group of researchers from the University of
Toronto developed an AI model that surpassed the best image recognition algorithms, and that too
by a large margin. The AI system, which became known as AlexNet (named after its main creator,
Alex Krizhevsky), won the 2012 ImageNet computer vision contest with an amazing 85 percent
accuracy. The runner-up scored a modest 74 percent on the test. At the heart of AlexNet was
Convolutional Neural Networks a special type of neural network that roughly imitates human vision.

Background of CNNs

CNN’s were first developed and used around the 1980s. The most that a CNN could do at that time
was recognize hand written digits. It was mostly used in the postal sectors to read zip codes,
pincodes, etc. The important thing to remember about any deep learning model is that it requires a
large amount of data to train and also requires a lot of computing resources.

In the past few decades, Deep Learning has proved to be a very powerful tool because of its ability to
handle large amounts of data. The interest to use hidden layers has surpassed traditional techniques,
especially in pattern recognition. One of the most popular deep neural networks is Convolutional
Neural Networks (also known as CNN or ConvNet).

What Is a CNN?

In deep learning, a Convolutional Neural Network(CNN/ConvNet) is a class of deep neural networks,


most commonly applied to analyze visual imagery.

Now when we think of a neural network we think about matrix multiplications but that is not the
case with ConvNet. It uses a special technique called Convolution. Now in mathematics convolution
is a mathematical operation on two functions that produces a third function that expresses how the
shape of one is modified by the other.
Bottom line is that the ConvNet role to reduce the images into a form that is easier to process,
without losing features crucial for good prediction.

How does it work?

Before we go to the working of CNN’s we will see the basics such as

what is an image and how is it represented?

An RGB image is nothing but a matrix of pixel values having three planes whereas a grayscale image
is the same but it has a single plane.

The above image shows what a convolution is. We take a filter/kernel (3×3 matrix) and apply it to the
input image to get the convolved feature. This convolved feature is passed on to the next layer.
In the case of RGB color, see the example:

Convolutional neural networks are composed of multiple layers of artificial neurons. Artificial
neurons, a rough imitation of their biological counterparts, are mathematical functions that calculate
the weighted sum of multiple inputs and outputs an activation value. When you input an image in a
ConvNet, each layer generates several activation functions that are passed on to the next layer.

The first layer usually extracts basic features such as horizontal or diagonal edges. This output is
passed on to the next layer which detects more complex features such as corners or combinational
edges. As we move deeper into the network it can identify even more complex features such as
objects, faces, etc.
Based on the activation map of the final convolution layer, the classification layer outputs a set of
confidence scores (values between 0 and 1) that specify how likely the image is to belong to a “class.”
For instance, if you have a ConvNet that detects cats, dogs, and horses, the output of the final layer is
the possibility that the input image contains any of those animals.
What Is a Pooling Layer?

Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial size of the
Convolved Feature. This is to decrease the computational power required to process the data by
reducing the dimensions. There are two types of pooling average pooling and max pooling.

In Max Pooling, the maximum value of a pixel from a portion of the image covered by the kernel is
found out. Max Pooling also performs as a Noise Suppressant. It discards the noisy activations
altogether and also performs de-noising along with dimensionality reduction. On the other hand,
Average Pooling returns the average of all the values from the portion of the image covered by the
Kernel. Average Pooling simply performs dimensionality reduction as a noise suppressing
mechanism. Hence, we can say that Max Pooling performs a lot better than Average Pooling.
What are Convolutional Neural Networks (CNNs)?

A Convolutional Neural Network (CNN) is a type of deep learning algorithm specifically designed for
image processing and recognition tasks. Compared to alternative classification models, CNNs require
less preprocessing as they can automatically learn hierarchical feature representations from raw
input images. They excel at assigning importance to various objects and features within the images
through convolutional layers, which apply filters to detect local patterns. The connectivity pattern in
CNNs is inspired by the visual cortex in the human brain, where neurons respond to specific regions
or receptive fields in the visual space. This architecture enables CNNs to effectively capture spatial
relationships and patterns in images. By stacking multiple convolutional and pooling layers, CNNs can
learn increasingly complex features, leading to high accuracy in tasks like image classification, object
detection, and segmentation.

Convolutional Neural Network Architecture Model

The CNN architecture comprises three main layers: convolutional layers, pooling layers, and a fully
connected (FC) layer. There can be multiple convolutional and pooling layers. The more layers in the
network, the greater is the complexity and (theoretically) the accuracy of the machine learning
model. Each additional layer that processes the input data increases the model’s ability to recognize
objects and patterns in the data.

The Convolutional Layer : Convolutional layers are the key building block of the network, where
most of the computations are carried out. It works by applying a filter to the input data to identify
features. This filter, known as a feature detector, checks the image input’s receptive fields for a given
feature. This operation is referred to as convolution. The filter is a two-dimensional array of weights
that represents part of a 2-dimensional image. A filter is typically a 3×3 matrix, although there are
other possible sizes. The filter is applied to a region with in the input image and calculates a dot
product between the pixels, which is fed to an output array. The filter then shifts and repeats the
process until it has covered the whole image. The final output of all the filter processes is called the
feature map. A convolutional layer is typically followed by a pooling layer. Together, the convolutional
and pooling layers make up a convolutional block.

Additional convolution blocks will follow the first block, creating a hierarchical structure with later
layers learning from the earlier layers.

This layer is the first layer that is used to extract the various features from the input images. In this
layer, the mathematical operation of convolution is performed between the input image and a filter
of a particular size M x M. By sliding the filter over the input image, the dot product is taken between
the filter and the parts of the input image with respect to the size of the filter (MxM). The output is
termed as the Feature map which gives us information about the image such as the corners and
edges. Later, this feature map is fed to other layers to learn several other features of the input image.

The Pooling Layers: A pooling or down sampling layer reduces the dimensionality of the input. Like a
convolutional operation, pooling operations use a filter to sweep the whole input image, but it
doesn’t use weights. The filter instead uses an aggregation function to populate the output array
based on the receptive field’s values. There are two key types of pooling:

●​ Average pooling: The filter calculates the receptive field’s average value when it scans the
input.
●​ Max pooling: The filter sends the pixel with the maximum value to populate the output
array. This approach is more common than average pooling.

The Fully Connected (FC)Layer: The FC layer performs classification tasks using the features that the
previous layers and filters extracted. Instead of ReLu functions, the FC layer typically uses a softmax
function that classifies inputs more appropriately and produces a probability score between 0 and 1.

Dropout: Usually, when all the features are connected to the FC layer, it can cause over fitting in the
training dataset. Over fitting occurs when a particular model works so well on the training data
causing a negative impact in the model’s performance when used on a new data. To overcome this
problem, a dropout layer is utilised wherein a few neurons are dropped from the neural network
during training process resulting in reduced size of the model. On passing a dropout of 0.3, 30% of
the nodes are dropped out randomly from the neural network. Dropout results in improving the
performance of a machine learning model as it prevents overfitting by making the network simpler. It
drops neurons from the neural networks during training.

Activation Functions: Finally, one of the most important parameters of the CNN model is the
activation function. They are used to learn and approximate any kind of continuous and complex
relationship between variables of the network. In simple words, it decides which information of the
model should fire in the forward direction and which ones should not at the end of the network. It
adds non-linearity to the network. There are several commonly used activation functions such as the
ReLU, Softmax, tanH and the Sigmoid functions. Each of these functions have a specific usage. For a
binary classification CNN model, sigmoid and softmax functions are preferred, for a multi-class
classification, generally softmax is used. In simple terms, activation functions in a CNN model
determine whether a neuron should be activated or not. It decides whether the input to the work is
important or not predict using mathematical operations.

The popular activation functions are:


a)​ Binary Step Function: Binary step function depends on a threshold value that decides
whether a neuron should be activated or not. The input fed to the activation function is
compared to a certain threshold; if the input is greater than it, then the neuron is
activated, else it is deactivated, meaning that its output is not passed on to the hidden
layer.

The limitations of binary step function are as follows:


●​ It can not provide multi-value outputs—for example, it can not be used for
multi-class classification problems.
●​ The gradient of the step function is zero, which causes a hindrance in the
backpropagation process.
b)​ Linear Activation Function: The linear activation function, also known as "no
activation," or "identity function" (multiplied x 1.0), is where the activation is
proportional to the input. The function doesn't do anything to the weighted sum of the
input, it simply spits out the value it was given.

However, a linear activation function has two major problems:


●​ It’s not possible to use backpropagation as the derivative of the function is a
constant and has no relation to the input x.
●​ All layers of the neural network will collapse in to one if a linear activation
function is used. No matter the number of layers in the neural network, the last
layer will still be a linear function of the first layer. So, essentially, a linear
activation function turns the neural network in to just one layer.
c)​ Non-Linear Activation Functions: The linear activation function shown above is simply a
linear regression model. Because of its limited power, this does not allow the model to
create complex mappings between the network’s inputs and outputs.
Non-linear activation functions solve the following limitations of linear activation
functions:
They allow backpropagation because now the derivative function would be related to
the input, and it’s possible to go back and understand which weights in the input
neurons can provide a better prediction.

They allow the stacking of multiple layers of neurons as the output would now be a non-linear
combination of input passed through multiple layers.

Any output can be represented as a functional computation in a neural network. Below are different
non-linear neural networks activation functions and their characteristics:

i.​ Sigmoid / Logistic Activation Function: This function takes any real value as input and outputs
values in the range of 0 to 1. The larger the input (more positive), the closer the output value will
be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0,as
shown below:

It is commonly used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because
of its range.

The function is differentiable and provides a smooth gradient, i.e., preventing jumps in output values.
This is represented by an S-shape of the sigmoid activation function.

ii.​ Tanh Function (Hyperbolic Tangent): Tanh function is very similar to the sigmoid/ logistic
activation function, and even has the same S-shape with the difference in output range of -1 to 1.
In Tanh, the larger the input (more positive), the closer the output value will be to 1.0, whereas
the smaller the input (more negative), the closer the output will be to -1.0.

The output of the tanh activation function is Zero centered; hence we can easily map the output
values as strongly negative, neutral, or strongly positive.

Usually used in hidden layers of a neural network as its values lie between-1 to 1; therefore, the
mean for the hidden layer comes out to be 0 or very close to it. It helps in centering the data and
makes learning for next layer much easier.
iii.​ ReLU Function: ReLU stands for Rectified Linear Unit. Although it gives an impression of a linear
function, ReLU has a derivative function and allows for backpropagation while simultaneously
making it computationally efficient. The main catch here is that the ReLU function does not
activate all the neurons at the same time. The neurons will only be deactivated if the output of
the linear transformation is less than 0.

Since only a certain number of neurons are activated, the ReLU function is far more computationally
efficient when compared to the sigmoid and tanh functions.

ReLU accelerates the convergence of gradient descent towards the global minimum of the loss
function due to its linear, non-saturating property.

The Dying ReLU problem : The negative side of the graph makes the gradient value zero. Due to this
reason, during the backpropagation process, the weights and biases for some neurons are not
updated. This can create dead neurons which never get activated. All the negative input values
become zero immediately, which decreases the model’s ability to fit or train from the data properly.

iv.​ Leaky ReLU Function: Leaky ReLU is an improved version of ReLU function to solve the Dying
ReLU problem as it has a small positive slope in the negative area.

The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable
back propagation, even for negative input values. By making this minor modification for negative
input values, the gradient of the left side of the graph comes out to be a non-zero value. Therefore,
we would no longer encounter dead neurons in that region.
The limitations that this function faces include: The predictions may not be consistent for negative
input values. The gradient for negative values is a small value that makes the learning of model
parameters time-consuming.

Types of Convolutional Neural Networks

There are different CNN LeNet, AlexNet, VGG-16 Net, ResNet and Inception Net

Deep CNNs (Convolutional Neural Networks) like LeNet and AlexNet are widely used architectures in
deep learning for image classification and recognition.

1. LeNet-5

This is also known as the Classic Neural Network that was designed by Yann LeCun, Leon Bottou,
Yosuha Bengio and Patrick Haffner for handwritten and machine-printed character recognition in
1990’s which they called LeNet-5. The architecture was designed to identify handwritten digits in the
MNIST data-set. The architecture is pretty straightforward and simple to understand. The input
images were gray scale with dimension of 32*32*1 followed by two pairs of Convolution layer with
stride 2 and Average pooling layer with stride 1. Finally, fully connected layers with Softmax
activation in the output layer. Traditionally, this network had 60,000 parameters in total.

Architecture Overview

LeNet-5 consists of 7 layers, including convolutional, pooling, and fully connected layers:

1.​ Input Layer – 32×32 grayscale images (e.g., MNIST digits).

2.​ Conv Layer 1 (C1) – 6 filters (5×5) → Feature extraction.

3.​ Pooling Layer 1 (S2) – 2×2 average pooling → Downsampling.

4.​ Conv Layer 2 (C3) – 16 filters (5×5) → Deeper feature extraction.

5.​ Pooling Layer 2 (S4) – 2×2 average pooling → Further downsampling.

6.​ Fully Connected Layer 1 (F5) – 120 neurons.


7.​ Fully Connected Layer 2 (F6) – 84 neurons.

8.​ Output Layer – 10 neurons (for digit classification, 0-9).

2. AlexNet

This network was very similar to LeNet-5 but was deeper with 8 layers, with more filters, stacked
convolutional layers, max pooling, dropout, data augmentation, ReLU and SGD. AlexNet was the
winner of the ImageNet ILSVRC-2012 competition, designed by Alex Krizhevsky, Ilya Sutskever and
Geoffery E. Hinton. It was trained on two Nvidia Geforce GTX 580 GPUs, therefore, the network was
split into two pipelines. AlexNet has 5 Convolution layers and 3 fully connected layers. AlexNet
consists of approximately 60 M parameters. A major drawback of this network was that it comprises
of too many hyper-parameters.

Architecture Overview

AlexNet is a deeper CNN than LeNet, consisting of 8 layers:

1.​ Conv Layer 1 – 96 filters (11×11) + ReLU + Max Pooling.

2.​ Conv Layer 2 – 256 filters (5×5) + ReLU + Max Pooling.

3.​ Conv Layer 3 – 384 filters (3×3) + ReLU.

4.​ Conv Layer 4 – 384 filters (3×3) + ReLU.

5.​ Conv Layer 5 – 256 filters (3×3) + ReLU + Max Pooling.

6.​ Fully Connected Layer 1 – 4096 neurons + ReLU + Dropout.

7.​ Fully Connected Layer 2 – 4096 neurons + ReLU + Dropout.


8.​ Output Layer – 1000 neurons (for 1000 ImageNet categories) + Softmax.

3. VGG-16 Net

The major shortcoming of too many hyper-parameters of AlexNet was solved by VGG Net by
replacing large kernel-sized filters (11 and 5 in the first and second convolution layer, respectively)
with multiple 3×3 kernel-sized filters one after another. The architecture developed by Simonyan and
Zisserman was the 1st runner up of the Visual Recognition Challenge of 2014. The architecture
consist of 3*3 Convolutional filters, 2*2 Max Pooling layer with a stride of 1, keeping the padding
same to preserve the dimension. In total, there are 16 layers in the network where the input image is
RGB format with dimension of 224*224*3, followed by 5 pairs of Convolution (filters: 64, 128,
256,512,512) and Max Pooling. The output of these layers is fed into three fully connected layers and
a softmax function in the output layer. In total there are 138 Million parameters in VGG Net.

Training CNNs

Training a Convolutional Neural Network (CNN) involves multiple steps, including forward
propagation, loss computation, backpropagation, and weight updates.

Steps in Training a CNN

1.​ Forward Propagation

o​ Input images are passed through convolutional layers, activation functions, pooling
layers, and fully connected layers.

o​ The output is the predicted class probabilities.


2.​ Loss Computation

o​ The loss function (e.g., cross-entropy loss) measures the difference between
predicted and actual labels.

3.​ Backpropagation & Weight Updates

o​ Gradients are calculated using backpropagation.

o​ Weights are updated using gradient descent (or an optimizer like Adam, SGD,
RMSprop).

4.​ Iterate Over Multiple Epochs

o​ The process is repeated over multiple epochs until convergence (when loss stops
decreasing significantly).

Weights Initialization

Weights Initialization refers to setting the initial values of the CNN's weights before training begins.
Proper initialization prevents problems like vanishing/exploding gradients and speeds up
convergence.

Types of Weight Initialization:

1.​ Zero Initialization (Not recommended)

o​ All weights = 0 → All neurons learn the same thing → No learning happens.

2.​ Random Initialization

o​ Weights are assigned random values.

o​ Can lead to problems like large/small gradients.

3.​ Xavier (Glorot) Initialization (Best for sigmoid/tanh activations)

o​ Ensures activations have zero mean and unit variance.

o​ Formula

Suitable for shallow networks.

He Initialization (Best for ReLU activations)

●​ Prevents dying ReLU problem.

●​ Formula:

Recommended for deep CNNs


Batch Normalization (BatchNorm)

Batch Normalization (BatchNorm) helps stabilize and speed up CNN training by normalizing
activations during training.

How BatchNorm Works

●​ In each mini-batch, it normalizes activations (zero mean, unit variance).

●​ It then applies learnable parameters (γ,β) to restore representational power:

Benefits of BatchNorm

✅ Reduces internal covariate shift (shifts in distributions during training).​


✅ Allows higher learning rates.​
✅ Reduces dependence on careful weight initialization.​
✅ Acts as a regularizer (similar to dropout).
Hyperparameter Optimization

Hyperparameter tuning is the process of finding the best set of hyperparameters to optimize CNN
performance.

Key Hyperparameters in CNNs

1.​ Learning Rate (α)

o​ Controls how much weights are updated.

o​ Too high → Model diverges.

o​ Too low → Training is slow.

o​ Solution: Use learning rate scheduling.

2.​ Batch Size

o​ Small batch = Noisy training but generalizes well.

o​ Large batch = Stable but may overfit.

o​ Typical values: 32, 64, 128.

3.​ Number of Filters in Conv Layers

o​ More filters → Learn complex features but increase computation.

4.​ Dropout Rate

o​ Prevents overfitting by randomly deactivating neurons.


5.​ Optimization Algorithm

o​ SGD: Works well but slow.

o​ Adam: Adaptive learning rate (faster convergence).

o​ RMSprop: Good for non-stationary problems.

Methods for Hyperparameter Optimization

●​ Grid Search: Tries all possible combinations (slow).

●​ Random Search: Selects random combinations (faster).

●​ Bayesian Optimization: Uses probability to find best hyperparameters.

●​ Hyperband: Efficiently allocates resources to promising configurations.

Using a Pretrained ConvNet (Convolutional Neural Network)

A pretrained ConvNet is a CNN model that has already been trained on a large dataset (e.g.,VGG-16,
ALexNet, ImageNet) and can be used for new tasks like image classification, object detection, or
feature extraction.

Why Use a Pretrained ConvNet?

✅ Saves Time & Compute Power – Training a deep CNN from scratch requires a huge dataset and
✅ Improves Accuracy – Pretrained models have already learned useful features, leading to better
days of GPU computation.​

✅ Works Well with Small Datasets – If you don’t have much labeled data, pretrained models can
performance.​

still work well.

Practical Question: Classify MNIST dataset using any pertained model like AlexNet, LeNet

●​ Load and preprocess the MNIST dataset.


●​ Convert grayscale MNIST images to 3 channels (RGB) since pretrained models expect 3D
input.
●​ Use AlexNet (pretrained on ImageNet) as the base model.
●​ Replace the last layer to classify 10 digits.
●​ Train and evaluate the model.

Implementing AlexNet for MNIST Classification

Solution:

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import layers, Model

import numpy as np
# Load MNIST dataset

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Convert grayscale (1 channel) to RGB (3 channels)

x_train = np.stack((x_train,)*3, axis=-1)

x_test = np.stack((x_test,)*3, axis=-1)

# Normalize pixel values

x_train, x_test = x_train / 255.0, x_test / 255.0

# Convert labels to categorical format (One-Hot Encoding)

y_train = keras.utils.to_categorical(y_train, 10)

y_test = keras.utils.to_categorical(y_test, 10)

def preprocess(image, label):

image = tf.image.resize(image, (224, 224)) # Resize dynamically

return image, label

# Create tf.data dataset (resizes images dynamically)

train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)).map(preprocess).batch(64)

test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).map(preprocess).batch(64)

# Define AlexNet model

def create_alexnet():

model = keras.Sequential([

layers.Conv2D(96, (11, 11), strides=4, activation='relu', input_shape=(224, 224, 3)),

layers.BatchNormalization(),

layers.MaxPooling2D((3, 3), strides=2),


layers.Conv2D(256, (5, 5), activation='relu', padding="same"),

layers.BatchNormalization(),

layers.MaxPooling2D((3, 3), strides=2),

layers.Conv2D(384, (3, 3), activation='relu', padding="same"),

layers.Conv2D(384, (3, 3), activation='relu', padding="same"),

layers.Conv2D(256, (3, 3), activation='relu', padding="same"),

layers.MaxPooling2D((3, 3), strides=2),

layers.Flatten(),

layers.Dense(4096, activation='relu'),

layers.Dropout(0.5),

layers.Dense(4096, activation='relu'),

layers.Dropout(0.5),

layers.Dense(10, activation='softmax') # 10 output classes for MNIST

])

return model

# Compile the model

model = create_alexnet()

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model (using dynamic resizing)

history = model.fit(train_ds, validation_data=test_ds, epochs=5)

# Evaluate the model

test_loss, test_acc = model.evaluate(test_ds, verbose=2)

print('\nTest accuracy: {test_acc}')

predictions = model.predict(x_test[:5]) # Predict on first 5 test images

predicted_classes = np.argmax(predictions, axis=1) # Convert probabilities to class labels


print("Predicted Classes:", predicted_classes)

You might also like