Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural Networks
Input Layers: It’s the layer in which we give input to our model. The number of neurons in this layer
is equal to the total number of features in our data (number of pixels in the case of an image).
Hidden Layer: The input from the Input layer is then feed into the hidden layer. There can be many
hidden layers depending upon our model and data size. Each hidden layer can have different numbers
of neurons which are generally greater than the number of features. The output from each layer is
computed by matrix multiplication of output of the previous layer with learnable weights of that layer
and then by the addition of learnable biases followed by activation function which makes the network
nonlinear.
Output Layer: The output from the hidden layer is then fed into a logistic function like sigmoid or
softmax which converts the output of each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained from the above step is called feed
forward, we then calculate the error using an error function, some common error functions are cross-entropy,
square loss error, etc. The error function measures how well the network is performing. After that, we back
propagate into the model by calculating the derivatives. This step is called Back propagation which basically
is used to minimize the loss.
Around the 1980s, CNNs were developed and deployed for the first time. A CNN could only detect
handwritten digits at the time. CNN was primarily used in various areas to read zip and pin codes etc. The
most common aspect of any AI model is that it requires a massive amount of data to train. This was one of
the biggest problems that CNN faced at the time, and due to this, they were only used in the postal
industry. Yann LeCun was the first to introduce convolutional neural networks.
Convolutional Neural Networks, commonly referred to as CNNs, are a specialized kind of neural network
architecture that is designed to process data with a grid-like topology. This makes them particularly well-
suited for dealing with spatial and temporal data, like images and videos that maintain a high degree of
correlation between adjacent elements.
CNNs are similar to other neural networks, but they have an added layer of complexity due to the fact that
they use a series of convolutional layers. Convolutional layers perform a mathematical operation called
convolution, a sort of specialized matrix multiplication, on the input data. The convolution operation helps
to preserve the spatial relationship between pixels by learning image features using small squares of input
data. . The picture below represents a typical CNN architecture.
Convolutional layers
Convolutional layers operate by sliding a set of ‘filters’ or ‘kernels’ across the input data. Each filter is
designed to detect a specific feature or pattern, such as edges, corners, or more complex shapes in the case
of deeper layers. As these filters move across the image, they generate a map that signifies the areas where
those features were found. The output of the convolutional layer is a feature map, which is a
representation of the input image with the filters applied. Convolutional layers can be stacked to create
more complex models, which can learn more intricate features from images. Simply speaking,
convolutional layers are responsible for extracting features from the input images. These features might
include edges, corners, textures, or more complex patterns.
Pooling layers
Pooling layers follow the convolutional layers and are used to reduce the spatial dimension of the input,
making it easier to process and requiring less memory. In the context of images, “spatial dimensions” refer
to the width and height of the image. An image is made up of pixels, and you can think of it like a grid,
with rows and columns of tiny squares (pixels). By reducing the spatial dimensions, pooling layers help
reduce the number of parameters or weights in the network. This helps to combat over-fitting and help
train the model in a fast manner. Max pooling helps in reducing computational complexity, owing to
reduction in size of feature map, and making the model invariant to small transitions. Without max
pooling, the network would not gain the ability to recognize features irrespective of small shifts or
rotations. This would make the model less robust to variations in object positioning within the image,
possibly affecting accuracy.
There are two main types of pooling: max pooling and average pooling. Max pooling takes the maximum
value from each feature map. For example, if the pooling window size is 2×2, it will pick the pixel with
the highest value in that 2×2 region. Max pooling effectively captures the most prominent feature or
characteristic within the pooling window. Average pooling calculates the average of all values within the
pooling window. It provides a smooth, average feature representation.
The combination of Convolution layer followed by max-pooling layer and then similar sets creates a
hierarchy of features. The first layer detects simple patterns, and subsequent layers build on those to detect
more complex patterns.
CNNs are often used for image recognition and classification tasks. For example, CNNs can be used to
identify objects in an image or to classify an image as being a cat or a dog. CNNs can also be used for more
complex tasks, such as generating descriptions of an image or identifying the points of interest in an image.
Beyond image data, CNNs can also handle time-series data, such as audio data or even text data, although
other types of networks like Recurrent Neural Networks (RNNs) or transformers are often preferred for
these scenarios. CNNs are a powerful tool for deep learning, and they have been used to achieve state-of-
the-art results in many different applications.
.
Fig. 3 Functions of CNN Layers