Object Detection with Deep Learning Models: Principles and Applications 1st Edition S. Poonkuntran pdf download
Object Detection with Deep Learning Models: Principles and Applications 1st Edition S. Poonkuntran pdf download
https://ebookmeta.com/product/object-detection-with-deep-
learning-models-principles-and-applications-1st-edition-s-
poonkuntran/
https://ebookmeta.com/product/advancement-of-deep-learning-and-
its-applications-in-object-detection-and-recognition-1st-edition-
roohie-naaz-mir/
https://ebookmeta.com/product/advanced-applied-deep-learning-
convolutional-neural-networks-and-object-detection-1st-edition-
umberto-michelucci/
https://ebookmeta.com/product/deep-learning-for-computer-vision-
image-classification-object-detection-and-face-recognition-in-
python-jason-brownlee/
https://ebookmeta.com/product/music-education-for-social-change-
constructing-an-activist-music-education-1st-edition-juliet-hess/
Good Girl Wicked 1 1st Edition Piper Lawson Lawson
Piper
https://ebookmeta.com/product/good-girl-wicked-1-1st-edition-
piper-lawson-lawson-piper/
https://ebookmeta.com/product/sordid-sordid-1-1st-edition-nikki-
sloane-sloane-nikki/
https://ebookmeta.com/product/one-dom-to-love-the-doms-of-her-
life-1-1st-edition-shayla-black-jenna-jacob-isabella-lapearl/
https://ebookmeta.com/product/virtual-menageries-animals-as-
mediators-in-network-cultures-berland-jody/
https://ebookmeta.com/product/cambridge-igcse-and-o-level-
history-workbook-2c-depth-study-the-united-states-1919-41-2nd-
edition-benjamin-harrison/
Must Know High School Basic French 1st Edition Annie
Heminway
https://ebookmeta.com/product/must-know-high-school-basic-
french-1st-edition-annie-heminway/
Object Detection with Deep
Learning Models
Principles and Applications
Edited by
S. Poonkuntran
Rajesh Kumar Dhanraj
Balamurugan Balusamy
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
© 2023 selection and editorial matter, [S Poonkuntran, Rajesh Kumar Dhanraj, Balamurugan Balusamy]; indi-
vidual chapters, the contributors
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has not
been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted,
or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, includ-
ing photocopying, microfilming, and recording, or in any information storage or retrieval system, without writ-
ten permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact
the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works
that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003206736
Typeset in Palatino
by SPi Technologies India Pvt Ltd (Straive)
Contents
Editors............................................................................................................................................. vii
List of Contributors.........................................................................................................................ix
3. Real-Time Tracing and Alerting System for Vehicles and Children to Ensure
Safety and Security, Using LabVIEW.............................................................................. 49
R. Deepalakshmi and R. Vijayalakshmi
v
vi Contents
Index.............................................................................................................................................. 253
Editors
vii
viii Editors
R. Deepalakshmi V. Palanisamy
Velammal College of Engineering and Alagappa University
Technology, Viraganoor Tamil Nadu, India
Tamil Nadu, India
ix
x List of Contributors
A.S. Renugadevi
Kongu Engineering College, Tamil Nadu, India
CONTENTS
1.1 Introduction to Deep Learning�������������������������������������������������������������������������������������������� 2
1.1.1 Deep Learning������������������������������������������������������������������������������������������������������������ 2
1.1.2 Machine Learning and Deep Learning������������������������������������������������������������������� 3
1.1.3 Types of Networks in Deep Learning��������������������������������������������������������������������� 3
1.1.3.1 Connection Type of Networks������������������������������������������������������������������ 4
1.1.3.2 Topology-based Neural Networks����������������������������������������������������������� 6
1.1.3.3 Learning Methods��������������������������������������������������������������������������������������� 8
1.2 Convolutional Neural Networks����������������������������������������������������������������������������������������� 9
1.2.1 Description of Five Layers of General CNN Architecture����������������������������������� 9
1.2.1.1 Input Layer������������������������������������������������������������������������������������������������� 10
1.2.1.2 Convolutional Layer��������������������������������������������������������������������������������� 10
1.2.1.3 Pooling Layer��������������������������������������������������������������������������������������������� 11
1.2.1.4 Fully Connected Layers��������������������������������������������������������������������������� 12
1.2.1.5 Output Layer��������������������������������������������������������������������������������������������� 13
1.2.2 Types of Architecture in CNN�������������������������������������������������������������������������������� 13
1.2.2.1 LeNet-5������������������������������������������������������������������������������������������������������� 13
1.2.2.2 AlexNet������������������������������������������������������������������������������������������������������� 14
1.2.2.3 ZFNet���������������������������������������������������������������������������������������������������������� 14
1.2.2.4 GoogLeNet/Inception������������������������������������������������������������������������������ 14
1.2.2.5 VGGNet������������������������������������������������������������������������������������������������������ 15
1.2.2.6 ResNet��������������������������������������������������������������������������������������������������������� 15
1.2.3 Applications of Deep Learning������������������������������������������������������������������������������ 16
1.3 Image Classification, Object Detection and Face Recognition��������������������������������������17
1.3.1 Dataset Creation������������������������������������������������������������������������������������������������������� 17
1.3.2 Data Preprocessing�������������������������������������������������������������������������������������������������� 18
1.3.3 Image Classification������������������������������������������������������������������������������������������������� 18
1.3.4 Object Detection������������������������������������������������������������������������������������������������������� 19
1.3.5 Face Recognition������������������������������������������������������������������������������������������������������ 20
References��������������������������������������������������������������������������������������������������������������������������������������� 21
DOI: 10.1201/9781003206736-1 1
2 Object Detection with Deep Learning Models
FIGURE 1.1
Emergence of deep learning.
Deep Learning and Computer Vision 3
FIGURE 1.2
Illustration of an artificial neuron.
In the artificial neural network, the neuron plays a major role. The structure of an artifi-
cial neuron consists of inputs from x0 through xn and weights w1 through wn. Each input
value is passed to the summation function. After that, the summed value obtained is
passed to the activation function, and output y is generated. The structure of the neuron is
given in Figure 1.2.
TABLE 1.1
Differences between Machine Learning and Deep Learning
Machine Learning Deep Learning
Small amount of data is needed to provide accuracy. Large amount of data is needed for training.
It requires low system specifications. It requires high system specifications.
The given problem is divided into multiple tasks, and The given problem is solved fully as a node-to-
each task is solved independently. Finally, the results node problem.
are combined.
The time needed for training the model is low. The time needed for training the model is high.
But for testing the data with the model, the time Here, less time is needed to test the data with
required is high. the model.
4 Object Detection with Deep Learning Models
Types of connection
Static feedforward networks
Dynamic feedback networks
Topology of networks
Single-layer neural networks
Multilayer neural networks
Recurrent neural networks
Learning methods
Supervised Learning
Unsupervised Learning
Reinforcement Learning
FIGURE 1.3
Feedforward networks.
Deep Learning and Computer Vision 5
Applications:
• Classification
• Speech recognition
• Face recognition
• Computer vision
FIGURE 1.4
Feedback neural networks.
6 Object Detection with Deep Learning Models
• Word Processing
• Speech recognition
• Tagging an image
• Process of detecting sentiments
• Translation
FIGURE 1.5
Single-layer neural network.
Deep Learning and Computer Vision 7
FIGURE 1.6
Multilayer neural network.
networks. The network gets self-adjusted based on the variation with the outputs pre-
dicted and the inputs trained in the network. The activation functions used are nonlinear
and they are then sent to the softmax function. Figure 1.6 depicts the multilayer neural
network.
• Machine translation
• Recognition of speech
• Classification of complex images
• Word Processing
• Speech recognition
• Tagging an image
• Process of detecting sentiments
• Translation
8 Object Detection with Deep Learning Models
FIGURE 1.7
Recurrent neural network.
1.1.3.3 Learning Methods
1.1.3.3.1 Supervised Learning
The most common form of deep learning is supervised learning. The set of images or data
can be taken as a training set, and it is given as input to the network with the aspect of
training the network. For every input, there will be a labeled corresponding output, such
that the input can be processed and the desired output reached. As an example, the images
are classified into X different classes. So that it needs a training set of images and a valida-
tion set of images. The training set can be written as {(r1,s1), (r2,s2),….(rx,sx)}, where the
input is ri and output is si [1]. Then the images can be trained by using the minimization
of a cost function that will connect the output along with the correct input. The trained
images are given to the model, and the model predicts the output. Figure 1.8 shows the
method of supervised learning.
1.1.3.3.2 Unsupervised Learning
Unlike supervised learning, in unsupervised learning the training data or image set is not
labeled for finding the classes or classifying the classes. So the network model finds the
common characteristics among the data or images and clubs the data based on the knowl-
edge of the model. The method of unsupervised learning is illustrated in Figure 1.9:
FIGURE 1.8
Supervised learning.
Deep Learning and Computer Vision 9
FIGURE 1.9
Unsupervised learning.
1.1.3.3.3 Reinforcement Learning
In reinforcement learning, without the training dataset, the suitable decision is taken on its
own with the help of its experience. That decision will help to receive the reward in certain
situations. It is achieved by using the different types of machines or software, whatever
they may be, but the solution is only to reach the best path or behavior. How reinforcement
learning varies from supervised and unsupervised learning is that the training data along
with the correct solution are available in the two types of learning, whereas the training
data are not available in reinforcement learning. So the reinforcement agent has to decide
what to do to perform the allocated work [2]. The diagram in Figure 1.10 gives the idea of
reinforcement learning.
FIGURE 1.10
Reinforcement learning.
10 Object Detection with Deep Learning Models
FIGURE 1.11
Convolutional neural networks.
the convolutional neural networks. The visual images are analyzed by this convolutional
neural network. The neural networks used to analyze the visual images are also called
shift invariant or space invariant artificial neural networks (SIANN) that scan the layers
of convolutional neural networks and translation invariance characteristics on the basis of
shared weight architecture. The translation invariance characteristics can be named feature
maps. The convolutional neural network consists of convolutional layers (one or more)
along with the pooling layer and the fully connected layer (one or more). The architecture
of the convolutional neural networks is shown in the Figure 1.11.
CNN is a specific version of the neural network designed to operate with one-
dimensional, two-dimensional, and three-dimensional data and images [6].
1.2.1.1 Input Layer
The whole CNN input depends on the input layer. The images are represented as the pixel
matrix in the neural network.
1.2.1.2 Convolutional Layer
The name of the convolutional neural networks is given because of the convolutional lay-
ers in the network. The convolution operation is performed in the convolutional layer.
In the convolutional neural network, the convolution operation can be done by multi-
plying the input with that of the set of weights as in the old neural networks. When the
two-dimensional input is taken, the two-dimensional array of weights called kernel or fil-
ter is multiplied with the two-dimensional input [6].
When the kernel used is smaller than the input data, the dot product can be said to mul-
tiply the small kernel-size input patch with the small kernel. The single value can be
obtained by adding the results obtained in the dot product, which is the elementary multi-
plication of the kernel-size patch of the input and the kernel. Since the single value is
obtained, it is called a scalar product.
The filter size should be smaller than the original input, then only the same size of the
filter can be repeatedly multiplied by the input array at multiple points in the input. The
Deep Learning and Computer Vision 11
FIGURE 1.12
Extraction of feature map.
filters can be applied to each small size of the input either in the direction of top to bottom
or in the direction of left to right.
The repeated application of the filter to the small size of the image is a very useful tech-
nique for identifying an exact feature in the input images. If the filter is applied in a similar
manner to the entire image, then the features can be easily identified throughout the image.
This concept is called translation invariance.
The single value is obtained as a result of multiplying the filter value with the small
patch of the input. But if the filter value is applied all over the array of inputs, then the
two-dimensional array of values is obtained. Those values seemed to be a filtering in input
values. The output obtained by multiplying the filter with the input array is known as a
feature map. After getting the feature map, the feature map is applied to the nonlinearity
function ReLU. The feature map extraction is shown in Figure 1.12.
The convolution operation is actually called a cross-correlation operation in technical
terms. The kernel value is rotated before applying to the input sometimes. The cross cor-
relation in deep learning is known as convolution operation.
1.2.1.3 Pooling Layer
In the convolutional neural networks, after the convolutional layer, the pooling layer is
added. The output from the convolutional layers is passed to the ReLU function, which
will apply the nonlinearity to the output of the convolutional layer (i.e., the feature maps).
So the ReLU function is added in between the convolutional layer and the pooling layer
[6, 7].
The use of the pooling layer may be repeated after each convolutional layer in the neural
network. Usage of the pooling layer may be decided based on the application. The pooling
layer is applied to the feature maps of the convolutional layer, so the pooled feature maps
are created in the same number from the pooling layer.
The pooling layer will perform a pooling operation, according to how the filter is going
to apply to the feature maps. Normally, the filter size is comparably lesser than that of the
12 Object Detection with Deep Learning Models
value of input in order to create feature maps. Similarly, the pooling operation size is also
small compared to that of the feature maps. Exactly, the pooling operation size is 2*2 pixels
which is applied to the 2 pixels stride.
The pooling layer will use the 2 factor as a size of features extracted in the map. The
reduction is carried out in each dimension to half of the original size, and as a result, the
pixel value is reduced to 1/4 of the total size. For instance, if the total number of pixels is
36 (6*6 matrix), the number of pixels in the pooling layer is reduced to 9 pixels (3*3
matrix).
The pooling operation can be performed in two ways: Average pooling and maximum
pooling.
Average pooling:
Each patch’s average value of the feature maps is calculated [6]. The average pooling func-
tion is shown in Figure 1.13.
FIGURE 1.13
Average pooling function.
Deep Learning and Computer Vision 13
FIGURE 1.14
Max pooling function.
FIGURE 1.15
Flattening.
The flattened value obtained in the above figure is given as input to the fully connected
layer. The result of the fully connected layer is sent to the final layer, which uses the soft-
max activation function for classifying the results. The results can be classified into various
classes.
1.2.1.5 Output Layer
The output is then generated through the output layer generates the output and the error
checking is also performed. As a result, the loss function is computed and also gradient
error is calculated.
FIGURE 1.16
LeNet-5.
1.2.2.2 AlexNet
The AlexNet architecture was designed by Alex Krizhevsky et al. in 2012. AlexNet has a
similar architecture to LeNet, but the depth of the network in AlexNet is increased. The
AlexNet architecture consists of eight layers. Of these, five are the convolutional layers
with the max pooling layer, and the remaining three are the fully connected layers. The
ReLU activation functions are added in each layer except the output layer. The overfit-
ting in the network can be avoided by adding the dropout layers in the network [10]. The
AlexNet architecture diagram is shown in Figure 1.17.
1.2.2.3 ZFNet
ZFNet was designed in 2013 in order to optimize the performance of AlexNet. The depth of
the networks can be increased by adding the extra filters in the same structure as AlexNet.
Instead of increasing the filter size, the number of filters or kernels is increased to optimize
the performance [10]. The architecture diagram is given in Figure 1.18.
1.2.2.4 GoogLeNet/Inception
The architecture of GoogLeNet differs from the other architectures in the way that it uses
the 1*1 convolution and global average pooling to create the deeper networks. The number
of parameters used in the convolution is decreased so that the deepness of networks gets
FIGURE 1.17
AlexNet.
Deep Learning and Computer Vision 15
FIGURE 1.18
ZFNet.
FIGURE 1.19
GoogLeNet.
increased. The accuracy of the classification is increased by means of the global average
pooling. The fully connected layer with the ReLU activation function is used, and also the
dropout layer is used for regularization [10]. The softmax classifier is used for the classifi-
cation of images or data. Figure 1.19 shows the block diagram of GoogLeNet.
1.2.2.5 VGGNet
VGGNet was designed by Simonyan and Zisserman in 2014. VGGNet architecture has a
total of 16 convolutional layers. The number of filters is increased as in AlexNet. The 3*3
filters are added to increase the depth of the network. The three fully connected layers are
added at the end after the pooling layers [10, 12]. The VGGNet architecture is depicted in
Figure 1.20.
1.2.2.6 ResNet
The ResNet was designed by Kaiming He et al. in 2015. ResNet is introduced to get rid of
the vanishing gradient. The skip connection technique is used in the ResNet network. The
skip connection works in the way that particular training is skipped from a few layers, and
the remaining is connected to the output layer [10]. The architecture of ResNet is shown in
Figure 1.21.
16 Object Detection with Deep Learning Models
FIGURE 1.20
VGGnet.
FIGURE 1.21
ResNet.
1. Automatic text generation – The learning of text is done and the new text is also
framed with the help of the model. The model helps to learn how to punctuate,
spell and frame new sentences and also the style is captured sometimes.
2. Healthcare – Various diseases can be diagnosed and also treated earlier.
3. Automatic machine translation – The translation of text in one language is con-
verted into another language automatically. The text may be words or sentences.
4. Image recognition – Objects and people are recognized and identified with the
help of deep learning.
5. Predicting earthquakes – Deep learning trains the model to predict earthquakes
earlier.
6. Industrial applications – Object detection and localization, sorting, robotics, qual-
ity control and inspection, packaging.
7. Retail applications – analytics, warehouse management, theft prevention, intel-
ligent barcode scanners, monitoring and distribution control.
8. Entertainment/gaming – gesture recognition, user identification, emotional feed-
back, experience monitoring, advance analytics.
9. Smart homes – vacuum cleaners, automatic lawn movers, intrusion and hazard
detection, smart lights, ovens, refrigerators.
10. Agriculture – weed control, fruit harvesting, autonomous tractors and combines.
11. Smart cities and infrastructure – parking, traffic monitoring, security monitoring,
road inspection.
12. Food industry – sorting, quality control.
Deep Learning and Computer Vision 17
• Dataset creation
• Preprocessing
• Image classification
• Object detection
• Face recognition
1.3.1 Dataset Creation
A dataset is a collection of data and its related values. The dataset has both the param-
eters as time and subject. The dataset creation is a challenging task in deep learning. The
data collection is a static process. The collection of data is over a period of time; labeling
the data, training the model and results are found in deep learning. There are different
types of datasets such as text data, image data, signal data, sound data, physical data,
anomaly data, biological data, multivariate data, question-answering data and other data
repositories.
The performance of deep learning is improved by improving the data. That means the
addition of more data to train the model will be helpful in classifying the data.
Data acquisition is the process by which datasets are found for training the models. The
two methods of data acquisition are:
1. Data generation
2. Data augmentation
• Scikit-Image
• OpenCV
• Python Image Library (Pillow/PIL)
18 Object Detection with Deep Learning Models
• Scipy
• SimpleITK
• Matplotlib
• Numpy
• Mahotas
1.3.2 Data Preprocessing
The preprocessing of the dataset involves both the text dataset and image dataset prepro-
cessing. The text dataset preprocessing consists of the steps such as
• Removal of punctuation
• Lower casing
• Spelling correction
• Removal of frequent words
• Chat words conversion
• Removal of URLs
• Lemmatization
• Removal of rare words
• Stemming
• Removal of emoticons
• Conversion of emoji to words
• Removal of stopwords
• Removal of emoji
• Conversion of emoticons to words
• Removal of HTML tags
The preprocessing of image datasets consists of image resizing, noise removal, segmenta-
tion, and edge smoothing.
Image resizing is varying the size of the image. Unwanted noise can be removed from
the images by using noise removal techniques. The particular part of the images can be
segmented using segmentation. The edges of the images can also be smoothed using edge
smoothing techniques.
1.3.3 Image Classification
The features extracted from the images for observing patterns in the dataset are helpful in
image classification [13]. If an artificial neural network is used for image classification, then
the classification process is very costly [14]. So CNN is used for the classification. There are
different types of classification problems, such as single label classification and multilabel
classification in supervised learning, unsupervised classification, video classification, and
3D classification.
Deep Learning and Computer Vision 19
Step 1: Specific dataset should be chosen. Choose a dataset already available or create
your own dataset.
Step 2: Import the necessary libraries needed for the classification.
Step 3: Prepare the training dataset by assigning the path and also create the catego-
ries. Also resize the images.
Step 4: Create the data in the training data set and shuffle the dataset. Assign the labels
as well as features to the entire image.
Step 5: Normalize the X values and convert labels into categorical data. Split the X
values and Y values for using it in CNN.
Step 6: Define the model, compile it and train the CNN model.
Step 7: Find the accuracy of the model in classifying the objects.
Binary classification
Multiclass classification
1.3.4 Object Detection
Object detection may be referred to as object recognition; since it combines the two func-
tionalities such as drawing a bounding box around each and every object, which needs to
be identified in the images and then assigning a label to the identified object [13]. Image
classification is a straightforward technique, whereas object detection also involves the
localization of the objects.
For addressing object localization, region-based convolutional neural networks are
used. R-CNNs are designed specifically for recognizing objects.
The YOLO model (you only look once) is also designed specifically for detecting objects
in the images considering the speed and the real-time usage. The variation between the
three tasks can be explained as follows:
Image classification: The type or class of the object can be identified in an image [15].
Object localization: The presence of objects is located in an image and also a bounding box
for indicating their exact location.
Object detection: The presence of objects is located in an image and also bounding box for
indicating their exact location and also the labeled classes of the exact objects should be
given as output [16, 17, 18].
The steps carried out in the object detection process are as follows:
• Each object in a street scene should be identified by a bounding box, and also
object should be labeled.
• Each object in an indoor photograph should be identified by a bounding box, and
also object should be labeled.
• Each object in a landscape should be identified by a bounding box, and also object
should be labeled.
• Object detection models for locating and detecting the kangaroos in the photo-
graphs [19, 20].
1.3.5 Face Recognition
Face recognition is the task in computer vision in which human faces are identified in
photographs. Humans easily perform face detection, but it is a challenging problem for
computers to recognize human faces. Face recognition becomes a nontrivial problem for
computers to solve [21].
In face detection, the faces of different humans in the photograph should be located. The
coordinates of the faces in the images should be represented by using the bounding box.
The dynamic nature of the human face should be considered irrespective of the angle or
orientation. Also, other parameters such as hair color, clothing, light levels, accessories, age
and makeup should be considered.
Deep Learning and Computer Vision 21
There are two methods used for the recognition of faces. They are:
• Methods based on features – Detecting the faces with the help of handcrafted
filters
• Methods based on images – Extracting the faces using the holistic learning from
the entire image
The three models frequently used for face recognition are multi-task cascaded convolu-
tional neural network (MTCNN), the VGGFace2 model, and the FaceNet model.
The MTCNN model is the most used model for detecting faces with expressions. It was
developed in 2016. As the name implies, the three neural networks are connected in a cas-
cade way, which helps detect faces and facial landmarks in the images.
Face identification and verification can be performed by using the VGGNet2 model.
VGG stands for Visual Geometry Group. The embedding of faces can also be detected
using this model.
The FaceNet model is mainly used for feature extraction from the human face. It is also
used for face identification and verification purpose.
References
1. https://towardsdatascience.com/derivative-of-the-sigmoid-function536880cf918e
2. https://www.medcalc.org/manual/tanh_function.php
3. Jie Wang and Zihao Li, “Research on Face Recognition Based on CNN,” IOP Conf. Series: Earth
and Environmental Science 170 (2018), 032110. DOI:10.1088/1755-1315/170/3/032110.
4. Keiron O’Shea, Ryan Nash An, “Introduction to Convolutional Neural Networks,”
arXiv:1511.08458v2 (2015).
5. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-
the-eli5-way-3bd2b1164a53
6. Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios
Protopapadakis, “Deep Learning for Computer Vision: A Brief Review,” Recent Developments in
Deep Learning for Engineering Applications (2018). DOI:10.1155/2018/7068349.
7. https://learnopencv.com/image-classification-using-convolutional-neural-networks-in-
keras/
8. https://www.tinymind.com/learn/terms/relu
22 Object Detection with Deep Learning Models
9. https://medium.com/geekculture/a-2021-guide-to-improving-cnns-network-architectures-
historical-network-architectures-d23f32afb1bd
10. A Ghosh, A Sufian, and F Sultana, “Fundamental Concepts of Convolutional Neural
Network,” Recent Trends and Advances in Artificial Intelligence and Internet of Things (2020).
DOI:10.1007/978-3-030-32644-9_36.
11. Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al-Dujaili, Ye Duan, Omran
Al-Shamma, J. Santamaría, Mohammed A. Fadhel, Muthana Al-Amidie and Laith Farhan,
“Review of deep learning: concepts, CNN architectures, challenges, applications, future direc-
tions,” Journal of Big Data (2021), DOI:10.1186/s40537-021-00444-8.
12. https://towardsdatascience.com/step-by-step-vgg16-implementation-in-keras-for-beginners-
a833c686ae6c
13. Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco
Ciompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram van Ginneken, Clara I.
Sanchez, “A Survey on Deep Learning in Medical Image Analysis,” arXiv:1702.05747v2 (2017).
14. https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_Functions
15. https://www.analyticsvidhya.com/blog/2021/01/image-classification-using-convolutional-
neural-networks-a-step-by-step-guide/
16. Zhong-Qiu Zhao, Member, IEEE, Peng Zheng, Shou-tao Xu, and Xindong Wu, Fellow, IEEE,
“Object Detection with Deep Learning: A Review,” arXiv:1807.05511v2 (2019).
17. Zhixue Wang, Jianping Peng, Wenwei Song, Xiaorong Gao, Yu Zhang, Xiang Zhang, Longfei
Xiao, and Li Ma, “Research Article A Convolutional Neural Network-Based Classification and
Decision-Making Model for Visible Defect Identification of High Speed Train Images,” Journal
of Sensors, (2021), 5554920, DOI:10.1155/2021/5554920.
18. https://books.google.co.in/books?hl=en&lr=&id=10jpDwAAQBAJ&oi=fnd&pg=PP1&dq=d
eep+learning+and+computer+vision&ots=wHn2HtMBT2&sig=lNP7CXdDIy2Tk1BrcsTv6QJ
wXmM#v=onepage&q=deep%20learning%20and%20computer%20vision&f=false
19. Ajeet Ram Pathak, Manjusha Pandey, and Siddharth Rautaray, “Application of Deep Learning
for object detection,” Procedia Computer Science 132 (2018), 1706–1717, DOI: 10.1016/j.
procs.2018.05.144.
20. https://www.upgrad.com/blog/ultimate-guide-to-object-detection-using-deep-learning/
21. KH Teoh, RC Ismail, SZM Naziri, R Hussin, MNM Isa and MSSM Basir, “Face Recognition and
Identification using Deep Learning Approach,” Journal of Physics: Conference Series 1755 (2021),
012006. DOI:10.1088/1742-6596/1755/1/012006.
2
Object Detection Frameworks and
Services in Computer Vision
CONTENTS
2.1 Neural Networks (NNs) and Deep Neural Networks (DNNs)������������������������������������ 24
2.1.1 Neural Networks����������������������������������������������������������������������������������������������������� 24
2.1.2 Single-Layer Perceptron (SLP)������������������������������������������������������������������������������� 25
2.1.3 Multilayer Perceptron (MLP)��������������������������������������������������������������������������������� 25
2.2 Activation Functions����������������������������������������������������������������������������������������������������������� 26
2.2.1 Identity Function������������������������������������������������������������������������������������������������������ 27
2.2.2 Sigmoid Function����������������������������������������������������������������������������������������������������� 27
2.2.3 Softmax Function����������������������������������������������������������������������������������������������������� 27
2.2.4 Tanh Function����������������������������������������������������������������������������������������������������������� 28
2.2.5 ReLU (Rectified Linear Unit) Function����������������������������������������������������������������� 28
2.3 Loss Functions���������������������������������������������������������������������������������������������������������������������� 29
2.4 Convolutional Neural Networks��������������������������������������������������������������������������������������� 30
2.4.1 CNN Architecture and its Components��������������������������������������������������������������� 30
2.5 Image Classification Using CNN��������������������������������������������������������������������������������������� 32
2.5.1 LeNet-5���������������������������������������������������������������������������������������������������������������������� 32
2.5.2 AlexNet���������������������������������������������������������������������������������������������������������������������� 33
2.5.3 VGGNet��������������������������������������������������������������������������������������������������������������������� 34
2.5.4 Inception and GoogLeNet�������������������������������������������������������������������������������������� 35
2.5.4.1 Inception Module�������������������������������������������������������������������������������������� 35
2.5.5 ResNet�����������������������������������������������������������������������������������������������������������������������36
2.5.5.1 Residual Block������������������������������������������������������������������������������������������� 36
2.6 Transfer Learning����������������������������������������������������������������������������������������������������������������� 37
2.6.1 Need for Transfer Learning������������������������������������������������������������������������������������ 37
2.6.2 Transfer Learning Approaches������������������������������������������������������������������������������� 37
2.6.2.1 Pre-trained Network as a Classifier�������������������������������������������������������� 38
2.6.2.2 Pre-trained Network as a Feature Extractor������������������������������������������ 38
2.6.2.3 Fine Tuning������������������������������������������������������������������������������������������������ 38
2.7 Object Detection������������������������������������������������������������������������������������������������������������������� 39
2.7.1 Object Localization�������������������������������������������������������������������������������������������������� 39
2.7.1.1 Sliding Window Detection����������������������������������������������������������������������� 39
2.7.1.2 Bounding Box Prediction������������������������������������������������������������������������� 40
2.7.2 Components of Object Detection Frameworks���������������������������������������������������� 40
2.8 Region-Based Convolutional Neural Networks (R-CNNs)������������������������������������������� 41
2.8.1 R-CNN����������������������������������������������������������������������������������������������������������������������� 41
DOI: 10.1201/9781003206736-2 23
24 Object Detection with Deep Learning Models
2.1.1 Neural Networks
The stackable computational graphs used in deep learning are called neural networks,
which refer to the term from neurobiology. But they hardly resemble the workings of our
brains, not to be confused. They are a mathematical framework for learning representa-
tions from data. Any graph can be broken into smaller pieces which can be broken down
until they reach their independent atomic component. In the case of neural networks, the
smallest independent unit is called a neuron [1,2].
Wx B Z (2.1)
f Z Y (2.2)
A single neuron is a group of two mathematical equations, as shown in Figure 2.1. The
equation (2.1) is the most basic linear equation where x is our input and W and B are coef-
ficients for the equation. This equation is responsible for learning the linear representations
in the data. Learning to understand only linear relationships is not sufficient in most cases,
as the real world contains so many irregularities, noise, and nonlinearity in the data. For
learning nonlinear representations, equation (2.2) can be used in each neuron to wrap a
FIGURE 2.1
An artificial neuron.
Object Detection Frameworks and Services in Computer Vision 25
function around the output from equation (2.1); these functions are called activation func-
tions. Activation functions are discussed in more detail in later topics.
y xi .wi b (2.3)
1 if y 0
y (2.4)
0 if y 0
FIGURE 2.2
Single-layer perceptron (SLP).
26 Object Detection with Deep Learning Models
FIGURE 2.3
Multilayer perceptron (MLP).
computation to the expected output [4]. Figure 2.3 shows a multilayer perceptron with two
hidden layers.
2.2 Activation Functions
Recalling the formation of a neuron, activation functions are applied to neurons in a layer
during prediction. They convert the linear output into a nonlinear form. It is embedded
after every perceptron, and it decides the activation of that neuron. There are several con-
straints to make activation work. Some of the primary constraints that turn a normal func-
tion into an activation function are [5]:
The activation functions can be broadly divided into linear and nonlinear functions. Some
of the most popular activation functions used in deep learning are.
2.2.1 Identity Function
In an identity function, also known as linear transfer function, the output is the same as the
input, equation (2.5) (Figure 2.4).
f x x (2.5)
2.2.2 Sigmoid Function
Also known as logistic activation function, equation (2.6). The sigmoid is extremely popu-
lar with classification tasks because it smoothly eliminates infinite amounts of input into
an output between 0 and 1 (Figure 2.5).
1
z (2.6)
1 ez
2.2.3 Softmax Function
The conversion of input values into probability values is done by softmax function equa-
tion (2.7). It is often used at the output layer of a classification model where prediction of
the class between more than two classes is required.
FIGURE 2.4
Linear activation function.
28 Object Detection with Deep Learning Models
FIGURE 2.5
Sigmoid activation function.
xj
e
xj (2.7)
e i
xi
2.2.4 Tanh Function
The tanh function is similar to the sigmoid (equation 2.8), except that it squishes the infinite
range of input values from −1 to 1, as opposed to 0 to 1 by the sigmoid function (Figure 2.6).
sinh x e x e x
tanh x (2.8)
cosh x e x e x
z max o, x (2.9)
0 if x 0
ReLU x (2.10)
x if x 0
Object Detection Frameworks and Services in Computer Vision 29
FIGURE 2.6
Tanh activation function.
FIGURE 2.7
ReLU activation function.
2.3 Loss Functions
The loss function, also known as the error function, comes under the umbrella body of
deep learning that encapsulates the idea of how well a model is performing in relation to
how it should be. It is used to measure the incorrect predictions made by the neural net-
work with respect to its true class. This is an optimization problem. Minimizing the loss by
optimizing the parameters will yield better accuracy of the model. The results of various
30 Object Detection with Deep Learning Models
loss functions for the same prediction will be different, and have significant consequences
on the performance of the trained model. The scope of this chapter does not allow for a
full explanation of the various loss functions. However, some popular loss functions are
explained below [6].
1. Mean absolute error (MAE): To work out how far the actual value deviates from
that predicted by the model, this formula is used. The mean absolute error main-
tains the same scale of error as the values by adjusting the standard deviation.
2. Mean squared error (MSE): This calculates the square of the difference between the
target price and the predicted value. This increases the scale of error by squaring
the value and making the model more sensitive to higher loss values.
3. Cross-entropy: Generally, this is used in classification problems as it calculates
the difference between two probability distributions. Classifying a single train-
ing example with respect to all available classes would mean that whichever class
shows the highest probability of representing the example as belonging to the cor-
responding class. Ideally, the aim is to get a 100 percent accurate prediction for the
correct class and 0 percent for the rest during training and learn to estimate this
score.
There are many more loss functions than discussed that one can find and use, depending
on the type of problem and their optimization approach.
Pre-processing
Input
-Transformation
(Images/feature Feature Extraction ML model
-Standardization
vector)
-More
FIGURE 2.8
Image classification using ML.
The first layer in CNN starts with a convolutional layer that learns basic features (lines,
edges, etc.); the next convolutional layer is responsible for learning complex features (cir-
cles, squares, and so on). Similarly, further stacked convolutional layers (if any) learn even
more complex features (such as facial parts, complex contours, and so on) [6].
Figure 2.8 shows the steps of a classification model using machine learning techniques.
The image features must be manually extracted to be fed into a machine learning system
(e.g., SVM). The manual work of feature extraction and classification can be replaced by
MLP or CNN; see Figure 2.9.
A basic CNN architecture with series of layers works in this manner:
• Convolutional layer (CONV): This layer works similarly to the feature detector
window by sliding over the image (pixel by pixel) with some fixed size and step;
to mine some significant features for object identification in the respective image.
FIGURE 2.9
General architecture of CNN.
32 Object Detection with Deep Learning Models
So, in general, they are used for feature extraction and learning. While the process
is intuitive and powerful, repetitive stacking and the use of convolutional layers
increase network dimensionality and space-time complexities. This is when pool-
ing or subsampling comes to the rescue.
• Activation function (ACT): They convert the linear output into a nonlinear form. It
is embedded after every perceptron, and it decides the activation of that neuron.
• Pooling layer (POOL): Pooling reduces the parameters given to the next layer,
which results in a reduction in network size. The process of parameters reduction
resizes its input using a summary statistics function like maximum or average.
• Fully connected layer (FC): FC layer is the normal dense layer that is a stack of
neurons. It flattens the 2D grid of multiple features into a single 1D grid (a long
tube) of values. These layers are responsible for learning and performing the clas-
sification task from the trained features.
• Batch normalization (BN): It is common practice to perform normalization before
feeding the training data to the input layer; doing so benefits the model training
and results. This can be done for each or a few selected layers of the neural net-
work for better feature extraction and in turn increasing the training speed and
network flexibility. The process is called batch normalization, where batch refers
to the collection of parameters in a specific layer [6].
• Dropout layer (DO): This is an additional layer used to avoid the scenario of over-
fitting. Overfitting in learning from the training dataset occurs when the model fits
the data but does not learn its features.
2.5.1 LeNet-5
LeNet is the first pioneering CNN proposed by Y. LeCun et al. [7] in 1998. This architecture
was developed for textual data that is optical character recognition (OCR). The LeNet-5
Object Detection Frameworks and Services in Computer Vision 33
FIGURE 2.10
The LeNet-5 architecture.
2.5.2 AlexNet
LeNet performs well for the simple dataset like MNIST, where images are in grayscale
and the number of classes is limited, ten in the case of the Modified National Institute
of Standards and Technology (MNIST) dataset. To build deeper networks, the AlexNet
model was proposed by A. Krzyzewski et al.[8], the winner of the ILSVRC Image
Classification competition in 2012. The model was later published in 2017 with the title
“Deep Convolutional Neural Networks with ImageNet Classification.” 1.2 million images
with high resolution from the ImageNet dataset were used to train the model, which was
then divided into 1,000 categories.
This pioneering study on “deep” convolutional networks for computer vision sparked
a storm of interest among researchers and practitioners alike. There are five convolution
layers and three completely connected layers in the architecture, as depicted in Figure 2.11.
This is how it looks:
FIGURE 2.11
AlexNet architecture.
2.5.3 VGGNet
VGGNet was developed by the Visual Geometry Group at Oxford University in 2014,
which is why it was named VGG [9]. It is a deeper convolutional neural network with
more convolutional, pooling and dense layers. VGGNet is popular in two architectures:
VGG16 and VGG19.
S
O
Conv Block 1 Conv Block 2 Conv Block 3 Conv Block 4 Conv Block 5 F F F
(Conv1+Conv2 (Conv4+Conv5 (
(Conv7+Conv8+ (
(Conv11+Conv12+ (Conv15+Conv16 T
C C M
+POOL3) +POOL6) Conv9+POOL10) Conv13+POOL14)) +POOL17)
A
X
FIGURE 2.12
VGG16 architecture.
Object Detection Frameworks and Services in Computer Vision 35
(ILSVRC14) [10]. By including the configuration into the model, the researchers increased
the depth of the network while keeping the processing budget constant. GoogLeNet, a
22-layer deep network, was the model utilized in the ILSVRC14 proposal.
2.5.4.1 Inception Module
These are the little components that stack themselves on each other and form the Inception
Network. A single Inception module is a combination of multiple convolutional layers
aligned parallel to each other. See the Figure 2.13 for its complete architecture.
The input to these modules is the output from the previous modules. It is more compu-
tationally efficient to use the Inception module solely at the higher layers, leaving the
lower layers alone, like in standard convolutional neural networks. The Inception modules
use a 1 × 1 convolution to calculate the deduction before the expensive 3 × 3 and 5v5 con-
volutions. In addition to reducing feature dimensions and therefore being used for compu-
tation, 1 × 1 convolutions also use rectified linear activation and serve a dual purpose for
the model.
Figure 2.14 shows that GoogLeNet contains nine inception modules in total, with a
maximum pooling layer appended after each block to reduce dimensions. Let’s divide
GoogLeNet into three sub parts:
1. Similar to LeNet and AlexNet model which contains multiple convolutional layers
and pooling layer connected in series.
2. Inception module: 9 inception modules (2 inception modules + 1 pooling layer + 5
inception modules + 1 pooling layer + 2 inception modules).
3. Classifier: fully connected output layer with softmax layer.
FIGURE 2.13
Inception module.
Discovering Diverse Content Through
Random Scribd Documents
Sinun laulut laulellasi,
Ilovirret vieretellä,
Lehot leikki lyöäksesi,
Tanner tanhuellaksesi." 60
Siitä lieto Lemminkäinen
Eleä nutustelevi
Saaren impien ilossa,
Kassapäien kauneussa;
Kunnepäin on päätä käänti, 65
Siinä suuta suihkatahan,
Kunne kättänsä kohotti,
Siinä kättä käpsätähän.
Kävi öillä oksimassa,
Syän-öillä yksinänsä; 70
Niinpä kerran käyessänsä
Kulkiessansa kylitse,
Saaren niemen pitkän päähän,
Kymmenentehen kylähän,
Ei nähnyt sitä taloa, 75
Kuss' ei miekkoja hiottu,
Tapparoita tahkaeltu,
Pään varalle Lemminkäisen.
Silloin lieto Lemminkäinen
Jo tunsi tuhon tulevan, 80
Hätäpäivän päälle saavan,
Sanan virkkoi, noin nimesi:
"Lempoko yhen urohon
Sovissansa suojelevi
Päälle saa'essa satojen, 85
Tuhansien tunkiessa!"
Astuiksen aluksehensa,
Vierähti venon perähän,
Laski laivansa ulomma;
Tuli tuuli tuon puhalti 90
Ulapalle aukealle,
Jäivät raukat rannikolle,
Saaren immet itkemähän,
Kultaiset kujertamahan.
Sini itki saaren immet, 95
Niemen neiet voikerrehti,
Kuni purjepuu näkyvi,
Rautahankki haimentavi;
Ei he itke purjepuuta,
Rautahankkia haloa, 100
Itki purjepuun alaista,
Rautahankin haltiata.
Itse itki Lemminkäinen,
Sini itki ja sureksi,
Kuni saaren maat näkyvi, 105
Saaren harjut haimentavi;
Ei hän itke saaren maita,
Saaren harjuja haloa,
Itki saaren impyitä,
Noita harjun hanhosia. 110
Siitä lieto Lemminkäinen
Päästyä kotiperille
Tunsi maat on, tunsi rannat,
Sekä saaret, jotta salmet,
Tunsi vanhat valkamansa, 115
Entiset elo-sijansa,
Ei tunne tuvan aloa,
Seinän seisonta-sijoa;
Jo tuossa tuvan sijalla
Nuori tuomikko tohisi, 120
Männikkö tupamäellä,
Katajikko kaivotiellä.
Virkkoi lieto Lemminkäinen,
Sanoi kaunis Kaukomieli:
"Tuoss' on lehto, jossa liikuin, 125
Kivet tuossa, joilla kiikuin,
Tuossa nurmet nukkeroimat,
Pientarehet piehtaroimat,
Mikä vei tutut tupani,
Kuka kaunihit katokset?" 130
Loihe siitä itkemähän,
Itki päivän, itki toisen
Ei hän itkenyt tupoa,
Eikä aittoa halannut,
Itki tuttua tuvassa, 135
Aitallista armastansa.
Virkkoi lieto Lemminkäinen,
Sanoi kaunis Kaukomieli:
"Ohoh kaunis kantajani,
Ihana imettäjäni! 140
Jo olet kuollut kantajani,
Mennyt ehtoinen emoni,
Liha mullaksi lahonnut,
Kuuset päälle kasvanehet."
Katseleikse, käänteleikse, 145
Näki jälkiä hitusen,
Ruohossa rutistunutta,
Kanervassa katkennutta,
Läksi tietä tietämähän,
Ojelvoista ottamahan, 150
Tiehyt metsähän vetävi,
Ojelvoinen ottelevi.
Vieri siitä virstan toisen,
Pakeni palasen maata,
Salon synkimmän sisähän, 155
Korven kolkon kainalohon:
Näkevi salaisen saunan
Kahen kallion lomassa,
Siellä ehtoisen emonsa,
Tuon on valta vanhempansa. 160
Siinä lieto Lemminkäinen
Ihastui iki hyväksi,
Sanan virkkoi, noin nimesi:
"Ohoh äiti armahani!
Viel' olet toki elossa, 165
Kun jo luulin kuolleheksi;
Pois itkin ihanat silmät,
Kasvon kaunihin kaotin."
Sanoi äiti Lemminkäisen:
"Viel' olen toki elossa, 170
Vaikkapa piti paeta
Tänne synkkähän salohon:
Suori Pohjola sotoa,
Takajoukko tappeloa,
Poltti huonehet poroksi, 175
Kaiken kartanon hävitti."
Sanoi lieto Lemminkäinen:
"Ellös olko milläkänä!
Tuvat uuet tehtänehe,
Paremmat osattanehe, 180
Pohjola so'ittanehe,
Lemmon kansa kaattanehe."
Siitä äiti Lemminkäisen
Itse tuon sanoiksi virkki:
"Viikon viivyit poikueni 185
Noilla mailla vierahilla."
Virkkoi lieto Lemminkäinen,
Sanoi kaunis Kaukomieli:
"Hyvä oli siellä ollakseni,
Armas aikaellakseni; 190
Siell' oli mäet simaiset,
Kalliot kananmunaiset,
Mettä vuoti kuivat kuuset,
Seipähät valoi olutta."
"Hyvä oli siellä ollakseni, 195
Armas aikaellakseni,
Siitä oli paha elämä,
Siitä outo ollakseni,
Pelkäsivät piikojansa,
Luulivat lutuksiansa, 200
Pahasti piteleväni,
Ylimäärin öitsiväni,
Minä piilin piikasia,
Varoin vaimon tyttäriä,
Kun susi sikoja piili, 205
Havukat kylän kanoja."
Kolmaskymmenes runo
Yhdesneljättä runo
Veljekset Untamo ja Kalervo riitautuvat kalavedestä keskenänsä,
jonka riidan ratkaisemiseksi Untamo nostaa sodan veljeänsä
Kalervoa vastaan; vv. 1-28. — Untamon väki hävittää koko Kalervon
joukon, paitsi yhden raskaan vaimon, jonka viepi sotavankina
kanssansa. Sille sitte syntyy poika Kullervo, joka jo piennä lasna
miettii kostoa vanhempainsa puolesta; vv. 29-66. — Untamo päättää
surmata pojan; viskauttaa veteen, se ei huku sinne, laittaa
tuliroviohon, se ei pala siinä, hirttäyttää puuhun, se ei kuole
hirsipuuhunkaan; vv. 67-114. — Untamo kun ei saa poikaa hengiltä,
kasvattaa sen suuremmaksi, panee lasta hoitamaan, se kuolettaa
lapsen, laittaa kaskea hakkaamaan, se kaataa koko Untamon
metsän; vv. 115-146. — Kolmanneksi työksi Kullervo määrätään
aidan panoon. Sen panee mahdottoman korkeaksi ja puipi sen
perästä Untamon rukiit paljaaksi pölyksi. Untamo viimein
suuttuneena myöpi hänen turhasta hinnasta Ilmariselle; vv. 147-190.
Kahdesneljättä runo
Kullervo Ilmarisen talossa pannaan karjaa paimentamaan. Emäntä
nauraa saadaksensa leipoo ison kiven hänen eväskakkuunsa; vv. 1-
22. — Emäntä karjansa laitumelle laskiessa pyytää Luonnotarten ja
metsän haltiain sille onnea ja menestystä laittamaan; vv. 23-78. —
Hyvällä puheella ja muillaki sanoilla kokee sitte saada otson rauhassa
elämään; vv. 79-122. — Viimeiseksi rukoilee Ukon ja Kuippanan
näyttämään kovuutta otsolle, jos hyvistä sanoista ei tottelisi; vv. 123-
164.