Deep Learning Models A Practical Approach for Hands-On Professionals (Jonah Gamba)
Deep Learning Models A Practical Approach for Hands-On Professionals (Jonah Gamba)
Jonah Gamba
Deep Learning
Models
A Practical Approach for Hands-On
Professionals
Transactions on Computer Systems and
Networks
Series Editor
Amlan Chakrabarti, Director and Professor, A. K. Choudhury School of
Information Technology, Kolkata, West Bengal, India
Editorial Board
Jürgen Becker, Institute for Information Processing–ITIV, Karlsruhe Institute of
Technology—KIT, Karlsruhe, Germany
Yu-Chen Hu, Department of Computer Science and Information Management,
Providence University, Taichung City, Taiwan
Anupam Chattopadhyay , School of Computer Science and Engineering,
Nanyang Technological University, Singapore, Singapore
Gaurav Tribedi, EEE Department, IIT Guwahati, Guwahati, India
Sriparna Saha, Computer Science and Engineering, Indian Institute of Technology
Patna, Patna, India
Saptarsi Goswami, A.K. Choudhury school of Information Technology, Kolkata,
India
Transactions on Computer Systems and Networks is a unique series that aims
to capture advances in evolution of computer hardware and software systems
and progress in computer networks. Computing Systems in present world span
from miniature IoT nodes and embedded computing systems to large-scale
cloud infrastructures, which necessitates developing systems architecture, storage
infrastructure and process management to work at various scales. Present
day networking technologies provide pervasive global coverage on a scale
and enable multitude of transformative technologies. The new landscape of
computing comprises of self-aware autonomous systems, which are built upon a
software-hardware collaborative framework. These systems are designed to execute
critical and non-critical tasks involving a variety of processing resources like
multi-core CPUs, reconfigurable hardware, GPUs and TPUs which are managed
through virtualisation, real-time process management and fault-tolerance. While AI,
Machine Learning and Deep Learning tasks are predominantly increasing in the
application space the computing system research aim towards efficient means of
data processing, memory management, real-time task scheduling, scalable, secured
and energy aware computing. The paradigm of computer networks also extends it
support to this evolving application scenario through various advanced protocols,
architectures and services. This series aims to present leading works on advances
in theory, design, behaviour and applications in computing systems and networks.
The Series accepts research monographs, introductory and advanced textbooks,
professional books, reference works, and select conference proceedings.
Jonah Gamba
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
This book is a result of realizing the need for practical approach to understanding deep
learning models since many existing books on the market tend to emphasize theoret-
ical aspects, leaving newcomers and professionals seeking new solutions scrambling
for effective guidelines to achieve their goals. Additionally, most available mate-
rial does not take into account the important factor of rapid prototyping where the
goal is to quickly evaluate the performance of algorithms before going deep into
consideration of final implementation platforms on which the algorithms will run.
The intention here is to address these problems by taking a different approach which
focuses on practicality while keeping theoretical concepts to a necessary minimum.
In this book, we first build the necessary foundation on deep learning models
including current status and progressively go into actual examples of model evalua-
tion. A dedicated chapter is allocated to evaluating the performance of multiple algo-
rithms on specific datasets, highlighting techniques and strategies that can address
real-world challenges when deep learning is employed. By consolidating all neces-
sary information into a single resource, readers can bypass the hassle of scouring
scattered online sources, gaining a one-stop solution to dive into deep learning for
object detection and classification.
To facilitate understanding, the book employs a rich array of illustrations, figures,
tables, and code snippets. Comprehensive code examples are provided, empowering
readers to grasp concepts quickly and develop practical solutions. The book covers
essential methods and tools, ensuring a complete and comprehensive treatment that
enables professionals to implement deep learning algorithms swiftly and effectively.
The book is also designed to equip professionals with the necessary skills to
thrive in the active field of deep learning, where it has the potential to revolutionize
traditional problem-solving approaches. This book serves as a practical companion,
enabling readers to grasp concepts swiftly and embark on building practical solutions.
The content presented in this book is based on several years of experience in
research and development. The main idea is to give a quick start for those try to find
answers within a short period of time irrespective of background. The chapters are
organized as follows:
vii
viii Preface
I would like to express my gratitude all the people who have in some ways, positively
contributed in various ways to the preparation of this book.
First and foremost I would like to thank my family members, Megumi, Sekai, and
Mirai, for their invaluable patience during this very long process. Their understanding
and accommodation made it possible for me to spare some time for putting together
the material required to complete the manuscript.
I would like to thank my extended family in various places and situations for the
emotional and physical support that they have given during the period of writing this
book.
Special thanks goes to Dr. Courage Kamusoko, Prof. Hiromi Murakami, formerly
Seikei University, and Prof. Shuji Kawasaki of Iwate University for their contin-
uous encouragement and advice during the process of putting the book together. Let
me also take this opportunity to thank LocaSense Research Systems team for their
assistance in the preparation of part of the evaluation data used in Chap. 5.
I would also like to express my sincere gratitude to Honjo Scholarship Foundation
for always including old boys in their programs and through their kindness, that made
it possible for me to pursue my interest in information systems, which is the subject
of this book.
Last but not least, many thanks also to Smith Chae, Sivananth S. Siva Chandran,
and Diya Ma of Springer for their very efficient and continuous support during the
process of creating this book.
Jonah Gamba
xi
Contents
xiii
xiv Contents
1.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 1
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_1
2 1 Basic Approaches in Object Detection and Classification by Deep Learning
concepts in a manner that will allow interested readers to start working on real-world
problems. Although theoretical concepts have a critical role to play on performance
of the algorithms, we will leave much of this discussion to dedicated professionals
and instead concentrate on how utilize the existing technology. The simple reason
is that technology consumers normally experiment with existing methods at a very
high level to confirm that they work as expected before deep diving into the theory
behind in order to improve performance. In some cases, existing algorithms work well
without any modifications thereby reducing the effort and money spent on further
research. Moreover, the approach presented here is natural and will in turn allow
a rapid entry into the realm of deep learning. There is an abundance of resources
in terms of data, code, and theoretical materials available on Internet platforms.
However, the existence of this enormous volume of information is a two-edged
sword. On one hand, it is much easier to find information on any topic of interest,
but on the other hand it makes it harder to figure out where to start and filter out all
the clutter. Once you visit a particular site from any search engine of your choice,
you are basically presented with multiple links which can be endless linked to other
sites. To summarize it all, this book offers the following advantages:
Deep Understanding: It provides a more immersive and focused reading expe-
rience that allows the reader to delve deeply into a subject, offering in-depth expla-
nations, comprehensive coverage, and a cohesive narrative that helps build a solid
foundation of understanding. This depth is often missing in fragmented online search
results.
Credibility and Quality: It goes through a rigorous review process, ensuring a
certain level of quality, accuracy, and credibility. Referenced authors are experts in
the field and have invested significant time and effort into research and writing.
Structure and Organization: It is organized in a structured manner, with chap-
ters, sections, and an index that allows for easy navigation and reference. This makes
it convenient to follow a logical progression of concepts, find specific information,
and revisit previous sections.
4 1 Basic Approaches in Object Detection and Classification by Deep Learning
Limited Distractions: With the material presented in the subsequent chapters, one
can concentrate on the content without the distractions of ads, pop-ups, or hyperlinks.
This helps maintain focus and promotes a deeper level of engagement with the
material.
The above are among the reason why you need this book to keep focused on the
ball and strike the target without miss in the most efficient way. Of course, it may be
necessary to occasionally check some material on the Internet but it’s best to quickly
come back to the main text to avoid getting lost in distractions.
So how do you we begin? As mentioned above, the best way to start with deep
learning would be to first limit the scope of the search area. The recommended
approach would be to make a quick survey of the publicly available material on
the subject and then chose one that matches your final objective. For quick results,
it is also instructive to take a hands-on approach where one can work on example
code along the way. This makes it possible to visualize the output and make further
refinements for robustness. For example, Python programming language code can be
executed to cement the ideas and also get a deeper understanding of how the concepts
can be implemented. To this end there are numerous stable, collaboratively debugged
open-source packages and tools available that make it unnecessary to code algorithms
from mathematical models, thereby reducing the learning curve and increasing speed
productively toward the intended goal enabling rapid prototyping.
There are a variety of machine learning algorithms that have evolved over many
years of research and development. An overview of some of these algorithms has
been can be found in [6], and a concise presentation is given in this section. The
performance of these so-called shallow machine learning algorithms depends heavily
on the representation of the input data they are given [4].
All machine learning algorithms are constructed along mathematical concepts
that make it possible to transform input data into a form that simplifies the task of
classification. After transformation, it becomes a matter of applying a logical rule
to cluster the data into their respective classes. Normally a series of affine and/or
nonlinear operations are applied to the input data to come to the final result.
One example is when data becomes linearly separable by conversion from Carte-
sian coordinates to polar coordinates. It is important to recognize that the repre-
sentation of data can make a big difference on the classification or recognition
task.
In the following subsections, we briefly summarize some of these conventional
methods that have been found to be effective in some applications.
1.2 Conventional Methods of Object Detection and Machine Learning 5
Step 5: Classification
For each data point in the test set, calculate the distances between the test point and
all data points in the training set using the chosen distance metric. Then sort the
distances in ascending order to identify the K-nearest neighbors.
Step 6: Voting
The first option is to count the occurrences of each class label among the K-nearest
neighbors. Then assign the test point to the class that appears most frequently among
its neighbors (majority voting).
The second option is to assign different weights to neighbors based on their
distance from the test point. Closer neighbors have a higher influence on the
prediction (weighted voting).
In case of ties (equal occurrences of multiple class labels among the K neighbors),
tie-breaking mechanisms, such as selecting the class with the closest neighbor can
be applied.
Step 7: Evaluation
After classifying all test points, evaluate the model’s performance using appropriate
metrics like accuracy, precision, recall, F1-score, or confusion matrix.
Step 8: Hyperparameter Tuning
If the performance is not satisfactory, it is recommended to adjust hyperparam-
eters like K or the distance metric and re-evaluate the model. This process can be
repeated with multiple combination of hyperparameters until the desired performance
is achieved.
Step 9: Prediction
Once the model is tuned and evaluated, the final step is use make predictions on new,
unseen data by following the same steps of distance calculation, neighbor selection,
and voting. If prediction erroneous, it may be necessary to re-examine steps 2–8
above.
It’s important to keep in mind that KNN is sensitive to noisy data, outliers, and the
curse of dimensionality just as with other techniques in the same category. Prepro-
cessing and data cleaning steps can help alleviate some of these challenges. In
summary KNN is a straightforward algorithm that can be implemented relatively
easily. In this respect and as additional information for the interested reader, in
Python the Scikit-learn API provides the sklearn.neighbors.KNeighborsClassifier
for the performing KNN [9]. However, its performance and accuracy heavily depend
on parameter tuning, distance metric selection, and data preprocessing. It’s also
important to balance computational efficiency with model accuracy, especially when
dealing with large datasets. Figure 1.6 is an example of applying the KNN algorithm
to classify a query point in cases of two classes.
8 1 Basic Approaches in Object Detection and Classification by Deep Learning
Merits of KNN
The KNN algorithm has several advantages that make it a popular and useful choice
for certain machine learning tasks. We outline some of the key advantages of the
KNN algorithm (Table 1.1).
Limitations of KNN
While KNN has several advantages, it’s essential to consider its limitations, such
as sensitivity to the choice of K, slow prediction times for large datasets, and the
impact of irrelevant or redundant features. Proper preprocessing, parameter tuning,
and validation are crucial for achieving optimal results with the KNN algorithm.
The KNN algorithm, despite its advantages, also has some limitations that need to
be considered when using it for machine learning tasks. Here are the main limitations
of the KNN algorithm (Table 1.2).
To mitigate these limitations, it’s important to preprocess the data, choose an
appropriate K, and consider using KNN in combination with other algorithms or
techniques, such as dimensionality reduction or ensemble methods. Additionally,
understanding the characteristics of the dataset and the problem domain can help
determine whether KNN is a suitable choice for a specific task.
lower weights. This can improve the accuracy of predictions, especially when some
neighbors are more relevant than others. This approach is similar to distance decay
where a distance decay function that reduces the influence of neighbors as their
distance from the query point increases. Distance Metric Learning can also be applied
where the objective is to learn a customized distance metric that optimizes the neigh-
bors’ relevance for classification. Metric learning can improve the algorithm’s perfor-
mance when the standard distance metrics are not well-suited for the data. A related
approach is the localized KNN in which instead of considering all data points equally,
use only a subset of neighbors that are closer to the query point. This can reduce the
influence of irrelevant neighbors and improve computational efficiency.
Kernel Density Estimation can incorporate kernel density estimation techniques
to smooth the contribution of neighbors. This can result in more stable and robust
predictions, particularly in noisy or irregular data [11].
Feature Selection and Dimensionality Reduction can be achieved by applying
techniques like Principal Component Analysis (PCA) or feature selection to reduce
the dimensionality of the data before applying KNN. This can help mitigate the curse
of dimensionality and improve computational efficiency [12, 13].
Other possibilities that can be explored are ensemble approaches. This involves
combining predictions from multiple KNN models with different settings (e.g.,
different K values or distance metrics) to enhance accuracy and robustness. Ensemble
methods like Bagging or Boosting can be employed. This can be done in conjunc-
tion with approximate nearest neighbor search algorithms to accelerate the search
for neighbors in high-dimensional spaces. Techniques like k-d trees or ball trees can
significantly improve the algorithm’s efficiency.
Adaptive KNN can dynamically adjust the value of K based on the local density
of data points. In regions with high data density, a smaller K value can be used, while
a larger K value can be employed in sparse regions [14].
Other improvements try to detect and handle outliers before applying KNN.
Outliers can significantly impact the algorithm’s performance, so preprocessing steps
like outlier removal or outlier correction can be beneficial. Hybrid models can also be
applied such as combination of KNN with other algorithms, such as decision trees,
support vector machines, or neural networks, to leverage their strengths and mitigate
KNN’s weaknesses. Finally localized classifiers and incremental learning have been
investigated. With localized classifiers, specialized classifiers are applied in specific
regions of the feature space, depending on the data characteristics. This approach can
1.2 Conventional Methods of Object Detection and Machine Learning 11
Further details of how the above steps are accomplished can be found in [4].
Additionally, in Python the Scikit-learn API provides the sklearn.discriminant_
analysis.LinearDiscriminantAnalysis Class for the performing LDA.
The basic principle of LDA is shown in Fig. 1.8.
Merits of LDA
Linear Discriminant Analysis (LDA) offers several advantages, making it a valu-
able technique in various machine learning and pattern recognition tasks. Some of
the key advantages of the Linear Discriminant Analysis algorithm are listed below
(Table 1.3).
might lead to a decision boundary that does not accurately represent the actual distri-
bution of the majority of the data. Similarly outliers can lead to distorted within-class
scatter, inaccurate between-class scatter and loss of linearity which all have negative
impact on classification performance.
When the number of samples is small, LDA may overfit the data, especially if the
number of features is large. In such cases, LDA can perform poorly due to the limited
amount of data available for estimation. This can result in the overfitting problem,
where an attempt is made to use all the available samples. Another limitation is
limited to linear separability. The LDA aims to find linear boundaries that separate
classes. It may struggle when classes are not linearly separable, leading to reduced
classification accuracy.
There are also problems associated with reduced performance in high-dimensional
data. In high-dimensional feature spaces, the “curse of dimensionality” can affect
the performance of LDA. This is because the assumptions of LDA become harder to
meet as the number of features increases.
The primary focus LDA’s lies in transforming data into a lower-dimensional space
that enhances class separability. However, it lacks an inherent mechanism for explicit
feature selection. This means that while it aims to improve classification accuracy
through this transformation, it does not automatically identify or eliminate irrelevant
or redundant features from the dataset. This leaves the algorithm implementer the
burden of identifying the correct set of features that may be suitable for the problem
at hand. This shortcoming is very severe when the data has no obvious patterns which
may give clues feature selection.
Although the LDA can be extended to address multiclass classification prob-
lems, its fundamental formulation is rooted in binary classification. Consequently, in
complex scenarios involving multiple classes, the algorithm’s binary origin may
impact its behavior. Although techniques exist to expand its use to multiclass
scenarios, the algorithm’s core design remains influenced by its binary classification
heritage.
The determination of the decision boundary by LDA takes into account the prior
probabilities of the different classes. This reliance on prior probabilities can be a
double-edged sword. When prior probabilities are accurate and unbiased, LDA can
produce effective results. However, if these probabilities are skewed or inaccurate,
18 1 Basic Approaches in Object Detection and Classification by Deep Learning
Innovative approaches combining deep learning and LDA such as Deep Linear
Discriminant Analysis (DLDA) have been proposed in the literature [21, 22]. Inte-
grating LDA into deep learning frameworks allows for learning more complex and
discriminative feature representations while retaining the benefits of LDA.
Other notable methods include combining multiple LDA models or LDA with
other classification algorithms in an ensemble can enhance classification perfor-
mance and robustness and Sparse Discriminant Analysis which incorporates sparsity-
inducing techniques to encourage feature selection and prioritize relevant features in
the discriminant analysis process. More accurate estimation of class priors can also
be employed to improve the performance of LDA, especially when prior probabilities
are imbalanced.
The above improvements and extensions address various limitations of the tradi-
tional LDA algorithm and offer more flexibility, accuracy, and robustness in various
scenarios.
about whether QDA is suitable for your data. If the features are highly correlated,
multicollinearity might impact the accuracy of covariance estimates in QDA. In this
case one can consider addressing multicollinearity through feature transformation or
regularization.
Finally real data doesn’t always come in assumed distributions and desired size.
In fact this happens more often than not. In that respect, handling non-Gaussian
distributions and small sample sizes should be considered. QDA assumes that the
feature distributions within each class are multivariate normal. If this assumption is
violated, consider transforming your data to approach normality. Additionally, if one
has a small dataset, cautious application of the QDA kept in mind as it might lead
to overfitting. Techniques like regularized discriminant analysis or dimensionality
reduction can be used in such cases.
Step 2: Compute Class Statistics
Calculate the mean vector and covariance matrix for each class. These statistics
provide information about the distribution of data within each class.
Quadratic Discriminant Function:
For each class, QDA models the class distribution using a quadratic function.
The quadratic discriminant function d(x) can be represented as:
1 1 T
d j (x) = − ∗ log j − ∗ x − μ j −1 j x − μ j + log p j
2 2
1 T
j = ∗ xk − μ j xk − μ j
n j kC
j
where the index j denotes the class, , μ and p are the covariance matrix, mean, and
probability respectively given the data x.
The objective is to calculate the quadratic discriminant function for each class
and assign the point to the class with the highest discriminant score.
Step 3: Model Training
QDA does not have an explicit training phase like some other algorithms. The model
parameters (mean vectors and covariance matrices) are estimated directly from the
training data.
Step 4: Regularization
If the covariance matrices are ill-conditioned or if the number of training samples is
small, regularization techniques can be optionally applied to stabilize the parameter
estimation.
Step 5: Model Evaluation
Split the dataset into training and testing sets. Train the QDA model on the training
data and evaluate its performance on the testing data using appropriate metrics
(accuracy, precision, recall, F1-score, etc.).
22 1 Basic Approaches in Object Detection and Classification by Deep Learning
Merits of QDA
The Quadratic Discriminant Analysis (QDA) algorithm offers several advantages that
make it a valuable tool for certain classification tasks. Here are the key advantages
of the QDA algorithm (Table 1.4).
QDA is particularly useful when data exhibits nonlinear relationships and varying
covariance structures among classes. It can provide accurate and flexible classifica-
tion in scenarios where linear classifiers might not be suitable. However, it’s important
to consider the assumptions of QDA, such as Gaussian distribution and class-specific
covariance matrices, to ensure its applicability to the given data.
Limitations of QDA
The Quadratic Discriminant Analysis (QDA) algorithm, while advantageous in
many aspects, also has limitations that need to be considered when applying it to
classification tasks. Here are the main limitations of the QDA algorithm (Table 1.5).
Despite these limitations, QDA can still be a powerful tool for classification tasks,
especially when data exhibits nonlinear relationships and varying covariance struc-
tures. It’s important to carefully assess whether QDA is appropriate for a specific
problem and ensure that the assumptions of the algorithm are met for accurate and
reliable results.
Improvements of QDA
While Linear Quadratic Discriminant Analysis (LQDA) is a simplified version of
Quadratic Discriminant Analysis (QDA) that assumes equal covariance matrices for
all classes, there are some possible improvements and variations that can enhance its
performance and address its limitations. Some of these improvements are common
across multiple object detection and classification methods. Here we summarize
some of the common approaches.
Introduce regularization techniques to mitigate the effects of ill-conditioned
covariance matrices or situations with limited data is almost standard for most clas-
sification methods and QDA is no exception. Regularized Linear Quadratic Discrim-
inant Analysis can stabilize parameter estimation and prevent overfitting. Another
approach is to modify the algorithm to allow for different covariance matrices in
localized regions of the feature space. This approach can improve accuracy by
accommodating varying data distributions [23].
Implementing feature selection or dimensionality reduction techniques before
applying LQDA is another common method. Reducing the number of features can
improve the algorithm’s performance, especially in high-dimensional spaces. It is
also worthwhile to consider Ensemble Methods. Employing ensemble methods like
Bagging or Boosting with LQDA as the base classifier leading to enhanced robustness
and accuracy by combining multiple classifiers.
Hybrid models can also be investigated. For example, combining LQDA with
other classification algorithms, such as logistic regression, naive Bayes, or support
vector machines, utilization of distance-based classifiers, like KNN, in conjunction
24 1 Basic Approaches in Object Detection and Classification by Deep Learning
with LQDA to incorporate local information for classification and exploring semi-
supervised or self-training techniques [24]. Hybrid models can leverage the strengths
of each algorithm to improve overall performance.
Other improvements worth mentioning include Kernel Linear Quadratic Discrim-
inant Analysis which extends LQDA using kernel methods to allow for nonlinear
decision boundaries. Kernel LQDA can capture complex relationships between
features and classes. Finally development of interpretable variations of LQDA that
provide insights into the decision-making process, similar to linear models, while
still capturing more complex relationships can be considered [25, 26].
As with LDA, it is important to note that while these improvements can enhance
LQDA’s performance, they might introduce additional complexity or computational
requirements. This trade-off between complexity and performance gains always
exists.
Due the importance of the SVM algorithm, we first give a brief description of its
historical background. The foundation for SVMs was laid by the work of Vladimir
Vapnik and Alexey Chervonenkis in the late 1960s [27]. They introduced the concepts
of “structural risk minimization” and the “VC dimension,” which form the theoretical
basis for SVMs.
The concept of SVMs as we know them today was developed by Vapnik and his
team at AT&T Bell Laboratories in the 1990s. In 1992, Bernhard Boser, Isabelle
Guyon, and Vapnik introduced the first algorithm for training linear SVMs [17].
In 1995, Corinna Cortes and Vapnik introduced the Support Vector Classifi-
cation (SVC) algorithm. The “kernel trick,” a fundamental aspect of SVMs that
allows nonlinear classification by mapping data into higher-dimensional spaces, was
proposed by Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik in 1992. This
enabled SVMs to tackle complex classification problems.
SVMs gained popularity in the early 2000s due to their strong theoretical founda-
tions and good generalization properties. The development of SVMs was intertwined
with the progress of kernel methods and machine learning in general. Researchers
26 1 Basic Approaches in Object Detection and Classification by Deep Learning
SVM is known for its ability to handle complex decision boundaries and perform
well on various types of data. Its effectiveness in high-dimensional spaces and its
ability to handle nonlinear relationships through kernel functions make it a popular
choice in many machine learning applications. However, SVM’s training time and
complexity can increase with larger datasets, and the selection of appropriate kernels
and hyperparameters requires careful consideration to achieve optimal results.
radial basis function, etc.), and the associated kernel-specific parameters (e.g., degree
for polynomial kernel, gamma for RBF kernel).
The selection of hyperparameters can dramatically impact the SVM’s ability to
generalize well on new, unseen data. An incorrect choice of hyperparameters can lead
to overfitting (when the model fits the training data too closely but doesn’t perform
well on new data) or underfitting (when the model is too simplistic to capture the
underlying patterns).
To determine the best combination of hyperparameters for your SVM, you typi-
cally use a technique called cross-validation. Cross-validation involves splitting your
training data into multiple subsets or folds. You train the SVM on several combi-
nations of hyperparameters and evaluate its performance on different folds. This
helps you understand how well the SVM generalizes to unseen data under various
hyperparameter settings.
One common approach for hyperparameter tuning is grid search. In grid search,
you define a range of possible values for each hyperparameter, and the algorithm
tries every possible combination of these values. For each combination, you train
the SVM using cross-validation and measure its performance. The combination of
hyperparameters that yields the best validation performance is selected as the optimal
set.
Grid search can be computationally expensive, especially when dealing with
multiple hyperparameters or large datasets. Random search is an alternative where
you randomly sample from the hyperparameter space. Bayesian optimization is
another approach that uses probabilistic models to find the next set of hyperpa-
rameters to evaluate based on past performance.
The regularization parameter (C) controls the trade-off between maximizing the
margin and minimizing the classification error. Larger C values result in a smaller
margin and potentially more training data points within it. Kernel parameters like
gamma in the RBF kernel influence the flexibility of the decision boundary. These
parameters require careful tuning to prevent overfitting or underfitting.
Step 6: Training the SVM
Train the SVM model on the training data using the chosen hyperparameters and
kernel. During training, the SVM optimizer adjusts the weights and bias of the
hyperplane to create the optimal decision boundary that separates the classes.
Step 7: Model Evaluation
Evaluate the trained SVM model on a separate testing dataset to assess its perfor-
mance. Use appropriate evaluation metrics such as accuracy, precision, recall,
F1-score, or ROC curves to measure the model’s effectiveness.
Step 8: Fine-Tuning (Optional)
If the performance is not satisfactory, it is possible optionally go back to hyperpa-
rameter tuning, try different kernels, or consider adjusting the dataset to improve
results.
32 1 Basic Approaches in Object Detection and Classification by Deep Learning
Step 9: Prediction
Once the model’s performance becomes satisfactory, one can use it to make predic-
tions on new, unseen data points. Apply the same preprocessing steps (scaling,
normalization) to the new data before making predictions.
Step 10: Model Interpretation
Depending on the kernel used, SVM might offer insights into feature importance,
allowing one to understand which features contribute most to the classification
decision. This step can be optionally performed.
Step 11: Deployment
Deploy the trained SVM model into production environments for making real-time
predictions on new data.
SVM is a versatile algorithm that can handle a variety of classification tasks,
from linear to nonlinear and binary to multiclass. Its effectiveness relies on proper
data preprocessing, kernel selection, and hyperparameter tuning to achieve optimal
performance.
The application of SVM for linear and nonlinear cases is illustrated in Figs. 1.12
and 1.13.
More details of how the above steps are accomplished can be found in [4]. Addi-
tionally, in Python the Scikit-learn API provides the sklearn.svm module for handling
the multiple variations of the SVM algorithms.
Kernel Trick for Dimensionality Reduction: The kernel trick can also be used for
dimensionality reduction, which can be useful when dealing with high-dimensional
data.
SVM’s combination of flexibility, generalization capability, and robustness makes
it a valuable tool in various domains, such as image classification, text categorization,
bioinformatics, and more. However, it’s important to fine-tune hyperparameters and
choose the appropriate kernel for each problem to achieve optimal results.
Limits of SVM
While the SVM algorithm offers numerous advantages, it also comes with some
limitations and challenges that need to be considered when applying it to different
machine learning tasks. Here are the main limitations of the SVM algorithm
(Table 1.7).
While SVM is a powerful algorithm with wide-ranging applications, it’s impor-
tant to be aware of its limitations and carefully consider whether it is the appropriate
choice for a specific problem. Addressing these limitations may involve using tech-
niques like feature engineering, kernel tuning, and model evaluation to ensure optimal
performance.
scalability and speed up training, especially for large datasets is one possible approach
[40]. Recently, incremental learning has been a subject of investigation. Creation of
incremental or online SVM algorithms that can adapt to new data without retraining
the entire model could be effective. This is particularly useful for real-time or
streaming data scenarios [41].
Applying advanced regularization techniques to SVM to handle noisy data and
improve generalization leads to improved performance. Techniques like L1 regular-
ization or Elastic Net can help with feature selection and reduce model complexity
[42].
Another approach is to take the SVM ensemble approach. This involves building
ensemble models using multiple SVM classifiers to enhance performance. Tech-
niques like Bagging or Boosting can combine multiple SVMs to achieve better
generalization [43]. Hybrid models which combine SVM with other algorithms, such
as Decision Trees or Neural Networks, to leverage their strengths and create hybrid
models with improved performance. For multiclass SVM, the idea is to develop
specialized algorithms for multiclass classification that go beyond One-vs-One and
One-vs-Rest approaches. Hierarchical classification or direct optimization methods
could be explored [44].
For kernel learning, investigation of methods for automatically learning the
optimal kernel from the data, potentially using unsupervised learning techniques
to uncover meaningful transformations has a chance to improve performance. In
case of imbalanced data, exploration of techniques to adapt SVM for imbalanced
class distributions, such as using cost-sensitive learning, adjusting class weights, or
generating synthetic samples for the minority class can be considered.
Design of interpretable kernels that provide insights into the decision boundary
and feature importance, making SVM results easier to understand and extension of
SVM to handle structured data like graphs or sequences by incorporating domain-
specific similarity measures or defining custom kernel functions could be possible
avenues to follow.
Scalability Improvements are also important for SVM. Development of parallel
and distributed versions of SVM algorithms to accelerate training and improve
scalability on distributed computing frameworks can be considered.
Other approaches include multilabel classification which extend SVM to handle
multilabel classification problems, where instances can belong to multiple classes
36 1 Basic Approaches in Object Detection and Classification by Deep Learning
Random forest is a decision tree approach used to successfully solve many shallow
machine learning tasks. It was the top algorithm of choice until mid-2010s when
another decision tree-based algorithm, gradient boosting machines, took over.
Random forest, is composed of a large ensemble of decision trees that perform
prediction tasks individually. The result of these decision trees is then combined into
the final result by some form of voting. This means that the class with the most votes
become the output. Figure 1.14 illustrates how the random forest algorithm works
in principle [33].
Why this approach works well is that since the individual decision trees are uncor-
related, prediction errors from some models can be covered by having correct results
in the majority of the decision trees. Random forests offer several advantages over
decision trees:
Improved Accuracy: Random forests generally provide higher accuracy compared
to individual decision trees. By aggregating the predictions of multiple decision trees,
38 1 Basic Approaches in Object Detection and Classification by Deep Learning
Fig. 1.14 Illustration of random forest algorithm where the majority determines the final predicted
class
the ensemble approach reduces overfitting and variance, resulting in more robust and
accurate predictions.
Reduced Overfitting: Decision trees tend to overfit the training data, capturing
noise and specific patterns that may not generalize well to unseen data. Random
forests mitigate overfitting by using random subsets of the data and features for each
tree, reducing the risk of memorizing noise.
Robustness: Random forests are less sensitive to outliers and noisy data points
compared to single decision trees. The averaging of multiple trees reduces the impact
of individual noisy predictions, leading to more robust models.
Feature Importance: Random forests can assess the importance of features in
the model, providing insights into which features are most influential in making
predictions. This information is valuable for feature selection and understanding the
underlying data patterns.
Parallelism: Random forests can be easily parallelized, making them efficient for
training on large datasets and taking advantage of multicore processors or distributed
computing.
No Need for Feature Scaling: Random forests are not sensitive to the scale of
features. Unlike some algorithms that require feature scaling, random forests can
handle features of different scales without impacting performance.
1.2 Conventional Methods of Object Detection and Machine Learning 39
Handling Missing Data: Random forests can handle missing data without
requiring imputation. Missing values can be efficiently dealt with during the
tree-building process.
Versatility: Random forests can be used for both classification and regression
tasks, making them a versatile choice for various machine learning problems.
In short, the ensemble nature of random forests, where multiple decision trees are
combined, leads to more accurate, robust, and stable models compared to individual
decision trees.
As stated in the preceding subsection, the recent champion of decision tree algorithms
is the gradient boosting machine. The variant called XGBoost (extreme gradient
boosting) got a boost by winning a Kaggle competition. Gradient boosting relies on
boosting where weak learners are converted into stronger learners. In this technique,
the gradient descent method is applied to the loss function to determine the model
parameters [34]. XGBoost is a powerful and widely used algorithm for both regres-
sion and classification tasks, known for its speed, scalability, and high predictive
performance. Gradient boosting is an ensemble learning technique that combines
multiple weak learners (usually decision trees) to create a strong predictive model. It
builds the models sequentially, with each new model attempting to correct the errors
made by the previous ones.
XGBoost extends the traditional gradient boosting algorithm by introducing
several enhancements, which contribute to its effectiveness and efficiency:
Regularization: Includes L1 (Lasso) and L2 (Ridge) regularization terms in the
objective function, which helps prevent overfitting and improve model generalization.
Tree Pruning: It uses a depth-first approach to build decision trees and prunes
branches that contribute little to the overall model’s performance. This helps reduce
the complexity of the model and enhance its efficiency.
Weighted Quantile Sketch: Employs an optimized data structure called the
“weighted quantile sketch” to efficiently handle data summary statistics during tree
construction, improving the speed of the algorithm.
Handling Missing Values: It automatically handles missing data during tree
construction, eliminating the need for explicit data imputation.
Cross-validation: Includes built-in cross-validation capabilities to assess model
performance and tune hyperparameters effectively.
Parallel Processing: It can be parallelized, taking advantage of multicore proces-
sors and distributed computing environments, making it highly efficient for large
datasets.
Due to these optimizations and improvements, XGBoost has gained significant
popularity in machine learning competitions, real-world applications, and academic
research. It is often regarded as one of the most powerful and versatile algorithms in
the gradient boosting family.
40 1 Basic Approaches in Object Detection and Classification by Deep Learning
The early boosting variants include AdaBoost (Adaptive Boosting) and have
extensively been employed to solve classification problems [35]. In case of conven-
tional machine learning approaches where the underlying problem is non-vision
related, gradient boosting can be considered as the best choice at this point in time.
Here we give a concise background of deep learning and its origins. Although deep
learning is intensively under spotlight in recent years, it has been in the literature for a
long time under different terminology [36, 37, 46, 47]. In fact, all of machine learning
algorithms can be broadly classified under the artificial intelligence umbrella. The
field of artificial intelligence is superset of both machine learning and deep learning.
However, machine learning is a subset of machine learning. This can be visualized
as shown in Fig. 1.15.
Artificial intelligence has a several definitions but the general consensus is the
desire to make machines have some level of human intelligence. While the Britan-
nica encyclopedia defines human intelligence as the mental quality that consists of
the abilities to learn from experience, adapt to new situations, understand and handle
abstract concepts, and use knowledge to manipulate one’s environment, we can all
agree that artificial intelligence is still far from achieving this level of ability. The
biggest limitations still remain on adaptations to new situations and handling abstract
concepts. To some extent, machines are able learn certain patterns and manipulate the
environment. Given the above outstanding hurdles, computer scientist and engineers
define artificial intelligence as the ability of computer systems to perform intelli-
gent tasks. Some notable examples of these tasks include computer vision, natural
language processing, text processing, and pattern recognition.
By definition, machine learning (ML) is the concerned with the study of computers
algorithms and statistical models that can accomplish intelligent tasks. These algo-
rithms can be notably categorized into supervised learning, semi-supervised learning,
unsupervised learning, and reinforcement learning. Supervised and semi-supervised
learning can be combined into one category, thus effectively resulting in three cate-
gories. Reinforcement learning differs from supervised and unsupervised learning in
that it does not rely on labeled or unlabeled examples of correct behavior, but is inter-
active and tries to maximize a reward signal as opposed to finding hidden structures
which is the basis of unsupervised learning [48]. On the other hand, deep learning
has roots in artificial neural networks which in turn are modeled based on inspiration
from human neurons or perceptrons [49], although it will be an oversimplification
to say that neurons operate like artificial neural networks.
For object detection and classification, we have gone a long way in formulating
very useful algorithms up to deep learning. Some of the popular deep learning algo-
rithms that have been successfully used in solving practical problems include, region
proposals (Region Based Convolutional Neural Networks (R-CNN), Fast R-CNN,
Faster R-CNN) [50], You Only Look Once (YOLO) [51], Deformable convolutional
networks [52], Refinement Neural Network for Object Detection (RefineDet) [53],
Retina-Net [54, 55] and many others. The number of the algorithms keeps growing
rapidly but the CNN has proven to be most widely used network architecture so far.
The VGG16 architecture [56] which is built on CNN is one example.
There are three main competing frameworks for implementing and evaluating deep
learning algorithms, namely, Keras (https://keras.io/), TensorFlow (https://www.
tensorflow.org/) and PyTorch (https://pytorch.org/). Keras and TensorFlow can be
viewed as complementing frameworks which then boils down to two frameworks in
reality. Each of the frameworks have their own pros and cons in terms of usability
and performance, so choices can be made on a need basis. While Keras offers a quick
start by hiding most of programmatic details in TensorFlow, PyTorch takes one level
deep into Python strengths. So, for a quick start, Keras would be the way to go and
then at some point venture into PyTorch. Therefore, in this book we will be building
all the examples on the Keras framework.
generic in nature and can be flexibly extended to general objecting detection for such
tasks as text recognition and autonomous driving environment object detection and
recognition.
1. What is the difference between object detection and object classification? How
can deep learning be used to solve both of these tasks?
2. Explain the difference between support vector machines, random forests and
gradient boosting. What are some advantages and disadvantages of each
approach?
3. How can convolutional neural networks (CNNs) be used for object detection and
classification? Describe the architecture of a typical CNN-based object detection
system.
4. Investigate on the object detection methods and explain their strengths and weak-
nesses. What are bounding boxes used for in object detection? How are they used
to improve the accuracy of object detection models?
5. What is transfer learning, and how can it be used for object detection and clas-
sification? Give an example of how a pretrained model could be fine-tuned for a
specific object detection task.
References
31. Bruzzone L, Persello C (2009) A novel context-sensitive semi-supervised SVM classifier robust
to mislabeled training samples. IEEE Trans Geosci Remote Sens 47(7)
32. Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Kluwer
Academic Publishers, Boston, pp 1–43
33. Breiman L (2001) Random forests. Mach Learn 45:5–32
34. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: KDD‘16: proceedings
of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
August, 2016, pp 785–794
35. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an
application to boosting. J Comput Syst Sci 55:119–139
36. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural
networks. Science 313(5786):504–507
37. Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural
Comput 18
38. Kamusoko C, Gamba J (2014) Mapping woodland cover in the Miombo ecosystem: a
comparison of machine learning classifiers. Land 3:524–540
39. Schultheis E, Babbar R (2021) Speeding-up one-vs-all training for extreme classification via
smart initialization. https://arxiv.org/abs/2109.13122
40. Abeykoon VL, Fox GC, Kim M (2019) Performance optimization on model synchronization
in parallel stochastic gradient descent based SVM. In: 2019 19th IEEE/ACM international
symposium on cluster, cloud and grid computing (CCGRID), Larnaca, Cyprus, 2019, pp 508–
517. https://doi.org/10.1109/CCGRID.2019.00065
41. Pesala V, Kalakanti AK, Paul T, Ueno K, Kesarwani A, Bugata HGSP (2019) Incremental
learning of SVM using backward elimination and forward selection of support vectors. In:
2019 International conference on applied machine learning (ICAML), Bhubaneswar, India,
2019, pp 9–14. https://doi.org/10.1109/ICAML48257.2019.00010
42. Xie L, Luo Y, Su S-F, Wei H (2023) Graph regularized structured output SVM for early
expression detection with online extension. IEEE Trans Cybern 53(3):1419–1431. https://doi.
org/10.1109/TCYB.2021.3108143
43. Cao Y, Sun Y, Li P, Su S, Vibration-based fault diagnosis for railway point machines using
multi-domain features, ensemble feature selection and SVM. IEEE Trans Veh Technol. https://
doi.org/10.1109/TVT.2023.3305603
44. Liu H, Yu Z, Shum CK, Man Q, Wang B (2023) A new hierarchical multiplication and spectral
mixing method for quantification of forest coverage changes using Gaofen (GF)-1 imagery in
Zhejiang Province, China. IEEE Trans Geosci Remote Sens 61:1–10, Art no. 4407210. https://
doi.org/10.1109/TGRS.2023.3303078
45. Su Y, Li X, Yao J, Dong C, Wang Y (2023) A spectral–spatial feature rotation-based ensemble
method for imbalanced hyperspectral image classification. IEEE Trans Geosci Remote Sens,
61:1–18, Art no. 5515918. https://doi.org/10.1109/TGRS.2023.3282064
46. Furukawa H (2018) Deep learning for end-to-end automatic target recognition from synthetic
aperture radar imagery. IEICE Tech Rep 117(403):35–40, SANE 2017-92
47. Angelov A, Robertson A, Murray-Smith R, Fioranelli F (2018) Practical classification of
different moving targets using automotive radar and deep neural networks. IET Radar, Sonar
Navig 12(10):1082–1089
48. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction, 2nd edn. MIT Press
49. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press Inc., New
York
50. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object
detection and semantic segmentation. In: 2014 IEEE Conference on computer vision and pattern
recognition, 2014, pp 580–587. https://doi.org/10.1109/CVPR.2014.81
51. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object
detection. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), 2016,
pp 779–788. https://doi.org/10.1109/CVPR.2016.91
References 45
52. Dai J et al (2017) Deformable convolutional networks. In: 2017 IEEE International conference
on computer vision (ICCV), 2017, pp 764–773. https://doi.org/10.1109/ICCV.2017.89
53. Zhang S, Wen L, Lei Z, Li SZ (2021) RefineDet++: single-shot refinement neural network for
object detection. IEEE Trans Circuits Syst Video Technol 31(2):674–687. https://doi.org/10.
1109/TCSVT.2020.2986402
54. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In:
2017 IEEE international conference on computer vision (ICCV), 2017, pp 2999–3007. https://
doi.org/10.1109/ICCV.2017.324
55. Del Prete R, Graziano MD, Renga A (2021) RetinaNet: a deep learning architecture to achieve
a robust wake detector in SAR images. In: 2021 IEEE 6th International forum on research and
technology for society and industry (RTSI), 2021, pp 171–176. https://doi.org/10.1109/RTS
I50628.2021.9597297
56. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image
recognition. https://arxiv.org/abs/1409.1556
Chapter 2
Requirements for Hands-On Approach
to Deep Learning
2.1 Introduction
In Python, vectors, matrices, arrays, and tensors are all data structures used to repre-
sent and manipulate multidimensional data. In the deep learning models presented
later, we will be processing data in numerical format defined from Python’s NumPy
library. Therefore, for our purposes we will treat tensors as multidimensional NumPy
arrays [1].
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 47
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_2
48 2 Requirements for Hands-On Approach to Deep Learning
Fig. 2.2 Visualization of 2-D array (2-D tensor, matrix) and 3-D array (3-D tensor, cube)
Fig. 2.3 Visualization of an example of a multidimensional array (4-D array, 4-D tensor)
Array Manipulation
Figure 2.4 is an example of reshaping an array from size (3, 5) to size (5, 3). The key
point is that the total number of elements in the new shape must be factorizable to
the old dimension.
Besides reshape, other array manipulation operations such as resize, transpose,
squeeze, flatten, etc. can be performed on Numpy arrays.
50 2 Requirements for Hands-On Approach to Deep Learning
Fig. 2.4 Visualization of an example of array manipulation where the shape is changed
This section is a quick guide that explains the necessary steps to create the environ-
ment using Python as the basis for deep learning algorithm evaluation. The processing
steps and resources will be explained as we walk through the process. With the avail-
ability of vast resources on the internet, the interested reader should be able to rapidly
create a working demo script within a few hours if not minutes. It is assumed in this
book that the reader has basic knowledge of programming and Python environment.
Deep knowledge of artificial intelligence, neural networks or deep learning is a not
a prerequisite to run deep learning algorithms. Basically, it is possible to run deep
learning algorithms either offline or online.
The examples that we will present will be built on Python 3.7 and can run on stan-
dalone Windows environment. However, we have confirmed that the setup is also
straightforward using VirtualBox Ubuntu 20.04 LTS on a Windows host. The notable
difference between Windows environment and Ubuntu is that the Ubuntu Terminal
is the basic tool for command line operations, and no additional terminal installation
is required. For package management, we recommend using Anaconda Navigator
which can be downloaded for free from their official website (Anaconda Navigator
Installation).
2.3 Setting Up Environment 51
Installation of the Anaconda Navigator on Windows is quite easy to perform and the
Navigator can be started from the Start Menu. Figure 2.5 below is an example of the
interface on Windows 10.
It is highly recommended to create a new environment for each classification task
or project using the following steps.
1. Click the Environments on the Anaconda Navigator and select Create on the
bottom left side (Fig. 2.6)
2. Set the environment name (in this case env_maskrcnn as an example) and select
the Python version (in this case 3.7) as in Fig. 2.7.
The environment will be shown in the list of environments to which packages and
tools can be added as a necessary. In our example, we created “env_maskrcnn” and
installed Spyder® and Jupyter Notebook, among other standard tools. Spyder® is
a user-friendly interactive Python GUI and Jupyter Notebook is good for visualizing
demos available from GitHub and for creating new scripts before running in Spyder as
one use-case example. The Jupyter Notebook is also handy for interactive debugging
as it can easily link to online resources like Stack Overflow, etc.
Although the Windows and Ubuntu/Linux platforms are convenient to use in terms
of availability and control, recent trends are to rely on online platforms, specifi-
cally Google Colab (https://research.google.com/colaboratory/). The advantages of
Google Colab (Colab for short) are that very minimal or no setup effort is required
and it also provides the option to use free GPU/TPU resources once an account is
created. The packages needed for most classification tasks are constantly updated
simplifying package management. In addition, it is very easy to share Notebooks
and check algorithm performance online. For a little affordable fee, it’s possible
upgrade to the account if more computational resources are required. In any case, it
2.4 Concluding Remarks 53
is always possible to unsubscribe anytime and use the free Colab account for small
demo projects. An example of the online Colab interface is shown in Fig. 2.8.
Wrapping up what we have learnt so far, we presented basic Python data structures and
their manipulation. We ended with reference material on setting up the environment
and also gave online options to consider.
54 2 Requirements for Hands-On Approach to Deep Learning
1. What is a tensor, and how is it used in deep learning? Describe the difference
between a scalar, vector, and matrix, and give an example of each.
2. How do you create a one-dimensional (1-D) array in Python? Give an example of
how to create an array of integers, and describe how to access individual elements
of the array.
3. What is a matrix, and how do you create a two-dimensional (2-D) array in
Python? Give an example of how to create a 2-D array of floating-point numbers,
and describe how to perform basic operations on matrices (e.g., addition,
multiplication).
4. What is a data cube, and how is it used in deep learning? Describe how to create
a three-dimensional (3-D) array in Python, and give an example of how to access
individual elements of the array.
5. Describe the concept of multidimensional arrays in Python. What are some
common operations you can perform on multidimensional arrays? Give an
example of how to perform each of these operations on a multidimensional array.
References
In this chapter, we illustrate how to build deep learning models, their training and
evaluation using the Keras framework in a simple and concise way. We briefly explain
some of the concepts behind these models so as to give the reader a smooth entry
into each section while concentrating mainly of how-to-use rather than details of
algorithms themselves. The entry point will be shallow networks upon which the
deep neural networks are developed. We then touch on convolutional neural networks
(CNNs), followed by recurrent neural networks (RNNs) and finally long short-term
memory (LSTM)/gated recurring units (GRUs). Along the way, we provide examples
on how each of these can be used in order to cement the ideas behind them. After that
we give a quick look at the Keras library and some references for further investigation.
In recent terminology, neural networks can be categorized into deep and shallow
neural networks. In this categorization, shallow neural networks can be thought of
as the basic building blocks required to understand deep neural networks and they
consist of a few hidden layers, normally one or two. In this subsection, we will give
a brief overview of shallow networks since they are an important part of artificial
intelligence.
Artificial neural networks models were originally inspired by human neurons,
referred to as perceptrons [1]. Comprehensive treatment of the evolution of neural
networks is beyond the scope of this section but in its basic functionality, a perceptron
takes several binary inputs and produces a single binary output as illustrated in
Fig. 3.1. The output can be computed using the following expressing:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 55
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_3
56 3 Building Deep Learning Models
⎧
⎨ 0, wi xi ≤ θ0
output =
i=0
(3.1)
⎩ 1, wi xi > θ0
i=0
The model can then the compiled and training on input can be done.
shallownet.compile(optimizer = ’adam’, loss= ’categorical_
crossentropy’, metrics=[’accuracy’])
shallownet.fit(train_images, train_labels, epochs=5, batch_
size=128)
plt.title(’Training/Validation Accuracy’)
plt.legend(loc=’lower right’)
The above simple model gives a test accuracy of 86.23% (Fig. 3.2). The utility
of Keras is that it is possible to quickly adjust hyperparameters to improve on test
accuracy. As an example, increasing the size of the network to 512, recompiling and
changing the training batch size to 128 results in the increase in accuracy to 98.15%!
# Define the network model by adding two Dense layers, with increased
network size to 512
shallownetwork = models.Sequential()
shallownetwork.add(layers.Dense(512, activation=’relu’, input_
shape=(28 * 28,)))
shallownetwork.add(layers.Dense(10, activation=’softmax’))
# Compile the model
shallownetwork.compile(optimizer = ’adam’, loss= ’categorical_
crossentropy’, metrics=[’accuracy’])
# Performing training of the network using the MNIST training
dataset with increased batchsize of 128
history = shallownetwork.fit(train_images, train_labels, epochs
= 10, batch_size = 128, validation_data = (test_images, test_
labels)).
Fig. 3.2 Training and validation accuracy (network size 4, batch size 64)
3.1 Introduction: Neural Networks Basics 59
Figure 3.3 shows that for this shallow model, over-fitting starts after the first epoch
as shown by almost flat validation accuracy.
The perceptron model can be extended to multiple hidden layers of perceptrons to
produce complex decisions as shown in Fig. 3.4 and often referred to as multilayer
perceptron (MLP) in the literature.
Fig. 3.3 Training and validation accuracy (network size 512, batch size 128)
60 3 Building Deep Learning Models
The CNN is one of the most successful models used in deep neural networks, espe-
cially in image processing and computer vision. Taking a little deviation into history,
deep learning networks differ from conventional neural networks by the number of
node layers used, which brings in the concept of depth, and can also have loops.
Basically, neural networks normally have one to two hidden layers and are used
for supervised prediction or classification. In contrast, deep learning networks can
have several hidden layers with the possibility of unsupervised training. Figure 3.5
illustrates one example of such a network. Examples of widely used deep learning
architectures include deep neural networks (DNN), deep belief networks (DBF),
and recurrent neural networks (RNNs) [3, 4]. The main advantage of DNN over
traditional neural networks is the ability to learn complex tasks in an unsupervised
manner. However, this advantage does not come at no cost [5]. Large amounts of
training data are required for building the network, high computational complexity
is a big burden, difficulties arise when attempting to analyze the algorithms and
also inability to predict the output precisely, among other challenges. For applica-
tions such as autonomous navigation, DNNs have a promising future and integration
into sensors like the automotive radar is currently under intensive research [6]. With
advances in both computational power (GPUs/TPUs) and available resources (RAM/
ROM) on the sensor devices, the realization of the so-called intelligent sensors is
now possible.
For the interested reader, further details about DBM and RNN can be found in
[7] and [8], respectively. It should be noted that RNNs have found better success in
natural language process (NLP).
Coming back to the subject of this section, a convolutional neural network (CNN)
is a neural network which uses at least one layer as part of the model. The construction
of a CNN involves the several layers between input and output and at least one layer
is a convolutional layer. A typical convolutional neural network consists of some
3.1 Introduction: Neural Networks Basics 61
combination of the following layers: convolutional layers, pooling layers, and fully
connected/dense layers.
Convolutional layers apply convolutional operations on their inputs to extract
features of the input. Pooling operations are used to reduce the size of the convo-
lution layer outputs by either maximization or averaging operations. Normally, the
averaging is done over a 2 × 2 matrix. Fully connected layers usually come at the top
of the network (close to output) and are also sometimes referred to as dense layers.
CNNs have been successfully applied to computer vision, producing state-of-the-
art performance in most applications.
Figure 3.6 illustrates the structure a typical CNN architecture.
The typical CNN architecture shows the progression through convolution and
pooling operations. The flattening operation produces a one-dimensional array for
inputting it to the final fully connected top layers.
We continue with the MNIST data as a concrete example of how to implement a
CNN in Keras following [9].
# Example of CNN using the MNIST data set
# Import necessary packages
from keras import layers
from keras import models
# Define the CNN model with 3 convolution layers and 2 pooling layers
cnn_model = models.Sequential()
Another popular type of neural network is the recurrent neural network that has been
very successfully used for application like natural language processing and speech
recognition.
RNNs differ from CNNs in that they have memory, meaning that previous inputs
have influence on the present input and output. We will not dwell much on RNNs in
this text, but Fig. 3.9 gives a simplified visual illustration on how they work.
3.1 Introduction: Neural Networks Basics 63
Fig. 3.7 Training and validation accuracy for the CNN model
Suffice to say, Keras provides the SimpleRNN layer for model construction.
Below is an example of RNN with Keras.
# Import necessary packages
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist
# Load MNIST dataset from Keras
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) =
mnist.load_data()
# Extract the number of labels
num_train_labels = len(np.unique(train_labels))
# Normalize data for training
train_images = train_images.reshape((60000, 28, 28))
train_images = train_images.astype("float32") / 255
64 3 Building Deep Learning Models
Fig. 3.9 Illustration of a simplified RNN showing the rolled and unrolled representations
Fig. 3.10 Training and validation accuracy for the RNN model
With this simple 2-layer RNN model, a decent accuracy of 97.65% can be achieved
for the MNIST data (Figs. 3.10 and 3.11).
The LSTM and GRU layers are designed to solve the vanishing gradient problem that
makes SimpleRNN not suitable for most practical problems [10]. This is achieved
by injecting information from previous layers at a later time using some form of
forgetting factors to circumvent the vanishing-gradient problem considerably. On
the other hand, GRUs operate on the same principle as LSTM except that for LSTM,
66 3 Building Deep Learning Models
three gates, namely input, output, and forget gate are used, while for GRU only two
gates, reset and update gate, are required. The choice between the two involves a
trade-off between accuracy and computational complexity, with LSTM generally
expected to provide higher accuracy [11, 12].
Employing the same approach as for the SimpleRNN model, we compare the
LSTM and GRU models built from Keras layers. We start with the LSTM model.
# Create LSTM model with 256 units
lstm_model = Sequential()
lstm_model.add(layers.LSTM(256,input_shape=(28, 28)))
lstm_model.add(Dense(num_train_labels, activation=’softmax’))
lstm_model.summary()
# Create LSTM model with 256 units
lstm_model = Sequential()
lstm_model.add(layers.LSTM(256,input_shape=(28, 28)))
lstm_model.add(Dense(num_train_labels, activation=’softmax’))
lstm_model.summary()
A total of 294 410 training parameters for this model (Fig. 3.12).
# Train the with the LSTM model with batch size of 128
and 20 epochs
lstm_model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = lstm_model.fit(train_images, train_labels,
epochs=20, batch_size=128, validation_data=(test_im-
ages, test_labels))
# Train the with the LSTM model with batch size of 128 and 20 epochs
lstm_model.compile(loss=’categorical_crossentropy’,
optimizer=’adam’,
metrics=[’accuracy’])
history = lstm_model.fit(train_images, train_labels, epochs=20,
batch_size=128, validation_data=(test_images, test_labels))
Fig. 3.14 Training and validation accuracy for the LSTM model
# Train the with the GRU model with batch size of 128
and 20 epochs
gru_model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = gru_model.fit(train_images, train_labels,
epochs=20, batch_size=128, validation_data=(test_im-
ages, test_labels))
Fig. 3.18 Training and validation accuracy for the GRU model
Comparing the above results obtained under similar conditions, it can be observed
that LSTM model achieves an average speed of 70 s per epoch with a validation accu-
racy 98.92% (Fig. 3.15). The GRU model achieves an average speed of 55 s per epoch
and 98.81% (Fig. 3.19) accuracy, which means LSTM is 0.11% better in this example.
As stated above, this improvement comes at a computational expense as reflected
in the execution speed. As shown in Figs. 3.14 and 3.18, the two models quickly
achieve high accuracy in the first 5 epochs after which over-fitting becomes visible.
With the addition of more layers and hyperparameter tuning, further improvements
can generally be achieved for any model as will be seen in the next chapters.
Keras is a widely used Python framework for machine learning and deep neural
network applications due its intuitive logical flow, easy to get started quickly and
richness in ready-to-use packages. With very few code lines, model evaluation on
benchmark and new datasets can be accomplished efficiently. We will briefly explore
the Keras framework here, but further details and latest developments can be found
at https://keras.io/.
The Keras API reference consists of Models APIs, Layers APIs, Callback APIs, Opti-
mizers, Metrics, Applications, and many other utilities that greatly reduce the effort
from concept to tangible results for engineers and scientist from various backgrounds
and fields. The workflow can be reduced to three main steps which are (1) define the
model, (2) compile the model, and (3) evaluate the model. By continuously refining
step (1), rapid evaluation of models is possible. Keras is also compatible with Ubuntu,
Windows, and macOS, thus is making it available to a wide range of audience. Among
other characteristics, it is can be run on both CPU & GPU platforms.
3.3 Concluding Remarks 71
3.2.2 Usability
Here we give the highlights of this chapter. In this chapter, we provided a concise
introduction to building deep learning models with practical examples. We discussed
the distinctions between shallow and deep neural networks and demonstrated how
to implement them using the Keras framework. Some of the popular deep learning
architectures, namely CNN and RNN, were also illustrated. In the end we provided
some background on why it makes sense to start with Keras as framework for building
and evaluating deep neural networks.
72 3 Building Deep Learning Models
1. Explain the concept of shallow networks and their limitations. Can shallow
networks be used for complex tasks such as image classification or natural
language processing?
2. What are Convolutional Neural Networks (CNNs)? How do they differ from fully
connected neural networks? Explain the architecture of a typical CNN.
3. Describe Recurrent Neural Networks (RNNs) and their ability to model sequen-
tial data. What are some limitations of standard RNNs, and how do Long
Short-Term Memory (LSTM) and Gated Recurring Units (GRU) address these
limitations?
4. Explain the Keras API and its advantages for building deep learning models. What
are some features of the Keras API that make it popular among developers?
5. Give an example of building a deep learning model using the Keras API. Describe
the steps involved in building a CNN or RNN using Keras, including data
preparation, model definition, training, and evaluation.
References
1. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, Inc.,
New York
2. Deep-Learning-Models. https://github.com/sn-code-inside/Deep-Learning-Models
3. Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural
Comput 18
4. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
5. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural
networks. Science 313(5786):504–507
6. Wheeler TA, Holder MF, Winner H, Kochenderfer MJ (2017) Deep stochastic radar models.
IEEE Intell Veh Symp IV
7. Salakhutdinov R, Hinton GE (2009) Deep Boltzmann machines. In: AISTATS, pp 448–455
8. Graves A, Mohamed A, Hinton GE (2013) Speech recognition with deep recurrent neural
networks. In: ICASSP, pp 6645–6649
9. Francois C (2018) Deep learning with Python. Manning Publications Co.
10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
11. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y
(2014) Learning phrase representations using RNN encoder–decoder for statistical machine
translation. In: Proceedings of the 2014 conference on empirical methods in natural language
processing, pp 1724–1734
12. Cahuantzi R, Chen X, Güttel S (2021) A comparison of LSTM and GRU networks for learning
symbolic sequences. https://arxiv.org/abs/2107.02248
13. Kaggle (2022) State of data science and machine learning 2022. https://www.kaggle.com/kag
gle-survey-2022
Chapter 4
The Building Blocks of Machine
Learning and Deep Learning
4.1 Introduction
In this chapter, we take a look at the three main categories of machine learning and
then move on to explore how the machine learning models can be evaluated. The
various metrics commonly used are explained. After that, we briefly address the
important topic of data preprocessing followed by standard methods of evaluating
machine learning models. One of the reasons why most models fail to perform on
unseen data is due to the problem of overfitting. We take a look at this problem and
outline some of the strategies that can be applied in order to overcome it. The next
topic is a discussion of the workflow for machine learning or deep learning. The
chapter ends with concluding remarks to recap the covered topics.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 73
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_4
74 4 The Building Blocks of Machine Learning and Deep Learning
The first step in evaluating machine learning models after collecting the dataset is to
decide on the split or proportion of the dataset that will be used for training, validation,
and testing phases. In most algorithms, it is possible to first split the data into training
and test datasets and then use as percentage of the training set for validation.
The training dataset is used for fitting the model parameters by in order to maxi-
mize prediction performance. The validation dataset is used to evaluate the model
performance during the training phase in order to aid tuning of model hyperparam-
eters. The test dataset is used for evaluating the model produced during the training
phase and should be completely separate from training dataset.
An example of splitting the data for computer vision applications is to
use a combination of StratifiedShuffleSplit from scikit-learn with
4.3 Methods of Evaluating Machine Learning Models 75
ImageDataGenerator from Keras to first create training and test datasets and
then partition the training data into training and validation portion.
Step1: Import libraries
shuffle=True,
seed=71
)
valid_generator = train_gen.flow_from_directory(
directory=TRAIN_DIR,
target_size=(64, 64),
batch_size=BATCH_SIZE,
class_mode=CLASS_MODE,
subset=’validation’,
color_mode=’rgb’,
shuffle=True,
seed=71
)
# Test generator for evaluation purposes (only rescaling applied)
test_gen = ImageDataGenerator(
rescale=1./255
)
test_generator = test_gen.flow_from_directory(
directory=TEST_DIR,
target_size=(64, 64),
batch_size=1,
class_mode=None,
color_mode=’rgb’,
shuffle=False,
seed=71
)
• Precision
It is the fraction of true positives among all classified as positives. It is also referred
to as the true positive rate (TPR).
• Recall
• Specificity
• F1-score
• F2-score
The F2 score is a weighted average of recall and precision that gives more weight to
recall than precision.
• Confusion Matrix
It is plot of Recall against false positive rate (1-Specificity). It is used to judge the
optimality of the model and has origins in radar processing where the false positive
rate known as the probability of false alarm.
• Area under the ROC curve (AUC)
It is used to measure model performance based on the area under the ROC curve. It
falls between 0 and 1, and higher values greater than 0.5 are desirable because 0.5
represent random guess.
The above metrics are well-known and widely used. In addition, most of them can
be easily imported from the sklearn.metrics module. Below is an example of
how this can be achieved in a single line of code.
The input data used for machine learning or deep learning comes in various formats
such as text, images, and even videos. Before feeding this data into a deep learning
model, it is necessary to put it into a format that makes the task of training tractable.
This means that besides denoising, the data will need to be vectorized and normalized
as part of preprocessing. In Chap. 2, we briefly discussed some of data structures
that can be handled in machine learning algorithms. Normalization is especially
important in image data processing where the (0, 255) pixel range is transformed
to (0, 1) used in most machine learning models. As can be seen in Sect. 4.3 of this
chapter, normalization is incorporated into the ImageDataGenerator for this
purpose.
Overfitting happens when the model performance on validation data stops improving
compared to the performance on the training data. This is usually seen by validation
accuracy remaining constant while training accuracy continues to approach 100%.
Or looking from the loss function side, the training loss decreases for each epoch
while validation loss stops decreasing or even worse increases. This behavior is an
indication of poor generalization of the model to unseen data. Fighting overfitting is
a common problem in machine learning, including deep learning. There are various
strategies that can be considered to tackle the overfitting problem (Fig. 4.3).
4.3 Methods of Evaluating Machine Learning Models 79
The reason why overfitting happens is due to the failure of the model to generalize
to new or unseen data and the simplest and most effective solution is collect more
data. However, this is not always possible so we have to deal with available limited
data to improve the situation. The way out of this problem is employing methods
such as regularization. To understand what is happening when overfitting occurs, it
will be constructive to imagine trying to fit noisy data to a quadratic function. The
data obviously will contain outliers. With overfitting, the model tries to approximate
a function that passes through all the data points, including outliers. This is over-
fitting because the resulting function is only good for this particular dataset. The
consequence of this is that if we get a new data with the same quadratic behavior
but different outliers, then our approximations will not fit properly. In the absence
of additional data to smooth out the outliers, regularization is our next best solution
because it will try to control the large swings in the approximating weights, thereby
making generalizations to new data possible.
When using Keras, the following regularization techniques can be applied:
Layer weight regularization—There are three forms regularizers, namely, kernel_
regularizer where a penalty is applied on the layer’s kernel, bias_regularizer where
a penalty is applied on the layer’s bias and activity_regularizer where a penalty is
applied on the layer’s output. For all the three, L1, L2 or a combination of L1 and
L2 (L1_L2), can be used.
Dropout—Network nodes are randomly selected and removed in order to reduce
the network complexity.
Network capacity reduction—Network units define the size of the output of the
layer, therefore reduction in capacity will lead to fewer parameters and thereby
increase ability to generalized. Moreover, large network parameters can be thought
of as having the ability to memorize large volume of data but cannot perform well
when required to make decision on new data.
80 4 The Building Blocks of Machine Learning and Deep Learning
For completeness’ sake, underfitting is also a problem when model does not
perform well on neither training data nor unseen or new data. It also leads to poor
generalization, but this may be indicative or poor model selection or untrainable data.
These kinds of problems must be solved before training starts.
Up to now we have not given any guideline on how to attack a machine learning
problem from the start. Here we explain the steps involved in the machine learning/
deep learning workflow (Fig. 4.4).
Problem Definition
The first step is to define the problem at hand in terms of the required data, and what
we are trying to achieve as output. At this stage, it is good practice to decide on
whether the problem will be binary, multiclass multilabel, etc. Most problems have
an application domain with vast examples in the literature. It is advisable to make a
survey of available approaches, among other things.
Data Collection
Data acquisition is one of the most tedious and time-consuming part of the workflow.
The data must be large enough to be representative of the problem under analysis.
As previously stated in the overfitting section, lack of sufficient data contributes to
lack of model generalization when the model is deployed on new data. So, how much
data is enough? There is no straight answer to this question, but a moderate deep
learning problem would require 10,000 to 100,000 data samples. On other hand,
more complex problems like machine translation would require up to one million
samples. A general rule of thumb for computer vision is to collect at least 1000 data
samples per class. When enough data is not available, methods of generating artificial
data such as data augmentation can be implemented.
Fig. 4.4 Illustration of the machine learning working from preparation to deployment
4.4 The Machine Learning Workflow 81
Data Preparation
Having collected enough data, the next step would be to transform the data into
machine-consumable format. This is where vectorization and normalization can be
applied before inputting the data to the model.
Defining Performance Measures
As described in the previous sections, how to measure performance should be decided
before the models can be selected. The metrics like precision, recall, accuracy, etc.
come into the picture. Leaving the decision on metrics to later stages will result in
wasted effort and time, and re-evaluation of the model performance may be necessary.
Given that most deep learning models required a lot of time in terms of epochs per
run, setting metrics from the start will help reduce the chances of doing this task
repeatedly.
Model Selection
With the problem defined and data available including performance measures, the
task of deciding the model comes into place. There is no formula for this task, but
it’s always good to start from a simple model with a few layers and few units and
increase the complexity until no gain in performance can realized.
Train Model
Training is the critical part of the whole process as it is at this point that we start to
see the level of difficulty of the task at hand. During training, performance metrics
can be monitored with tools like the Tensorboard, and decision on whether to keep
the current model can be made as quickly and as early as possible.
Model Evaluation
If the model runs to the end, the remaining thing to do would be to evaluate model
performance against benchmarks or target values. If for example it is required to
achieve a 99% accuracy but the model reaches only 70%, then it will be better to
change the model or adjust the hyperparameter space. At this stage, we may decide
to abandon the model or choose alternative performance measures.
Hyperparameter Tuning
The hyperparameter of the model can be tuned to achieve a certain level of perfor-
mance. This includes changing the optimizers, learning rate, etc. and including
measures for reducing overfitting if this is a problem. After hyperparameter tuning,
we then retrain the model to see if there are any gains that can be achieved. By
repeating this experimenting phase several times, we end up with best model from
the given data.
Deployment
When we are satisfied with performance of the model on unseen data, then it can be
deployed into use.
82 4 The Building Blocks of Machine Learning and Deep Learning
Maintenance
In the maintenance phase, we keep checking the real performance of the model to
decide on whether additional data acquisition would be required.
1. What are the three main categories of machine learning? Explain the difference
between supervised, unsupervised, and reinforcement learning.
2. What are some common metrics used to evaluate machine learning models?
Describe the differences between accuracy, precision, recall, F1-score, and AUC-
ROC.
3. What is overfitting in machine learning? Why is it a problem, and how can it
be addressed? Describe some techniques for preventing overfitting in machine
learning models.
4. What is the typical workflow for building a machine learning model? Describe
the steps involved, including data preparation, feature selection, model selection,
hyperparameter tuning, and model evaluation.
5. Give an example of building a machine learning or deep learning model using a
specific framework or library (e.g., scikit-learn, TensorFlow, PyTorch). Describe
the steps involved in building the model, including data preparation, feature
selection, model definition, training, and evaluation.
References
5.1 Introduction
Recently, remote sensing has become heavily dependent on machine learning algo-
rithms such decision trees, random forests, support vector machines, and artificial
neural networks. However, there is an increasing recognition that deep learning which
has been applied successfully in other areas such as computer vision and language
processing is a viable alternative to traditional machine learning methods [1]. With
the availability of high-resolution imagery, it is becoming more attractive to venture
into deep learning as a key technology to achieve previously unimaginable classifi-
cation accuracies [2, 3]. In this chapter, we will work through a specific example of
application of deep learning algorithms to one important area of remote sensing data
analysis, namely land cover classification. Land cover and land use change anal-
ysis is of importance in many practical areas such urban planning, environmental
degradation monitoring, and disaster management [4, 5].
The main goal of this chapter is to provide a detailed understanding of the perfor-
mance of various deep learning models applied to the problem of land cover clas-
sification starting from known dataset. We divide the presentation into 5 main parts
including preliminary information on the models covering input data restrictions
followed by exploration of the EuroSAT data contents, preprocessing steps, and
performance evaluation results for several selected models in Sect. 5.3. Finally, we
test the performance of the models with a new dataset to get a clear picture of the
limitations of the presented approach in the face of unseen data in Sect. 5.4.
This application example assumes basic knowledge of the Python programming
language. There is an abundance of easy-to-follow material for this topic for readers
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 85
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_5
86 5 Remote Sensing Example for Deep Learning
of all background publicly starting from python.org. We therefore assume the reader
is familiar with Python syntax and how to get needed solutions from platforms
such as stack overflow, etc. In addition, it is not the intention of this chapter to
provide mathematical details of inner working of algorithms behind the presented
models. Having said this, this chapter is meant to give the interested reader a good
insight into the performance of the Keras APIs (models) that are available for land
cover classification. The techniques introduced here can be extended, improved, and
applied to a broad range of problems.
The EuroSAT dataset is obtained from the openly and freely accessible Sentinel-2
satellite images provided in the Earth observation program Copernicus. It has been
demonstrated in [2] that RGB bands of the Sentinel data give the best results in terms
of accuracy. We will therefore only use the RGB dataset in this chapter. This does
not in any way mean that the other bands cannot be used for classification.
The folder structure for algorithm evaluation is shown in Fig. 5.1. We also give a
flow diagram of the approach used to train and test the data in Fig. 5.2.
Fig. 5.1 EuroSAT data folder structure for training and testing
Models Evaluated:
ResNet50
ResNet101
ResNet152
VGG16
VGG19
NasNetLarge
NasNetMobile
EfficientNet B0
EfficientNet B1
EfficientNet B2
EfficientNet B3
EfficientNet B4
EfficientNet B5
88 5 Remote Sensing Example for Deep Learning
Fig. 5.2 Example processing flow of the EuroSAT data for deep learning algorithm evaluation
EfficientNet B6
EfficientNet B7
Keras offers many models under Applications API that can be used as a base on top
of which upper layers including dense layers can be added. Using this approach, we
check how the models perform on the publicly available EuroSAT dataset. For more
details about the dataset and how it was collected, refer to https://github.com/phe
lber/EuroSAT. We briefly discuss the preliminary information related to the Keras
models.
5.4 Background of Experimental Comparison of Keras Applications Deep … 89
The full details and arguments for each model can be found from the following link:
https://keras.io/api/applications/
Here we are only interested in highlighting the limitations imposed on input data
for each model at the time of writing.
NasNetLarge has highest top-1 and top-5 accuracy. The top-1 and top-5 accuracy
refers to the model’s performance on the ImageNet validation dataset. However, there
is an issue with earlier implementations of the model as described below.
During training, we found it necessary to modify the model library file’s (in
Keras applications) input shape argument “require_flatten” by setting it to
“False” before running the training. Without this modification, an error message like
“ValueError: When setting ‘include_top = True‘ and loading ‘imagenet‘ weights,
‘input_shape‘ should be (331, 331, 3).” will be thrown for NasNetLarge even if the
argument “include_top” is set to false. The argument “require_flatten” is set
to “True” by default hence the need to make this adjustment to avoid the bug.
However, for EfficientNet models the input argument is set to “require_
flatten = include_top” by default with the restriction that min_size
= 32. On the other hand, the min_size restriction was not documented in the
above API link at the time of writing.
ResNet50
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with ‘channels_last’ data format) or (3,
224, 224) (with ‘channels_first’ data format). It should have exactly 3 input channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.
ResNet101
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with ‘channels_last’ data format) or (3,
224, 224) (with ‘channels_first’ data format). It should have exactly 3 input channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.
ResNet152
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with ‘channels_last’ data format) or (3,
224, 224) (with ‘channels_first’ data format). It should have exactly 3 input channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.
90 5 Remote Sensing Example for Deep Learning
VGG16
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with channels_last data format) or (3,
224, 224) (with channels_first data format). It should have exactly 3 input channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.
VGG19
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with channels_last data format) or (3,
224, 224) (with channels_first data format). It should have exactly 3 inputs channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.
NasNetLarge
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (331, 331, 3) for NasNetLarge. It should have exactly
3 input channels, and width and height should be no smaller than 32, e.g., (224, 224,
3) would be one valid value.
5.4.2.4 NasNetMobile
5.4.2.5 EfficientNet B0 to B7
Below we give a visual summary of results obtained by running the above as convo-
lution base models under similar settings for each class of models and using same
input shape (64 × 64 × 3). The first part of the simulation used training–test split of
70/30 while the latter half used 80/20.
5.4 Background of Experimental Comparison of Keras Applications Deep … 91
Set the EuroSAT dataset path and extract labels. There are 10 classes, namely
AnnualCrop, Pasture, PermanentCrop, Residential, Industrial, River, SeaLake,
HerbaceousVegetation, Highway, and Forest.
Next, we plot class distributions of the EuroSAT dataset. There are a total of 2700
images distributed among the classes as shown in Fig. 5.3.
Select 20 images arbitrarily from the whole dataset and show the classes to which
they belong.
92 5 Remote Sensing Example for Deep Learning
def plot_sat_imgs(paths):
plt.figure(figsize=(15, 8))
for i in range(20):
plt.sub-
plot(4, 5, i+1, xticks=[], yticks=[])
img = PIL.Image.open(paths[i], 'r')
plt.imshow(np.asarray(img))
plt.title(paths[i].split('/')[-2])
plot_sat_imgs(img_paths)
5.4 Background of Experimental Comparison of Keras Applications Deep … 93
Fig. 5.4 Samples arbitrarily selected from the dataset for visual inspection
Next the data is split into training and test sets using stratified shuffle-split from
Scikit-learn. We also make use of Keras ImageDataGenerator for data augmentation.
import re
from sklearn.model_selection import StratifiedS-
huffleSplit
from keras.preprocessing.image import ImageDat-
aGenerator
# Execute this once to load split data into train and test folders
respectively
data = {}
for l in LABELS:
for img in os.listdir(DATASET+'/'+l):
data.update({os.path.join(DATASET, l, img): l})
X = pd.Series(list(data.keys()))
y = pd.get_dummies(pd.Series(data.values()))
train_paths = X[train_idx]
test_paths = X[test_idx]
The next sections provide details of deep learning algorithms that will be used in
the evaluation. We will first start by importing all the necessary packages followed
by definition of a generic function for model compilation and then some functions to
plot and visualize results. The ResNet framework model will be taken as an example
to demonstrate the evaluation procedure. After that results of other selected models
will be presented.
5.4 Background of Experimental Comparison of Keras Applications Deep … 99
import tensorflow as tf
from keras.models import Model
from keras.layers import Dense, Dropout, Flatten, GlobalAveragePool-
ing2D, BatchNormalization
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceL-
ROnPlateau
from tensorflow.keras.optimizers import Adam
from keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.vgg19 import VGG19
from tensorflow.keras.applications.resnet import ResNet50, Res-
Net101, ResNet152
from tensorflow.keras.applications import ResNet50V2, ResNet50V2, Res-
Net152V2
from tensorflow.python.keras import regularizers
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
logical_gpus = tf.config.experimental.list_logical_de-
vices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logi-
cal GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
100 5 Remote Sensing Example for Deep Learning
We then define the generic function for model selection and compilation using
the following function.
conv_base = NASNetMobile(include_top=False,
weights='imagenet',
input_shape=input_shape)
top_model = conv_base.output
top_model = GlobalAveragePooling2D()(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)
weights="imagenet",
input_shape=input_shape)
elif cnn_base == 'EfficientNetB4':
conv_base = EfficientNetB4(include_top=False,
weights="imagenet",
input_shape=input_shape)
elif cnn_base == 'EfficientNetB5':
conv_base = EfficientNetB5(include_top=False,
weights="imagenet",
input_shape=input_shape)
elif cnn_base == 'EfficientNetB6':
5.4 Background of Experimental Comparison of Keras Applications Deep … 103
conv_base = EfficientNetB6(include_top=False,
weights="imagenet",
input_shape=input_shape)
else:
conv_base = EfficientNetB7(include_top=False,
weights="imagenet",
input_shape=input_shape)
top_model = conv_base.output
top_model = GlobalAveragePooling2D()(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)
if type(fine_tune) == int:
for layer in conv_base.layers[fine_tune:]:
layer.trainable = True
else:
for layer in conv_base.layers:
layer.trainable = False
model.compile(optimizer=optimizer, loss='categorical_crossentropy',
metrics=['categorical_accuracy'])
return model
def plot_history(history):
acc = history.history['categorical_accuracy']
val_acc = history.history['val_categorical_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(acc)
plt.plot(val_acc)
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.subplot(1, 2, 2)
plt.plot(loss)
plt.plot(val_loss)
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show();
5.4 Background of Experimental Comparison of Keras Applications Deep … 105
N_STEPS = train_generator.samples//BATCH_SIZE
N_VAL_STEPS = test_generator.samples//BATCH_SIZE
N_EPOCHS = 100
train_generator.reset()
test_generator.reset()
resnet50_history = resnet50_model.fit(train_generator,
steps_per_epoch=N_STEPS,
epochs=N_EPOCHS,
callbacks=[early_stop, checkpoint, reduce_lr],
validation_data=test_generator,
validation_steps=N_VAL_STEPS)
108 5 Remote Sensing Example for Deep Learning
# Evaluate the model on test data and compute precision, recall, f-score and
confusion matrix. Display prf.
class_indices = train_generator.class_indices
class_indices = dict((v,k) for k,v in class_indices.items())
test_generator_new = test_gen.flow_from_directory(
directory=TEST_DIR,
target_size=(64, 64),
batch_size=1,
class_mode=None,
color_mode='rgb',
shuffle=False,
seed=69
)
predictions = resnet50_model.predict(test_genera-
tor_new, steps=len(test_generator_new.filenames))
predicted_classes = np.argmax(np.rint(predictions), axis=1)
true_classes = test_generator_new.classes
Repeating the above steps for ResNet50 model, all the other models can be
similarly evaluated. We present some of selected model evaluation results below.
Fig. 5.5 Training and validation loss and accuracy for ResNet50 model
110 5 Remote Sensing Example for Deep Learning
lack of improvement in validation categorical accuracy. The loss is small and close
to zero after the initial swings.
Confusion Matrix
See Fig. 5.7.
Fig. 5.6 Precision, recall, and F-score of each of the classes for ResNet50 model
Fig. 5.7 Confusion matrix showing number of hits for each of the classes for ResNet50 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 111
Fig. 5.8 Confusion matrix showing ratio of hits for each of the classes for ResNet50 model. The
Highway, Pasture, and SeaLake classes show the lowest accuracy of 94%, while the Forest and
Residential can be classified with 100% accuracy.
ResNet101
Training/Validation Accuracy
See Fig. 5.9.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 35th epoch due
lack of improvement in validation categorical accuracy. The loss is small and close
to zero after high initial validation loss.
Fig. 5.9 Training and validation loss and accuracy for ResNet101 model
112 5 Remote Sensing Example for Deep Learning
Fig. 5.10 Precision, recall, and F-score of each of the classes for ResNet101 model
Confusion Matrix
See Fig. 5.11.
Training/Validation Accuracy
See Fig. 5.13.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 35th epoch due
lack of improvement in validation categorical accuracy. The loss is small and close
to zero after the initial swings.
Fig. 5.11 Confusion matrix showing number of hits for each of the classes for ResNet101 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 113
Fig. 5.12 Confusion matrix showing ratio of hits for each of the classes for ResNet101 model. The
Pasture class shows the lowest accuracy of 93%, while the Forest and Residential can be classified
with 100% accuracy
Fig. 5.13 Training and validation loss and accuracy for ResNet152 model
Confusion Matrix
See Fig. 5.15.
Fig. 5.14 Precision, recall, and F-score of each of the classes for ResNet152 model
Confusion Matrix
Herbaceous PermanentC
AnnualCrop Forest Highway Industrial Pasture Residential River SeaLake
Vegetation rop
AnnualCrop 1151 1 0 4 0 6 44 0 5 0
Forest 0 1202 0 0 0 0 0 2 0 0
Herbaceous
9 19 1093 2 1 5 36 31 0 0
Vegetation
Highway 15 0 5 1135 10 5 5 16 23 0
Industrial 0 0 0 1 720 0 0 28 1 0
Pasture 6 25 8 0 1 553 5 0 2 0
PermanentC
3 1 15 2 9 0 799 12 1 0
rop
Residential 0 0 0 0 0 0 0 993 0 0
River 6 1 1 18 1 0 1 1 829 0
SeaLake 12 37 1 0 0 1 0 0 4 959
Fig. 5.15 Confusion matrix showing number of hits for each of the classes for ResNet152 model
Fig. 5.16 Confusion matrix showing ratio of hits for each of the classes for ResNet152 model. The
HerbaceousVegetation class shows the lowest accuracy of 91% while the Forest and Residential
can be classified with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep … 115
Fig. 5.17 Training and validation loss and accuracy for VGG16 model
Training/Validation Accuracy
See Fig. 5.17.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 40th epoch due
lack of improvement in validation categorical accuracy. The loss gradually decreases
to below 0.2.
Confusion Matrix
See Fig. 5.19.
Training/Validation Accuracy
See Fig. 5.21.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 35th epoch due
lack of improvement in validation categorical accuracy. The loss gradually decreases
to below 0.25.
116 5 Remote Sensing Example for Deep Learning
Fig. 5.18 Precision, recall, and F-score of each of the classes for VGG16 model
Fig. 5.19 Confusion matrix showing number of hits for each of the classes for VGG16 model
Fig. 5.20 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The
AnnualCrop class shows the lowest accuracy of 93% while the Residential class can be classified
with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep … 117
Fig. 5.21 Training and validation loss and accuracy for VGG19 model
Confusion Matrix
See Fig. 5.23.
Fig. 5.22 Precision, recall, and F-score of each of the classes for VGG19 model
118 5 Remote Sensing Example for Deep Learning
Fig. 5.23 Confusion matrix showing number of hits for each of the classes for VGG19 model
Fig. 5.24 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The
AnnualCrop class shows the lowest accuracy of 94% while the Forest and Residential can be
classified with 100% accuracy
Training/Validation Accuracy
See Fig. 5.25.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 85th epoch due
lack of improvement in validation categorical accuracy. The loss remains close to
zero after initial large swings.
Confusion Matrix
See Fig. 5.27.
5.4 Background of Experimental Comparison of Keras Applications Deep … 119
Fig. 5.25 Training and validation loss and accuracy for NasNetLarge model
Fig. 5.26 Precision, recall, and F-score of each of the classes for NasNetLarge model
Training/Validation Accuracy
See Fig. 5.29.
There is a steady increase in training accuracy. Over-fitting can be observed imme-
diately after the 1st epoch. Early stopping comes in as early as the 10th epoch due
120 5 Remote Sensing Example for Deep Learning
Fig. 5.27 Confusion matrix showing number of hits for each of the classes for NasNetLarge model
Fig. 5.28 Confusion matrix showing ratio of hits for each of the classes for NasNetLarge model.
The Highway and Pasture classes show the lowest accuracy of 95%, while the Forest and Residential
can be classified with 100% accuracy
Fig. 5.29 Training and validation loss and accuracy for NasNetMobile model
5.4 Background of Experimental Comparison of Keras Applications Deep … 121
Confusion Matrix
See Fig. 5.31.
Fig. 5.30 Precision, recall, and F-score of each of the classes for NasNetMobile model
Fig. 5.31 Confusion matrix showing number of hits for each of the classes for NasNetMobile
model
122 5 Remote Sensing Example for Deep Learning
Training/Validation Accuracy
See Fig. 5.33.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 35th epoch due
Fig. 5.32 Confusion matrix showing ratio of hits for each of the classes for NasNetMobile model.
The Highway class shows the lowest accuracy of 35%, while the AnnualCrop class shows the
highest accuracy of 95%
Fig. 5.33 Training and validation loss and accuracy for EfficientNet B0 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 123
lack of improvement in validation categorical accuracy. The loss is small and close
to zero after the initial swings.
Confusion Matrix
See Fig. 5.35.
Fig. 5.34 Precision, recall, and F-score of each of the classes for EfficientNet B0 model
Fig. 5.35 Confusion matrix showing number of hits for each of the classes for EfficientNet B0
model
124 5 Remote Sensing Example for Deep Learning
Fig. 5.36 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B0 model.
The Highway class shows the lowest accuracy of 36%, while the Residential class shows the highest
accuracy of 98%
EfficientNet B1
Training/Validation Accuracy
See Fig. 5.37.
There is an initial steady increase in both training and validation accuracy. Over-
fitting begins to show after 80th epoch where the accuracy becomes unstable and
large swings in validation accuracy are evident. The loss is small and close to zero
after the initial dip.
Fig. 5.37 Training and validation loss and accuracy for EfficientNet B1 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 125
Fig. 5.38 Precision, recall, and F-score of each of the classes for Efficient-Net B1 model
Confusion Matrix
See Fig. 5.39.
Training/Validation Accuracy
See Fig. 5.41.
There is a gradual increase in validation accuracy that can be observed. Training
terminates at the 30th epoch due to lack of improvement in accuracy. The loss remains
close to zero after the initial dip.
Fig. 5.39 Confusion matrix showing the number of hits for each of the classes for EfficientNet B1
model
126 5 Remote Sensing Example for Deep Learning
Fig. 5.40 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B1 model.
The PermanentCrop class shows the lowest accuracy of 53% while the Residential class shows the
highest accuracy of 100%
Fig. 5.41 Training and validation loss and accuracy for EfficientNet B2 model
Confusion Matrix
See Fig. 5.43.
Fig. 5.42 Precision, recall, and F-score of each of the classes for EfficientNet B2 model
Fig. 5.43 Confusion matrix showing the number of hits for each of the classes for EfficientNet B2
model
Fig. 5.44 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B2 model.
The PermanentCrop class shows the lowest accuracy of 42% while the Forest class shows the highest
accuracy of 96%
128 5 Remote Sensing Example for Deep Learning
Training/Validation Accuracy
See Fig. 5.45.
Wild swings in validation accuracy can be observed. However, the loss remains
close to zero after the initial dip.
Fig. 5.45 Training and validation loss and accuracy for EfficientNet B3 model
Fig. 5.46 Precision, recall, and F-score of each of the classes for Efficient-Net B3 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 129
The highest precision and recall were obtained for Residential class (99.5% recall,
90.8% precision) followed by SeaLake class (96.7% recall, 98.7% precision). The
Forest class achieves a decent performance of 95.8% recall, 97.0% precision. The
global F2-score not shown here is estimated to be 90.5%.
Confusion Matrix
See Fig. 5.47.
Training/Validation Accuracy
See Fig. 5.49.
There is a steady increase in both training and validation accuracy with initial
fluctuations. Over-fitting begins to show after 80th epoch where the accuracy becomes
unstable. The loss is small and close to zero after the initial dip.
Fig. 5.47 Confusion matrix showing the number of hits for each of the classes for EfficientNet B3
model
Fig. 5.48 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B3 model.
The Highway class shows the lowest accuracy of 81%, while the Residential class shows the highest
accuracy of 99%
130 5 Remote Sensing Example for Deep Learning
Fig. 5.49 Training and validation loss and accuracy for EfficientNet B4 model
Confusion Matrix
See Fig. 5.51.
Fig. 5.50 Precision, recall and F-score of each of the classes for EfficientNet B4 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 131
Fig. 5.51 Confusion matrix showing the number of hits for each of the classes for EfficientNet B4
model
Fig. 5.52 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B4 model.
The River class shows the lowest accuracy of 71% while the Residential class shows the highest
accuracy of 98%
Training/Validation Accuracy
See Fig. 5.53.
There is a steady increase in both training and validation accuracy with shallow
initial fluctuations. Over-fitting begins to show after 80th epoch. The loss is small
and close to zero after the initial dip.
Confusion Matrix
See Fig. 5.55.
132 5 Remote Sensing Example for Deep Learning
Fig. 5.53 Training and validation loss and accuracy for EfficientNet B5 model
Fig. 5.54 Precision, recall, and F-score of each of the classes for EfficientNet B5 model
Training/Validation Accuracy
See Fig. 5.57.
There is a steady increase in both training and validation accuracy. Over-fitting
begins to show after 80th epoch.
5.4 Background of Experimental Comparison of Keras Applications Deep … 133
Fig. 5.55 Confusion matrix showing the number of hits for each of the classes for EfficientNet B5
model
Fig. 5.56 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B5 model.
The Highway class shows the lowest accuracy of 78% while the Residential class shows the highest
accuracy of 99%
Fig. 5.57 Training and validation loss and accuracy for EfficientNet B6 model
134 5 Remote Sensing Example for Deep Learning
Confusion Matrix
See Fig. 5.59.
Fig. 5.58 Precision, recall, and F-score of each of the classes for EfficientNet B6 model
Fig. 5.59 Confusion matrix showing the number of hits for each of the classes for EfficientNet B6
model
5.4 Background of Experimental Comparison of Keras Applications Deep … 135
Fig. 5.60 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B6 model.
The Pasture class shows the lowest accuracy of 75%, while the Residential class shows the highest
accuracy of 100%
Training/Validation Accuracy
See Fig. 5.61.
There is a steady increase in training accuracy and validation accuracy becomes
stable after initial fluctuations. Over-fitting begins to show after 50th epoch. The same
trend can be observed for the loss function.
Confusion Matrix
See Fig. 5.63.
Fig. 5.61 Training and validation loss and accuracy for EfficientNet B7 model
136 5 Remote Sensing Example for Deep Learning
Fig. 5.62 Precision, recall, and F-score of each of the classes for EfficientNet B7 model
Fig. 5.63 Confusion matrix showing the number of hits for each of the classes for EfficientNet B7
model
Fig. 5.64 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B7 model.
The PermanentCrop class shows the lowest accuracy of 88% while the Forest and Residential classes
show the highest accuracy of 99%
that by adding BatchNormalization layers after the top Dense layers, the global
mean recall could be improved from 92.8 to 96.4% (+ 3.6%) and global F2-score
improved from 93 to 96.7% (+ 3.7%)!! In addition, the highest precision and recall
were obtained for Residential class (99.8% recall, 96.5% precision), a noticeable
improvement from (99.4% recall, 91.2% precision), followed by Forest class (99.3%
recall, 95.7% precision) up from (99.0% recall, 92.0% precision). The results are
shown below.
EfficientNet B7
Training/Validation Accuracy
See Fig. 5.65.
138 5 Remote Sensing Example for Deep Learning
Fig. 5.65 Training and validation loss and accuracy for EfficientNet B7 model
Confusion Matrix
See Fig. 5.67.
Fig. 5.66 Precision, recall, and F-score of each of the classes for EfficientNet B7 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 139
Fig. 5.67 Confusion matrix showing the number of hits for each of the classes for EfficientNet B7
model
Fig. 5.68 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B7 model.
The Pasture class shows the lowest accuracy of 94% while the Residential class achieves the highest
accuracy of 100%
In short, with some slight modifications to the model, great benefits in perfor-
mance can be achieved as shown above for EfficientNet B7 model. This is also
applicable to other models and illustrates the advantage of the Keras framework for
experimental modeling when quick confirmations and decision have to be made. In
fact, we checked the effect of similar changes on ResNet50, ResNet101, VVG16,
VGG19, and NasNetLarge models. The results were shown below.
ResNet50
Training/Validation Accuracy
See Fig. 5.69.
Confusion Matrix
See Fig. 5.71.
Fig. 5.69 Training and validation loss and accuracy for ResNet50 model
Fig. 5.70 Precision, recall, and F-score of each of the classes for ResNet50 model
ResNet101
Training/Validation Accuracy
See Fig. 5.73.
Fig. 5.71 Confusion matrix showing number of hits for each of the classes for ResNet50 model
Fig. 5.72 Confusion matrix showing ratio of hits for each of the classes for ResNet50 model. The
PermanentCrop class shows the lowest accuracy of 95% while the Forest and Residential can be
classified with 100% accuracy
Fig. 5.73 Training and validation loss and accuracy for ResNet101 model
142 5 Remote Sensing Example for Deep Learning
Fig. 5.74 Precision, recall, and F-score of each of the classes for ResNet101 model
Confusion Matrix
See Fig. 5.75.
Training/Validation Accuracy
See Fig. 5.77.
Fig. 5.75 Confusion matrix showing number of hits for each of the classes for ResNet101 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 143
Fig. 5.76 Confusion matrix showing ratio of hits for each of the classes for ResNet101 model. The
HerbaceousVegetation class shows the lowest accuracy of 95% while the Forest and Residential
can be classified with 100% accuracy
Fig. 5.77 Training and validation loss and accuracy for VGG16 model
Confusion Matrix
See Fig. 5.79.
Training/Validation Accuracy
See Fig. 5.81.
144 5 Remote Sensing Example for Deep Learning
Fig. 5.78 Precision, recall, and F-score of each of the classes for VGG16 model
Fig. 5.79 Confusion matrix showing number of hits for each of the classes for VGG16 model
Fig. 5.80 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The
Highway class shows the lowest accuracy of 96% while the Residential class can be classified with
100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep … 145
Fig. 5.81 Training and validation loss and accuracy for VGG19 model
Confusion Matrix
See Fig. 5.83.
Fig. 5.82 Precision, recall, and F-score of each of the classes for VGG19 model
146 5 Remote Sensing Example for Deep Learning
Fig. 5.83 Confusion matrix showing number of hits for each of the classes for VGG19 model
Fig. 5.84 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The
AnnualCrop, HerbaceousVegetation, Industrial, Pasture, and PermanentCrop classes have the lowest
accuracy of 97% while the Forest and Residential can be classified with 100% accuracy
NasNetLarge
Training/Validation Accuracy
See Fig. 5.85.
Confusion Matrix
See Fig. 5.87.
Fig. 5.85 Training and validation loss and accuracy for NasNetLarge model
Fig. 5.86 Precision, recall, and F-score of each of the classes for NasNetLarge model
Changing the train–test split is one way to improve the validation accuracy. In
this case, we change the split case from 70–30 to 80–20 and check some of the top
performing models so far. The results of this change are shown below.
ResNet50
Training/Validation Accuracy
See Fig. 5.89.
148 5 Remote Sensing Example for Deep Learning
Fig. 5.87 Confusion matrix showing number of hits for each of the classes for NasNetLarge model
Fig. 5.88 Confusion matrix showing ratio of hits for each of the classes for NasNetLarge model.
The PermanentCrop class shows the lowest accuracy of 94% while the Forest and Residential can
be classified with 100% accuracy
Confusion Matrix
See Fig. 5.91.
Fig. 5.89 Training and validation loss and accuracy for ResNet50 model
Fig. 5.90 Precision, recall, and F-score of each of the classes for ResNet50 model
Training/Validation Accuracy
See Fig. 5.93.
Confusion Matrix
See Fig. 5.95.
150 5 Remote Sensing Example for Deep Learning
Fig. 5.91 Confusion matrix showing number of hits for each of the classes for ResNet50 model
Fig. 5.92 Confusion matrix showing ratio of hits for each of the classes for ResNet50 model.
The AnnualCrop class shows the lowest accuracy of 95% while the Forest and Residential can be
classified with 100% accuracy
Fig. 5.93 Training and validation loss and accuracy for ResNet101 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 151
Fig. 5.94 Precision, recall, and F-score of each of the classes for ResNet101 model
Fig. 5.95 Confusion matrix showing number of hits for each of the classes for ResNet101 model
Training/Validation Accuracy
See Fig. 5.97.
Confusion Matrix
See Fig. 5.99.
152 5 Remote Sensing Example for Deep Learning
Fig. 5.96 Confusion matrix showing ratio of hits for each of the classes for ResNet101 model. The
Highway class shows the lowest accuracy of 94% while the Forest and Residential can be classified
with 100% accuracy
Fig. 5.97 Training and validation loss and accuracy for VGG16 model
Training/Validation Accuracy
See Fig. 5.101.
Confusion Matrix
See Fig. 5.103.
5.4 Background of Experimental Comparison of Keras Applications Deep … 153
Fig. 5.98 Precision, recall, and F-score of each of the classes for VGG16 model
Fig. 5.99 Confusion matrix showing number of hits for each of the classes for VGG16 model
Fig. 5.100 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The
AnnualCrop and Highway classes show the lowest accuracy of 96% while the Forest and Residential
classes can be classified with 100% accuracy
154 5 Remote Sensing Example for Deep Learning
Fig. 5.101 Training and validation loss and accuracy for VGG16 model
Fig. 5.102 Precision, recall, and F-score of each of the classes for VGG16 model
Training/Validation Accuracy
See Fig. 5.105.
5.4 Background of Experimental Comparison of Keras Applications Deep … 155
Fig. 5.103 Confusion matrix showing number of hits for each of the classes for VGG16 model
Fig. 5.104 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The
PermanentCrop class shows the lowest accuracy of 95% while the Forest and Residential classes
can be classified with 100% accuracy
Fig. 5.105 Training and validation loss and accuracy for VGG19 model
156 5 Remote Sensing Example for Deep Learning
Confusion Matrix
See Fig. 5.107.
Fig. 5.106 Precision, recall, and F-score of each of the classes for VGG19 mod-el
Fig. 5.107 Confusion matrix showing number of hits for each of the classes for VGG19 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 157
Fig. 5.108 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The
AnnualCrop class shows the lowest accuracy of 94% while the Forest and Residential can be
classified with 100% accuracy
Training/Validation Accuracy
See Fig. 5.109.
Confusion Matrix
See Fig. 5.111.
Training/Validation Accuracy
See Fig. 5.113.
Fig. 5.109 Training and validation loss and accuracy for VGG19 model
158 5 Remote Sensing Example for Deep Learning
Fig. 5.110 Precision, recall, and F-score of each of the classes for VGG19 model
Fig. 5.111 Confusion matrix showing number of hits for each of the classes for VGG19 model
Fig. 5.112 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The
AnnualCrop and Highway classes show the lowest accuracy of 96% while the Forest and Residential
can be classified with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep … 159
Fig. 5.113 Training and validation loss and accuracy for NasNetLarge model
Confusion Matrix
See Fig. 5.115.
Fig. 5.114 Precision, recall, and F-score of each of the classes for NasNetLarge model
160 5 Remote Sensing Example for Deep Learning
Fig. 5.115 Confusion matrix showing number of hits for each of the classes for NasNetLarge
model
Fig. 5.116 Confusion matrix showing ratio of hits for each of the classes for NasNetLarge model.
The Pasture class shows the lowest accuracy of 94% while the Forest and Residential can be classified
with 100% accuracy
Training/Validation Accuracy
See Fig. 5.117.
Confusion Matrix
See Fig. 5.119.
Fig. 5.117 Training and validation loss and accuracy for EfficientNet B7 model
Fig. 5.118 Precision, recall, and F-score of each of the classes for EfficientNet B7 model
Fig. 5.119 Confusion matrix showing the number of hits for each of the classes for EfficientNet
B7 model
Fig. 5.120 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B7 model.
The Highway class shows the lowest accuracy of 95% while the Residential class shows the highest
accuracy of 100%
Training/Validation Accuracy
See Fig. 5.121.
Fig. 5.121 Training and validation loss and accuracy after applying kernel regularization to VGG16
model with a capacity of 2048 units
Fig. 5.122 Precision, recall, and F-score of each of the classes for after applying kernel
regularization to VGG16 model with a capacity of 2048 units
Confusion Matrix
See Fig. 5.123.
Fig. 5.123 Confusion matrix showing number of hits for each of the classes after applying kernel
regularization to VGG16 model with a capacity of 2048 units
Fig. 5.124 Confusion matrix showing ratio of hits for each of the classes after applying kernel
regularization to VGG16 model with a capacity of 2048 units. The AnnualCrop, Industrial, and
SeaLake classes show the lowest accuracy of 97% while the Forest and Residential classes can be
classified with 100% accuracy
Training/Validation Accuracy
See Fig. 5.125.
Confusion Matrix
See Fig. 5.127.
Training/Validation Accuracy
See Fig. 5.129.
5.4 Background of Experimental Comparison of Keras Applications Deep … 165
Fig. 5.125 Training and validation loss and accuracy after applying activity regularization to
VGG16 model with a capacity of 2048 units
Fig. 5.126 Precision, recall, and F-score of each of the classes for after applying activity
regularization to VGG16 model with a capacity of 2048 units
Confusion Matrix
See Fig. 5.131.
166 5 Remote Sensing Example for Deep Learning
Fig. 5.127 Confusion matrix showing number of hits for each of the classes after applying activity
regularization to VGG16 model with a capacity of 2048 units
Fig. 5.128 Confusion matrix showing ratio of hits for each of the classes after applying activity
regularization to VGG16 model with a capacity of 2048 units. The HerbaceousVegetation class
shows the lowest accuracy of 97% while the Forest and Residential classes can be classified with
100% accuracy
Fig. 5.129 Training and validation loss and accuracy after applying Activity regularization to
VGG16 model with a capacity of 1024 units
5.4 Background of Experimental Comparison of Keras Applications Deep … 167
Fig. 5.130 Precision, recall, and F-score of each of the classes for after applying activity
regularization to VGG16 model with a capacity of 1024 units
Fig. 5.131 Confusion matrix showing number of hits for each of the classes after applying activity
regularization to VGG16 model with a capacity of 1024 units
Training/Validation Accuracy
See Fig. 5.133.
168 5 Remote Sensing Example for Deep Learning
Fig. 5.132 Confusion matrix showing ratio of hits for each of the classes after applying Activity
regularization to VGG16 model with a capacity of 1024 units. The HerbaceousCrop class shows
the lowest accuracy of 97% while the Forest and Residential classes can be classified with 100%
accuracy
Fig. 5.133 Training and validation loss and accuracy after applying Activity regularization to
VGG16 model with a capacity of 512 units
Confusion Matrix
See Fig. 5.135.
Fig. 5.134 Precision, recall, and F-score of each of the classes for after applying Activity
regularization to VGG16 model with a capacity of 512 units
Fig. 5.135 Confusion matrix showing number of hits for each of the classes after applying Activity
regularization to VGG16 model with a capacity of 512 units
Fig. 5.136 Confusion matrix showing ratio of hits for each of the classes after applying Activity
regularization to VGG16 model with a capacity of 512 units. The Highway class shows the lowest
accuracy of 96% while the Forest and Residential classes can be classified with 100% accuracy
170 5 Remote Sensing Example for Deep Learning
overfitting and accuracy. It can also be seen that in some cases, the validation accuracy
loss is lower than the training loss. This can be due the fact that regularization is
applied only during training and not during validation. Some other valid reasons
normally given are that the evaluation of the training loss happens during training
while that of validation happens at the end of validation resulting in slight shift of loss
time by about half an epoch. Some strategies to avoid being too conservative during
training would be lowering the regularization constant, reducing dropout rate and
increasing model capacity. In our case, a model capacity of 1024 seems to be the
best so far achieving an accuracy of greater than 98% for 9 out of the 10 classes and
also showing resistance to overfitting.
Dropout
In this section we take a look at what happens if we vary the dropout rate for a
given network capacity of 1024 using the VGG16 model as base. We evaluate the
Dropout rates of 0.2, 0.3, 0.4, and 0.5. In practice, dropout rates of between 0.2 and
0.5 are recommended.
Dropout Rate 0.2
Training/Validation Accuracy
See Fig. 5.137.
Confusion Matrix
See Fig. 5.139.
Fig. 5.137 Training and validation loss and accuracy of the VGG16 model with a dropout rate of
0.2
5.4 Background of Experimental Comparison of Keras Applications Deep … 171
Fig. 5.138 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout
rate of 0.2
Fig. 5.139 Confusion matrix showing number of hits for each of the classes using the VGG16
model as base with a dropout rate of 0.2
Training/Validation Accuracy
See Fig. 5.141.
Fig. 5.140 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model
as base with a dropout rate of 0.2. The HerbaceousVegetation class shows the lowest accuracy of
96% while the Forest and Residential classes can be classified with 100% accuracy
Fig. 5.141 Training and validation loss and accuracy of the VGG16 model with a dropout rate of
0.3
Confusion Matrix
See Fig. 5.143.
Training/Validation Accuracy
See Fig. 5.145.
5.4 Background of Experimental Comparison of Keras Applications Deep … 173
Fig. 5.142 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout
rate of 0.3
Fig. 5.143 Confusion matrix showing number of hits for each of the classes using the VGG16
model as base with a dropout rate of 0.3
Fig. 5.144 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model
as base with a dropout rate of 0.3. The HerbaceousVegetation class shows the lowest accuracy of
96% while the Residential class can be classified with 100% accuracy
174 5 Remote Sensing Example for Deep Learning
Fig. 5.145 Training and validation loss and accuracy of the VGG16 model with a dropout rate of
0.4
Confusion Matrix
See Fig. 5.147.
Fig. 5.146 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout
rate of 0.4
5.4 Background of Experimental Comparison of Keras Applications Deep … 175
Fig. 5.147 Confusion matrix showing number of hits for each of the classes using the VGG16
model as base with a dropout rate of 0.4
Fig. 5.148 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model
as base with a dropout rate of 0.4. The AnnualCrop, HerbaceousVegetation, Highway, and SeaLake
classes show the lowest accuracy of 96% while the Forest and Residential classes can be classified
with 100% accuracy
Training/Validation Accuracy
See Fig. 5.149.
Confusion Matrix
See Fig. 5.151.
Fig. 5.149 Training and validation loss and accuracy of the VGG16 model with a dropout rate of
0.5
Fig. 5.150 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout
rate of 0.5
For the dropout rate of 0.5, the overall accuracy is about 98.17%. In terms of clas-
sification performance for the resulting model, AnnualCrop, HerbaceousVegetation,
Pasture, and PermanentCrop have a recall of less than 98%.
We found out that for this particular dataset, there was no marked improvement
in model accuracy associated with a dropout rate increase from 0.2 to 0.5. A dropout
rate of 0.2 would still be sufficient to achieve a decent accuracy of 98.28%. This does
5.4 Background of Experimental Comparison of Keras Applications Deep … 177
Fig. 5.151 Confusion matrix showing number of hits for each of the classes using the VGG16
model as base with a dropout rate of 0.5
Fig. 5.152 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model
as base with a dropout rate of 0.5. The HerbaceousVegetation class shows the lowest accuracy of
96% while the Forest and Residential classes can be classified with 100% accuracy
not however imply that there is no merit in investigating drop as part of a broader
strategy to improve validation accuracy by algorithm tuning.
Network Capacity Reduction
Another strategy to reduce overfitting is network capacity reduction. We check the
effect of reducing the network capacity from 2048 units, which is the default setting
in all the above simulations, to 1024 and 512 respectively. As in the above cases, we
evaluate the performance in terms of accuracy and F2-score for the VGG16 as an
example.
Capacity reduced to from 2048 to 1024 units
The network capacity can be easily obtained from the model summary. In our
case, we use vgg16_model.summary()to get this information since we defined
vgg16_model as the model name. The results of network capacity with 1024 units
compared to 2048 units are shown in Table 5.4.
As can be seen from the numbers, the trainable parameters reduced by slightly
more than half from over 8 million to about 3 million parameters. The following
figures show the effect of the capacity reduction.
Training/Validation Accuracy
See Fig. 5.153.
178 5 Remote Sensing Example for Deep Learning
Table 5.4 Comparison of parameters for 2048 and 1024 units of VGG16 model
Parameters Number of model units
2048 1024
Total parameters 23,144,266 17,880,906
Trainable parameters 8,421,386 3,162,122
Non-trainable parameters 14,722,880 14,718,784
Fig. 5.153 Training and validation loss and accuracy of VGG16 model with a reduced capacity of
1024 units
Confusion Matrix
See Fig. 5.155.
Fig. 5.154 Precision, recall, and F-score of each of the classes of VGG16 model with a reduced
capacity of 1024 units
Fig. 5.155 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a reduced capacity of 1024 units
Fig. 5.156 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with a
reduced capacity of 1024 units. The PermanentCrop class shows the lowest accuracy of 96% while
the Forest and Residential classes can be classified with 100% accuracy
180 5 Remote Sensing Example for Deep Learning
Using 2048 units as the base for comparison, we can see a considerable reduction
of the trainable parameters to less than one quarter as shown Table 5.5.
In terms of numbers, it translates to slightly over 8.4 million parameters compared
to about 1.3 million parameters. The following figures show the effect of the capacity
reduction.
Training/Validation Accuracy
See Fig. 5.157.
Confusion Matrix
See Fig. 5.159.
Table 5.5 Comparison of parameters for 2048 and 512 units of VGG16 model
Parameters Number of model units
2048 512
Total parameters 23,144,266 16,035,658
Trainable parameters 8,421,386 1,318,922
Non-trainable parameters 14,722,880 14,716,736
Fig. 5.157 Training and validation loss and accuracy of VGG16 model with a reduced capacity of
512 units
5.4 Background of Experimental Comparison of Keras Applications Deep … 181
Fig. 5.158 Precision, recall, and F-score of each of the classes of VGG16 model with a reduced
capacity of 512 units
Fig. 5.159 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a reduced capacity of 512 units
Fig. 5.160 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with
a reduced capacity of 512 units. The Highway class shows the lowest accuracy of 96% while the
Forest and Residential classes can be classified with 100% accuracy
182 5 Remote Sensing Example for Deep Learning
As with the case of reduction to 1024 units, there is slight loss in performance and
improved robustness to overfitting when 512 units were used as shown in the training/
validation loss graph above. In this case, the accuracy and F2-score of 98.13% were
obtained after network capacity reduction which is comparable to 98.15% for 2048
units. This translates to a decrease of 0.02% which we think can be acceptable in
most practical situations.
The results for regularization can be summarized in Table 5.6. Although the model
showed increased resistance to overfitting, the price to pay was a slight decrease in
accuracy of the model. But this is common in many machine learning and deep
learning scenarios where a trade-off of some sort has to be made. More training data
is always better to have. The results also show doubling the network capacity from
1024 to 2048 did not give additional benefits of increased accuracy.
Effect of Batch Size
Up to this point we have set the batch size for train/validation to 128. Adjusting
the batch size can have an impact on the accuracy of the resulting model. For some
easily trainable data like the standard MNIST dataset, reducing the batch size may
lead to improved performance. However, there is no general rule on the impact of
batch size as the effect can depend on the complexity of the problem under modeling.
This means that it is necessary to try a couple of batch sizes to see how much how it
affects the output model performance. In general, batch size of 32 is a good starting
point when using Keras and it is advisable to try other sizes like 64, 128, and 256.
Choosing batch sizes which are powers of 2 is recommended when using GPUs for
processing in order to exploit parallel execution. We changed the batch size to 32, 64,
and 256 and performed the training using the VGG16 model as the base with 1024
units, L2 activity regularization constant of 1e-4, dropout of 0.2, and early stopping
patience of 30. The number of epochs was set to 200. The results are shown below.
Batch size 32
Training/Validation Accuracy
See Fig. 5.161.
Table 5.6 Comparison of accuracy with various regularization kernels and network size VGG16
model
Regularization method Accuracy (%) F2-score (%)
No Regularization (2048 units) 98.18 98.08
L2 Kernel Regularization (2048 units) 98.15 98.15
L2 Kernel Reg + Network size 1024 98.11 98.11
L2 Kernel Reg + Network size 512 98.13 98.13
L2 Activity Regularization (2048 units) 98.28 98.28
L2 Activity Reg + Network size 1024 98.28 98.28
L2 Activity Reg + Network size 512 98.17 98.17
5.4 Background of Experimental Comparison of Keras Applications Deep … 183
Fig. 5.161 Training and validation loss and accuracy of VGG16 model with a batch size of 32
Confusion Matrix
See Fig. 5.163.
Fig. 5.162 Precision, recall, and F-score of each of the classes of VGG16 model with a batch size
of 32
184 5 Remote Sensing Example for Deep Learning
Fig. 5.163 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a batch size of 32
Fig. 5.164 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with
a batch size of 32. The AnnualCrop, HerbaceousVegetation, Pasture and PermanentCrop classes
show the lowest accuracy of 97% while the Forest and Residential classes can be classified with
100% accuracy
Batch size 64
Training/Validation Accuracy
See Fig. 5.165.
Confusion Matrix
See Fig. 5.167.
Training/Validation Accuracy
See Fig. 5.169.
5.4 Background of Experimental Comparison of Keras Applications Deep … 185
Fig. 5.165 Training and validation loss and accuracy of VGG16 model with a batch size of 64
Fig. 5.166 Precision, recall, and F-score of each of the classes of VGG16 model with a batch size
of 64
Confusion Matrix
See Fig. 5.171.
Fig. 5.167 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a batch size of 64
Fig. 5.168 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with
a batch size of 64. The AnnualCrop and Highway classes show the lowest accuracy of 97% while
the Forest, Residential, and SeaLake classes can be classified with 100% accuracy. The rest of the
classes are above 98% accuracy
Fig. 5.169 Training and validation loss and accuracy of VGG16 model with a batch size of 256
5.4 Background of Experimental Comparison of Keras Applications Deep … 187
Fig. 5.170 Precision, recall, and F-score of each of the classes of VGG16 model with a batch size
of 256
Fig. 5.171 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a batch size of 256
Fig. 5.172 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with
a batch size of 256. The Highway class shows the lowest accuracy of 96%, while the Forest and
Residential classes can be classified with 100% accuracy
188 5 Remote Sensing Example for Deep Learning
Table 5.7 shows that for batch size of 64 a state-of-the-art accuracy of 98.46%
was achieved. Further reducing the batch size to 32 gave an accuracy of 98.31%. On
the other hand, increasing the batch size to 256 resulted in accuracy of 98.15%. In
summary, an as the batch size increases, accuracy was observed to decrease for sizes
of 64 and above. When a fixed training data sample size is available, a reduced batch
size will lead to an increase in the number steps per epoch and therefore training
time depending on the available computation resources. In our case, a batch size of
64 was experimentally determined to be the best to employ.
Intermodel type comparison
Using recall and F2-score as performance metrics, with a train–test split of 70–30,
NasNetLarge gave the best performance with an accuracy of 97.4% and F2-score
of 97.6% followed by VGG16 (mean recall 97.14%, F2-score 97.1%), ResNet101
(mean recall 96.4%, F2-score 96.6%), and EfficientNetB7 (mean recall 92.8%, F2-
score 93.0%) in the that order. We explored the train–split ratio of 80–20 and found out
that VGG16 gave the best results with an accuracy of 98.18% and F2-score of 98.06%.
The above evaluation was performed with a batch size of 128. One of the known
strategies to fight overfitting is regularization. We investigated the effect of weight
regularization, network capacity reduction, and dropout. It was found out that there
was minor degradation in the performance with better resistance to overfitting for
the VGG16 model. In fact, regularization can produce meaningful results and stable
validation loss. Specifically, L2 activity regularization produced a peak accuracy
of 98.28%. An investigation into the impact of batch size resulted in the final best
performance of 98.46% for the VGG16 model. This was with a batch size of 64.
There were varying degradations in accuracy with higher and lower batch sizes. It is
generally recommended to fix the batch size throughout model evaluations and also
to choose a value that is a power of 2 in order to exploit computation optimizations
in some GPU implementations. It should be possible to further increase the accuracy
of the models by further tuning the models and also acquiring more training data.
Based on the preceding evaluation, we can summarize the important best model
hyperparameter as shown in Table 5.8.
5.5 Application of EuroSAT Results to Uncorrelated Dataset 189
We applied the above model to separately acquired data. This dataset is also Sentiel-
2 acquired data which covers the areas surrounding Gweru city of Zimbabwe [7].
Gweru is a small city that is characterized by a dry, cool winter season from May to
July, a hot, dry period in August to early November, and a warm, rainy period from
early November to April. The hottest month is October, while the coldest is July.
The temperatures range from an average of 21 °C in July to 30 °C in October, while
the annual rainfall is about 684 mm. In this chapter, only median post rainy-season
Sentinel-2 imagery will be used for land cover classification. Although the median
post-rainy Sentinel-2 imagery (April - June 2020) comprises 13 spectral bands with
a spatial resolution that range between 10 and 20 m, we will only use RGB bands
in a similar fashion to the EuroSAT dataset. It has been already proven in [8] that
RGB bands give the highest accuracy when deep learning algorithms considered. As
preparation the original GeoTiff format data is converted into 64 × 64 patches for
processing by the deep learning algorithm. Since we have already confirmed that the
VGG16 is the best performing model on the EuroSAT dataset, we will evaluate only
this model on the Gweru dataset. It is obvious from the location information that the
data is completely uncorrelated with the EuroSAT data.
Recap of results from the EuroSAT dataset with the same model.
See Fig. 5.176.
Result Summary:
5400 images belonging to 10 classes.
Accuracy: 0.9846296296296296
Global F2 Score: 0.9846296296296296
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.15 0.00 0.07 0.02 0.02 0.03 0.02 0.25 0.06 0.38
Forest 0.08 0.00 0.57 0.04 0.00 0.00 0.00 0.22 0.03 0.05
HerbaceousVegetation 0.04 0.00 0.12 0.04 0.03 0.00 0.00 0.57 0.03 0.17
Highway 0.03 0.00 0.03 0.06 0.00 0.00 0.00 0.75 0.06 0.06
Industrial 0.00 0.00 0.03 0.03 0.14 0.00 0.00 0.81 0.00 0.00
Pasture 0.10 0.01 0.22 0.07 0.00 0.00 0.00 0.35 0.05 0.19
PermanentCrop 0.08 0.00 0.02 0.08 0.12 0.00 0.02 0.38 0.05 0.25
Residential 0.02 0.00 0.02 0.02 0.02 0.01 0.00 0.82 0.02 0.08
River 0.03 0.00 0.09 0.03 0.05 0.00 0.00 0.63 0.03 0.15
SeaLake 0.13 0.00 0.00 0.13 0.13 0.00 0.00 0.38 0.13 0.13
Fig. 5.176 Precision, recall, accuracy results from the vgg16_eurosat8breg_act_batch64.h5 with
EuroSAT dataset
192 5 Remote Sensing Example for Deep Learning
Fig. 5.177 Confusion matrix results from the vgg16_eurosat8breg_act_batch64.h5 with the
EuroSAT dataset
Since we already have a working model, our best bet, which is the utility of deep
learning approach, is that we can re-use this model as a starting point to see how much
improvement can achieved. However, we are faced with class imbalanced problem
as the HerbaceousVegetation class data is about 47% (779/1648) of the whole dataset
by far outnumbers the rest of the classes and while the minority class SeaLake has as
few samples as about 0.5% (8/1648). The data scarcity issues also apply Highway,
Industrial, PermanentCrop, and River which have less than 100 data points per class.
See Fig. 5.178.
Some strategies to explicitly deal with this class imbalance problem have been
addressed in literature include but not limited to [9–12]:
Strategy 1: Merging near-identical classes in one class
Strategy 2: Downsizing majority samples
Strategy 3: Resampling specific classes
Strategy 4: Adjusting the loss function.
779
Fig. 5.181 PRF with distribution of Gweru class data after Strategy 1 is applied. Grassland, Water,
Woodland classes have 0% recall
Some improvement for Water & Woodland can be observed but Grassland still
has recall and precision of zero meaning prediction is not possible.
Different combinations of precision and recall (ability to remember) which give
you a better understanding of how well your model is performing for a given class
are shown in Table 5.10 [13].
Table 5.10 Interpretation of precision and recall results with respect to a given class
Low recall High recall
Low Class prediction unreliable (model cannot Class prediction reliable but not
Precision recall many precisely) others (model recall imprecisely)
High Class prediction reliable but not detectability Class prediction reliable (model
Precision is low (model recall few precisely) recall many precisely)
It is known that accuracy is the best measure of performance for imbalance datasets.
We therefore introduce the AUC ROC metric as part of evaluation including PRF as
shown in the figure below.
See Fig. 5.188.
See Fig. 5.189.
Summary of Result:
Found 196 images belonging to 5 classes.
Accuracy: 0.6887755102040817
Global F2 Score: 0.6887755102040817
It can be seen that the strategy is effective in improving the PRF metrics across
all classes achieving at the same time achieving a validation of AUC of 94.23%. The
accuracy still remains around 68.88%.
This demonstrates that importance of using different metrics for different data
distributions.
The classification of imbalanced data is not a simple task especially when there
is a very limited number of samples as in our case. The best chance of improving
performance is to start with a large dataset in which even the minority class is well
5.5 Application of EuroSAT Results to Uncorrelated Dataset 199
Fig. 5.188 Training/Validation accuracy and loss and AUC performance for 5-class Gweru dataset
Fig. 5.189 PRF with of Gweru class data after Strategy 1 is applied on 6 classes
Fig. 5.190 Confusion matrix of Gweru class data after Strategy 1 is applied on 6 classes
All models tested were shown be good predictors for Residential and Forest classes
on the EuroSAT dataset. This gives us a hint that they can be used to detect changes in
urban expansion where forest is converted to residential areas. EfficientNet models
tend to classify residential better than forest for the 70–30 train–test split. This is
opposite of what was observed for ResNet, VGG, and NasNet models. In general,
for the EuroSAT dataset we could see that the VGG models performed well on the
80–20 split with and without regularization. This leads us to explore more on the
utility of the VGG models for land-cover classification.
In this investigation, we came to discovered that there are a lot of opportuni-
ties to improve the performance of the deep learning algorithms to achieve the
highest possible target. Most state-of-the-art algorithms are required achieve an
accuracy of not less than 98%. Through data manipulation and algorithm hyper-
parameter tuning we could achieve an accuracy of 98.46% by using the VGG16 as
the base model without feature engineering. Other methods such model ensembles
have been suggested in the literature as viable approaches although this may lead to
increased effort in training due to the huge number of parameters involved. If time
and computation resources are not an issue, this approach is sure worth trying.
We also evaluated the performance of the best EuroSAT model weights on non-
EuroSAT dataset, specifically using the Gweru dataset described in Sect. 5.3. Unfor-
tunately, we could not get good results using these model weights. However, on
retraining with VGG16 model we could get some reasonable results albeit with limi-
tations due to imbalanced data. Next steps will be to explore emerging approaches
including wide ResNet and expand the algorithms on to non-EuroSAT dataset to
solve real problems. The journey has just started!
References
International Geoscience and Remote Sensing Symposium, pp. 204–207, 2018. doi: https://doi.
org/10.1109/IGARSS.2018.8519248
9. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-
sampling technique. J Artif Intell Res 16:321–357
10. D. Koßmann, T. Wilhelm and Fink GA (2021) Generation of attributes for highly imbalanced
land cover data. In: 2021 IEEE International Geoscience and Remote Sensing Symposium
IGARSS, pp 2616–2619. https://doi.org/10.1109/IGARSS47720.2021.9554331
11. G. Douzas, F. Bação, J. Fonseca and M. Khudinyan, “Imbalanced Learning in Land Cover
Classification: Improving Minority Classes’ Prediction Accuracy Using the Geometric SMOTE
Algorithm. Remote Sensing. 11. 3040, 2019. https://doi.org/10.3390/rs11243040.
12. Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem
in convolutional neural networks. Neural Netw 106:249–259. https://doi.org/10.1016/j.neunet.
2018.07.011
13. Scikit-learn: https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_
recall.html
14. TensorFlow: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data