0% found this document useful (0 votes)

66 views

Deep Learning Models A Practical Approach for Hands-On Professionals (Jonah Gamba)

Uploaded by

anderswang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views

Deep Learning Models A Practical Approach for Hands-On Professionals (Jonah Gamba)

Uploaded by

anderswang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 211

Transactions on Computer Systems and Networks

Jonah Gamba

Deep Learning
Models
A Practical Approach for Hands-On
Professionals
Transactions on Computer Systems and
Networks

Series Editor
Amlan Chakrabarti, Director and Professor, A. K. Choudhury School of
Information Technology, Kolkata, West Bengal, India

Editorial Board
Jürgen Becker, Institute for Information Processing–ITIV, Karlsruhe Institute of
Technology—KIT, Karlsruhe, Germany
Yu-Chen Hu, Department of Computer Science and Information Management,
Providence University, Taichung City, Taiwan
Anupam Chattopadhyay , School of Computer Science and Engineering,
Nanyang Technological University, Singapore, Singapore
Gaurav Tribedi, EEE Department, IIT Guwahati, Guwahati, India
Sriparna Saha, Computer Science and Engineering, Indian Institute of Technology
Patna, Patna, India
Saptarsi Goswami, A.K. Choudhury school of Information Technology, Kolkata,
India
Transactions on Computer Systems and Networks is a unique series that aims
to capture advances in evolution of computer hardware and software systems
and progress in computer networks. Computing Systems in present world span
from miniature IoT nodes and embedded computing systems to large-scale
cloud infrastructures, which necessitates developing systems architecture, storage
infrastructure and process management to work at various scales. Present
day networking technologies provide pervasive global coverage on a scale
and enable multitude of transformative technologies. The new landscape of
computing comprises of self-aware autonomous systems, which are built upon a
software-hardware collaborative framework. These systems are designed to execute
critical and non-critical tasks involving a variety of processing resources like
multi-core CPUs, reconfigurable hardware, GPUs and TPUs which are managed
through virtualisation, real-time process management and fault-tolerance. While AI,
Machine Learning and Deep Learning tasks are predominantly increasing in the
application space the computing system research aim towards efficient means of
data processing, memory management, real-time task scheduling, scalable, secured
and energy aware computing. The paradigm of computer networks also extends it
support to this evolving application scenario through various advanced protocols,
architectures and services. This series aims to present leading works on advances
in theory, design, behaviour and applications in computing systems and networks.
The Series accepts research monographs, introductory and advanced textbooks,
professional books, reference works, and select conference proceedings.
Jonah Gamba

Deep Learning Models

A Practical Approach for Hands-On
Professionals
Jonah Gamba
Tsukuba, Ibaraki, Japan

ISSN 2730-7484 ISSN 2730-7492 (electronic)

Transactions on Computer Systems and Networks
ISBN 978-981-99-9671-1 ISBN 978-981-99-9672-8 (eBook)
https://doi.org/10.1007/978-981-99-9672-8

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore

Paper in this product is recyclable.

For our extended family,
whose diversity enriches our lives and
broadens our horizons
And in loving memory of our departed ones,
whose presence is felt in every family
gathering
Preface

This book is a result of realizing the need for practical approach to understanding deep
learning models since many existing books on the market tend to emphasize theoret-
ical aspects, leaving newcomers and professionals seeking new solutions scrambling
for effective guidelines to achieve their goals. Additionally, most available mate-
rial does not take into account the important factor of rapid prototyping where the
goal is to quickly evaluate the performance of algorithms before going deep into
consideration of final implementation platforms on which the algorithms will run.
The intention here is to address these problems by taking a different approach which
focuses on practicality while keeping theoretical concepts to a necessary minimum.
In this book, we first build the necessary foundation on deep learning models
including current status and progressively go into actual examples of model evalua-
tion. A dedicated chapter is allocated to evaluating the performance of multiple algo-
rithms on specific datasets, highlighting techniques and strategies that can address
real-world challenges when deep learning is employed. By consolidating all neces-
sary information into a single resource, readers can bypass the hassle of scouring
scattered online sources, gaining a one-stop solution to dive into deep learning for
object detection and classification.
To facilitate understanding, the book employs a rich array of illustrations, figures,
tables, and code snippets. Comprehensive code examples are provided, empowering
readers to grasp concepts quickly and develop practical solutions. The book covers
essential methods and tools, ensuring a complete and comprehensive treatment that
enables professionals to implement deep learning algorithms swiftly and effectively.
The book is also designed to equip professionals with the necessary skills to
thrive in the active field of deep learning, where it has the potential to revolutionize
traditional problem-solving approaches. This book serves as a practical companion,
enabling readers to grasp concepts swiftly and embark on building practical solutions.
The content presented in this book is based on several years of experience in
research and development. The main idea is to give a quick start for those try to find
answers within a short period of time irrespective of background. The chapters are
organized as follows:

vii
viii Preface

Chapter 1 Basic Approaches in Object Detection and Classification by Deep

Learning
This chapter introduces the basics of object detection and classification as target
areas of deep learning. It briefly covers traditional methods such K-nearest neighbors
(KNN), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA),
support vector machine (SVM), random forest (RF), and gradient boosting machine.
We also give an overview of deep learning within the context of artificial intelligence.
With this, we aim to introduce the reader into the subject. Object classification will
be the main focus of this book.
Chapter 2 Requirements for Hands-On Approach to Deep Learning
This a bridging chapter in which we introduce some of the concepts needed to
start building deep learning models in Python. The chapter starts with basic prin-
ciples related to data manipulation and ends with explanation on how to set up the
modelling environment. It is expected that the reader is familiar with some high-level
programming concepts which are very easy to acquire within a short space of time.
Deep learning models mostly deal with vectors and matrices as we know them
from linear algebra. These objects are sometimes referred to as tensors but from an
engineering perspective, they can be considered as subsets of multi-dimensional
arrays especially if one is already familiar with numerical processing tools like
MATLAB, Scilab, Octave, etc. Like any other language, Python has a unique way
of accessing and manipulating these arrays. This topic is concisely addressed.
We also include here a discussion on some environments supported for deep
learning model evaluation, both offline and online.
Chapter 3 Building Deep Learning Models
This chapter illustrates how to build deep learning models, their training and
evaluation using the Keras framework in a simple and succinct way. We briefly
explain some of the concepts behind these models so as to give the reader a smooth
entry into each section while concentrating mainly of how-to-use rather than details of
the algorithms themselves. The entry point will be shallow networks upon which the
deep neural networks are developed. We then touch on convolutional neural networks
(CNNs), followed by recurrent neural networks (RNNs) and finally long short-term
memory (LSTM)/ gated recurring units (GRUs). Along the way, we provide examples
on how each of these can be used in order to cement the ideas behind them. After that
we give a quick look at the Keras library and some references for further investigation.
Chapter 4 The Building Blocks of Machine Learning and Deep Learning
In this chapter, we take a look at the three main categories of machine learning
and then move on to explore how the machine learning models can be evaluated.
The various metrics commonly used are explained. After that, we briefly address the
important topic of data preprocessing followed by standard methods of evaluating
machine learning models. One of the reasons why most models fail to perform on
unseen data is due the problem of overfitting. We take a look at this problem and
outline some of the strategies that can be applied in order to overcome it. The next
topic is a discussion of the workflow for machine learning or deep learning. The
chapter ends with concluding remarks to recap the covered topics.
Chapter 5 Remote Sensing Example for Deep Learning
Preface ix

Recently remote sensing has become heavily dependent on machine learning

algorithms such decision trees, random forests, support vector machines, and artificial
neural networks. However, there is an increasing recognition that deep learning which
has been applied successfully in other areas such as computer vision and language
processing is a viable alternative to traditional machine learning. In this chapter, we
will work through one specific example of application of deep learning algorithms to
one important area of remote sensing data analysis, namely land cover classification.
Land cover and land use change analysis is of importance in many practical areas such
urban planning, environmental degradation monitoring, and disaster management.
The main goal of this chapter is to provide a detailed understanding of the perfor-
mance of various deep learning models applied to the problem of land cover classifi-
cation starting from known dataset. Although we use remote sensing as an example,
the key point here is to show the level of hyperparameter tuning that is required to get
desired results from any multiclass problem to which deep learning is applied. We
divide the presentation into five main parts including preliminary information on the
models including input data restrictions followed by exploration of the EuroSAT data
contents, preprocessing steps and performance evaluation results for several selected
models. Finally, we test the performance of the models with a new dataset to get a
clear picture of the limitations of the presented approach in the face of unseen data.
I hope that the material presented in the book will be valuable to all readers and
enable them to move fast on employing deep learning models for various applica-
tions. Deep learning is now entering an exciting phase in which most scientist and
enthusiasts are actively involved.
It is also desirable that the readers are able to extend the ideas covered here to
their particular situations with little effort.

Tsukuba, Japan Jonah Gamba

2023
Acknowledgements

I would like to express my gratitude all the people who have in some ways, positively
contributed in various ways to the preparation of this book.
First and foremost I would like to thank my family members, Megumi, Sekai, and
Mirai, for their invaluable patience during this very long process. Their understanding
and accommodation made it possible for me to spare some time for putting together
the material required to complete the manuscript.
I would like to thank my extended family in various places and situations for the
emotional and physical support that they have given during the period of writing this
book.
Special thanks goes to Dr. Courage Kamusoko, Prof. Hiromi Murakami, formerly
Seikei University, and Prof. Shuji Kawasaki of Iwate University for their contin-
uous encouragement and advice during the process of putting the book together. Let
me also take this opportunity to thank LocaSense Research Systems team for their
assistance in the preparation of part of the evaluation data used in Chap. 5.
I would also like to express my sincere gratitude to Honjo Scholarship Foundation
for always including old boys in their programs and through their kindness, that made
it possible for me to pursue my interest in information systems, which is the subject
of this book.
Last but not least, many thanks also to Smith Chae, Sivananth S. Siva Chandran,
and Diya Ma of Springer for their very efficient and continuous support during the
process of creating this book.

Jonah Gamba

xi
Contents

1 Basic Approaches in Object Detection and Classification

by Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Conventional Methods of Object Detection and Machine
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Linear Discriminant Analysis (LDA) and Quadratic
Discriminant Analysis (QDA) . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.2.5 Gradient Boosting Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.3 Deep Learning as Part of Artificial Intelligence . . . . . . . . . . . . . . . . . 40
1.4 Frameworks for Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.5 Selection of Target Areas for This Book . . . . . . . . . . . . . . . . . . . . . . . 41
1.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.7 Self-evaluation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2 Requirements for Hands-On Approach to Deep Learning . . . . . . . . . . 47
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Basic Python Arrays for Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3 Setting Up Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.1 OS Support for Offline Environments . . . . . . . . . . . . . . . . . . . 50
2.3.2 Windows Environment Creation Example . . . . . . . . . . . . . . . 51
2.3.3 Options to Consider for Online Environments . . . . . . . . . . . . 52
2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Self-evaluation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

xiii
xiv Contents

3 Building Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.1 Introduction: Neural Networks Basics . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.1 Shallow Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.2 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . 60
3.1.3 Recurrent Neural Networks (RNNs) . . . . . . . . . . . . . . . . . . . . 62
3.1.4 Long Short-Term Memory (LSTM)/Gated Recurring
Units (GRUs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2 Using Keras for as Deep Learning Framework . . . . . . . . . . . . . . . . . . 70
3.2.1 Overview of Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.2 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Self-evaluation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 The Building Blocks of Machine Learning and Deep Learning . . . . . . 73
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Categorization of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Methods of Evaluating Machine Learning Models . . . . . . . . . . . . . . . 74
4.3.1 Data Preprocessing for Deep Learning . . . . . . . . . . . . . . . . . . 78
4.3.2 Problem of Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 The Machine Learning Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 Self-evaluation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Remote Sensing Example for Deep Learning . . . . . . . . . . . . . . . . . . . . . . 85
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Background of the Remote Sensing Example . . . . . . . . . . . . . . . . . . . 85
5.3 Remote Sensing: Land Cover Classification . . . . . . . . . . . . . . . . . . . . 86
5.4 Background of Experimental Comparison of Keras
Applications Deep Learning Models Performance
on EuroSAT Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.1 Information Input Data Requirements . . . . . . . . . . . . . . . . . . . 88
5.4.2 Input Restrictions (from Keras Application Page) . . . . . . . . . 89
5.4.3 Training and Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Application of EuroSAT Results to Uncorrelated Dataset . . . . . . . . . 189
5.5.1 Evaluation of 10-Classes with Best EuroSAT Weights . . . . . 189
5.5.2 Training Results with 6 Classes—Unbalanced/
Balanced Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.5.3 Training Results with 5 Classes . . . . . . . . . . . . . . . . . . . . . . . . 197
5.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Chapter 1
Basic Approaches in Object Detection
and Classification by Deep Learning

1.1 Introduction

Deep learning is a high-pace research and development topic spanning an ever-

increasing number of applications fields including but not limited to text recognition,
speech recognition, natural language processing, image recognition, autonomous
driving, and remote sensing [1–4]. Among these applications, one of the currently
trending and exciting implementations of deep learning is ChatGPT by OpenAI,
which provides a platform to interactively performing queries on a wide range of
subjects and get almost human-level answers [5]. The list of applications seems
endless and only depends on how problems can be transformed into machine-
processible and learnable representations. Since it is practically impossible to cover
every application field of deep learning in great details, we chose to limit the scope
of this book to object detection and classification which have strong connections to
computer vision. The ideas from object detection and classification can be extended
into other areas with some modifications.
So, what is object detection all about? In object detection, the aim is to localize
objects in an image by some predefined algorithm. The algorithms take images as
input and produce identified objects together with labels usually superimposed on the
input images with a bounding box (see Figs. 1.1, and 1.2). On the other classification
is mainly concerned with assigning predefined classes to detected objects. In this
respect, it is reasonable to say that detection encompasses classification. However,
as we will see later, classification can standalone as the objective of the algorithm
when applied to fields like remote sensing. Remote sensing is a fascinating field
of research where data provide by Google and other players is find use in several
applications (see Figs. 1.3, and 1.4). In Chap. 5, we will present a comprehensive
example of remote sensing classification to illustrate its practical implementation.
Our focus here will be on algorithms coming from the deep learning area. We
will explore and present the practical of the approaches which are state of the art.
To be clear from the beginning, the approach that we will take here is to present

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 1
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_1
2 1 Basic Approaches in Object Detection and Classification by Deep Learning

Fig. 1.1 Flow of object detection process

Fig. 1.2 An example of multiple object detection in a driving environment

Fig. 1.3 An example of

classification of
remote-sensed data for
building footprint detection
1.1 Introduction 3

Fig. 1.4 An example of

classification of
remote-sensed land use data

concepts in a manner that will allow interested readers to start working on real-world
problems. Although theoretical concepts have a critical role to play on performance
of the algorithms, we will leave much of this discussion to dedicated professionals
and instead concentrate on how utilize the existing technology. The simple reason
is that technology consumers normally experiment with existing methods at a very
high level to confirm that they work as expected before deep diving into the theory
behind in order to improve performance. In some cases, existing algorithms work well
without any modifications thereby reducing the effort and money spent on further
research. Moreover, the approach presented here is natural and will in turn allow
a rapid entry into the realm of deep learning. There is an abundance of resources
in terms of data, code, and theoretical materials available on Internet platforms.
However, the existence of this enormous volume of information is a two-edged
sword. On one hand, it is much easier to find information on any topic of interest,
but on the other hand it makes it harder to figure out where to start and filter out all
the clutter. Once you visit a particular site from any search engine of your choice,
you are basically presented with multiple links which can be endless linked to other
sites. To summarize it all, this book offers the following advantages:
Deep Understanding: It provides a more immersive and focused reading expe-
rience that allows the reader to delve deeply into a subject, offering in-depth expla-
nations, comprehensive coverage, and a cohesive narrative that helps build a solid
foundation of understanding. This depth is often missing in fragmented online search
results.
Credibility and Quality: It goes through a rigorous review process, ensuring a
certain level of quality, accuracy, and credibility. Referenced authors are experts in
the field and have invested significant time and effort into research and writing.
Structure and Organization: It is organized in a structured manner, with chap-
ters, sections, and an index that allows for easy navigation and reference. This makes
it convenient to follow a logical progression of concepts, find specific information,
and revisit previous sections.
4 1 Basic Approaches in Object Detection and Classification by Deep Learning

Limited Distractions: With the material presented in the subsequent chapters, one
can concentrate on the content without the distractions of ads, pop-ups, or hyperlinks.
This helps maintain focus and promotes a deeper level of engagement with the
material.
The above are among the reason why you need this book to keep focused on the
ball and strike the target without miss in the most efficient way. Of course, it may be
necessary to occasionally check some material on the Internet but it’s best to quickly
come back to the main text to avoid getting lost in distractions.
So how do you we begin? As mentioned above, the best way to start with deep
learning would be to first limit the scope of the search area. The recommended
approach would be to make a quick survey of the publicly available material on
the subject and then chose one that matches your final objective. For quick results,
it is also instructive to take a hands-on approach where one can work on example
code along the way. This makes it possible to visualize the output and make further
refinements for robustness. For example, Python programming language code can be
executed to cement the ideas and also get a deeper understanding of how the concepts
can be implemented. To this end there are numerous stable, collaboratively debugged
open-source packages and tools available that make it unnecessary to code algorithms
from mathematical models, thereby reducing the learning curve and increasing speed
productively toward the intended goal enabling rapid prototyping.

1.2 Conventional Methods of Object Detection

and Machine Learning

There are a variety of machine learning algorithms that have evolved over many
years of research and development. An overview of some of these algorithms has
been can be found in [6], and a concise presentation is given in this section. The
performance of these so-called shallow machine learning algorithms depends heavily
on the representation of the input data they are given [4].
All machine learning algorithms are constructed along mathematical concepts
that make it possible to transform input data into a form that simplifies the task of
classification. After transformation, it becomes a matter of applying a logical rule
to cluster the data into their respective classes. Normally a series of affine and/or
nonlinear operations are applied to the input data to come to the final result.
One example is when data becomes linearly separable by conversion from Carte-
sian coordinates to polar coordinates. It is important to recognize that the repre-
sentation of data can make a big difference on the classification or recognition
task.
In the following subsections, we briefly summarize some of these conventional
methods that have been found to be effective in some applications.
1.2 Conventional Methods of Object Detection and Machine Learning 5

1.2.1 K-Nearest Neighbors (KNN)

The K-Nearest Neighbors (KNN) algorithm is a simple and versatile supervised

machine learning algorithm used for classification and regression tasks [7]. KNN is
a nonparametric and instance-based learning algorithm, which means it doesn’t make
assumptions about the underlying data distribution and uses the entire training dataset
for making predictions. KNN is particularly useful for tasks involving nonlinear
relationships and complex decision boundaries.
KNN works on the principle that similar instances tend to have similar outcomes.
Given a new data point, KNN identifies the K-nearest data points (neighbors) from
the training dataset and assigns the majority class or computes the average value of
these neighbors to make predictions. Therefore, K is the important parameter that
need determined in order to avoid noisy predictions. Having set the value of K, the
algorithm can then proceed to use a distance metric to judge the neighborliness of
given query points.
Due to its simplicity, the KNN algorithm generally has low complexity. However,
it’s performance decreases with increase in the feature space dimension. In this
respect, the algorithm is not well positioned for high-dimensional problems. In such
cases, strategies such as parallelization, dimensionality reduction, and partitioning
can be employed [8]. We will next take close look the steps followed in executing
the KNN algorithm.

KNN Algorithm Steps

Several steps are involved in performing classification by KNN algorithm starting
from data preparation until making predictions on new data points. Below is summary
of the sequential steps to be followed for the classification task by the KNN algorithm
(Fig. 1.5).
Step 1: Data Preparation
Gather and preprocess labeled dataset, which normally consists of features and their
corresponding class labels. At this stage, the dataset is split into a training set and a
test set for final model evaluation.
Step 2: Choosing K
Select an appropriate value for K, the number of neighbors to consider. This can be
determined through techniques such as cross-validation to find the optimal K value
for the given dataset. The choice of K is trade-off between the bias and the variance
of the model. Smaller values of K can lead to more flexible models (low bias, high
variance), while larger K values can result in smoother decision boundaries (high
bias, low variance).
Step 3: Distance Metric Selection
Choose a suitable distance metric (e.g., Euclidean distance, Mahalanobis Distance,
Manhattan distance, Minkowski Distance, Chebyshev Distance, Cosine Distance,
6 1 Basic Approaches in Object Detection and Classification by Deep Learning

Fig. 1.5 KNN processing

flow

Hamming Distance, Jaccard Distance, Correlation Distance) to measure the simi-

larity between data points. The choice of metric depends on the nature of the
data.
Step 4: Normalization
Normalization is necessary in order to ensure that all features contribute equally to the
distance calculations preventing features with larger magnitudes from dominating the
distance computation. Common applied normalization methods include Min–Max
Scaling and Z-Score Scaling.
1.2 Conventional Methods of Object Detection and Machine Learning 7

Step 5: Classification
For each data point in the test set, calculate the distances between the test point and
all data points in the training set using the chosen distance metric. Then sort the
distances in ascending order to identify the K-nearest neighbors.
Step 6: Voting
The first option is to count the occurrences of each class label among the K-nearest
neighbors. Then assign the test point to the class that appears most frequently among
its neighbors (majority voting).
The second option is to assign different weights to neighbors based on their
distance from the test point. Closer neighbors have a higher influence on the
prediction (weighted voting).
In case of ties (equal occurrences of multiple class labels among the K neighbors),
tie-breaking mechanisms, such as selecting the class with the closest neighbor can
be applied.
Step 7: Evaluation
After classifying all test points, evaluate the model’s performance using appropriate
metrics like accuracy, precision, recall, F1-score, or confusion matrix.
Step 8: Hyperparameter Tuning
If the performance is not satisfactory, it is recommended to adjust hyperparam-
eters like K or the distance metric and re-evaluate the model. This process can be
repeated with multiple combination of hyperparameters until the desired performance
is achieved.
Step 9: Prediction
Once the model is tuned and evaluated, the final step is use make predictions on new,
unseen data by following the same steps of distance calculation, neighbor selection,
and voting. If prediction erroneous, it may be necessary to re-examine steps 2–8
above.
It’s important to keep in mind that KNN is sensitive to noisy data, outliers, and the
curse of dimensionality just as with other techniques in the same category. Prepro-
cessing and data cleaning steps can help alleviate some of these challenges. In
summary KNN is a straightforward algorithm that can be implemented relatively
easily. In this respect and as additional information for the interested reader, in
Python the Scikit-learn API provides the sklearn.neighbors.KNeighborsClassifier
for the performing KNN [9]. However, its performance and accuracy heavily depend
on parameter tuning, distance metric selection, and data preprocessing. It’s also
important to balance computational efficiency with model accuracy, especially when
dealing with large datasets. Figure 1.6 is an example of applying the KNN algorithm
to classify a query point in cases of two classes.
8 1 Basic Approaches in Object Detection and Classification by Deep Learning

Fig. 1.6 An example KNN

applied to query point (star)
where the distance to each of
the selected samples is
computed

Merits of KNN
The KNN algorithm has several advantages that make it a popular and useful choice
for certain machine learning tasks. We outline some of the key advantages of the
KNN algorithm (Table 1.1).

Limitations of KNN
While KNN has several advantages, it’s essential to consider its limitations, such
as sensitivity to the choice of K, slow prediction times for large datasets, and the
impact of irrelevant or redundant features. Proper preprocessing, parameter tuning,
and validation are crucial for achieving optimal results with the KNN algorithm.
The KNN algorithm, despite its advantages, also has some limitations that need to
be considered when using it for machine learning tasks. Here are the main limitations
of the KNN algorithm (Table 1.2).
To mitigate these limitations, it’s important to preprocess the data, choose an
appropriate K, and consider using KNN in combination with other algorithms or
techniques, such as dimensionality reduction or ensemble methods. Additionally,
understanding the characteristics of the dataset and the problem domain can help
determine whether KNN is a suitable choice for a specific task.

Improvements to KNN Algorithm

Several improvements and variations of the KNN algorithm have been proposed to
address its limitations and enhance its performance in various scenarios. Here are
some possible improvements and extensions for the KNN algorithm:
One category of improve is related to distance computation. This includes
distance weighting, distance metric learning, and distance decay [10]. With distance
weighting, weights are assigned to neighbors based on their distance from the query
point. Closer neighbors can be given higher weights, while farther neighbors receive
1.2 Conventional Methods of Object Detection and Machine Learning 9

Table 1.1 Merits of the KNN algorithm

Merit Details
Simplicity of KNN is straightforward to implement and understand. It doesn’t require
implementation complex mathematical formulations or assumptions about the data
distribution. This is a crucial advantage in real-time applications where
algorithm resources are constrained
Nonparametric KNN is a nonparametric algorithm, meaning it imposes no assumptions
approach on the specific functional form for the data statistical distribution. This
makes it versatile and suitable for a wide range of data patterns
Flexibility for KNN can capture complex, nonlinear relationships in the data. It’s
nonlinear data particularly useful when the decision boundary is irregular or when
classes have intricate shapes
Instance-based KNN is an instance-based learning algorithm meaning that the model
learning learns from the specific instances in the training data instead of
attempting to build a generalizable model based on abstract features.
This makes it possible for KNN to handle diverse datasets, adapt to
changing data, and make intuitive predictions without complex model
training. This flexibility makes KNN an attractive choice for various
real-world applications
No training phase KNN doesn’t have an explicit training phase. Once the dataset is
required prepared, the algorithm is ready for prediction, making it suitable for
scenarios where new data arrives frequently
Interpretability The KNN algorithm provides easily interpretable results. Predictions are
based on the actual instances in the dataset, which can help provide
insights into the decision-making process
Can handle multiclass KNN can handle multiclass classification problems naturally by
problems extending the idea of majority voting to multiple classes
Suitable for small KNN performs well on small datasets, as it doesn’t require large
datasets amounts of training data to make accurate predictions
Robustness to noise KNN can handle noisy data by considering multiple neighbors during
prediction. Outliers or noisy instances have less impact on predictions
due to the averaging effect of multiple neighbors. One other reason for
noise robustness is the voting mechanism. When classifying a new data
point, KNN looks at the labels of its KNN thereby voting the class.
Noisy data points might have incorrect labels, but their influence is
mitigated because KNN considers multiple neighbors. This means that if
a couple of neighbors have incorrect labels due to noise, their impact is
diluted by the correctly labeled neighbors
No assumption of KNN doesn’t assume linearity in the data, making it suitable for
linearity scenarios where relationships between features are not linear
High Recall in KNN can achieve high recall in imbalanced class distributions, as it is
imbalanced classes not biased toward any specific class and can capture minority class
instances effectively
Natural feature By examining the nearest neighbors of a data point, KNN can provide
importance insights into the importance of features for prediction
(continued)
10 1 Basic Approaches in Object Detection and Classification by Deep Learning

Table 1.1 (continued)

Merit Details
Dynamic adaptation KNN can adapt to changes in the data distribution without the need for
retraining. As new data arrives, predictions can be updated using the
existing model
Ensemble and hybrid KNN can be used as a component in ensemble methods or hybrid
approaches models, combining its strengths with other algorithms for improved
performance

lower weights. This can improve the accuracy of predictions, especially when some
neighbors are more relevant than others. This approach is similar to distance decay
where a distance decay function that reduces the influence of neighbors as their
distance from the query point increases. Distance Metric Learning can also be applied
where the objective is to learn a customized distance metric that optimizes the neigh-
bors’ relevance for classification. Metric learning can improve the algorithm’s perfor-
mance when the standard distance metrics are not well-suited for the data. A related
approach is the localized KNN in which instead of considering all data points equally,
use only a subset of neighbors that are closer to the query point. This can reduce the
influence of irrelevant neighbors and improve computational efficiency.
Kernel Density Estimation can incorporate kernel density estimation techniques
to smooth the contribution of neighbors. This can result in more stable and robust
predictions, particularly in noisy or irregular data [11].
Feature Selection and Dimensionality Reduction can be achieved by applying
techniques like Principal Component Analysis (PCA) or feature selection to reduce
the dimensionality of the data before applying KNN. This can help mitigate the curse
of dimensionality and improve computational efficiency [12, 13].
Other possibilities that can be explored are ensemble approaches. This involves
combining predictions from multiple KNN models with different settings (e.g.,
different K values or distance metrics) to enhance accuracy and robustness. Ensemble
methods like Bagging or Boosting can be employed. This can be done in conjunc-
tion with approximate nearest neighbor search algorithms to accelerate the search
for neighbors in high-dimensional spaces. Techniques like k-d trees or ball trees can
significantly improve the algorithm’s efficiency.
Adaptive KNN can dynamically adjust the value of K based on the local density
of data points. In regions with high data density, a smaller K value can be used, while
a larger K value can be employed in sparse regions [14].
Other improvements try to detect and handle outliers before applying KNN.
Outliers can significantly impact the algorithm’s performance, so preprocessing steps
like outlier removal or outlier correction can be beneficial. Hybrid models can also be
applied such as combination of KNN with other algorithms, such as decision trees,
support vector machines, or neural networks, to leverage their strengths and mitigate
KNN’s weaknesses. Finally localized classifiers and incremental learning have been
investigated. With localized classifiers, specialized classifiers are applied in specific
regions of the feature space, depending on the data characteristics. This approach can
1.2 Conventional Methods of Object Detection and Machine Learning 11

Table 1.2 Limitations of KNN algorithm

Limitation Details
Computational As the dataset grows, the time and memory required to make
complexity predictions using KNN can increase significantly. Searching for the
nearest neighbors among a large number of data points can be
computationally expensive
High storage KNN requires storing the entire training dataset for making predictions.
requirements This can be memory-intensive, especially for large datasets with many
features
Sensitivity to feature KNN is sensitive to the scale of features. Features with larger scales can
scaling dominate the distance calculations, leading to biased results. Feature
normalization or standardization is crucial before applying KNN
Choosing an The choice of the K parameter significantly affects the algorithm’s
appropriate K performance. A small K can make predictions noisy and sensitive to
outliers, while a large K can lead to oversmoothing and loss of
important details
Curse of KNN’s performance can degrade in high-dimensional spaces. As the
dimensionality number of dimensions increases, the distinction between nearest and
non-nearest neighbors becomes less meaningful, which can lead to
reduced accuracy
Unevenly distributed KNN may perform poorly when the data is unevenly distributed across
data classes. In cases where one class significantly outweighs the others,
KNN can be biased toward the majority class
Local optima and KNN can be prone to overfitting, especially when the K value is small.
overfitting Using a larger K can help alleviate this issue, but it may lead to
underfitting or oversmoothing of the decision boundary
Outliers and noisy Outliers and noisy data can disproportionately influence the prediction
data process, especially when K is small. Robustness to outliers can be
improved by using larger K values
Distance metric The choice of distance metric can significantly impact the algorithm’s
choice performance. An inappropriate distance metric can lead to inaccurate
predictions. Choosing the right metric depends on the data and the
problem domain
Boundary KNN may struggle with irregular decision boundaries or classes with
irregularities intricate shapes. It’s not well-suited for cases where classes overlap
heavily
Data imbalance KNN can struggle with imbalanced class distributions, as it might favor
the majority class when predicting the class of a new data point
Lack of While KNN provides accurate predictions, it may not provide insights
interpretability into why a particular prediction was made. The algorithm lacks the
interpretability of some other methods
Efficiency in large KNN’s efficiency can be compromised when working with large
datasets datasets, especially when the dimensionality is high. Approximation
methods or other algorithms might be more suitable
12 1 Basic Approaches in Object Detection and Classification by Deep Learning

improve classification accuracy in complex or overlapping regions. For incremental

learning new data points are incorporated into the model without retraining the entire
dataset. This is useful for scenarios with streaming data [15].
These improvements and variations address various challenges of the KNN algo-
rithm and can enhance its accuracy, efficiency, and applicability to a wide range
of machine learning tasks. The choice of improvement will depend on the specific
characteristics of the data and the goals of the analysis.

1.2.2 Linear Discriminant Analysis (LDA) and Quadratic

Discriminant Analysis (QDA)
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is a statistical technique used for dimension-
ality reduction and classification in the field of machine learning and pattern recog-
nition. LDA aims to find a linear combination of features that maximizes the sepa-
ration between classes while minimizing the variation within each class [16]. It
is particularly useful for tasks such as feature extraction, data visualization, and
classification.
LDA assumes that the data is normally distributed and that the classes have similar
covariance matrices [17]. It takes a labeled dataset where each sample is associated
with a class label. The goal is to project this high-dimensional data onto a lower-
dimensional space while preserving class separability. For each class, LDA computes
the mean vector (average) of the feature values and the scatter matrix (covariance
matrix). The scatter matrix captures the spread of data within each class.
LDA aims to maximize the distance between class means (between-class scatter)
while minimizing the spread of data within each class (within-class scatter). These
scatter matrices provide insights into the separability of the classes in the trans-
formed space. To find the optimal projection for the data, LDA performs eigenvalue
decomposition on the matrix that is the result of inverting the within-class scatter
matrix and multiplying it with the between-class scatter matrix. The eigenvalues and
eigenvectors obtained from this decomposition are used to determine the directions
of the new feature space. The eigenvectors corresponding to the largest eigenvalues
are chosen as the directions of the new feature space. These eigenvectors are used to
transform the original feature vectors into a lower-dimensional space.
LDA reduces the dimensionality of the data by selecting a subset of the eigen-
vectors. The number of eigenvectors chosen corresponds to the desired number of
dimensions in the new feature space. Typically, the number of dimensions is set to
the number of classes minus one to prevent overfitting.
In a classification context, the reduced-dimensional data can be fed into a classifier
such as a linear classifier (e.g., logistic regression) for making predictions. In a
visualization context, the reduced-dimensional data can be used to visualize the data
distribution while preserving class separability.
1.2 Conventional Methods of Object Detection and Machine Learning 13

In a nutshell, LDA provides a structured way to reduce the dimensionality of data

while preserving information that is relevant for classification tasks. It is particu-
larly effective when dealing with multiclass classification problems and can lead to
improved classification accuracy by transforming the data into a space where class
separability is maximized.
The following steps are taken for performing classification using Linear Discrim-
inant Analysis (Fig. 1.7).
Step 1: Data Preparation
Gather and preprocess your labeled dataset. Each data point should have a set of
features and a corresponding class label. Ensure that the data satisfies the assumptions
of LDA, such as normal distribution and similar covariance matrices for different
classes.

Fig. 1.7 LDA processing

flow
14 1 Basic Approaches in Object Detection and Classification by Deep Learning

Step 2: Compute Class Means

Calculate the mean vector for each class by averaging the feature vectors of all data
points belonging to that class.
Step 3: Compute Within-Class Scatter Matrix
Calculate the within-class scatter matrix (SW ) for each class. This matrix captures
the spread of data points within each class and is computed by summing the outer
product of the difference between each data point and its class mean.
Step 4: Compute Between-Class Scatter Matrix
Calculate the between-class scatter matrix (SB ) that measures the spread between
different classes. It is computed by summing the outer product of the difference
between class means and the overall mean.
Step 5: Eigenvalue Decomposition
Find the matrix St = SW −1 * SB and perform eigenvalue decomposition on it to
obtain eigenvalues and eigenvectors. Sort the eigenvectors in descending order of
their corresponding eigenvalues.
Step 6: Dimensions Selection
Select the top k eigenvectors (where k is the number of classes minus one, or a
smaller number chosen based on the desired dimensionality reduction) to form the
transformation matrix W. This means W is a matrix of eigenvectors corresponding
to k largest eigenvalues.
Step 7: Data Transformation
Transform the original data to the lower-dimensional space using the transformation
matrix W such that Y = X * W, where X is the original data matrix and Y is the
transformed data matrix.
Step 8: Classification
Apply a classification algorithm (e.g., logistic regression) on the reduced-
dimensional data Y to perform classification.
Step 9: Model Evaluation
Split your dataset into training and testing sets. Train the classification model on
the training data and evaluate its performance on the testing data using appropriate
metrics.
Step 10: Predictions
Once the model is trained, you can use it to make predictions on new, unseen data by
first transforming the new data using the transformation matrix W and then applying
the trained classifier.
1.2 Conventional Methods of Object Detection and Machine Learning 15

Fig. 1.8 An illustration of the operating principle of LDA

Further details of how the above steps are accomplished can be found in [4].
Additionally, in Python the Scikit-learn API provides the sklearn.discriminant_
analysis.LinearDiscriminantAnalysis Class for the performing LDA.
The basic principle of LDA is shown in Fig. 1.8.

Merits of LDA
Linear Discriminant Analysis (LDA) offers several advantages, making it a valu-
able technique in various machine learning and pattern recognition tasks. Some of
the key advantages of the Linear Discriminant Analysis algorithm are listed below
(Table 1.3).

Limitations of the LDA

Despite its advantages, LDA also has some limitations that should be considered
when applying the algorithm to real-world problems. The main limitations mainly
stem from underlying assumptions used by the algorithm which in some cases may
not fit the data under analysis. We briefly outline some of these limitations here.
LDA makes the assumption of Gaussian distribution which means that the data
in each class follows a Gaussian distribution. If this assumption is not met, the
performance of LDA may degrade. In real-world datasets, the distribution of data
may not always be Gaussian.
The assumption of equal covariance matrices of different classes is equal which
is also made to simplify the computation. This assumption may not hold true for all
datasets, especially when classes have significantly different variances or covariance
structures.
The LDA is sensitive to outliers in the data. Outliers can disproportionately influ-
ence the estimation of class means and covariance matrices, leading to suboptimal
results. As an example, outliers can greatly affect the calculation of class means.
Since class means play a crucial role in LDA by determining the position of the deci-
sion boundary, outliers can pull the mean of a class in the direction of the outlier. This
16 1 Basic Approaches in Object Detection and Classification by Deep Learning

Table 1.3 Merits of LDA algorithm

Merit Details
Dimensionality reduction with class Reduced feature dimensions while retaining most of
separability the discriminatory information since LDA aims to
find a lower-dimensional space that maximizes the
separation between classes, making it useful for
visualization and feature extraction
Improved classification performance By focusing on class separability, LDA can lead to
improved classification accuracy. It reduces the
“curse of dimensionality” by projecting data onto a
lower-dimensional space where classes are better
distinguished
Can handle multiclass problems LDA can handle multiclass classification problems
efficiently by transforming the data into a
lower-dimensional space that optimally separates
different classes, even when there are more than two
classes
Reduced overfitting By projecting data onto a lower-dimensional space,
LDA reduces the complexity of the model and
mitigates the risk of overfitting. This is especially
helpful when the number of training samples is
limited
Utilizes class information LDA makes use of class labels during training,
which allows it to capture information about class
distributions. This results in a more informed
transformation that improves class separability
Data visualization LDA can be used for data visualization, particularly
in two- or three-dimensional space. It can help in
understanding the distribution of data classes and
their separability, aiding in exploratory data analysis
Complementary to other algorithms LDA can be used in conjunction with other
classification algorithms, serving as a preprocessing
step to improve the quality of features fed into
subsequent classifiers
Robustness to outliers LDA is less sensitive to outliers compared to some
other techniques like principal component analysis
(PCA). It focuses on maximizing the ratio of
between-class variance to within-class variance,
which makes it more robust
Interpretable results The transformation matrix computed by LDA
provides insights into how the features contribute to
class separability. This transparency can be valuable
for understanding the importance of different
features
(continued)
1.2 Conventional Methods of Object Detection and Machine Learning 17

Table 1.3 (continued)

Merit Details
Low computational cost The computational complexity of LDA is generally
lower than more complex algorithms like support
vector machines (SVMs), making it efficient for
large datasets
Well-established theory LDA has a strong theoretical foundation, which
facilitates its understanding and implementation. It
has been extensively studied and applied in various
fields

might lead to a decision boundary that does not accurately represent the actual distri-
bution of the majority of the data. Similarly outliers can lead to distorted within-class
scatter, inaccurate between-class scatter and loss of linearity which all have negative
impact on classification performance.
When the number of samples is small, LDA may overfit the data, especially if the
number of features is large. In such cases, LDA can perform poorly due to the limited
amount of data available for estimation. This can result in the overfitting problem,
where an attempt is made to use all the available samples. Another limitation is
limited to linear separability. The LDA aims to find linear boundaries that separate
classes. It may struggle when classes are not linearly separable, leading to reduced
classification accuracy.
There are also problems associated with reduced performance in high-dimensional
data. In high-dimensional feature spaces, the “curse of dimensionality” can affect
the performance of LDA. This is because the assumptions of LDA become harder to
meet as the number of features increases.
The primary focus LDA’s lies in transforming data into a lower-dimensional space
that enhances class separability. However, it lacks an inherent mechanism for explicit
feature selection. This means that while it aims to improve classification accuracy
through this transformation, it does not automatically identify or eliminate irrelevant
or redundant features from the dataset. This leaves the algorithm implementer the
burden of identifying the correct set of features that may be suitable for the problem
at hand. This shortcoming is very severe when the data has no obvious patterns which
may give clues feature selection.
Although the LDA can be extended to address multiclass classification prob-
lems, its fundamental formulation is rooted in binary classification. Consequently, in
complex scenarios involving multiple classes, the algorithm’s binary origin may
impact its behavior. Although techniques exist to expand its use to multiclass
scenarios, the algorithm’s core design remains influenced by its binary classification
heritage.
The determination of the decision boundary by LDA takes into account the prior
probabilities of the different classes. This reliance on prior probabilities can be a
double-edged sword. When prior probabilities are accurate and unbiased, LDA can
produce effective results. However, if these probabilities are skewed or inaccurate,
18 1 Basic Approaches in Object Detection and Classification by Deep Learning

it can lead to suboptimal outcomes, as the algorithm’s performance is closely linked

to these estimates.
LDA assumes a linear correlation between the features and the class labels. If the
true relationship between these elements is nonlinear, LDA might struggle to capture
the underlying patterns adequately. This limitation can affect its ability to accurately
classify data in cases where nonlinearity is prevalent.
The LDA approach offers dimensionality reduction and feature transformation,
which are valuable for improving classification accuracy. However, the transformed
features generated by LDA might not be as intuitively interpretable as the original
features. This reduced interpretability can hinder its application in certain contexts
where understanding the relationships between features and classes is crucial.
It is essential to carefully consider the above limitations and assess whether LDA
is suitable for a particular problem. Depending on the nature of the data and the goals
of the analysis, alternative techniques like Quadratic Discriminant Analysis (QDA),
SVMs, or more advanced methods may be more appropriate choices. We will look
at the approaches in the following sections.

Improvements to the LDA Algorithm

Several extensions and improvements to the LDA algorithm have been proposed
to address its limitations. Here we provide a quick summary of these possible
improvements and alternative approaches:
Regularized Linear Discriminant Analysis is one way to address the shortcomings
of LDA. Regularization techniques, like Ridge or LASSO regression, can be applied
to LDA to address the issue of overfitting, especially when dealing with small sample
sizes and high-dimensional data [18]. As will be seen the following subsection,
relaxing the assumption of equal covariance matrices for all classes by using diagonal
or different-shaped covariance matrices for each class can improve performance
when class covariances are not equal. In this class of solutions is the Quadratic
Discriminant Analysis (QDA). Instead of assuming a common covariance matrix
for all classes, QDA allows each class to have its own covariance matrix. This can
capture more complex relationships among features.
Another approach is the Regularized Discriminant Analysis (RDA). It is a varia-
tion of LDA that uses regularization to estimate class-specific covariance matrices,
addressing the issue of singularity when sample size is small [19].
Techniques such as Ledoit–Wolf shrinkage can be employed to improve covari-
ance matrix estimation, particularly when the number of features is large and sample
size is small. In addition, Kernel Discriminant Analysis (KDA) extends LDA to a
kernelized version, enabling nonlinear separation between classes. Kernel methods
can increase discriminative power in complex datasets.
On the other hand, Multiple Discriminant Analysis (MDA) is an extension of
LDA for multiple class problems that constructs a series of discriminant functions.
It provides a flexible framework for multiclass classification [20]. As usual, Dimen-
sionality Reduction Techniques like PCA or autoencoders can help mitigate issues
with high-dimensional data.
1.2 Conventional Methods of Object Detection and Machine Learning 19

Innovative approaches combining deep learning and LDA such as Deep Linear
Discriminant Analysis (DLDA) have been proposed in the literature [21, 22]. Inte-
grating LDA into deep learning frameworks allows for learning more complex and
discriminative feature representations while retaining the benefits of LDA.
Other notable methods include combining multiple LDA models or LDA with
other classification algorithms in an ensemble can enhance classification perfor-
mance and robustness and Sparse Discriminant Analysis which incorporates sparsity-
inducing techniques to encourage feature selection and prioritize relevant features in
the discriminant analysis process. More accurate estimation of class priors can also
be employed to improve the performance of LDA, especially when prior probabilities
are imbalanced.
The above improvements and extensions address various limitations of the tradi-
tional LDA algorithm and offer more flexibility, accuracy, and robustness in various
scenarios.

Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis (QDA) is a statistical classification algorithm
that extends the concept of Linear Discriminant Analysis (LDA) to accommodate
nonlinear relationships between features and class labels. QDA is a supervised
learning algorithm used for classification tasks, where the goal is to predict the
class of a data point based on its features. QDA models the distribution of each class
using quadratic functions, allowing it to capture more complex decision boundaries
compared to LDA.
We provide a detailed overview of the Quadratic Discriminant Analysis algorithm
below by explaining the steps involved in the processing (Fig. 1.9).
Step 1: Data Preparation
Gather a labeled dataset with features and corresponding class labels. It is vital to
ensure that the data satisfies the assumptions of QDA, including the assumption of
multivariate Gaussian distribution for each class. Data preparation can make differ-
ence in the final model performance. We briefly describe some of points which need
to be consider for the data preparation phase.
The data cleaning process removes or corrects any missing values in the dataset.
QDA, like other algorithms, requires complete data to function properly. The target
is to identify and handle any outliers that could potentially skew the covariance
estimates used in QDA. During data preparations feature selection and extraction are
performed. This involves evaluating the relevance of each feature to the classification
task. Redundant, irrelevant, or those features that have little discriminatory power
are removed at this stage. Techniques that can be considered for this include PCA
which can to reduce dimensionality of the feature space while retaining important
information.
It should also be ensured that that each class in the dataset has sufficient represen-
tation. Highly imbalanced classes might lead to biased results. Basically, it is ideal
to have equal data samples for each class if sufficient data is available. To achieve
this, techniques like oversampling or undersampling may be used if necessary.
20 1 Basic Approaches in Object Detection and Classification by Deep Learning

Fig. 1.9 QDA processing

flow

Normalization or standardization is performed during data preparation.

Depending on the nature of the features, one might need to normalize or standardize
them to ensure that they are on a similar scale. Most algorithms work well for data in
the range 0–1. Normalization is also important for QDA’s covariance calculations.
If the dataset contains categorical variables, they need to be properly encoded as
numerical values. Techniques like one-hot encoding can be applied.
Having done the necessary data processing the next step is decided on the train-
test split for the dataset. At this point the dataset is divided into training and testing
sets. Normally 80–20 or 70–30 splits are applied. The training set is used to estimate
the parameters of the QDA model, while the testing set evaluates its performance.
Although not of immediate use, the appropriate choice of evaluation metrics for the
classification task can be chosen at this stage. Accuracy, precision, recall, F1-score,
and confusion matrix are common metrics used to assess QDA’s performance.
As a preliminary step, data visualization can be important to see the nature of
the data at hand. Visualization helps to understand the distribution of classes, poten-
tial overlaps, and the separation between classes. One example of such cases is
for the commonly used Iris dataset. Just visualization of the class histograms can
already give an idea of separable and non-separable features, thereby aiding the
reduction the feature space. Visualization can help you make informed decisions
1.2 Conventional Methods of Object Detection and Machine Learning 21

about whether QDA is suitable for your data. If the features are highly correlated,
multicollinearity might impact the accuracy of covariance estimates in QDA. In this
case one can consider addressing multicollinearity through feature transformation or
regularization.
Finally real data doesn’t always come in assumed distributions and desired size.
In fact this happens more often than not. In that respect, handling non-Gaussian
distributions and small sample sizes should be considered. QDA assumes that the
feature distributions within each class are multivariate normal. If this assumption is
violated, consider transforming your data to approach normality. Additionally, if one
has a small dataset, cautious application of the QDA kept in mind as it might lead
to overfitting. Techniques like regularized discriminant analysis or dimensionality
reduction can be used in such cases.
Step 2: Compute Class Statistics
Calculate the mean vector and covariance matrix for each class. These statistics
provide information about the distribution of data within each class.
Quadratic Discriminant Function:
For each class, QDA models the class distribution using a quadratic function.
The quadratic discriminant function d(x) can be represented as:

1 1 T
d j (x) = − ∗ log j − ∗ x − μ j −1 j x − μ j + log p j
2 2
1 T
j = ∗ xk − μ j xk − μ j
n j kC
j

where the index j denotes the class, , μ and p are the covariance matrix, mean, and
probability respectively given the data x.
The objective is to calculate the quadratic discriminant function for each class
and assign the point to the class with the highest discriminant score.
Step 3: Model Training
QDA does not have an explicit training phase like some other algorithms. The model
parameters (mean vectors and covariance matrices) are estimated directly from the
training data.
Step 4: Regularization
If the covariance matrices are ill-conditioned or if the number of training samples is
small, regularization techniques can be optionally applied to stabilize the parameter
estimation.
Step 5: Model Evaluation
Split the dataset into training and testing sets. Train the QDA model on the training
data and evaluate its performance on the testing data using appropriate metrics
(accuracy, precision, recall, F1-score, etc.).
22 1 Basic Approaches in Object Detection and Classification by Deep Learning

Step 6: Hyperparameter Tuning

If needed, tune hyperparameters such as the regularization parameter to improve the
model’s performance.
Step 7: Prediction
Once the model is trained and evaluated, you can use it to make predictions on new,
unseen data by calculating the quadratic discriminant functions and assigning the
data points to the class with the highest score.
QDA is a versatile algorithm that can capture complex decision boundaries
and perform well when classes have different covariance structures. It can be a
powerful tool for classification tasks where the relationships between features and
class labels are nonlinear. However, it’s important to consider the assumptions of
QDA, such as Gaussian distribution and class-specific covariance matrices, to ensure
its applicability to the data at hand.
Details of how the above steps are accomplished can be found in [4].
Additionally, in Python the Scikit-learn API provides the sklearn.discriminant_
analysis.QuadraticDiscriminantAnalysis Class for flexibly performing QDA. An
example of such class separation is shown in Fig. 1.10.
It is important to keep in mind that QDA assumes class-specific covariance
matrices, which allows it to capture different covariance structures for each class.
This assumption should align with the characteristics of your data.
QDA is a powerful classification algorithm that can handle complex decision
boundaries and perform well when classes have distinct covariance structures. It
is particularly useful when the relationship between features and class labels is
nonlinear. However, as with any algorithm, it’s important to preprocess the data
appropriately and ensure that the assumptions of QDA are met for accurate results.

Fig. 1.10 Example of

nonlinear decision boundary
for the classification of a
two-class problem by the
QDA. Such cases cannot be
handled by LDA
1.2 Conventional Methods of Object Detection and Machine Learning 23

Merits of QDA
The Quadratic Discriminant Analysis (QDA) algorithm offers several advantages that
make it a valuable tool for certain classification tasks. Here are the key advantages
of the QDA algorithm (Table 1.4).
QDA is particularly useful when data exhibits nonlinear relationships and varying
covariance structures among classes. It can provide accurate and flexible classifica-
tion in scenarios where linear classifiers might not be suitable. However, it’s important
to consider the assumptions of QDA, such as Gaussian distribution and class-specific
covariance matrices, to ensure its applicability to the given data.

Limitations of QDA
The Quadratic Discriminant Analysis (QDA) algorithm, while advantageous in
many aspects, also has limitations that need to be considered when applying it to
classification tasks. Here are the main limitations of the QDA algorithm (Table 1.5).
Despite these limitations, QDA can still be a powerful tool for classification tasks,
especially when data exhibits nonlinear relationships and varying covariance struc-
tures. It’s important to carefully assess whether QDA is appropriate for a specific
problem and ensure that the assumptions of the algorithm are met for accurate and
reliable results.

Improvements of QDA
While Linear Quadratic Discriminant Analysis (LQDA) is a simplified version of
Quadratic Discriminant Analysis (QDA) that assumes equal covariance matrices for
all classes, there are some possible improvements and variations that can enhance its
performance and address its limitations. Some of these improvements are common
across multiple object detection and classification methods. Here we summarize
some of the common approaches.
Introduce regularization techniques to mitigate the effects of ill-conditioned
covariance matrices or situations with limited data is almost standard for most clas-
sification methods and QDA is no exception. Regularized Linear Quadratic Discrim-
inant Analysis can stabilize parameter estimation and prevent overfitting. Another
approach is to modify the algorithm to allow for different covariance matrices in
localized regions of the feature space. This approach can improve accuracy by
accommodating varying data distributions [23].
Implementing feature selection or dimensionality reduction techniques before
applying LQDA is another common method. Reducing the number of features can
improve the algorithm’s performance, especially in high-dimensional spaces. It is
also worthwhile to consider Ensemble Methods. Employing ensemble methods like
Bagging or Boosting with LQDA as the base classifier leading to enhanced robustness
and accuracy by combining multiple classifiers.
Hybrid models can also be investigated. For example, combining LQDA with
other classification algorithms, such as logistic regression, naive Bayes, or support
vector machines, utilization of distance-based classifiers, like KNN, in conjunction
24 1 Basic Approaches in Object Detection and Classification by Deep Learning

Table 1.4 Merits of QDA

Merit Details
Captures nonlinear relationships Unlike Linear Discriminant Analysis (LDA),
QDA can capture nonlinear relationships
between features and class labels. This makes
QDA suitable for classification problems where
classes have complex decision boundaries
Flexible decision boundaries QDA allows for more flexible decision
boundaries compared to linear methods like
LDA or logistic regression. It can model curved
decision boundaries, enabling it to handle a
wider range of data distributions
Takes into account the different covariance QDA assumes class-specific covariance
structures matrices, which means it can capture varying
covariance structures within different classes.
This is particularly useful when classes have
distinct variations or dispersions
Makes no assumption of equal covariance Unlike LDA, QDA does not assume that all
matrices classes share the same covariance matrix. This
makes QDA more robust when dealing with
datasets where covariance matrices differ
significantly
Can handle multimodal distributions QDA can effectively model and differentiate
between classes with multiple peaks or modes in
their distributions, which might be challenging
for linear classifiers
Optimal for small datasets QDA can perform well even when the dataset is
small, as it can leverage the available data to
estimate class parameters more accurately
Probabilistic classification QDA inherently provides probabilistic
classification. It estimates the class membership
probabilities based on the Gaussian
distributions, allowing for probabilistic
interpretation of predictions
Interpretability QDA provides interpretable results, as the
decision boundaries and classification
probabilities are derived from explicit
mathematical models
Applicable to non-Gaussian data cases While QDA assumes Gaussian distributions, it
can still perform reasonably well on data that are
approximately Gaussian or have distributions
close to Gaussian
Regularization may be optionally applied QDA can be regularized to handle
ill-conditioned covariance matrices or situations
with limited data, improving stability and
preventing overfitting
(continued)
1.2 Conventional Methods of Object Detection and Machine Learning 25

Table 1.4 (continued)

Merit Details
Ensemble and hybrid approaches are possible QDA can be integrated into ensemble methods
or hybrid models, combining its strengths with
other algorithms for improved performance
Class-specific information modeling possible By modeling class-specific distributions, QDA
can uncover unique characteristics of each class,
which might be important for interpreting and
understanding the data

with LQDA to incorporate local information for classification and exploring semi-
supervised or self-training techniques [24]. Hybrid models can leverage the strengths
of each algorithm to improve overall performance.
Other improvements worth mentioning include Kernel Linear Quadratic Discrim-
inant Analysis which extends LQDA using kernel methods to allow for nonlinear
decision boundaries. Kernel LQDA can capture complex relationships between
features and classes. Finally development of interpretable variations of LQDA that
provide insights into the decision-making process, similar to linear models, while
still capturing more complex relationships can be considered [25, 26].
As with LDA, it is important to note that while these improvements can enhance
LQDA’s performance, they might introduce additional complexity or computational
requirements. This trade-off between complexity and performance gains always
exists.

1.2.3 Support Vector Machine (SVM)

Due the importance of the SVM algorithm, we first give a brief description of its
historical background. The foundation for SVMs was laid by the work of Vladimir
Vapnik and Alexey Chervonenkis in the late 1960s [27]. They introduced the concepts
of “structural risk minimization” and the “VC dimension,” which form the theoretical
basis for SVMs.
The concept of SVMs as we know them today was developed by Vapnik and his
team at AT&T Bell Laboratories in the 1990s. In 1992, Bernhard Boser, Isabelle
Guyon, and Vapnik introduced the first algorithm for training linear SVMs [17].
In 1995, Corinna Cortes and Vapnik introduced the Support Vector Classifi-
cation (SVC) algorithm. The “kernel trick,” a fundamental aspect of SVMs that
allows nonlinear classification by mapping data into higher-dimensional spaces, was
proposed by Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik in 1992. This
enabled SVMs to tackle complex classification problems.
SVMs gained popularity in the early 2000s due to their strong theoretical founda-
tions and good generalization properties. The development of SVMs was intertwined
with the progress of kernel methods and machine learning in general. Researchers
26 1 Basic Approaches in Object Detection and Classification by Deep Learning

Table 1.5 Limitations of QDA

Limitation Details
Low performance for QDA can struggle with high-dimensional datasets, where the
high-dimensional data number of features is large. As the dimensionality increases, the
number of parameters in the covariance matrices grows, which can
lead to overfitting and computational challenges
Large sample size required QDA may not perform well with a small number of training
samples for each class. Having too few samples relative to the
dimensionality of the data can lead to unreliable covariance matrix
estimates
Curse of dimensionality While QDA can capture nonlinear relationships, it is still
problem susceptible to the curse of dimensionality. The performance of
QDA may degrade as the number of features increases, especially
when the data is sparse
Assumption of Gaussian QDA assumes that each class follows a multivariate Gaussian
distribution distribution. If the data does not adhere to this assumption, QDA
may provide suboptimal results
Computational complexity QDA involves the estimation of class-specific covariance matrices,
which can be computationally intensive, especially for large
datasets or datasets with many features
Sensitivity to outliers QDA can be sensitive to outliers, as they can significantly impact
the estimation of covariance matrices and, consequently, the
decision boundaries
Reduced Bias In situations where there is limited data per class, LDA might have
an advantage due to its assumption of shared covariance matrices.
QDA could suffer from higher variance due to the smaller sample
size
Tuning parameters QDA may require the estimation of more parameters (covariance
matrices) than Linear Discriminant Analysis (LDA), which could
lead to overfitting when the training sample size is small
Limited generalization to If the assumptions of Gaussian distribution and class-specific
new data covariance matrices are not met in new data, QDA’s performance
may degrade
Not suitable for online QDA typically requires retraining the entire model when new data
learning arrives. This might not be efficient for scenarios where data
streams in real-time
Limited interpretability While QDA provides interpretable results, the nonlinear decision
boundaries it generates might be harder to explain compared to
linear methods like LDA
Limited performance on In cases where the relationships between features and classes are
linear data mostly linear, QDA might not provide significant advantages over
simpler linear classifiers
1.2 Conventional Methods of Object Detection and Machine Learning 27

introduced various enhancements, like the formulation of multiclass classification

using One-vs-One and One-vs-Rest strategies, and extensions to handle regression
and ranking tasks. The field of machine learning saw the emergence of deep learning
methods, which led to some reduction in SVM’s popularity for certain tasks, but
SVMs still remain relevant for various applications. SVMs and their variants continue
to be an active area of research, with efforts focused on optimization techniques, large-
scale implementations, and the integration of SVMs with other machine learning
approaches. Today, SVMs are widely used in many fields such as image classifi-
cation, text categorization, bioinformatics, finance, and more. Their evolution from
theoretical foundations to practical applications has contributed significantly to the
advancement of machine learning and pattern recognition.
SVM is a powerful and versatile supervised machine learning algorithm used
for classification and regression tasks [4, 7, 17, 27–38]. It’s particularly effective
in scenarios where the data is not linearly separable and requires finding a clear
decision boundary between classes or predicting continuous values. SVM aims to
find the optimal hyperplane that best separates different classes in the feature space
while maximizing the margin between them.
SVM seeks to find the hyperplane that maximizes the margin between classes.
The margin is the distance between the hyperplane and the nearest data points from
each class, which are called support vectors.
The optimization problem involves finding the weights (w) and bias (b) that define
the hyperplane or decision boundary wT x + b = 0 at the same time maximizing the
margin m = 2/||w|| and minimizing the classification error.
In cases where the data is not linearly separable, SVM can transform the data into
a higher-dimensional space using a kernel function (e.g., polynomial, radial basis
function) to create a hyperplane that can separate classes.
In real-world scenarios, data might not be perfectly separable. Soft margin
SVM allows for some misclassification by introducing a penalty parameter (C) that
balances between maximizing the margin and minimizing the classification error.
Moreover, Kernel SVM extends the algorithm to handle nonlinear classification.
It uses a kernel function to implicitly map the data into a higher-dimensional space,
making it possible to find a hyperplane that can separate classes.
SVM can be applied to regression tasks using support vector regression (SVR),
where the goal is to predict continuous values instead of discrete classes. SVR aims
to minimize the deviation of predicted values from the actual target values.
By selecting appropriate hyperparameters such as the type of kernel, kernel param-
eters, and regularization parameter (C for C-SVM). Cross-validation is often used to
find optimal values.
As most classification algorithms, the dataset is split into training and testing
sets. The SVM model is trained on the training data using the selected kernel and
hyperparameters. Evaluation of the model’s performance on the testing data using
appropriate metrics (accuracy, precision, recall, F1-score for classification; RMSE,
MAE for regression). Once trained, the SVM model can be used to predict the class
label (classification) or continuous value (regression) of new, unseen data points.
28 1 Basic Approaches in Object Detection and Classification by Deep Learning

SVM is known for its ability to handle complex decision boundaries and perform
well on various types of data. Its effectiveness in high-dimensional spaces and its
ability to handle nonlinear relationships through kernel functions make it a popular
choice in many machine learning applications. However, SVM’s training time and
complexity can increase with larger datasets, and the selection of appropriate kernels
and hyperparameters requires careful consideration to achieve optimal results.

Steps of the SVM Algorithm

Performing classification using the SVM algorithm involves several steps, from data
preparation to making predictions. Below is a step-by-step flow on how to perform
classification with SVM (Fig. 1.11).
Step 1: Data Preparation
Collect and preprocess your labeled dataset, consisting of features and corresponding
class labels.
Ensure the data is properly scaled and normalized to prevent features with larger
scales from dominating the optimization process.

Fig. 1.11 SVM processing

flow
1.2 Conventional Methods of Object Detection and Machine Learning 29

Step 2: SVM Type Selection

Decide whether you’re performing binary classification (separating two classes) or
multiclass classification (separating more than two classes).
For binary classification, choose between the standard SVM and the soft margin
SVM (C-SVM) depending on the data separability. C-SVM allows for some
misclassification to find a better overall separation.
Step 3: Kernel Selection
If your data is not linearly separable in its current form, choose an appropriate kernel
function to transform the data into a higher-dimensional space where it becomes sepa-
rable. Common kernels include Linear, Polynomial, Radial Basis Function (RBF),
and Sigmoid.
Kernel selection is a pivotal step in the SVM algorithm, especially when dealing
with complex, nonlinearly separable data. SVMs achieve their power by implic-
itly transforming the original data into a higher-dimensional space through kernel
functions, enabling them to find nonlinear decision boundaries. However, real-world
data often exhibits intricate patterns that cannot be separated linearly in the original
feature space. Kernels allow SVMs to map the data to a higher-dimensional space
where a linear boundary can separate the transformed points effectively. The choice
of kernel function determines how the data is transformed and how well the SVM
captures its underlying structure.
Four types of kernels are commonly used: Linear, polynomial, Radial Basis Func-
tion (RBF), and sigmoid kernels. With linear kernels, the kernel performs no trans-
formation and represents the original feature space. It’s suitable for linearly sepa-
rable data. Polynomial kernels transforms data into a higher-dimensional space using
polynomial functions. It’s useful for capturing moderate levels of nonlinearity. The
degree parameter controls the degree of the polynomial. Radial Basis Functions, also
known as the Gaussian kernels, are the most widely used. They map data into an
infinite-dimensional space, capturing complex nonlinear relationships. The gamma
parameter determines the extent of influence of each data point. Lastly, sigmoid
kernels use hyperbolic tangent functions to transform data. They can be effective,
but its performance might depend heavily on the choice of parameters.
Selecting the appropriate kernel depends on the characteristics of data under anal-
ysis. If the data is not linearly separable in its original space, polynomial, RBF, or
sigmoid that capture nonlinear relationships can be considered. High-dimensional
data might benefit from kernels that are more flexible in capturing complex rela-
tionships. The choice of kernel parameters (such as degree for polynomial kernel
and gamma for RBF kernel) is critical. Proper tuning through techniques like cross-
validation ensures optimal performance. Overfitting could be a problem when using
highly flexible kernels like polynomial with high degrees or RBF with small gamma.
Regularization techniques might be needed in such cases.
30 1 Basic Approaches in Object Detection and Classification by Deep Learning

Step 4: SVM Problem Formulation

For binary classification, the goal is to find the optimal hyperplane that maximizes
the margin between the two classes. This involves minimizing a cost function that
accounts for misclassified points while maximizing the margin.
For multiclass classification, techniques like One-vs-One (OvO) or One-vs-Rest
(OvR)/One-vs-All (OvA) are used to extend binary SVM to handle multiple classes.
In the One-vs-One approach, a separate binary classifier is trained for each pair of
classes. If there are N classes, this results in N × (N − 1)/2 binary classifiers. During
training, each classifier is trained using data from its respective pair of classes. When
classifying a new data point, each classifier predicts a class label. The class label with
the majority votes is chosen as the final prediction. OvO can handle complex decision
boundaries for each pair of classes, but it can become computationally expensive for
a large number of classes due to the need to train multiple classifiers.
In the OvR approach, a separate binary classifier is trained for each class, treating
it as the positive class and grouping all other classes as the negative class. If there
are N classes, this results in N binary classifiers. During training, each classifier is
trained using data from the positive class and the aggregated negative class. When
classifying a new data point, each classifier predicts whether the point belongs to its
positive class or not. The class associated with the classifier that gives the highest
confidence is chosen as the final prediction. OvR is more suitable for scenarios with
a large number of classes since it requires fewer classifiers than OvO.
In terms of computational complexity, OvR is usually more computationally effi-
cient than OvO when dealing with many classes, as it requires training only N clas-
sifiers compared to the N × (N − 1) classifiers in OvO. For predictive performance,
the OvO can potentially lead to more accurate results because it focuses on binary
comparisons for each pair of classes. However, OvR might be more balanced in terms
of training data distribution. It can also be noted that OvR is simpler to implement,
as it involves training and evaluating a set of binary classifiers independently. Addi-
tionally, OvR might perform better in cases of class imbalance, as it balances class
distributions by treating each class as the positive class once.
Therefore, the choice between OvO and OvR depends on factors like computa-
tional efficiency, predictive performance, and class distribution. Example applica-
tions of these approaches can be found in [38, 39].
Step 5: Hyperparameter Tuning
Choose the hyperparameters for the SVM, such as the regularization parameter
(C in C-SVM), kernel parameters (if applicable), and other settings specific to the
chosen SVM variant. Then, perform cross-validation to find the best combination of
hyperparameters that yields the highest performance on validation data.
Hyperparameters SVM:
SVMs have hyperparameters that are not learned from the data itself but are set
before training the model. Some essential hyperparameters include the regularization
parameter (often denoted as C in C-SVM), the choice of kernel (linear, polynomial,
1.2 Conventional Methods of Object Detection and Machine Learning 31

radial basis function, etc.), and the associated kernel-specific parameters (e.g., degree
for polynomial kernel, gamma for RBF kernel).
The selection of hyperparameters can dramatically impact the SVM’s ability to
generalize well on new, unseen data. An incorrect choice of hyperparameters can lead
to overfitting (when the model fits the training data too closely but doesn’t perform
well on new data) or underfitting (when the model is too simplistic to capture the
underlying patterns).
To determine the best combination of hyperparameters for your SVM, you typi-
cally use a technique called cross-validation. Cross-validation involves splitting your
training data into multiple subsets or folds. You train the SVM on several combi-
nations of hyperparameters and evaluate its performance on different folds. This
helps you understand how well the SVM generalizes to unseen data under various
hyperparameter settings.
One common approach for hyperparameter tuning is grid search. In grid search,
you define a range of possible values for each hyperparameter, and the algorithm
tries every possible combination of these values. For each combination, you train
the SVM using cross-validation and measure its performance. The combination of
hyperparameters that yields the best validation performance is selected as the optimal
set.
Grid search can be computationally expensive, especially when dealing with
multiple hyperparameters or large datasets. Random search is an alternative where
you randomly sample from the hyperparameter space. Bayesian optimization is
another approach that uses probabilistic models to find the next set of hyperpa-
rameters to evaluate based on past performance.
The regularization parameter (C) controls the trade-off between maximizing the
margin and minimizing the classification error. Larger C values result in a smaller
margin and potentially more training data points within it. Kernel parameters like
gamma in the RBF kernel influence the flexibility of the decision boundary. These
parameters require careful tuning to prevent overfitting or underfitting.
Step 6: Training the SVM
Train the SVM model on the training data using the chosen hyperparameters and
kernel. During training, the SVM optimizer adjusts the weights and bias of the
hyperplane to create the optimal decision boundary that separates the classes.
Step 7: Model Evaluation
Evaluate the trained SVM model on a separate testing dataset to assess its perfor-
mance. Use appropriate evaluation metrics such as accuracy, precision, recall,
F1-score, or ROC curves to measure the model’s effectiveness.
Step 8: Fine-Tuning (Optional)
If the performance is not satisfactory, it is possible optionally go back to hyperpa-
rameter tuning, try different kernels, or consider adjusting the dataset to improve
results.
32 1 Basic Approaches in Object Detection and Classification by Deep Learning

Step 9: Prediction
Once the model’s performance becomes satisfactory, one can use it to make predic-
tions on new, unseen data points. Apply the same preprocessing steps (scaling,
normalization) to the new data before making predictions.
Step 10: Model Interpretation
Depending on the kernel used, SVM might offer insights into feature importance,
allowing one to understand which features contribute most to the classification
decision. This step can be optionally performed.
Step 11: Deployment
Deploy the trained SVM model into production environments for making real-time
predictions on new data.
SVM is a versatile algorithm that can handle a variety of classification tasks,
from linear to nonlinear and binary to multiclass. Its effectiveness relies on proper
data preprocessing, kernel selection, and hyperparameter tuning to achieve optimal
performance.
The application of SVM for linear and nonlinear cases is illustrated in Figs. 1.12
and 1.13.
More details of how the above steps are accomplished can be found in [4]. Addi-
tionally, in Python the Scikit-learn API provides the sklearn.svm module for handling
the multiple variations of the SVM algorithms.

Merits of SVM Algorithm

The SVM algorithm offers several advantages that contribute to its popularity and
effectiveness in various machine learning applications. Below are some of the key
advantages of SVM (Table 1.6).

Fig. 1.12 This is a scenario

where two classes can be
separated by linear
hyperplanes. In the
illustration, white circles
represent one class (Class 1),
while solid circles represent
another class (Class 2). The
optimal hyperplane that
distinctly separates these
classes is depicted as a bold
black line. Circles aligned
along the dashed lines are
known as support vectors,
having a significant influence
on defining the hyperplane
1.2 Conventional Methods of Object Detection and Machine Learning 33

Fig. 1.13 In this case, the

classes are separable by a
nonlinear hyperplane

Kernel Trick for Dimensionality Reduction: The kernel trick can also be used for
dimensionality reduction, which can be useful when dealing with high-dimensional
data.
SVM’s combination of flexibility, generalization capability, and robustness makes
it a valuable tool in various domains, such as image classification, text categorization,
bioinformatics, and more. However, it’s important to fine-tune hyperparameters and
choose the appropriate kernel for each problem to achieve optimal results.

Limits of SVM
While the SVM algorithm offers numerous advantages, it also comes with some
limitations and challenges that need to be considered when applying it to different
machine learning tasks. Here are the main limitations of the SVM algorithm
(Table 1.7).
While SVM is a powerful algorithm with wide-ranging applications, it’s impor-
tant to be aware of its limitations and carefully consider whether it is the appropriate
choice for a specific problem. Addressing these limitations may involve using tech-
niques like feature engineering, kernel tuning, and model evaluation to ensure optimal
performance.

Improvements to the SVM Algorithm

Several improvements and variations have been proposed to enhance the performance
and address some limitations of the SVM algorithm. We give some of the possible
improvements and extensions for SVM.
Automated methods as kernel selection strategies can be develop for selecting
the most suitable kernel for a given dataset. This could involve exploring various
kernels and measuring their impact on cross-validation performance. Additionally,
combining SVM with Stochastic Gradient Descent (SGD) optimization to improve
34 1 Basic Approaches in Object Detection and Classification by Deep Learning

Table 1.6 Merits of SVM algorithm

Merit Details
Effective in high-dimensional feature spaces SVM performs well even in high-dimensional
feature spaces, making it suitable for complex
data that might be difficult to separate using
linear methods
Nonlinearity handling SVM can efficiently handle nonlinear
relationships between features and classes
through the use of kernel functions, enabling it
to capture complex decision boundaries
Robust generalization SVM aims to maximize the margin between
classes, which helps it to generalize well to new,
unseen data and reduces overfitting
Flexibility in the choice of kernels The choice of kernel functions (linear,
polynomial, RBF, sigmoid, etc.) allows SVM to
be adapted to different types of data and problem
domains, enhancing its versatility
Global optimization achievable SVM’s objective function aims to find the
hyperplane that maximizes the margin, resulting
in a globally optimal solution rather than getting
stuck in local minima
Robustness to overfitting By minimizing the classification error and
maximizing the margin, SVM is less prone to
overfitting, even when the number of features is
greater than the number of samples
Effective in small datasets SVM can perform well with small datasets, as it
focuses on the most informative points (support
vectors) rather than relying on the entire dataset
Insensitivity to irrelevant features SVM is relatively insensitive to irrelevant
features, focusing on the most discriminative
features that contribute to separating the classes
Regularization In the case of the soft margin SVM, the
hyperparameter
C controls the trade-off between achieving a
large margin and allowing some
misclassification. This built-in regularization
prevents excessive model complexity
Handling unbalanced data SVM can handle class imbalance by assigning
different weights to classes or by utilizing
techniques such as cost-sensitive learning
Interpretability (linear kernel) With the linear kernel, SVM can provide insights
into feature importance, helping to understand
the contributions of different features to the
classification decision
Well-studied theory SVM is built on solid mathematical principles,
and its theoretical foundations are
well-established, making it easier to understand,
analyze, and implement
(continued)
1.2 Conventional Methods of Object Detection and Machine Learning 35

Table 1.6 (continued)

Merit Details
Support for different problems SVM can be applied to both classification and
regression tasks, making it a versatile tool for a
wide range of machine learning problems
Consistency in high-dimensional data SVM’s ability to maximize margins helps
maintain classification consistency, even when
the number of features is much larger than the
sample size

scalability and speed up training, especially for large datasets is one possible approach
[40]. Recently, incremental learning has been a subject of investigation. Creation of
incremental or online SVM algorithms that can adapt to new data without retraining
the entire model could be effective. This is particularly useful for real-time or
streaming data scenarios [41].
Applying advanced regularization techniques to SVM to handle noisy data and
improve generalization leads to improved performance. Techniques like L1 regular-
ization or Elastic Net can help with feature selection and reduce model complexity
[42].
Another approach is to take the SVM ensemble approach. This involves building
ensemble models using multiple SVM classifiers to enhance performance. Tech-
niques like Bagging or Boosting can combine multiple SVMs to achieve better
generalization [43]. Hybrid models which combine SVM with other algorithms, such
as Decision Trees or Neural Networks, to leverage their strengths and create hybrid
models with improved performance. For multiclass SVM, the idea is to develop
specialized algorithms for multiclass classification that go beyond One-vs-One and
One-vs-Rest approaches. Hierarchical classification or direct optimization methods
could be explored [44].
For kernel learning, investigation of methods for automatically learning the
optimal kernel from the data, potentially using unsupervised learning techniques
to uncover meaningful transformations has a chance to improve performance. In
case of imbalanced data, exploration of techniques to adapt SVM for imbalanced
class distributions, such as using cost-sensitive learning, adjusting class weights, or
generating synthetic samples for the minority class can be considered.
Design of interpretable kernels that provide insights into the decision boundary
and feature importance, making SVM results easier to understand and extension of
SVM to handle structured data like graphs or sequences by incorporating domain-
specific similarity measures or defining custom kernel functions could be possible
avenues to follow.
Scalability Improvements are also important for SVM. Development of parallel
and distributed versions of SVM algorithms to accelerate training and improve
scalability on distributed computing frameworks can be considered.
Other approaches include multilabel classification which extend SVM to handle
multilabel classification problems, where instances can belong to multiple classes
36 1 Basic Approaches in Object Detection and Classification by Deep Learning

Table 1.7 Limitations of SVM

Limitation Details
High computational complexity for large For large datasets, training an SVM can be
datasets computationally intensive. The time complexity can
become prohibitive as the dataset size increases
Sensitivity to noise SVM is sensitive to noisy data, especially outliers.
Outliers can have a significant impact on the
position of the hyperplane and the margin,
potentially leading to overfitting
High memory resource usage SVM models can require significant memory,
especially when dealing with large datasets and
high-dimensional feature spaces
Choice of kernel function affects The choice of the kernel function can greatly affect
performance the performance of the SVM. Selecting an
inappropriate kernel for the data can lead to
suboptimal results
Difficult to train with large datasets SVM’s training time increases significantly with the
number of data points, making it less suitable for
very large datasets
Limited interpretability (nonlinear While linear SVMs provide interpretable feature
kernels) importance, nonlinear kernels (e.g., polynomial,
RBF) can make the model’s decision boundary
difficult to understand and visualize
Model selection and tuning difficult Selecting the appropriate kernel and
hyperparameters can be challenging. Improper
choices may lead to poor generalization or
overfitting
No probabilistic output for interpretation SVM does not provide direct probabilistic outputs
of results like some other algorithms (e.g., logistic regression),
making it less straightforward to interpret
classification confidence
Not suitable for noisy labels Performance of SVM degrades in situations with
mislabeled data or uncertain class assignments
Limited to binary and multiclass While SVM is primarily designed for binary
classification classification, it can be extended to multiclass
classification using techniques like One-vs-One
(OvO) or One-vs-Rest (OvR). However, direct
support for multilabel classification is limited
Impact of class imbalance In scenarios with imbalanced class distributions,
SVM may have difficulty capturing the minority
class adequately, affecting the overall model
performance
Feature scaling necessary to ensure SVM is sensitive to the scale of features, so proper
scale-invariance feature scaling is necessary to prevent features with
larger scales from dominating the optimization
process
(continued)
1.2 Conventional Methods of Object Detection and Machine Learning 37

Table 1.7 (continued)

Limitation Details
Lack of robustness in feature selection SVM does not inherently perform feature selection.
Feature selection should be performed separately,
and the choice of features can impact SVM’s
performance
Domain knowledge required for kernel Choosing the right kernel requires domain
selection knowledge and experimentation, which can be
time-consuming and may not always lead to optimal
results
Limited applicability to non-Euclidean SVM assumes a Euclidean space, which may not be
data suitable for all types of data (e.g., structured data
with graph-like relationships)

simultaneously and kernel approximations techniques to reduce the computational

burden of SVM training while maintaining reasonable accuracy.
Finally, combining SVM with feature selection techniques to identify the most
relevant features, improving both model efficiency and generalization, incorporation
of semi-supervised learning for SVM and exploration of ways to integrate SVM
with deep learning techniques to create hybrid models that benefit from both SVM’s
robustness and deep learning’s feature representation capabilities [45].
These improvements and extensions showcase the ongoing research and devel-
opment efforts aimed at enhancing the capabilities and addressing the challenges of
the SVM algorithm in various contexts. The choice of improvement depends on the
specific problem, dataset characteristics, and computational resources available.

1.2.4 Random Forest

Random forest is a decision tree approach used to successfully solve many shallow
machine learning tasks. It was the top algorithm of choice until mid-2010s when
another decision tree-based algorithm, gradient boosting machines, took over.
Random forest, is composed of a large ensemble of decision trees that perform
prediction tasks individually. The result of these decision trees is then combined into
the final result by some form of voting. This means that the class with the most votes
become the output. Figure 1.14 illustrates how the random forest algorithm works
in principle [33].
Why this approach works well is that since the individual decision trees are uncor-
related, prediction errors from some models can be covered by having correct results
in the majority of the decision trees. Random forests offer several advantages over
decision trees:
Improved Accuracy: Random forests generally provide higher accuracy compared
to individual decision trees. By aggregating the predictions of multiple decision trees,
38 1 Basic Approaches in Object Detection and Classification by Deep Learning

Fig. 1.14 Illustration of random forest algorithm where the majority determines the final predicted
class

the ensemble approach reduces overfitting and variance, resulting in more robust and
accurate predictions.
Reduced Overfitting: Decision trees tend to overfit the training data, capturing
noise and specific patterns that may not generalize well to unseen data. Random
forests mitigate overfitting by using random subsets of the data and features for each
tree, reducing the risk of memorizing noise.
Robustness: Random forests are less sensitive to outliers and noisy data points
compared to single decision trees. The averaging of multiple trees reduces the impact
of individual noisy predictions, leading to more robust models.
Feature Importance: Random forests can assess the importance of features in
the model, providing insights into which features are most influential in making
predictions. This information is valuable for feature selection and understanding the
underlying data patterns.
Parallelism: Random forests can be easily parallelized, making them efficient for
training on large datasets and taking advantage of multicore processors or distributed
computing.
No Need for Feature Scaling: Random forests are not sensitive to the scale of
features. Unlike some algorithms that require feature scaling, random forests can
handle features of different scales without impacting performance.
1.2 Conventional Methods of Object Detection and Machine Learning 39

Handling Missing Data: Random forests can handle missing data without
requiring imputation. Missing values can be efficiently dealt with during the
tree-building process.
Versatility: Random forests can be used for both classification and regression
tasks, making them a versatile choice for various machine learning problems.
In short, the ensemble nature of random forests, where multiple decision trees are
combined, leads to more accurate, robust, and stable models compared to individual
decision trees.

1.2.5 Gradient Boosting Machines

As stated in the preceding subsection, the recent champion of decision tree algorithms
is the gradient boosting machine. The variant called XGBoost (extreme gradient
boosting) got a boost by winning a Kaggle competition. Gradient boosting relies on
boosting where weak learners are converted into stronger learners. In this technique,
the gradient descent method is applied to the loss function to determine the model
parameters [34]. XGBoost is a powerful and widely used algorithm for both regres-
sion and classification tasks, known for its speed, scalability, and high predictive
performance. Gradient boosting is an ensemble learning technique that combines
multiple weak learners (usually decision trees) to create a strong predictive model. It
builds the models sequentially, with each new model attempting to correct the errors
made by the previous ones.
XGBoost extends the traditional gradient boosting algorithm by introducing
several enhancements, which contribute to its effectiveness and efficiency:
Regularization: Includes L1 (Lasso) and L2 (Ridge) regularization terms in the
objective function, which helps prevent overfitting and improve model generalization.
Tree Pruning: It uses a depth-first approach to build decision trees and prunes
branches that contribute little to the overall model’s performance. This helps reduce
the complexity of the model and enhance its efficiency.
Weighted Quantile Sketch: Employs an optimized data structure called the
“weighted quantile sketch” to efficiently handle data summary statistics during tree
construction, improving the speed of the algorithm.
Handling Missing Values: It automatically handles missing data during tree
construction, eliminating the need for explicit data imputation.
Cross-validation: Includes built-in cross-validation capabilities to assess model
performance and tune hyperparameters effectively.
Parallel Processing: It can be parallelized, taking advantage of multicore proces-
sors and distributed computing environments, making it highly efficient for large
datasets.
Due to these optimizations and improvements, XGBoost has gained significant
popularity in machine learning competitions, real-world applications, and academic
research. It is often regarded as one of the most powerful and versatile algorithms in
the gradient boosting family.
40 1 Basic Approaches in Object Detection and Classification by Deep Learning

The early boosting variants include AdaBoost (Adaptive Boosting) and have
extensively been employed to solve classification problems [35]. In case of conven-
tional machine learning approaches where the underlying problem is non-vision
related, gradient boosting can be considered as the best choice at this point in time.

1.3 Deep Learning as Part of Artificial Intelligence

Here we give a concise background of deep learning and its origins. Although deep
learning is intensively under spotlight in recent years, it has been in the literature for a
long time under different terminology [36, 37, 46, 47]. In fact, all of machine learning
algorithms can be broadly classified under the artificial intelligence umbrella. The
field of artificial intelligence is superset of both machine learning and deep learning.
However, machine learning is a subset of machine learning. This can be visualized
as shown in Fig. 1.15.
Artificial intelligence has a several definitions but the general consensus is the
desire to make machines have some level of human intelligence. While the Britan-
nica encyclopedia defines human intelligence as the mental quality that consists of
the abilities to learn from experience, adapt to new situations, understand and handle
abstract concepts, and use knowledge to manipulate one’s environment, we can all
agree that artificial intelligence is still far from achieving this level of ability. The
biggest limitations still remain on adaptations to new situations and handling abstract
concepts. To some extent, machines are able learn certain patterns and manipulate the
environment. Given the above outstanding hurdles, computer scientist and engineers
define artificial intelligence as the ability of computer systems to perform intelli-
gent tasks. Some notable examples of these tasks include computer vision, natural
language processing, text processing, and pattern recognition.

Fig. 1.15 Relationship

between artificial
intelligence, machine
learning, and deep learning
1.5 Selection of Target Areas for This Book 41

By definition, machine learning (ML) is the concerned with the study of computers
algorithms and statistical models that can accomplish intelligent tasks. These algo-
rithms can be notably categorized into supervised learning, semi-supervised learning,
unsupervised learning, and reinforcement learning. Supervised and semi-supervised
learning can be combined into one category, thus effectively resulting in three cate-
gories. Reinforcement learning differs from supervised and unsupervised learning in
that it does not rely on labeled or unlabeled examples of correct behavior, but is inter-
active and tries to maximize a reward signal as opposed to finding hidden structures
which is the basis of unsupervised learning [48]. On the other hand, deep learning
has roots in artificial neural networks which in turn are modeled based on inspiration
from human neurons or perceptrons [49], although it will be an oversimplification
to say that neurons operate like artificial neural networks.
For object detection and classification, we have gone a long way in formulating
very useful algorithms up to deep learning. Some of the popular deep learning algo-
rithms that have been successfully used in solving practical problems include, region
proposals (Region Based Convolutional Neural Networks (R-CNN), Fast R-CNN,
Faster R-CNN) [50], You Only Look Once (YOLO) [51], Deformable convolutional
networks [52], Refinement Neural Network for Object Detection (RefineDet) [53],
Retina-Net [54, 55] and many others. The number of the algorithms keeps growing
rapidly but the CNN has proven to be most widely used network architecture so far.
The VGG16 architecture [56] which is built on CNN is one example.

1.4 Frameworks for Deep Learning

There are three main competing frameworks for implementing and evaluating deep
learning algorithms, namely, Keras (https://keras.io/), TensorFlow (https://www.
tensorflow.org/) and PyTorch (https://pytorch.org/). Keras and TensorFlow can be
viewed as complementing frameworks which then boils down to two frameworks in
reality. Each of the frameworks have their own pros and cons in terms of usability
and performance, so choices can be made on a need basis. While Keras offers a quick
start by hiding most of programmatic details in TensorFlow, PyTorch takes one level
deep into Python strengths. So, for a quick start, Keras would be the way to go and
then at some point venture into PyTorch. Therefore, in this book we will be building
all the examples on the Keras framework.

1.5 Selection of Target Areas for This Book

This book is mainly focused on application of deep learning to classification of objects

and the targeting the remote sensed data as a representative example. However,
the algorithms described here are not limited to this area of application as they
42 1 Basic Approaches in Object Detection and Classification by Deep Learning

generic in nature and can be flexibly extended to general objecting detection for such
tasks as text recognition and autonomous driving environment object detection and
recognition.

1.6 Concluding Remarks

Wrapping up what we have learnt so far, we briefly introduced conventional methods

of object detection and machine learning principles. We also touched on deep learning
to understand its roots as part of artificial intelligence which actually began in the
early 1950s. Finally, we ended by presenting the deep learning frameworks among
which Keras was chosen as the basis for building application examples in the rest of
the book.

1.7 Self-evaluation Exercises

1. What is the difference between object detection and object classification? How
can deep learning be used to solve both of these tasks?
2. Explain the difference between support vector machines, random forests and
gradient boosting. What are some advantages and disadvantages of each
approach?
3. How can convolutional neural networks (CNNs) be used for object detection and
classification? Describe the architecture of a typical CNN-based object detection
system.
4. Investigate on the object detection methods and explain their strengths and weak-
nesses. What are bounding boxes used for in object detection? How are they used
to improve the accuracy of object detection models?
5. What is transfer learning, and how can it be used for object detection and clas-
sification? Give an example of how a pretrained model could be fine-tuned for a
specific object detection task.

References

1. Francois C (2018) Deep learning with Python. Manning Publications Co.

2. Jiang X, Hadid A, Pang Y, Granger E, Feng X (2019) Deep learning in object detection and
recognition, 1 edn. Springer
3. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
4. Gamba J (2020) Radar signal processing for autonomous driving. Springer
5. ChatGPT. https://chat.openai.com/
6. Gamba J (2020) Radar signal processing for autonomous Driving. Springer, Berlin/Heidelberg,
Germany
References 43

7. Cortes C, Vapnik V (1995) Support-vector network. Mach Learn 20(3):273–297

8. Ukey N et al (2023) Survey on exact kNN queries over high-dimensional data space. Sensors
23(2):629. https://doi.org/10.3390/s23020629
9. scikit-learn. https://scikit-learn.org
10. Zhang S, Li J (2023) KNN classification with one-step computation. In: IEEE Trans Knowl
Data Eng 35(3):2711–2723. https://doi.org/10.1109/TKDE.2021.3119140
11. Zhao P, Lai L (2022) Analysis of KNN density estimation. IEEE Trans Inf Theory 68(12):7971–
7995. https://doi.org/10.1109/TIT.2022.3195870
12. Liu Y, Chen H, Wang B (2020) DOA estimation of underwater acoustic signals based on PCA-
kNN algorithm. In: 2020 international conference on computer information and Big Data
applications (CIBDA), Guiyang, China, 2020, pp 486–490. https://doi.org/10.1109/CIBDA5
0819.2020.00115
13. Rashid NEA, Nor YAIM, Sharif KKM, Khan ZI, Zakaria NA (2021) Hand gesture recognition
using continuous wave (CW) radar based on hybrid PCA-KNN. In: 2021 IEEE symposium on
wireless technology & applications (ISWTA), Shah Alam, Malaysia, 2021, pp 88–92. https://
doi.org/10.1109/ISWTA52208.2021.9587404
14. Zheng X et al (2021) Adaptive nearest neighbor machine translation. https://arxiv.org/abs/2105.
13022
15. Zhang J, Wang T, Ng WWY, Pedrycz W, KNNENS: a k-nearest neighbor ensemble-based
method for incremental learning under data stream with emerging new classes. IEEE Trans
Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3149991
16. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen
7:179–188
17. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers.
In: Fifth annual workshop on computational learning theory. ACM, pp 144–152
18. Li C-N, Li Y, Meng Y-H, Ren P-W, Shao Y-H (2023) L2,1 -Norm regularized robust and sparse
linear discriminant analysis via an alternating direction method of multipliers. IEEE Access
11:34250–34259. https://doi.org/10.1109/ACCESS.2023.3264688
19. Dai D-Q, Yuen PC (2007) Face recognition by regularized discriminant analysis. IEEE Trans
Syst, Man, Cybern, Part B (Cybernetics), 37(4):1080–1085. https://doi.org/10.1109/TSMCB.
2007.895363
20. Duda R, Hart P, Stork D (2001) Pattern classification, 2nd edn. New York
21. Lu W (2022) Regularized deep linear discriminant analysis. https://arxiv.org/abs/2105.07129
22. Chang C-C (2023) Fisher’s linear discriminant analysis with space-folding operations. In:
IEEE Trans Pattern Anal Mach Intell 45(7):9233–9240. https://doi.org/10.1109/TPAMI.2022.
3233572
23. Elkhalil K, Kammoun A, Couillet R, Al-Naffouri TY, Alouini M-S (2020) A large dimensional
study of regularized discriminant analysis. IEEE Trans Signal Process 68:2464–2479. https://
doi.org/10.1109/TSP.2020.2984160
24. Cai D, He X, Han J (2007) Semi-supervised discriminant analysis. In: 2007 IEEE 11th Inter-
national conference on computer vision, Rio de Janeiro, Brazil, 2007, pp 1–7. https://doi.org/
10.1109/ICCV.2007.4408856
25. Wang J, Plataniotis KN, Lu J, Venetsanopoulos AN (2008) Kernel quadratic discriminant
analysis for small sample size problem. Pattern Recogn 41(5):1528–1538
26. Pȩkalska E, Haasdonk B (2009) Kernel discriminant analysis for positive definite and indefi-
nite kernels. IEEE Trans Pattern Anal Mach Intell 31(6):1017–1032. https://doi.org/10.1109/
TPAMI.2008.290
27. Vapnik VN (1998) Statistical learning theory. Wiley, New York
28. Huang Z, Lee BG (2004) Combining non-parametric models for multisource predictive forest
mapping. Photogramm Eng Remote Sens 70:415–425
29. Vapnik VN (1998) The nature of statistical learning theory. Wiley, New York
30. Camps-Valls G, Bruzzone L (2005) Kernel-based methods for hyperspectral image classifica-
tion. IEEE Trans Geosci Remote Sens 43(6):1351–1362
44 1 Basic Approaches in Object Detection and Classification by Deep Learning

31. Bruzzone L, Persello C (2009) A novel context-sensitive semi-supervised SVM classifier robust
to mislabeled training samples. IEEE Trans Geosci Remote Sens 47(7)
32. Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Kluwer
Academic Publishers, Boston, pp 1–43
33. Breiman L (2001) Random forests. Mach Learn 45:5–32
34. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: KDD‘16: proceedings
of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
August, 2016, pp 785–794
35. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an
application to boosting. J Comput Syst Sci 55:119–139
36. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural
networks. Science 313(5786):504–507
37. Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural
Comput 18
38. Kamusoko C, Gamba J (2014) Mapping woodland cover in the Miombo ecosystem: a
comparison of machine learning classifiers. Land 3:524–540
39. Schultheis E, Babbar R (2021) Speeding-up one-vs-all training for extreme classification via
smart initialization. https://arxiv.org/abs/2109.13122
40. Abeykoon VL, Fox GC, Kim M (2019) Performance optimization on model synchronization
in parallel stochastic gradient descent based SVM. In: 2019 19th IEEE/ACM international
symposium on cluster, cloud and grid computing (CCGRID), Larnaca, Cyprus, 2019, pp 508–
517. https://doi.org/10.1109/CCGRID.2019.00065
41. Pesala V, Kalakanti AK, Paul T, Ueno K, Kesarwani A, Bugata HGSP (2019) Incremental
learning of SVM using backward elimination and forward selection of support vectors. In:
2019 International conference on applied machine learning (ICAML), Bhubaneswar, India,
2019, pp 9–14. https://doi.org/10.1109/ICAML48257.2019.00010
42. Xie L, Luo Y, Su S-F, Wei H (2023) Graph regularized structured output SVM for early
expression detection with online extension. IEEE Trans Cybern 53(3):1419–1431. https://doi.
org/10.1109/TCYB.2021.3108143
43. Cao Y, Sun Y, Li P, Su S, Vibration-based fault diagnosis for railway point machines using
multi-domain features, ensemble feature selection and SVM. IEEE Trans Veh Technol. https://
doi.org/10.1109/TVT.2023.3305603
44. Liu H, Yu Z, Shum CK, Man Q, Wang B (2023) A new hierarchical multiplication and spectral
mixing method for quantification of forest coverage changes using Gaofen (GF)-1 imagery in
Zhejiang Province, China. IEEE Trans Geosci Remote Sens 61:1–10, Art no. 4407210. https://
doi.org/10.1109/TGRS.2023.3303078
45. Su Y, Li X, Yao J, Dong C, Wang Y (2023) A spectral–spatial feature rotation-based ensemble
method for imbalanced hyperspectral image classification. IEEE Trans Geosci Remote Sens,
61:1–18, Art no. 5515918. https://doi.org/10.1109/TGRS.2023.3282064
46. Furukawa H (2018) Deep learning for end-to-end automatic target recognition from synthetic
aperture radar imagery. IEICE Tech Rep 117(403):35–40, SANE 2017-92
47. Angelov A, Robertson A, Murray-Smith R, Fioranelli F (2018) Practical classification of
different moving targets using automotive radar and deep neural networks. IET Radar, Sonar
Navig 12(10):1082–1089
48. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction, 2nd edn. MIT Press
49. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press Inc., New
York
50. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object
detection and semantic segmentation. In: 2014 IEEE Conference on computer vision and pattern
recognition, 2014, pp 580–587. https://doi.org/10.1109/CVPR.2014.81
51. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object
detection. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), 2016,
pp 779–788. https://doi.org/10.1109/CVPR.2016.91
References 45

52. Dai J et al (2017) Deformable convolutional networks. In: 2017 IEEE International conference
on computer vision (ICCV), 2017, pp 764–773. https://doi.org/10.1109/ICCV.2017.89
53. Zhang S, Wen L, Lei Z, Li SZ (2021) RefineDet++: single-shot refinement neural network for
object detection. IEEE Trans Circuits Syst Video Technol 31(2):674–687. https://doi.org/10.
1109/TCSVT.2020.2986402
54. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In:
2017 IEEE international conference on computer vision (ICCV), 2017, pp 2999–3007. https://
doi.org/10.1109/ICCV.2017.324
55. Del Prete R, Graziano MD, Renga A (2021) RetinaNet: a deep learning architecture to achieve
a robust wake detector in SAR images. In: 2021 IEEE 6th International forum on research and
technology for society and industry (RTSI), 2021, pp 171–176. https://doi.org/10.1109/RTS
I50628.2021.9597297
56. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image
recognition. https://arxiv.org/abs/1409.1556
Chapter 2
Requirements for Hands-On Approach
to Deep Learning

2.1 Introduction

This a bridging chapter in which we introduce some of the concepts needed to

start building deep learning models in Python. We will start with basic principles
related to data manipulation and end with explanation on how to set up the modeling
environment. Of course, this chapter is not meant to replace any detailed course on
Python but to be a stepping stone for those already familiar with or new to data
structures in Python. We would recommend to visit the https://www.python.org/ site
for comprehensive materials on Python. There is also a vast number of online material
both text and video on the Internet to aid the learning process, but due diligence is
necessary to avoid falling into the trap as mentioned in Chap. 1.
In deep learning, we are mostly dealing with vectors and matrices as we know them
from linear algebra. These objects are sometimes referred to as tensors but from an
engineering perspective, they can be considered as subsets of multidimensional arrays
especially if one is already familiar with numerical processing tools like MATLAB,
Scilab, Octave, etc. Like any other language, Python has a unique way of accessing
and manipulating these arrays.

2.2 Basic Python Arrays for Deep Learning

In Python, vectors, matrices, arrays, and tensors are all data structures used to repre-
sent and manipulate multidimensional data. In the deep learning models presented
later, we will be processing data in numerical format defined from Python’s NumPy
library. Therefore, for our purposes we will treat tensors as multidimensional NumPy
arrays [1].

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 47
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_2
48 2 Requirements for Hands-On Approach to Deep Learning

Fig. 2.1 Visualization of

scalar (0-D tensor) and 1-D
array (1-D tensor, row
vector)

In fact, tensors are a generalization of vectors and matrices to higher dimensions.

They can have any number of dimensions and are used to represent multidimen-
sional data in deep learning and scientific computing. Tensors are commonly used
by libraries such as TensorFlow and PyTorch. In Python, tensors are often represented
using multidimensional NumPy arrays or specialized tensor libraries. The Tensor-
Flow library is specifically geared to perform operations on tensors since tensors
have the attractive property of being able to be efficiently manipulated on GPU. In
deep learning, GPU processing drastically makes operations faster thereby reducing
the required time from hours to minutes.
Please refer to the companion Notebook (Chapter02.ipynb) to get a better insight
into the nature the data and also as part of hands-on experience [2].

Scalars and 1-D Arrays (Vector)

Scalars have zero dimensions while vectors are single-dimensional as shown in
Fig. 2.1. Specifically, vectors are one-dimensional arrays that store elements in a
single row or column. They can be considered as a special case of a matrix with
either a single row (row vector) or a single column (column vector). In Python,
vectors are often represented using one-dimensional NumPy arrays or lists.

2-D Arrays (Matrices), 3-D Arrays (Data Cubes)

Matrices are presented as 2-D arrays and data cubes as 3-D arrays as shown in
Fig. 2.2. Matrices are two-dimensional data structures that store elements in rows
and columns. They are used to represent tabular or grid-like data. Matrices can be
created using nested lists or two-dimensional NumPy arrays in Python.

Multidimensional Arrays (ndarray)

Multidimensional arrays are normally visualized with a dimension greater or equal
to 3 but by definition, the only requirement is that the dimension must be non-
negative. We will use image data as an example to explain multidimensional array.
An image can be represented by (height, width, color depth), and a collection of
images stacked together (such as frames in a video) would have a fourth dimension
(frame number, height, width, color depth). For multiple video sequences, we end
up with fifth dimension and have the representation (video number, frame number,
height, width, color depth) as illustrated in Fig. 2.3.
In general, two forms of representation are used for 3D image data, namely
channels-last convention (height, width, color depth) supported by TensorFlow or the
channels-first convention (color depth, height, width) supported by Theano. In this
book, we will be focusing on the Keras framework which supports both conventions.
In any case, switching between the two conventions is possible by data transposition.
2.2 Basic Python Arrays for Deep Learning 49

Fig. 2.2 Visualization of 2-D array (2-D tensor, matrix) and 3-D array (3-D tensor, cube)

Fig. 2.3 Visualization of an example of a multidimensional array (4-D array, 4-D tensor)

Array Manipulation
Figure 2.4 is an example of reshaping an array from size (3, 5) to size (5, 3). The key
point is that the total number of elements in the new shape must be factorizable to
the old dimension.
Besides reshape, other array manipulation operations such as resize, transpose,
squeeze, flatten, etc. can be performed on Numpy arrays.
50 2 Requirements for Hands-On Approach to Deep Learning

Fig. 2.4 Visualization of an example of array manipulation where the shape is changed

2.3 Setting Up Environment

This section is a quick guide that explains the necessary steps to create the environ-
ment using Python as the basis for deep learning algorithm evaluation. The processing
steps and resources will be explained as we walk through the process. With the avail-
ability of vast resources on the internet, the interested reader should be able to rapidly
create a working demo script within a few hours if not minutes. It is assumed in this
book that the reader has basic knowledge of programming and Python environment.
Deep knowledge of artificial intelligence, neural networks or deep learning is a not
a prerequisite to run deep learning algorithms. Basically, it is possible to run deep
learning algorithms either offline or online.

2.3.1 OS Support for Offline Environments

The examples that we will present will be built on Python 3.7 and can run on stan-
dalone Windows environment. However, we have confirmed that the setup is also
straightforward using VirtualBox Ubuntu 20.04 LTS on a Windows host. The notable
difference between Windows environment and Ubuntu is that the Ubuntu Terminal
is the basic tool for command line operations, and no additional terminal installation
is required. For package management, we recommend using Anaconda Navigator
which can be downloaded for free from their official website (Anaconda Navigator
Installation).
2.3 Setting Up Environment 51

2.3.2 Windows Environment Creation Example

Installation of the Anaconda Navigator on Windows is quite easy to perform and the
Navigator can be started from the Start Menu. Figure 2.5 below is an example of the
interface on Windows 10.
It is highly recommended to create a new environment for each classification task
or project using the following steps.
1. Click the Environments on the Anaconda Navigator and select Create on the
bottom left side (Fig. 2.6)
2. Set the environment name (in this case env_maskrcnn as an example) and select
the Python version (in this case 3.7) as in Fig. 2.7.
The environment will be shown in the list of environments to which packages and
tools can be added as a necessary. In our example, we created “env_maskrcnn” and
installed Spyder® and Jupyter Notebook, among other standard tools. Spyder® is
a user-friendly interactive Python GUI and Jupyter Notebook is good for visualizing
demos available from GitHub and for creating new scripts before running in Spyder as
one use-case example. The Jupyter Notebook is also handy for interactive debugging
as it can easily link to online resources like Stack Overflow, etc.

Fig. 2.5 Anaconda Navigator interface

52 2 Requirements for Hands-On Approach to Deep Learning

Fig. 2.6 An example of creating an environment

Fig. 2.7 An example of setting environment properties

2.3.3 Options to Consider for Online Environments

Although the Windows and Ubuntu/Linux platforms are convenient to use in terms
of availability and control, recent trends are to rely on online platforms, specifi-
cally Google Colab (https://research.google.com/colaboratory/). The advantages of
Google Colab (Colab for short) are that very minimal or no setup effort is required
and it also provides the option to use free GPU/TPU resources once an account is
created. The packages needed for most classification tasks are constantly updated
simplifying package management. In addition, it is very easy to share Notebooks
and check algorithm performance online. For a little affordable fee, it’s possible
upgrade to the account if more computational resources are required. In any case, it
2.4 Concluding Remarks 53

Fig. 2.8 An example of Google Colab interface

is always possible to unsubscribe anytime and use the free Colab account for small
demo projects. An example of the online Colab interface is shown in Fig. 2.8.

2.4 Concluding Remarks

Wrapping up what we have learnt so far, we presented basic Python data structures and
their manipulation. We ended with reference material on setting up the environment
and also gave online options to consider.
54 2 Requirements for Hands-On Approach to Deep Learning

2.5 Self-evaluation Exercises

1. What is a tensor, and how is it used in deep learning? Describe the difference
between a scalar, vector, and matrix, and give an example of each.
2. How do you create a one-dimensional (1-D) array in Python? Give an example of
how to create an array of integers, and describe how to access individual elements
of the array.
3. What is a matrix, and how do you create a two-dimensional (2-D) array in
Python? Give an example of how to create a 2-D array of floating-point numbers,
and describe how to perform basic operations on matrices (e.g., addition,
multiplication).
4. What is a data cube, and how is it used in deep learning? Describe how to create
a three-dimensional (3-D) array in Python, and give an example of how to access
individual elements of the array.
5. Describe the concept of multidimensional arrays in Python. What are some
common operations you can perform on multidimensional arrays? Give an
example of how to perform each of these operations on a multidimensional array.

References

1. The N-dimensional array (ndarray). https://numpy.org/doc/stable/reference/arrays.ndarray.html

2. Deep-Learning-Models. https://github.com/sn-code-inside/Deep-Learning-Models
Chapter 3
Building Deep Learning Models

3.1 Introduction: Neural Networks Basics

In this chapter, we illustrate how to build deep learning models, their training and
evaluation using the Keras framework in a simple and concise way. We briefly explain
some of the concepts behind these models so as to give the reader a smooth entry
into each section while concentrating mainly of how-to-use rather than details of
algorithms themselves. The entry point will be shallow networks upon which the
deep neural networks are developed. We then touch on convolutional neural networks
(CNNs), followed by recurrent neural networks (RNNs) and finally long short-term
memory (LSTM)/gated recurring units (GRUs). Along the way, we provide examples
on how each of these can be used in order to cement the ideas behind them. After that
we give a quick look at the Keras library and some references for further investigation.

3.1.1 Shallow Networks

In recent terminology, neural networks can be categorized into deep and shallow
neural networks. In this categorization, shallow neural networks can be thought of
as the basic building blocks required to understand deep neural networks and they
consist of a few hidden layers, normally one or two. In this subsection, we will give
a brief overview of shallow networks since they are an important part of artificial
intelligence.
Artificial neural networks models were originally inspired by human neurons,
referred to as perceptrons [1]. Comprehensive treatment of the evolution of neural
networks is beyond the scope of this section but in its basic functionality, a perceptron
takes several binary inputs and produces a single binary output as illustrated in
Fig. 3.1. The output can be computed using the following expressing:

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 55
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_3
56 3 Building Deep Learning Models

Fig. 3.1 Illustration of a

simple perceptron model

⎧
⎨ 0, wi xi ≤ θ0
output =
i=0
(3.1)
⎩ 1, wi xi > θ0
i=0

where x0 = 1 and θ0 is predetermined threshold.

The main components of the perceptron can be summarized as follows:
1. Inputs: The data to be processed.
2. Weights: Values which determine the importance of each input.
3. Processing layer: It is the part that performs mathematical operations on the
inputs by applying weights to them.
4. Activation function: A nonlinear output selection function which can be a
sigmoid, rectified linear unit (ReLU), tanh or any other appropriate function.
5. Output: The result of applying activation function to the processed input.
In the Keras framework, it is a simple matter to create a shallow neural network
using Dense layers. A popular and well-known standard example that can be used to
illustrate the shallow neural network in a concrete manner is by using the established
MNIST dataset that is packaged into the Keras Library. The MNIST dataset consists
of two sets of data with 60,000 training images and 10,000 testing images. The images
are 28 × 28 pixels grayscale (intensities 0–255) and are a collection of handwritten
images of digits 0–9 (10 classes).
The MNIST, which stands for Modified National Institute of Standards and Tech-
nology, is a dataset created from NIST original data with some modification and is
extensively studied in the computer vision and machine learning literature. To those
familiar with image processing algorithm evaluation, MNIST can be considered as
the Lena of image classification.
Using this MNIST dataset the goal is to classify the handwritten digits into one
of the 10 classes (0–9).
Please refer to the companion Notebook (Chapter03.ipynb) to get a better insight
into the nature the data and also as part of hands-on experience [2].
For this purpose, the shallow neural network model can be defined as follow:
3.1 Introduction: Neural Networks Basics 57

from keras import models

from keras import layers
shallownet = models.Sequential()
shallownet.add(layers.Dense(4, activation=’relu’, input_
shape=(28 * 28,)))
shallownet.add(layers.Dense(10, activation=’softmax’))

The model can then the compiled and training on input can be done.
shallownet.compile(optimizer = ’adam’, loss= ’categorical_
crossentropy’, metrics=[’accuracy’])
shallownet.fit(train_images, train_labels, epochs=5, batch_
size=128)

The full example is given below:

# Import the necessary libraries
from keras import models
from keras import layers
# Load MNIST dataset from Keras
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) =
mnist.load_data()
# Define the model by adding two Dense layers
shallownetwork = models.Sequential()
shallownetwork.add(layers.Dense(4, activation=’relu’, input_
shape=(28 * 28,)))
shallownetwork.add(layers.Dense(10, activation=’softmax’))
# Compile the model
shallownetwork.compile(optimizer = ’adam’, loss= ’categorical_
crossentropy’, metrics=[’accuracy’])
# Preprocess the data by scaling it from [0, 255] range to [0, 1]
range.
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype(’float32’) / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype(’float32’) / 255
# Prepare the training and test labels
from keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
# Performing training of the network using the MNIST training
dataset
history = shallownetwork.fit(train_images, train_labels,
epochs=10, batch_size=64, validation_data=(test_images, test_
labels))
# Plot training results
#Import library for plots
import matplotlib.pyplot as plt
plt.plot(history.history[’accuracy’], label=’train_accuracy’)
plt.plot(history.history[’val_accuracy’], label = ’val_
accuracy’)
plt.xlabel(’Epoch’)
plt.ylabel(’Accuracy’)
58 3 Building Deep Learning Models

plt.title(’Training/Validation Accuracy’)
plt.legend(loc=’lower right’)

# Evaluate the model using the MNIST test dataset

test_loss, test_acc = network.evaluate(test_images, test_labels)
print(’test_acc:’, test_acc)

The above simple model gives a test accuracy of 86.23% (Fig. 3.2). The utility
of Keras is that it is possible to quickly adjust hyperparameters to improve on test
accuracy. As an example, increasing the size of the network to 512, recompiling and
changing the training batch size to 128 results in the increase in accuracy to 98.15%!
# Define the network model by adding two Dense layers, with increased
network size to 512
shallownetwork = models.Sequential()
shallownetwork.add(layers.Dense(512, activation=’relu’, input_
shape=(28 * 28,)))
shallownetwork.add(layers.Dense(10, activation=’softmax’))
# Compile the model
shallownetwork.compile(optimizer = ’adam’, loss= ’categorical_
crossentropy’, metrics=[’accuracy’])
# Performing training of the network using the MNIST training
dataset with increased batchsize of 128
history = shallownetwork.fit(train_images, train_labels, epochs
= 10, batch_size = 128, validation_data = (test_images, test_
labels)).

Fig. 3.2 Training and validation accuracy (network size 4, batch size 64)
3.1 Introduction: Neural Networks Basics 59

# Plot training results

#Import library for plots
import matplotlib.pyplot as plt
plt.plot(history.history[’accuracy’], label=’train_accuracy’)
plt.plot(history.history[’val_accuracy’], label = ’val_
accuracy’)
plt.xlabel(’Epoch’)
plt.ylabel(’Accuracy’)
plt.title(’Training/Validation Accuracy’)
plt.legend(loc=’lower right’)

# Evaluate the model using the MNIST test dataset

test_loss, test_acc = shallownetwork.evaluate(test_images, test_
labels)
print(’test_acc:’, test_acc)

Figure 3.3 shows that for this shallow model, over-fitting starts after the first epoch
as shown by almost flat validation accuracy.
The perceptron model can be extended to multiple hidden layers of perceptrons to
produce complex decisions as shown in Fig. 3.4 and often referred to as multilayer
perceptron (MLP) in the literature.

Fig. 3.3 Training and validation accuracy (network size 512, batch size 128)
60 3 Building Deep Learning Models

Fig. 3.4 An illustration of

the construction of the
multilayer (2-layer)
perceptron model

3.1.2 Convolutional Neural Networks (CNNs)

The CNN is one of the most successful models used in deep neural networks, espe-
cially in image processing and computer vision. Taking a little deviation into history,
deep learning networks differ from conventional neural networks by the number of
node layers used, which brings in the concept of depth, and can also have loops.
Basically, neural networks normally have one to two hidden layers and are used
for supervised prediction or classification. In contrast, deep learning networks can
have several hidden layers with the possibility of unsupervised training. Figure 3.5
illustrates one example of such a network. Examples of widely used deep learning
architectures include deep neural networks (DNN), deep belief networks (DBF),
and recurrent neural networks (RNNs) [3, 4]. The main advantage of DNN over
traditional neural networks is the ability to learn complex tasks in an unsupervised
manner. However, this advantage does not come at no cost [5]. Large amounts of
training data are required for building the network, high computational complexity
is a big burden, difficulties arise when attempting to analyze the algorithms and
also inability to predict the output precisely, among other challenges. For applica-
tions such as autonomous navigation, DNNs have a promising future and integration
into sensors like the automotive radar is currently under intensive research [6]. With
advances in both computational power (GPUs/TPUs) and available resources (RAM/
ROM) on the sensor devices, the realization of the so-called intelligent sensors is
now possible.
For the interested reader, further details about DBM and RNN can be found in
[7] and [8], respectively. It should be noted that RNNs have found better success in
natural language process (NLP).
Coming back to the subject of this section, a convolutional neural network (CNN)
is a neural network which uses at least one layer as part of the model. The construction
of a CNN involves the several layers between input and output and at least one layer
is a convolutional layer. A typical convolutional neural network consists of some
3.1 Introduction: Neural Networks Basics 61

Fig. 3.5 An illustration of the components of a deep neural network model

combination of the following layers: convolutional layers, pooling layers, and fully
connected/dense layers.
Convolutional layers apply convolutional operations on their inputs to extract
features of the input. Pooling operations are used to reduce the size of the convo-
lution layer outputs by either maximization or averaging operations. Normally, the
averaging is done over a 2 × 2 matrix. Fully connected layers usually come at the top
of the network (close to output) and are also sometimes referred to as dense layers.
CNNs have been successfully applied to computer vision, producing state-of-the-
art performance in most applications.
Figure 3.6 illustrates the structure a typical CNN architecture.
The typical CNN architecture shows the progression through convolution and
pooling operations. The flattening operation produces a one-dimensional array for
inputting it to the final fully connected top layers.
We continue with the MNIST data as a concrete example of how to implement a
CNN in Keras following [9].
# Example of CNN using the MNIST data set
# Import necessary packages
from keras import layers
from keras import models
# Define the CNN model with 3 convolution layers and 2 pooling layers
cnn_model = models.Sequential()

Fig. 3.6 Typical CNN architecture

62 3 Building Deep Learning Models

cnn_model.add(layers.Conv2D(32, (3, 3), activation=’relu’, input_

shape=(28, 28,1)))
cnn_model.add(layers.MaxPooling2D((2, 2)))
cnn_model.add(layers.Conv2D(64, (3, 3), activation=’relu’))
cnn_model.add(layers.MaxPooling2D((2, 2)))
cnn_model.add(layers.Conv2D(64, (3, 3), activation=’relu’))
cnn_model.add(layers.Flatten())
cnn_model.add(layers.Dense(64, activation=’relu’))
cnn_model.add(layers.Dense(10, activation=’softmax’))
# View the model summary
cnn_model.summary()
# Training the convnet on MNIST images
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255
cnn_model.compile(optimizer=’adam’,loss=’categorical_
crossentropy’,metrics=[’accuracy’])
history = cnn_model.fit(train_images, train_labels, epochs=10,
batch_size=64, validation_data=(test_images, test_labels))
# Plot training results
#Import library for plots
import matplotlib.pyplot as plt
plt.plot(history.history[’accuracy’], label=’train_accuracy’)
plt.plot(history.history[’val_accuracy’], label = ’val_
accuracy’)
plt.xlabel(’Epoch’)
plt.ylabel(’Accuracy’)
plt.title(’Training/Validation Accuracy’)
#plt.ylim([0.5, 1])
plt.legend(loc=’lower right’)

# Evaluate the model using the MNIST test dataset

test_loss, test_acc = cnn_model.evaluate(test_images, test_
labels)
print(’test_acc:’, test_acc)

Figure 3.7 shows the progression of training/validation accuracy. The validation

accuracy seems be higher than training accuracy which shows good generalization
of the model. A decent test accuracy result of 96.5% is achieved (Fig. 3.8).

3.1.3 Recurrent Neural Networks (RNNs)

Another popular type of neural network is the recurrent neural network that has been
very successfully used for application like natural language processing and speech
recognition.
RNNs differ from CNNs in that they have memory, meaning that previous inputs
have influence on the present input and output. We will not dwell much on RNNs in
this text, but Fig. 3.9 gives a simplified visual illustration on how they work.
3.1 Introduction: Neural Networks Basics 63

Fig. 3.7 Training and validation accuracy for the CNN model

# Evaluate the model using the MNIST test dataset

test_loss, test_acc = cnn_model.evaluate(test_images,
test_labels)
print('test_acc:', test_acc)

Fig. 3.8 Test accuracy results for CNN

Suffice to say, Keras provides the SimpleRNN layer for model construction.
Below is an example of RNN with Keras.
# Import necessary packages
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist
# Load MNIST dataset from Keras
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) =
mnist.load_data()
# Extract the number of labels
num_train_labels = len(np.unique(train_labels))
# Normalize data for training
train_images = train_images.reshape((60000, 28, 28))
train_images = train_images.astype("float32") / 255
64 3 Building Deep Learning Models

Fig. 3.9 Illustration of a simplified RNN showing the rolled and unrolled representations

test_images = test_images.reshape((10000, 28, 28))

test_images = test_images.astype("float32") / 255
# Prepare the training and test labels
from keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
# Create RNN model with 256 units
rnn_model = Sequential()
rnn_model.add(SimpleRNN(256,input_shape=(28, 28)))
rnn_model.add(Dense(num_train_labels, activation=’softmax’))
rnn_model.summary()
# Train the with the RNN model with batch size of 128 and 20 epochs
rnn_model.compile(loss=’categorical_crossentropy’,
optimizer=’adam’,
metrics=[’accuracy’])
history = rnn_model.fit(train_images, train_labels, epochs=20,
batch_size=128, validation_data=(test_images, test_labels))
# Plot training results
#Import library for plots
import matplotlib.pyplot as plt
plt.plot(history.history[’accuracy’], label=’train_accuracy’)
plt.plot(history.history[’val_accuracy’], label = ’val_
accuracy’)
plt.xlabel(’Epoch’)
plt.ylabel(’Accuracy’)
plt.title(’Training/Validation Accuracy’)
#plt.ylim([0.5, 1])
plt.legend(loc=’lower right’)
3.1 Introduction: Neural Networks Basics 65

Fig. 3.10 Training and validation accuracy for the RNN model

# Evaluate the model using the MNIST test dataset

test_loss, test_acc = rnn_model.evaluate(test_images,
test_labels)
print('test_acc:', test_acc)

Fig. 3.11 Test accuracy results for RNN

# Evaluate the model using the MNIST test dataset

test_loss, test_acc = rnn_model.evaluate(test_images, test_
labels)
print(’test_acc:’, test_acc)

With this simple 2-layer RNN model, a decent accuracy of 97.65% can be achieved
for the MNIST data (Figs. 3.10 and 3.11).

3.1.4 Long Short-Term Memory (LSTM)/Gated Recurring

Units (GRUs)

The LSTM and GRU layers are designed to solve the vanishing gradient problem that
makes SimpleRNN not suitable for most practical problems [10]. This is achieved
by injecting information from previous layers at a later time using some form of
forgetting factors to circumvent the vanishing-gradient problem considerably. On
the other hand, GRUs operate on the same principle as LSTM except that for LSTM,
66 3 Building Deep Learning Models

three gates, namely input, output, and forget gate are used, while for GRU only two
gates, reset and update gate, are required. The choice between the two involves a
trade-off between accuracy and computational complexity, with LSTM generally
expected to provide higher accuracy [11, 12].
Employing the same approach as for the SimpleRNN model, we compare the
LSTM and GRU models built from Keras layers. We start with the LSTM model.
# Create LSTM model with 256 units
lstm_model = Sequential()
lstm_model.add(layers.LSTM(256,input_shape=(28, 28)))
lstm_model.add(Dense(num_train_labels, activation=’softmax’))
lstm_model.summary()
# Create LSTM model with 256 units
lstm_model = Sequential()
lstm_model.add(layers.LSTM(256,input_shape=(28, 28)))
lstm_model.add(Dense(num_train_labels, activation=’softmax’))
lstm_model.summary()

A total of 294 410 training parameters for this model (Fig. 3.12).

# Create LSTM model with 256 units

lstm_model = Sequential()
lstm_model.add(layers.LSTM(256,input_shape=(28, 28)))
lstm_model.add(Dense(num_train_labels, activa-
tion='softmax'))
lstm_model.summary()

# Create LSTM model with 256 units

lstm_model = Sequential()
lstm_model.add(layers.LSTM(256,input_shape=(28, 28)))
lstm_model.add(Dense(num_train_labels, activa-
tion='softmax'))
lstm_model.summary()

Fig. 3.12 Model parameters summary for LSTM

3.1 Introduction: Neural Networks Basics 67

# Train the with the LSTM model with batch size of 128
and 20 epochs
lstm_model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = lstm_model.fit(train_images, train_labels,
epochs=20, batch_size=128, validation_data=(test_im-
ages, test_labels))

Fig. 3.13 Training of LSTM progress for each epoch

# Train the with the LSTM model with batch size of 128 and 20 epochs
lstm_model.compile(loss=’categorical_crossentropy’,
optimizer=’adam’,
metrics=[’accuracy’])
history = lstm_model.fit(train_images, train_labels, epochs=20,
batch_size=128, validation_data=(test_images, test_labels))

The validation accuracy progressively increases as the validation loss falls

(Figs. 3.13, 3.14 and 3.15).
# Evaluate the model using the MNIST test dataset
test_loss, test_acc = lstm_model.evaluate(test_images, test_
labels)
print(’test_acc:’, test_acc)

Next, we construct and train the GRU model.

A total of 221 450 training parameters for this model (Fig. 3.16).
# Train the with the GRU model with batch size of 128 and 20 epochs
gru_model.compile(loss=’categorical_crossentropy’,
optimizer=’adam’,
metrics=[’accuracy’])
history = gru_model.fit(train_images, train_labels, epochs=20,
batch_size=128, validation_data=(test_images, test_labels))

The validation accuracy progressively increases as the validation loss falls

(Figs. 3.17 and 3.18).
68 3 Building Deep Learning Models

Fig. 3.14 Training and validation accuracy for the LSTM model

# Evaluate the model using the MNIST test dataset

test_loss, test_acc = lstm_model.evaluate(test_images,
test_labels)
print('test_acc:', test_acc)

Fig. 3.15 Test results for LSTM

Fig. 3.16 Model parameters summary for GRU

3.1 Introduction: Neural Networks Basics 69

# Train the with the GRU model with batch size of 128
and 20 epochs
gru_model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = gru_model.fit(train_images, train_labels,
epochs=20, batch_size=128, validation_data=(test_im-
ages, test_labels))

Fig. 3.17 Training of GRU progress for each epoch

Fig. 3.18 Training and validation accuracy for the GRU model

# Evaluate the model using the MNIST test dataset

test_loss, test_acc = gru_model.evaluate(test_images, test_
labels)
print(’test_acc:’, test_acc)
70 3 Building Deep Learning Models

# Evaluate the model using the MNIST test dataset

test_loss, test_acc = gru_model.evaluate(test_images,
test_labels)
print('test_acc:', test_acc)

Fig. 3.19 Test accuracy results for GRU

Comparing the above results obtained under similar conditions, it can be observed
that LSTM model achieves an average speed of 70 s per epoch with a validation accu-
racy 98.92% (Fig. 3.15). The GRU model achieves an average speed of 55 s per epoch
and 98.81% (Fig. 3.19) accuracy, which means LSTM is 0.11% better in this example.
As stated above, this improvement comes at a computational expense as reflected
in the execution speed. As shown in Figs. 3.14 and 3.18, the two models quickly
achieve high accuracy in the first 5 epochs after which over-fitting becomes visible.
With the addition of more layers and hyperparameter tuning, further improvements
can generally be achieved for any model as will be seen in the next chapters.

3.2 Using Keras for as Deep Learning Framework

Keras is a widely used Python framework for machine learning and deep neural
network applications due its intuitive logical flow, easy to get started quickly and
richness in ready-to-use packages. With very few code lines, model evaluation on
benchmark and new datasets can be accomplished efficiently. We will briefly explore
the Keras framework here, but further details and latest developments can be found
at https://keras.io/.

3.2.1 Overview of Library

The Keras API reference consists of Models APIs, Layers APIs, Callback APIs, Opti-
mizers, Metrics, Applications, and many other utilities that greatly reduce the effort
from concept to tangible results for engineers and scientist from various backgrounds
and fields. The workflow can be reduced to three main steps which are (1) define the
model, (2) compile the model, and (3) evaluate the model. By continuously refining
step (1), rapid evaluation of models is possible. Keras is also compatible with Ubuntu,
Windows, and macOS, thus is making it available to a wide range of audience. Among
other characteristics, it is can be run on both CPU & GPU platforms.
3.3 Concluding Remarks 71

Fig. 3.20 Survey results showing popularity of Keras in Kaggle competitions

3.2.2 Usability

The usability of Keras is evidenced by the data available at https://keras.io/why_

keras/. Keras has been used by the majority of top-5 winners of Kaggle competitions
based on 2019 survey. Additionally, the results of the 2022 state of data science and
machine learning survey published by Kaggle showed that the Tensorflow, which
is the backend engine of Keras, has broad adoption in both industry and research
circles reaching approximately 61% (Fig. 3.20) [13].

3.3 Concluding Remarks

Here we give the highlights of this chapter. In this chapter, we provided a concise
introduction to building deep learning models with practical examples. We discussed
the distinctions between shallow and deep neural networks and demonstrated how
to implement them using the Keras framework. Some of the popular deep learning
architectures, namely CNN and RNN, were also illustrated. In the end we provided
some background on why it makes sense to start with Keras as framework for building
and evaluating deep neural networks.
72 3 Building Deep Learning Models

3.4 Self-evaluation Exercises

1. Explain the concept of shallow networks and their limitations. Can shallow
networks be used for complex tasks such as image classification or natural
language processing?
2. What are Convolutional Neural Networks (CNNs)? How do they differ from fully
connected neural networks? Explain the architecture of a typical CNN.
3. Describe Recurrent Neural Networks (RNNs) and their ability to model sequen-
tial data. What are some limitations of standard RNNs, and how do Long
Short-Term Memory (LSTM) and Gated Recurring Units (GRU) address these
limitations?
4. Explain the Keras API and its advantages for building deep learning models. What
are some features of the Keras API that make it popular among developers?
5. Give an example of building a deep learning model using the Keras API. Describe
the steps involved in building a CNN or RNN using Keras, including data
preparation, model definition, training, and evaluation.

References

1. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, Inc.,
New York
2. Deep-Learning-Models. https://github.com/sn-code-inside/Deep-Learning-Models
3. Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural
Comput 18
4. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
5. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural
networks. Science 313(5786):504–507
6. Wheeler TA, Holder MF, Winner H, Kochenderfer MJ (2017) Deep stochastic radar models.
IEEE Intell Veh Symp IV
7. Salakhutdinov R, Hinton GE (2009) Deep Boltzmann machines. In: AISTATS, pp 448–455
8. Graves A, Mohamed A, Hinton GE (2013) Speech recognition with deep recurrent neural
networks. In: ICASSP, pp 6645–6649
9. Francois C (2018) Deep learning with Python. Manning Publications Co.
10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
11. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y
(2014) Learning phrase representations using RNN encoder–decoder for statistical machine
translation. In: Proceedings of the 2014 conference on empirical methods in natural language
processing, pp 1724–1734
12. Cahuantzi R, Chen X, Güttel S (2021) A comparison of LSTM and GRU networks for learning
symbolic sequences. https://arxiv.org/abs/2107.02248
13. Kaggle (2022) State of data science and machine learning 2022. https://www.kaggle.com/kag
gle-survey-2022
Chapter 4
The Building Blocks of Machine
Learning and Deep Learning

4.1 Introduction

In this chapter, we take a look at the three main categories of machine learning and
then move on to explore how the machine learning models can be evaluated. The
various metrics commonly used are explained. After that, we briefly address the
important topic of data preprocessing followed by standard methods of evaluating
machine learning models. One of the reasons why most models fail to perform on
unseen data is due to the problem of overfitting. We take a look at this problem and
outline some of the strategies that can be applied in order to overcome it. The next
topic is a discussion of the workflow for machine learning or deep learning. The
chapter ends with concluding remarks to recap the covered topics.

4.2 Categorization of Machine Learning

As introduced in Chap. 1, there are three major categories of machine learning:

supervised machine learning, unsupervised machine, and reinforcement learning [1,
2] (see Fig. 4.1).
A supervised machine learning algorithm uses labeled input data to learn a
mapping function which generates an appropriate output when given new unlabeled
data. The term supervised learning comes from the fact that the process of algorithm
learning uses a training dataset that can be viewed as an instructor supervising the
learning process. Supervised learning can be divided into classification and regres-
sion. The classification process results in discrete or categorized outputs such as car,
bicycle, pedestrian, or truck in the case of road objects classification. The output
class can be labeled as an integer. On the other hand, regression results in real-valued
outputs such as height or width. By far supervised learning is currently the most
widely used type of machine learning including in deep learning.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 73
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_4
74 4 The Building Blocks of Machine Learning and Deep Learning

Fig. 4.1 Main branches of machine learning with representative examples

An unsupervised machine learning algorithm utilizes input data without using

explicitly provided labels. The algorithms work by themselves to discover patterns
within the unlabeled data. To avoid wild results, human intervention is required for
validation.
Reinforcement learning a comparatively new branch which has its roots in game
development and has been extended to autonomous driving and other applications.
Reinforcement learning differs from the other two branches described above in that
intelligent agents interact with environment to maximize reward and requires no
labelled data. The concept of reward maximization makes reinforcement learning
distinct from unsupervised learning [3, 4].
In this book, we will be mainly focusing on supervised learning. Supervised
machine learning finds application in many areas including image recognition, speech
recognition, object detection, remote sensing, autonomous driving, just to name few
[5–8]. There is a vast amount of reference material in the literature on recognition
and classification [9].

4.3 Methods of Evaluating Machine Learning Models

The first step in evaluating machine learning models after collecting the dataset is to
decide on the split or proportion of the dataset that will be used for training, validation,
and testing phases. In most algorithms, it is possible to first split the data into training
and test datasets and then use as percentage of the training set for validation.
The training dataset is used for fitting the model parameters by in order to maxi-
mize prediction performance. The validation dataset is used to evaluate the model
performance during the training phase in order to aid tuning of model hyperparam-
eters. The test dataset is used for evaluating the model produced during the training
phase and should be completely separate from training dataset.
An example of splitting the data for computer vision applications is to
use a combination of StratifiedShuffleSplit from scikit-learn with
4.3 Methods of Evaluating Machine Learning Models 75

Fig. 4.2 Training/validation

split example

ImageDataGenerator from Keras to first create training and test datasets and
then partition the training data into training and validation portion.
Step1: Import libraries

from sklearn.model_selection import StratifiedShuffleSplit

from keras.preprocessing.image import ImageDataGenerator

Step2: Split with scikit-learn: 20% test and 80% training.

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_

state=69)

Then create training and test folders based on the split.

Step3: Using the ImageDataGenerator define the proportion of the training
data that will be used for validation during training by the validation_split
argument. This is the fraction of images reserved for validation and must be between
0 and 1. In this example, the value is set to 0.1 which means that 10% samples will be
reserved for the validation set and remaining 90% for the training set (see Fig. 4.2).
# training generator – reserve 0.1 for as validation subset
train_gen = ImageDataGenerator(
rescale=1./255,
rotation_range=60,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
horizontal_flip=True,
validation_split=0.1
)

Using the.flow_from_directory method, create the train, validation, and

test generators. TRAIN_DIR, BATCH_SIZE, and CLASS_MODE are predefined
values.
train_generator = train_gen.flow_from_directory(
directory=TRAIN_DIR,
target_size=(32, 32),
batch_size=BATCH_SIZE,
class_mode=CLASS_MODE,
subset=’training’,
color_mode=’rgb’,
76 4 The Building Blocks of Machine Learning and Deep Learning

shuffle=True,
seed=71
)
valid_generator = train_gen.flow_from_directory(
directory=TRAIN_DIR,
target_size=(64, 64),
batch_size=BATCH_SIZE,
class_mode=CLASS_MODE,
subset=’validation’,
color_mode=’rgb’,
shuffle=True,
seed=71
)
# Test generator for evaluation purposes (only rescaling applied)
test_gen = ImageDataGenerator(
rescale=1./255
)
test_generator = test_gen.flow_from_directory(
directory=TEST_DIR,
target_size=(64, 64),
batch_size=1,
class_mode=None,
color_mode=’rgb’,
shuffle=False,
seed=71
)

The above approach is referred to as hold-out validation where a single valida-

tion subset is created. Another innovative method of cross validation is the K-fold
validation where the training dataset can be divided into K equal proportions and
reserving one of the K portions for validation and the using remaining K-1 portions
for training at each training cycle. The performance score is calculated by averaging
the K results. This approach is effective when data size is very small. A more compu-
tationally intensive approach would be to shuffle data and perform several K-fold
times, one set for each shuffled data. The final result can be computed by averaging
over all K-fold validations.
Performance evaluation is an important part of any model evaluation. Here we
list the common methods of evaluation, but it should be kept in mind that these
metrics can be used in combination and new ones can be defined if the data cannot
be correctly evaluated by any of them. It will be instructive to start by defining the
terminology used in performance evaluation from a classification perspective.
True Negative (TN) = correct prediction of non-existence of a class
False Negative (FN) = incorrect prediction of non-existence of a class
False Positive (FP) = incorrect prediction of class when it actually exists
True Positive (TP) = correct prediction of existence of class
• Accuracy

It indicates the ratio of correct classifications, whether positive or negative.

Accuracy = (TP + TN)/(TP + FP + FN + TN)

4.3 Methods of Evaluating Machine Learning Models 77

• Precision

It is the fraction of true positives among all classified as positives. It is also referred
to as the true positive rate (TPR).

Precision = TP/(TP + FP)

• Recall

It is a measure of how correctly positives were classified as positive.

Recall = TP/(TP + FN)

• Specificity

It is measure of how correctly negatives were actually classified as negatives. It is

also referred to as the true negative rate.

Specificity = TN/(FP + TN)

• F1-score

It is a measure of accuracy and is the harmonic mean of Recall and Precision.

F1 = 2TP/(2TP + FP + FN) = 2 ∗ Precision ∗ Recall/(Precision + Recall)

= 0.5/(1/Recall + 1/Precision)

• F2-score

The F2 score is a weighted average of recall and precision that gives more weight to
recall than precision.

F2 = 5 ∗ Precision ∗ Recall/(4 ∗ Precision + Recall)

• Confusion Matrix

It is a matrix showing classification results by comparing actual classes and predicted

classes. It is most informative in multiclass scenarios.
• Precision-Recall (PR) curve

Plot of PR versus ROC curve.

• Receiver Operating Characteristics (ROC) curve
78 4 The Building Blocks of Machine Learning and Deep Learning

It is plot of Recall against false positive rate (1-Specificity). It is used to judge the
optimality of the model and has origins in radar processing where the false positive
rate known as the probability of false alarm.
• Area under the ROC curve (AUC)

It is used to measure model performance based on the area under the ROC curve. It
falls between 0 and 1, and higher values greater than 0.5 are desirable because 0.5
represent random guess.
The above metrics are well-known and widely used. In addition, most of them can
be easily imported from the sklearn.metrics module. Below is an example of
how this can be achieved in a single line of code.

from sklearn.metrics import precision_re-

call_fscore_support, confusion_matrix, fbeta_score, ac-
curacy_score, roc_curve, roc_auc_score, auc

4.3.1 Data Preprocessing for Deep Learning

The input data used for machine learning or deep learning comes in various formats
such as text, images, and even videos. Before feeding this data into a deep learning
model, it is necessary to put it into a format that makes the task of training tractable.
This means that besides denoising, the data will need to be vectorized and normalized
as part of preprocessing. In Chap. 2, we briefly discussed some of data structures
that can be handled in machine learning algorithms. Normalization is especially
important in image data processing where the (0, 255) pixel range is transformed
to (0, 1) used in most machine learning models. As can be seen in Sect. 4.3 of this
chapter, normalization is incorporated into the ImageDataGenerator for this
purpose.

4.3.2 Problem of Overfitting

Overfitting happens when the model performance on validation data stops improving
compared to the performance on the training data. This is usually seen by validation
accuracy remaining constant while training accuracy continues to approach 100%.
Or looking from the loss function side, the training loss decreases for each epoch
while validation loss stops decreasing or even worse increases. This behavior is an
indication of poor generalization of the model to unseen data. Fighting overfitting is
a common problem in machine learning, including deep learning. There are various
strategies that can be considered to tackle the overfitting problem (Fig. 4.3).
4.3 Methods of Evaluating Machine Learning Models 79

Fig. 4.3 Illustration of overfitting

The reason why overfitting happens is due to the failure of the model to generalize
to new or unseen data and the simplest and most effective solution is collect more
data. However, this is not always possible so we have to deal with available limited
data to improve the situation. The way out of this problem is employing methods
such as regularization. To understand what is happening when overfitting occurs, it
will be constructive to imagine trying to fit noisy data to a quadratic function. The
data obviously will contain outliers. With overfitting, the model tries to approximate
a function that passes through all the data points, including outliers. This is over-
fitting because the resulting function is only good for this particular dataset. The
consequence of this is that if we get a new data with the same quadratic behavior
but different outliers, then our approximations will not fit properly. In the absence
of additional data to smooth out the outliers, regularization is our next best solution
because it will try to control the large swings in the approximating weights, thereby
making generalizations to new data possible.
When using Keras, the following regularization techniques can be applied:
Layer weight regularization—There are three forms regularizers, namely, kernel_
regularizer where a penalty is applied on the layer’s kernel, bias_regularizer where
a penalty is applied on the layer’s bias and activity_regularizer where a penalty is
applied on the layer’s output. For all the three, L1, L2 or a combination of L1 and
L2 (L1_L2), can be used.
Dropout—Network nodes are randomly selected and removed in order to reduce
the network complexity.
Network capacity reduction—Network units define the size of the output of the
layer, therefore reduction in capacity will lead to fewer parameters and thereby
increase ability to generalized. Moreover, large network parameters can be thought
of as having the ability to memorize large volume of data but cannot perform well
when required to make decision on new data.
80 4 The Building Blocks of Machine Learning and Deep Learning

For completeness’ sake, underfitting is also a problem when model does not
perform well on neither training data nor unseen or new data. It also leads to poor
generalization, but this may be indicative or poor model selection or untrainable data.
These kinds of problems must be solved before training starts.

4.4 The Machine Learning Workflow

Up to now we have not given any guideline on how to attack a machine learning
problem from the start. Here we explain the steps involved in the machine learning/
deep learning workflow (Fig. 4.4).
Problem Definition
The first step is to define the problem at hand in terms of the required data, and what
we are trying to achieve as output. At this stage, it is good practice to decide on
whether the problem will be binary, multiclass multilabel, etc. Most problems have
an application domain with vast examples in the literature. It is advisable to make a
survey of available approaches, among other things.
Data Collection
Data acquisition is one of the most tedious and time-consuming part of the workflow.
The data must be large enough to be representative of the problem under analysis.
As previously stated in the overfitting section, lack of sufficient data contributes to
lack of model generalization when the model is deployed on new data. So, how much
data is enough? There is no straight answer to this question, but a moderate deep
learning problem would require 10,000 to 100,000 data samples. On other hand,
more complex problems like machine translation would require up to one million
samples. A general rule of thumb for computer vision is to collect at least 1000 data
samples per class. When enough data is not available, methods of generating artificial
data such as data augmentation can be implemented.

Fig. 4.4 Illustration of the machine learning working from preparation to deployment
4.4 The Machine Learning Workflow 81

Data Preparation
Having collected enough data, the next step would be to transform the data into
machine-consumable format. This is where vectorization and normalization can be
applied before inputting the data to the model.
Defining Performance Measures
As described in the previous sections, how to measure performance should be decided
before the models can be selected. The metrics like precision, recall, accuracy, etc.
come into the picture. Leaving the decision on metrics to later stages will result in
wasted effort and time, and re-evaluation of the model performance may be necessary.
Given that most deep learning models required a lot of time in terms of epochs per
run, setting metrics from the start will help reduce the chances of doing this task
repeatedly.
Model Selection
With the problem defined and data available including performance measures, the
task of deciding the model comes into place. There is no formula for this task, but
it’s always good to start from a simple model with a few layers and few units and
increase the complexity until no gain in performance can realized.
Train Model
Training is the critical part of the whole process as it is at this point that we start to
see the level of difficulty of the task at hand. During training, performance metrics
can be monitored with tools like the Tensorboard, and decision on whether to keep
the current model can be made as quickly and as early as possible.
Model Evaluation
If the model runs to the end, the remaining thing to do would be to evaluate model
performance against benchmarks or target values. If for example it is required to
achieve a 99% accuracy but the model reaches only 70%, then it will be better to
change the model or adjust the hyperparameter space. At this stage, we may decide
to abandon the model or choose alternative performance measures.
Hyperparameter Tuning
The hyperparameter of the model can be tuned to achieve a certain level of perfor-
mance. This includes changing the optimizers, learning rate, etc. and including
measures for reducing overfitting if this is a problem. After hyperparameter tuning,
we then retrain the model to see if there are any gains that can be achieved. By
repeating this experimenting phase several times, we end up with best model from
the given data.
Deployment
When we are satisfied with performance of the model on unseen data, then it can be
deployed into use.
82 4 The Building Blocks of Machine Learning and Deep Learning

Maintenance
In the maintenance phase, we keep checking the real performance of the model to
decide on whether additional data acquisition would be required.

4.5 Concluding Remarks

Wrapping up what we have learnt so far, we gave a categorization of machine learning

branches. This was followed by introduction to machine learning evaluation methods.
After that we touched on data processing for deep learning followed by the problem of
overfitting which is of importance in all machine learning algorithms. We then ended
by presenting the workflow of machine learning development. The above content
should give a pretty good picture of how the general deep learning algorithms are
constructed.

4.6 Self-evaluation Exercises

1. What are the three main categories of machine learning? Explain the difference
between supervised, unsupervised, and reinforcement learning.
2. What are some common metrics used to evaluate machine learning models?
Describe the differences between accuracy, precision, recall, F1-score, and AUC-
ROC.
3. What is overfitting in machine learning? Why is it a problem, and how can it
be addressed? Describe some techniques for preventing overfitting in machine
learning models.
4. What is the typical workflow for building a machine learning model? Describe
the steps involved, including data preparation, feature selection, model selection,
hyperparameter tuning, and model evaluation.
5. Give an example of building a machine learning or deep learning model using a
specific framework or library (e.g., scikit-learn, TensorFlow, PyTorch). Describe
the steps involved in building the model, including data preparation, feature
selection, model definition, training, and evaluation.

References

1. Francois C (2018) Deep learning with Python. Manning Publications Co.

2. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
3. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell
Res 4:237–285
References 83

4. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction (Adaptive Computation

and Machine Learning series), A Bradford Book, 2nd edn
5. Gamba J (2020) Radar signal processing for autonomous driving. Springer
6. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press Inc., New
York
7. Scikit-learn. https://scikit-learn.org/stable/
8. OpenCV. https://www.learnopencv.com/
9. El Mrabet MA, El Makkaoui K, Faize A (2021) Supervised machine learning: a survey. In:
2021 4th International conference on advanced communication technologies and networking
(CommNet), 2021, pp 1–10. https://doi.org/10.1109/CommNet52204.2021.9641998
Chapter 5
Remote Sensing Example for Deep
Learning

5.1 Introduction

Recently, remote sensing has become heavily dependent on machine learning algo-
rithms such decision trees, random forests, support vector machines, and artificial
neural networks. However, there is an increasing recognition that deep learning which
has been applied successfully in other areas such as computer vision and language
processing is a viable alternative to traditional machine learning methods [1]. With
the availability of high-resolution imagery, it is becoming more attractive to venture
into deep learning as a key technology to achieve previously unimaginable classifi-
cation accuracies [2, 3]. In this chapter, we will work through a specific example of
application of deep learning algorithms to one important area of remote sensing data
analysis, namely land cover classification. Land cover and land use change anal-
ysis is of importance in many practical areas such urban planning, environmental
degradation monitoring, and disaster management [4, 5].

5.2 Background of the Remote Sensing Example

The main goal of this chapter is to provide a detailed understanding of the perfor-
mance of various deep learning models applied to the problem of land cover clas-
sification starting from known dataset. We divide the presentation into 5 main parts
including preliminary information on the models covering input data restrictions
followed by exploration of the EuroSAT data contents, preprocessing steps, and
performance evaluation results for several selected models in Sect. 5.3. Finally, we
test the performance of the models with a new dataset to get a clear picture of the
limitations of the presented approach in the face of unseen data in Sect. 5.4.
This application example assumes basic knowledge of the Python programming
language. There is an abundance of easy-to-follow material for this topic for readers

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 85
J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks,
https://doi.org/10.1007/978-981-99-9672-8_5
86 5 Remote Sensing Example for Deep Learning

of all background publicly starting from python.org. We therefore assume the reader
is familiar with Python syntax and how to get needed solutions from platforms
such as stack overflow, etc. In addition, it is not the intention of this chapter to
provide mathematical details of inner working of algorithms behind the presented
models. Having said this, this chapter is meant to give the interested reader a good
insight into the performance of the Keras APIs (models) that are available for land
cover classification. The techniques introduced here can be extended, improved, and
applied to a broad range of problems.

5.3 Remote Sensing: Land Cover Classification

The EuroSAT dataset is obtained from the openly and freely accessible Sentinel-2
satellite images provided in the Earth observation program Copernicus. It has been
demonstrated in [2] that RGB bands of the Sentinel data give the best results in terms
of accuracy. We will therefore only use the RGB dataset in this chapter. This does
not in any way mean that the other bands cannot be used for classification.
The folder structure for algorithm evaluation is shown in Fig. 5.1. We also give a
flow diagram of the approach used to train and test the data in Fig. 5.2.

5.4 Background of Experimental Comparison of Keras

Applications Deep Learning Models Performance
on EuroSAT Dataset

We make a comparison of the performance of various classes of Keras models

using the publicly available EuroSAT dataset (https://github.com/phelber/EuroSAT)
as input. We also build on the Kaggle (land-cover-classification-with-eurosat-dataset)
example which is under the Apache 2.0 open-source software license to expand
the range of models evaluated and utilize Google Colab for convenient and effi-
cient execution of the models by taking advantage of the available GPU resources.
The high computing power is important during the training phase where multiples
epochs have to be executed. In this chapter, specifically we will evaluate ResNet,
VGG, NasNet, and EfficientNet V1 models to see how they compare under similar
conditions except where model specific treatment is necessary. Since our approach is
hands-on, the code for this chapter is available at [6]. We should also emphasize that
the scope of this material is evaluation of model performance and detailed analysis
of the implications of these results will be left for later treatment. However, we will
also briefly compare how the models perform on completely uncorrelated data to
understand the limitations and challenges of deep learning algorithm applications
across various datasets. So, this material is just the beginning of a long journey. The
details of evaluated models are as follows.
5.4 Background of Experimental Comparison of Keras Applications Deep … 87

Fig. 5.1 EuroSAT data folder structure for training and testing

Models Evaluated:
ResNet50
ResNet101
ResNet152
VGG16
VGG19
NasNetLarge
NasNetMobile
EfficientNet B0
EfficientNet B1
EfficientNet B2
EfficientNet B3
EfficientNet B4
EfficientNet B5
88 5 Remote Sensing Example for Deep Learning

Fig. 5.2 Example processing flow of the EuroSAT data for deep learning algorithm evaluation

EfficientNet B6
EfficientNet B7

Comparison Methods and Metrics:

• Training/Validation accuracy and loss (visualization)

• Precision, Recall, F2 score (PRF)
• Confusion matrix.

5.4.1 Information Input Data Requirements

Keras offers many models under Applications API that can be used as a base on top
of which upper layers including dense layers can be added. Using this approach, we
check how the models perform on the publicly available EuroSAT dataset. For more
details about the dataset and how it was collected, refer to https://github.com/phe
lber/EuroSAT. We briefly discuss the preliminary information related to the Keras
models.
5.4 Background of Experimental Comparison of Keras Applications Deep … 89

5.4.2 Input Restrictions (from Keras Application Page)

The full details and arguments for each model can be found from the following link:
https://keras.io/api/applications/
Here we are only interested in highlighting the limitations imposed on input data
for each model at the time of writing.
NasNetLarge has highest top-1 and top-5 accuracy. The top-1 and top-5 accuracy
refers to the model’s performance on the ImageNet validation dataset. However, there
is an issue with earlier implementations of the model as described below.
During training, we found it necessary to modify the model library file’s (in
Keras applications) input shape argument “require_flatten” by setting it to
“False” before running the training. Without this modification, an error message like
“ValueError: When setting ‘include_top = True‘ and loading ‘imagenet‘ weights,
‘input_shape‘ should be (331, 331, 3).” will be thrown for NasNetLarge even if the
argument “include_top” is set to false. The argument “require_flatten” is set
to “True” by default hence the need to make this adjustment to avoid the bug.
However, for EfficientNet models the input argument is set to “require_
flatten = include_top” by default with the restriction that min_size
= 32. On the other hand, the min_size restriction was not documented in the
above API link at the time of writing.

5.4.2.1 ResNet50, ResNet101, ResNet152

ResNet50
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with ‘channels_last’ data format) or (3,
224, 224) (with ‘channels_first’ data format). It should have exactly 3 input channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.
ResNet101
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with ‘channels_last’ data format) or (3,
224, 224) (with ‘channels_first’ data format). It should have exactly 3 input channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.
ResNet152
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with ‘channels_last’ data format) or (3,
224, 224) (with ‘channels_first’ data format). It should have exactly 3 input channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.
90 5 Remote Sensing Example for Deep Learning

5.4.2.2 VGG16 and VGG19

VGG16
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with channels_last data format) or (3,
224, 224) (with channels_first data format). It should have exactly 3 input channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.
VGG19
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (224, 224, 3) (with channels_last data format) or (3,
224, 224) (with channels_first data format). It should have exactly 3 inputs channels,
and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one
valid value.

5.4.2.3 NasNetLarge and NasNetMobile

NasNetLarge
input_shape: Optional shape tuple, only to be specified if include_top is False (other-
wise the input shape has to be (331, 331, 3) for NasNetLarge. It should have exactly
3 input channels, and width and height should be no smaller than 32, e.g., (224, 224,
3) would be one valid value.

5.4.2.4 NasNetMobile

input_shape: Optional shape tuple, only to be specified if include_top is False (other-

wise the input shape has to be (224, 224, 3) for NasNetMobile. It should have exactly
3 inputs channels, and width and height should be no smaller than 32, e.g., (224, 224,
3) would be one valid value.

5.4.2.5 EfficientNet B0 to B7

No input width and height restriction!!

input_shape: Optional shape tuple, only to be specified if include_top is False. It
should have exactly 3 input channels.

5.4.3 Training and Test Results

Below we give a visual summary of results obtained by running the above as convo-
lution base models under similar settings for each class of models and using same
input shape (64 × 64 × 3). The first part of the simulation used training–test split of
70/30 while the latter half used 80/20.
5.4 Background of Experimental Comparison of Keras Applications Deep … 91

Please refer to the companion Notebook (eurosat-projectbook-blg.ipynb) for

further hands-on experience [6].

5.4.3.1 Data Exploration

Import the required libraries for data loading and preprocessing.

import os # file directory manipulation

import shutil # file copying, etc.
import random # random number generation
from tqdm import tqdm # execution progress
import numpy as np # array processing
import pandas as pd # data folder manipulation
import PIL # image visualization and processing tool
import matplotlib.pyplot as plt # plotting functions

Mount Google Drive to access files from Google Colab.

from google.colab import drive

drive.mount('/content/drive')

Set the EuroSAT dataset path and extract labels. There are 10 classes, namely
AnnualCrop, Pasture, PermanentCrop, Residential, Industrial, River, SeaLake,
HerbaceousVegetation, Highway, and Forest.

DATASET = "/content/drive/My Drive/Colab Notebooks/Eu

roSAT/2750"
LABELS = os.listdir(DATASET)
print(LABELS)

Next, we plot class distributions of the EuroSAT dataset. There are a total of 2700
images distributed among the classes as shown in Fig. 5.3.
Select 20 images arbitrarily from the whole dataset and show the classes to which
they belong.
92 5 Remote Sensing Example for Deep Learning

Fig. 5.3 Distribution of data among the classes

img_paths = [os.path.join(DATASET, l, l+'_1000.j

pg') for l in LABELS]
img_paths = img_paths + [os.path.join(DATASET, l
, l+'_2000.jpg') for l in LABELS]

def plot_sat_imgs(paths):
plt.figure(figsize=(15, 8))
for i in range(20):
plt.sub-
plot(4, 5, i+1, xticks=[], yticks=[])
img = PIL.Image.open(paths[i], 'r')
plt.imshow(np.asarray(img))
plt.title(paths[i].split('/')[-2])

plot_sat_imgs(img_paths)
5.4 Background of Experimental Comparison of Keras Applications Deep … 93

Fig. 5.4 Samples arbitrarily selected from the dataset for visual inspection

Figure 5.4 shows the result of selected samples.

The sample data shows the variability in contents of the classes from AnnualCrop
to Forest. Some similarities can be observed, for example, between Highway and
Industrial classes. The challenge for the deep learning algorithms is to be able to
distinguish these classes by minimizing false positives and false negatives among
other metrics that can be used. Although NIR band data is available, our evaluation
will solely use RGB bands.
94 5 Remote Sensing Example for Deep Learning

5.4.3.2 Data Preprocessing

Next the data is split into training and test sets using stratified shuffle-split from
Scikit-learn. We also make use of Keras ImageDataGenerator for data augmentation.

import re
from sklearn.model_selection import StratifiedS-
huffleSplit
from keras.preprocessing.image import ImageDat-
aGenerator

TRAIN_DIR = '/content/drive/My Drive/Colab Noteboo

ks/EuroSAT/working/training'
TEST_DIR = '/content/drive/My Drive/Colab Notebook
s/EuroSAT/working/testing'
BATCH_SIZE = 64
NUM_CLASSES=len(LABELS)
INPUT_SHAPE = (64, 64, 3)
CLASS_MODE = 'categorical'

# Create training and testing directories

for path in (TRAIN_DIR, TEST_DIR):
if not os.path.exists(path):
os.mkdir(path)

We copy train and test data into respective folders.

5.4 Background of Experimental Comparison of Keras Applications Deep … 95

import re # regular expression package for string pat-

tern processing
from sklearn.model_selection import StratifiedShuf-
fleSplit # stratified sampling for train-test split
from keras.preprocessing.image import ImageDataGenera-
tor # image augmentation utility

TRAIN_DIR = '/content/drive/My Drive/Colab Notebooks/Eu

roSAT/working/training'
TEST_DIR = '/content/drive/My Drive/Colab Notebooks/Eur
oSAT/working/testing'
BATCH_SIZE = 64
NUM_CLASSES=len(LABELS)
INPUT_SHAPE = (64, 64, 3)
CLASS_MODE = 'categorical'
# Create training and testing directories
for path in (TRAIN_DIR, TEST_DIR):
if not os.path.exists(path):
os.mkdir(path)
# Create class label subdirectories in train and test
for l in LABELS:

if not os.path.exists(os.path.join(TRAIN_DIR, l)):

os.mkdir(os.path.join(TRAIN_DIR, l))

if not os.path.exists(os.path.join(TEST_DIR, l)):

os.mkdir(os.path.join(TEST_DIR, l))
96 5 Remote Sensing Example for Deep Learning

# Execute this once to load split data into train and test folders
respectively
data = {}
for l in LABELS:
for img in os.listdir(DATASET+'/'+l):
data.update({os.path.join(DATASET, l, img): l})

X = pd.Series(list(data.keys()))
y = pd.get_dummies(pd.Series(data.values()))

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, ran-

dom_state=69)

# Split the list of image paths

for train_idx, test_idx in split.split(X, y):

train_paths = X[train_idx]
test_paths = X[test_idx]

# Define a new path for each image depending on training or testing

s_in ='/content/drive/My Drive/Colab Notebooks/EuroSAT/2750'
s_train ='/content/drive/My Drive/Colab Notebooks/EuroSAT/work-
ing/training'
s_test = '/content/drive/My Drive/Colab Notebooks/EuroSAT/work-
ing/testing'
new_train_paths = [re.sub(s_in, s_train, i) for i in train_path
s]
new_test_paths = [re.sub(s_in, s_test, i) for i in test_paths]
train_path_map = list((zip(train_paths, new_train_paths)))
test_path_map = list((zip(test_paths, new_test_paths)))
5.4 Background of Experimental Comparison of Keras Applications Deep … 97

# Move the files

print("moving training files..")
for i in tqdm(train_path_map):
if not os.path.exists(i[1]):
if not os.path.exists(re.sub('train-
ing', 'testing', i[1])):
shutil.copy(i[0], i[1])
print("moving testing files..")
for i in tqdm(test_path_map):
if not os.path.exists(i[1]):
if not os.path.exists(re.sub('train-
ing', 'testing', i[1])):

# Create a ImageDataGenerator in-

stance which can be used for data augmentation
train_gen = ImageDataGenerator(
rescale=1./255,
rotation_range=60,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
vertical_flip = True
)
train_generator = train_gen.flow_from_directory(
directory=TRAIN_DIR, target_size=(64, 64),
batch_size=BATCH_SIZE, class_mode=CLASS_MODE,
color_mode='rgb', shuffle=True,
seed=69
)
98 5 Remote Sensing Example for Deep Learning

# Test generator for evaluation purposes with no augmenta-

tions, just rescaling

test_gen = ImageDataGenerator(
rescale=1./255,
)
test_generator = test_gen.flow_from_directory(
directory=TEST_DIR, target_size=(64, 64),
batch_size=BATCH_SIZE,
class_mode=CLASS_MODE,
color_mode='rgb',
shuffle=False,
seed=69
)

Confirm and save class indices.

# Print class indices

print(train_generator.class_indices)

# Save class indices

np.save('class_indices', train_generator.class_indices)

The next sections provide details of deep learning algorithms that will be used in
the evaluation. We will first start by importing all the necessary packages followed
by definition of a generic function for model compilation and then some functions to
plot and visualize results. The ResNet framework model will be taken as an example
to demonstrate the evaluation procedure. After that results of other selected models
will be presented.
5.4 Background of Experimental Comparison of Keras Applications Deep … 99

import tensorflow as tf
from keras.models import Model
from keras.layers import Dense, Dropout, Flatten, GlobalAveragePool-
ing2D, BatchNormalization
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceL-
ROnPlateau
from tensorflow.keras.optimizers import Adam
from keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.vgg19 import VGG19
from tensorflow.keras.applications.resnet import ResNet50, Res-
Net101, ResNet152
from tensorflow.keras.applications import ResNet50V2, ResNet50V2, Res-
Net152V2
from tensorflow.python.keras import regularizers

from sklearn.metrics import precision_recall_fscore_support, confu-

sion_matrix, fbeta_score, accuracy_score
from tensorflow.keras.applications.nasnet import NASNetLarge, NASNetMo-
bile
from tensorflow.keras.applications import EfficientNetB0, Efficient-
NetB1, EfficientNetB2
from tensorflow.keras.applications import EfficientNetB3, Efficient-
NetB4, EfficientNetB5
from tensorflow.keras.applications import EfficientNetB6, Efficient-
NetB7
from tensorflow.keras.regularizers import l2,l1, l1_l2
from tensorflow.python.keras import regularizers

Configure GPUs for processing if available. It is recommended to use the first

available GPU for TensorFlow processing (https://www.tensorflow.org/guide/gpu#
using_multiple_gpus).

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
logical_gpus = tf.config.experimental.list_logical_de-
vices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logi-
cal GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
100 5 Remote Sensing Example for Deep Learning

We then define the generic function for model selection and compilation using
the following function.

# Note that for differ-

ent CNN models we will be using different setup of dense layers
def compile_model(cnn_base, input_shape, n_classes, opti-
mizer, fine_tune=None):
if (cnn_base == 'ResNet50') or (cnn_base == 'Res-
Net50V2') or (cnn_base == 'ResNet152V2') or (cnn_base == 'Res-
Net101') or (cnn_base == 'ResNet152'):
if cnn_base == 'ResNet50':
conv_base = ResNet50(include_top=False,
weights='imagenet',
input_shape=input_shape)
elif cnn_base == 'ResNet50V2':
conv_base = ResNet50V2(include_top=False,
weights='imagenet',
input_shape=input_shape)
elif cnn_base == 'ResNet101':
conv_base = ResNet101(include_top=False,
weights='imagenet',
input_shape=input_shape)
elif cnn_base == 'ResNet152':
conv_base = ResNet152(include_top=False,
weights='imagenet',
input_shape=input_shape)
else:
conv_base = ResNet152V2(include_top=False,
weights='imagenet',
input_shape=input_shape)
top_model = conv_base.output
top_model = Flatten()(top_model)
top_model = Dense(1024, activity_regularizer=regulariz-
ers.l2(1e-4), activation='relu')(top_model)
top_model = Dense(1024, activation='relu')(top_model)
top_model = BatchNormalization()(top_model) #added
top_model = Dropout(0.2)(top_model)
top_model = Dense(1024, activity_regularizer=regulariz-
ers.l2(1e-4), activation='relu')(top_model)
top_model = Dense(1024, activation='relu')(top_model)
top_model = BatchNormalization()(top_model) #added
top_model = Dropout(0.2)(top_model)
5.4 Background of Experimental Comparison of Keras Applications Deep … 101

elif (cnn_base == 'VGG16') or (cnn_base == 'VGG19'):

if cnn_base == 'VGG16':
conv_base = VGG16(include_top=False,
weights='imagenet',
input_shape=input_shape)
else:
conv_base = VGG19(include_top=False,
weights='imagenet',
input_shape=input_shape)
top_model = conv_base.output
top_model = Flatten()(top_model)
top_model = Dense(1024, activity_regularizer=regulariz-
ers.l2(1e-4), activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)
top_model = Dense(1024, activity_regularizer=regulariz-
ers.l2(1e-4), activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)

elif (cnn_base == 'Xception') or (cnn_base == 'InceptionV3'):

if cnn_base == 'Xception':
conv_base = Xception(include_top=False,
weights='imagenet',
input_shape=input_shape)
else:
conv_base = InceptionV3(include_top=False,
weights='imagenet',
input_shape=input_shape)
top_model = conv_base.output
top_model = GlobalAveragePooling2D()(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = Dropout(0.2)(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = Dropout(0.2)(top_model)
elif (cnn_base == 'NASNetLarge') or (cnn_base == 'NASNetMobile'
if cnn_base == 'NASNetLarge':
conv_base = NASNetLarge(include_top=False,
weights='imagenet',
input_shape=input_shape)
else:
102 5 Remote Sensing Example for Deep Learning

conv_base = NASNetMobile(include_top=False,
weights='imagenet',
input_shape=input_shape)
top_model = conv_base.output
top_model = GlobalAveragePooling2D()(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)

elif (cnn_base == 'EfficientNetB0') or (cnn_base == 'Efficient-

NetB1') or (cnn_base == 'EfficientNetB2') or (cnn_base == 'Efficient-
NetB3') or (cnn_base == 'EfficientNetB4') or (cnn_base == 'Efficient-
NetB5') or (cnn_base == 'EfficientNetB6') or (cnn_base == 'EfficientNetB7
'):
if cnn_base == 'EfficientNetB0':
conv_base = EfficientNetB0(include_top=False,
weights="imagenet",
input_shape=input_shape)
elif cnn_base == 'EfficientNetB1':
conv_base = EfficientNetB1(include_top=False,
weights="imagenet",
input_shape=input_shape)
elif cnn_base == 'EfficientNetB2':
conv_base = EfficientNetB2(include_top=False,
weights="imagenet",
input_shape=input_shape)
elif cnn_base == 'EfficientNetB3':
conv_base = EfficientNetB3(include_top=False,

weights="imagenet",
input_shape=input_shape)
elif cnn_base == 'EfficientNetB4':
conv_base = EfficientNetB4(include_top=False,
weights="imagenet",
input_shape=input_shape)
elif cnn_base == 'EfficientNetB5':
conv_base = EfficientNetB5(include_top=False,
weights="imagenet",
input_shape=input_shape)
elif cnn_base == 'EfficientNetB6':
5.4 Background of Experimental Comparison of Keras Applications Deep … 103

conv_base = EfficientNetB6(include_top=False,
weights="imagenet",
input_shape=input_shape)

else:
conv_base = EfficientNetB7(include_top=False,
weights="imagenet",
input_shape=input_shape)
top_model = conv_base.output
top_model = GlobalAveragePooling2D()(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)
top_model = Dense(2048, activation='relu')(top_model)
top_model = BatchNormalization()(top_model)
top_model = Dropout(0.2)(top_model)

output_layer = Dense(n_classes, activation='softmax')(top_model)

model = Model(inputs=conv_base.input, outputs=output_layer)

if type(fine_tune) == int:
for layer in conv_base.layers[fine_tune:]:
layer.trainable = True
else:
for layer in conv_base.layers:
layer.trainable = False

model.compile(optimizer=optimizer, loss='categorical_crossentropy',
metrics=['categorical_accuracy'])

return model

Utility functions to plot training/validation progress and display results.

104 5 Remote Sensing Example for Deep Learning

def plot_history(history):

acc = history.history['categorical_accuracy']
val_acc = history.history['val_categorical_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(acc)
plt.plot(val_acc)
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')

plt.subplot(1, 2, 2)
plt.plot(loss)
plt.plot(val_loss)
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')

plt.show();
5.4 Background of Experimental Comparison of Keras Applications Deep … 105

def display_results(y_true, y_preds, class_labels):

results = pd.DataFrame(precision_recall_fscore_support(y_true, y_preds),
columns=class_labels).T
results.rename(columns={0: 'Precision',
1: 'Recall',
2: 'F-Score',
3: 'Support'}, inplace=True)

conf_mat = pd.DataFrame(confusion_matrix(y_true, y_preds),

columns=class_labels,
index=class_labels)
f2 = fbeta_score(y_true, y_preds, beta=2, average='micro')
accuracy = accuracy_score(y_true, y_preds)
print(f"Accuracy: {accuracy}")
print(f"Global F2 Score: {f2}")
return results, conf_mat
def plot_predictions(y_true, y_preds, test_generator, class_indices):

fig = plt.figure(figsize=(20, 10))

for i, idx in enumerate(np.random.choice(test_generator.samples, size=20, re-
place=False)):
ax = fig.add_subplot(4, 5, i + 1, xticks=[], yticks=[])
ax.imshow(np.squeeze(test_generator[idx]))
pred_idx = np.argmax(y_preds[idx])
true_idx = y_true[idx]
plt.tight_layout()
ax.set_title("{}\n({})".format(class_indices[pred_idx], class_indi-
ces[true_idx]),
color=("green" if pred_idx == true_idx else "red"))

Example of evaluating the ResNet50 model.

106 5 Remote Sensing Example for Deep Learning

# Model compilation and summary display

resnet50_model = compile_model('Res-
Net50', INPUT_SHAPE, NUM_CLASSES, Adam(learning_rate=1e-2), fine_tune=None)
resnet50_model.summary()

# Initialize the train and test generators

train_generator.reset()
test_generator.reset()

N_STEPS = train_generator.samples//BATCH_SIZE
N_VAL_STEPS = test_generator.samples//BATCH_SIZE
N_EPOCHS = 100

# Define model callbacks

checkpoint = ModelCheckpoint(filepath='/content/drive/My Drive/Colab Notebooks/Eu-
roSAT/working/model.weights.best.hdf5',
monitor='val_categorical_accuracy',
save_best_only=True,
verbose=1)
early_stop = EarlyStopping(monitor='val_categorical_accuracy',
patience=10,
restore_best_weights=True,
mode='max')

reduce_lr = ReduceLROnPlateau(monitor='val_categorical_accuracy', factor=0.5,

patience=3, min_lr=0.00001)
5.4 Background of Experimental Comparison of Keras Applications Deep … 107

# First perform pretraining of the dense layer

resnet50_history = resnet50_model.fit(train_generator,
steps_per_epoch=N_STEPS,
epochs=50,
callbacks=[early_stop, checkpoint],
validation_data=test_generator,
validation_steps=N_VAL_STEPS)

# Re-train whole network end to end

resnet50_model = compile_model('Res-
Net50', INPUT_SHAPE, NUM_CLASSES, Adam(learning_rate=1e-4), fine_tune=0)

resnet50_model.load_weights('/content/drive/My Drive/Colab Notebooks/Eu-

roSAT/working/model.weights.best.hdf5')

train_generator.reset()
test_generator.reset()

resnet50_history = resnet50_model.fit(train_generator,
steps_per_epoch=N_STEPS,
epochs=N_EPOCHS,
callbacks=[early_stop, checkpoint, reduce_lr],
validation_data=test_generator,
validation_steps=N_VAL_STEPS)
108 5 Remote Sensing Example for Deep Learning

# Plot loss and accuracy

plot_history(resnet50_history)
resnet50_model.load_weights('/content/drive/My Drive/Colab Notebooks/Eu-
roSAT/working/model.weights.best.hdf5')

# Evaluate the model on test data and compute precision, recall, f-score and
confusion matrix. Display prf.
class_indices = train_generator.class_indices
class_indices = dict((v,k) for k,v in class_indices.items())

test_generator_new = test_gen.flow_from_directory(
directory=TEST_DIR,
target_size=(64, 64),
batch_size=1,
class_mode=None,
color_mode='rgb',
shuffle=False,
seed=69
)

predictions = resnet50_model.predict(test_genera-
tor_new, steps=len(test_generator_new.filenames))
predicted_classes = np.argmax(np.rint(predictions), axis=1)
true_classes = test_generator_new.classes

prf, conf_mat = display_results(true_classes, predicted_classes, class_in-

dices.values())
prf

# Display confusion matrix

conf_mat

# Save the model and the weights

resnet50_model.save('/con-
tent/drive/My Drive/Colab Notebooks/EuroSAT/working/Res-
Net50_eurosat.h5')
5.4 Background of Experimental Comparison of Keras Applications Deep … 109

Repeating the above steps for ResNet50 model, all the other models can be
similarly evaluated. We present some of selected model evaluation results below.

Basic Assumptions and Definitions:

The training/validation accuracy and loss are defined for each model. The models
are compiled with the Adam optimizer, categorical cross-entropy loss and use the
categorical accuracy metric. This can be easily accomplished in Keras with one line
of code. There is no fine-tuning applied.
Precision is the ratio of true positives to all predicted positives of a given class.
Recall is the ratio of true positives to all actual positives of a given class that should
have been identified, i.e., it includes false negatives.
The F2-score is a weighted average of recall and precision that gives more weight
to recall than precision. It is defined by 5 * precision * recall/(4 * precision + recall).
For a given dataset, the support is defined as the number of occurrences of each
class. The expectation is to have balanced data in which there are no huge differences
in support.
As some informative details in terms simulation settings, 100 or 200 epochs were
set for all models. Additionally, EarlyStopping patience of 10 and ReduceLROn-
Plateau patience of 5 were used as default. The Adam optimizer with default value
was also applied to all models.
ResNet Models
ResNet50

Training/Validation Accuracy and Loss

See Fig. 5.5.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 40th epoch due

Fig. 5.5 Training and validation loss and accuracy for ResNet50 model
110 5 Remote Sensing Example for Deep Learning

lack of improvement in validation categorical accuracy. The loss is small and close
to zero after the initial swings.

Precision, Recall, F-Score, Support

See Fig. 5.6.
High precision and recall (> 99%) can be obtained for Forest class (99.9% recall,
95.1% precision) and Residential class (100% recall, 94.7% precision). The global
F2-score not shown here is estimated to be 96.4%.

Confusion Matrix
See Fig. 5.7.

Confusion Matrix (%)

See Fig. 5.8.

Fig. 5.6 Precision, recall, and F-score of each of the classes for ResNet50 model

Fig. 5.7 Confusion matrix showing number of hits for each of the classes for ResNet50 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 111

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.96 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.95 0.00 0.00 0.00 0.03 0.01 0.00 0.00
Highway 0.01 0.00 0.00 0.94 0.01 0.00 0.00 0.01 0.03 0.00
Industrial 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00
Pasture 0.01 0.03 0.01 0.00 0.00 0.94 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.02 0.00 0.01 0.00 0.95 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.02 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.94

Fig. 5.8 Confusion matrix showing ratio of hits for each of the classes for ResNet50 model. The
Highway, Pasture, and SeaLake classes show the lowest accuracy of 94%, while the Forest and
Residential can be classified with 100% accuracy.

ResNet101

Training/Validation Accuracy
See Fig. 5.9.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 35th epoch due
lack of improvement in validation categorical accuracy. The loss is small and close
to zero after high initial validation loss.

Precision, Recall, F-Score, Support

See Fig. 5.10.
High precision and recall (> 99%) can be obtained for Forest class (100% recall,
94.2% precision) and Residential class (99.8% recall, 94.1% precision). The global
F2-score not shown here is estimated to be 96.6%.

Fig. 5.9 Training and validation loss and accuracy for ResNet101 model
112 5 Remote Sensing Example for Deep Learning

Fig. 5.10 Precision, recall, and F-score of each of the classes for ResNet101 model

Confusion Matrix
See Fig. 5.11.

Confusion Matrix (%)

See Fig. 5.12.
ResNet152.

Training/Validation Accuracy
See Fig. 5.13.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 35th epoch due
lack of improvement in validation categorical accuracy. The loss is small and close
to zero after the initial swings.

Fig. 5.11 Confusion matrix showing number of hits for each of the classes for ResNet101 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 113

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.96 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.96 0.00 0.00 0.00 0.01 0.01 0.00 0.00
Highway 0.00 0.00 0.00 0.96 0.01 0.00 0.00 0.00 0.02 0.00
Industrial 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.04 0.00 0.00
Pasture 0.01 0.03 0.01 0.00 0.00 0.93 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.02 0.00 0.00 0.00 0.95 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.95

Fig. 5.12 Confusion matrix showing ratio of hits for each of the classes for ResNet101 model. The
Pasture class shows the lowest accuracy of 93%, while the Forest and Residential can be classified
with 100% accuracy

Fig. 5.13 Training and validation loss and accuracy for ResNet152 model

Precision, Recall, F-Score, Support

See Fig. 5.14.
High precision and recall (>99%) can be obtained for Forest class (99.8% recall,
93.5% precision) and Residential class (100% recall, 91.7% precision). The global
F2-score not shown here is estimated to be 95.5%.

Confusion Matrix
See Fig. 5.15.

Confusion Matrix (%)

See Fig. 5.16.
VGG Models
VGG16
114 5 Remote Sensing Example for Deep Learning

Fig. 5.14 Precision, recall, and F-score of each of the classes for ResNet152 model

Confusion Matrix
Herbaceous PermanentC
AnnualCrop Forest Highway Industrial Pasture Residential River SeaLake
Vegetation rop
AnnualCrop 1151 1 0 4 0 6 44 0 5 0
Forest 0 1202 0 0 0 0 0 2 0 0
Herbaceous
9 19 1093 2 1 5 36 31 0 0
Vegetation
Highway 15 0 5 1135 10 5 5 16 23 0
Industrial 0 0 0 1 720 0 0 28 1 0
Pasture 6 25 8 0 1 553 5 0 2 0
PermanentC
3 1 15 2 9 0 799 12 1 0
rop
Residential 0 0 0 0 0 0 0 993 0 0
River 6 1 1 18 1 0 1 1 829 0
SeaLake 12 37 1 0 0 1 0 0 4 959

Fig. 5.15 Confusion matrix showing number of hits for each of the classes for ResNet152 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.95 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.01 0.02 0.91 0.00 0.00 0.00 0.03 0.03 0.00 0.00
Highway 0.01 0.00 0.00 0.93 0.01 0.00 0.00 0.01 0.02 0.00
Industrial 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.04 0.00 0.00
Pasture 0.01 0.04 0.01 0.00 0.00 0.92 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.02 0.00 0.01 0.00 0.95 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.97 0.00
SeaLake 0.01 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.95

Fig. 5.16 Confusion matrix showing ratio of hits for each of the classes for ResNet152 model. The
HerbaceousVegetation class shows the lowest accuracy of 91% while the Forest and Residential
can be classified with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep … 115

Fig. 5.17 Training and validation loss and accuracy for VGG16 model

Training/Validation Accuracy
See Fig. 5.17.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 40th epoch due
lack of improvement in validation categorical accuracy. The loss gradually decreases
to below 0.2.

Precision, Recall, F-Score, Support

See Fig. 5.18.
High precision and recall (> 99%) can be obtained for Forest class (99.2% recall,
97.9% precision) and Residential class (99.9% recall, 96.1% precision). The global
F2-score not shown here is estimated to be 97.1%.

Confusion Matrix
See Fig. 5.19.

Confusion Matrix (%)

See Fig. 5.20.
VGG19

Training/Validation Accuracy
See Fig. 5.21.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 35th epoch due
lack of improvement in validation categorical accuracy. The loss gradually decreases
to below 0.25.
116 5 Remote Sensing Example for Deep Learning

Fig. 5.18 Precision, recall, and F-score of each of the classes for VGG16 model

Fig. 5.19 Confusion matrix showing number of hits for each of the classes for VGG16 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.93 0.00 0.00 0.00 0.00 0.01 0.06 0.00 0.00 0.00
Forest 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.01 0.01 0.95 0.00 0.00 0.01 0.01 0.01 0.00 0.00
Highway 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.02 0.00 0.00
Pasture 0.01 0.00 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.01 0.00 0.01 0.00 0.96 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98

Fig. 5.20 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The
AnnualCrop class shows the lowest accuracy of 93% while the Residential class can be classified
with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep … 117

Fig. 5.21 Training and validation loss and accuracy for VGG19 model

Precision, Recall, F-Score, Support

See Fig. 5.22.
High precision and recall (> 99%) can be obtained for Forest class (99.5% recall,
97.8% precision) and Residential class (99.5% recall, 97.6% precision). The global
F2-score not shown here is estimated to be 96.8%.

Confusion Matrix
See Fig. 5.23.

Fig. 5.22 Precision, recall, and F-score of each of the classes for VGG19 model
118 5 Remote Sensing Example for Deep Learning

Fig. 5.23 Confusion matrix showing number of hits for each of the classes for VGG19 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.94 0.00 0.00 0.00 0.00 0.01 0.04 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.01 0.01 0.96 0.00 0.00 0.01 0.00 0.00 0.00 0.00
Highway 0.02 0.00 0.00 0.96 0.00 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00
Pasture 0.00 0.01 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00
PermanentCrop 0.02 0.00 0.03 0.00 0.01 0.00 0.93 0.00 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.96

Fig. 5.24 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The
AnnualCrop class shows the lowest accuracy of 94% while the Forest and Residential can be
classified with 100% accuracy

Confusion Matrix (%)

See Fig. 5.24.
NasNet Models
NasNetLarge

Training/Validation Accuracy
See Fig. 5.25.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 85th epoch due
lack of improvement in validation categorical accuracy. The loss remains close to
zero after initial large swings.

Precision, Recall, F-Score, Support

See Fig. 5.26.
High precision and recall (> 99%) can be obtained for Forest class (99.9% recall,
97.9% precision) and Residential class (100% recall, 95.0% precision). The global
F2-score not shown here is estimated to be 97.6%.

Confusion Matrix
See Fig. 5.27.
5.4 Background of Experimental Comparison of Keras Applications Deep … 119

Fig. 5.25 Training and validation loss and accuracy for NasNetLarge model

Fig. 5.26 Precision, recall, and F-score of each of the classes for NasNetLarge model

Confusion Matrix (%)

See Fig. 5.28.
NasNetMobile

Training/Validation Accuracy
See Fig. 5.29.
There is a steady increase in training accuracy. Over-fitting can be observed imme-
diately after the 1st epoch. Early stopping comes in as early as the 10th epoch due
120 5 Remote Sensing Example for Deep Learning

Fig. 5.27 Confusion matrix showing number of hits for each of the classes for NasNetLarge model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.98 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.98 0.00 0.00 0.00 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.95 0.00 0.00 0.05 0.00 0.00
Pasture 0.01 0.01 0.02 0.00 0.00 0.96 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.02 0.00 0.00 0.00 0.95 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.97 0.00
SeaLake 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.98

Fig. 5.28 Confusion matrix showing ratio of hits for each of the classes for NasNetLarge model.
The Highway and Pasture classes show the lowest accuracy of 95%, while the Forest and Residential
can be classified with 100% accuracy

Fig. 5.29 Training and validation loss and accuracy for NasNetMobile model
5.4 Background of Experimental Comparison of Keras Applications Deep … 121

lack of improvement in validation categorical accuracy. The validation loss diverges

despite decreasing training loss. In this case, training stopped much earlier than
expected compared to other models. Further model tuning is required to improve on
both accuracy and loss and avoid over-fitting.

Precision, Recall, F-Score, Support

See Fig. 5.30.
The precision and recall were lower than expected. The global F2-score not shown
here is estimated to be 63.0%. Further investigation is necessary to improve the
performance.

Confusion Matrix
See Fig. 5.31.

Fig. 5.30 Precision, recall, and F-score of each of the classes for NasNetMobile model

Fig. 5.31 Confusion matrix showing number of hits for each of the classes for NasNetMobile
model
122 5 Remote Sensing Example for Deep Learning

Confusion Matrix (%)

See Fig. 5.32.
The NasNetMobile model appears not suitable for this classification task as it
shows the worst performance among all models evaluated so far.
EfficientNet Models
Efficient models trade-off performance for speed. Therefore, results are expected
to show lower performance compared to VGG and NasNet models.
EfficientNet B0

Training/Validation Accuracy
See Fig. 5.33.
There is a steady increase in both training and validation accuracy. No over-fitting
can be observed although the early stopping comes in only after the 35th epoch due

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.95 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.01 0.00
Forest 0.14 0.79 0.01 0.00 0.00 0.01 0.00 0.01 0.00 0.04
HerbaceousVegetation 0.22 0.04 0.69 0.00 0.00 0.00 0.01 0.03 0.00 0.00
Highway 0.51 0.00 0.02 0.35 0.01 0.01 0.02 0.02 0.06 0.00
Industrial 0.31 0.00 0.00 0.00 0.43 0.00 0.01 0.23 0.01 0.00
Pasture 0.47 0.03 0.05 0.01 0.00 0.38 0.04 0.03 0.00 0.00
PermanentCrop 0.38 0.00 0.10 0.00 0.01 0.01 0.43 0.05 0.00 0.00
Residential 0.22 0.00 0.03 0.00 0.00 0.00 0.00 0.74 0.00 0.00
River 0.40 0.00 0.01 0.03 0.01 0.00 0.00 0.01 0.53 0.00
SeaLake 0.11 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.85

Fig. 5.32 Confusion matrix showing ratio of hits for each of the classes for NasNetMobile model.
The Highway class shows the lowest accuracy of 35%, while the AnnualCrop class shows the
highest accuracy of 95%

Fig. 5.33 Training and validation loss and accuracy for EfficientNet B0 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 123

lack of improvement in validation categorical accuracy. The loss is small and close
to zero after the initial swings.

Precision, Recall, F-Score, Support

See Fig. 5.34.
The highest precision and recall were obtained for Forest class (96.3% recall,
88.5% precision) followed by Residential class (97.6% recall, 75.1% precision). The
global F2-score not shown here is estimated to be 96.8%.

Confusion Matrix
See Fig. 5.35.

Confusion Matrix (%)

See Fig. 5.36.

Fig. 5.34 Precision, recall, and F-score of each of the classes for EfficientNet B0 model

Fig. 5.35 Confusion matrix showing number of hits for each of the classes for EfficientNet B0
model
124 5 Remote Sensing Example for Deep Learning

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.89 0.00 0.01 0.01 0.00 0.02 0.03 0.00 0.02 0.01
Forest 0.01 0.96 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.14 0.03 0.68 0.00 0.02 0.01 0.03 0.09 0.00 0.00
Highway 0.35 0.00 0.02 0.36 0.06 0.01 0.04 0.06 0.11 0.00
Industrial 0.08 0.00 0.00 0.00 0.81 0.00 0.01 0.10 0.00 0.00
Pasture 0.22 0.11 0.03 0.01 0.00 0.59 0.01 0.00 0.03 0.00
PermanentCrop 0.29 0.00 0.12 0.03 0.04 0.00 0.47 0.05 0.00 0.00
Residential 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.98 0.00 0.00
River 0.29 0.02 0.01 0.05 0.01 0.03 0.00 0.02 0.57 0.01
SeaLake 0.04 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.91

Fig. 5.36 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B0 model.
The Highway class shows the lowest accuracy of 36%, while the Residential class shows the highest
accuracy of 98%

EfficientNet B1

Training/Validation Accuracy
See Fig. 5.37.
There is an initial steady increase in both training and validation accuracy. Over-
fitting begins to show after 80th epoch where the accuracy becomes unstable and
large swings in validation accuracy are evident. The loss is small and close to zero
after the initial dip.

Precision, Recall, F-Score, Support

See Fig. 5.38.
The highest precision and recall were obtained for Residential class (99.7% recall,
74.4% precision) followed by Forest class (96.9% recall, 86.9% precision). The
global F2-score not shown here is estimated to be 77.8%.

Fig. 5.37 Training and validation loss and accuracy for EfficientNet B1 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 125

Fig. 5.38 Precision, recall, and F-score of each of the classes for Efficient-Net B1 model

Confusion Matrix
See Fig. 5.39.

Confusion Matrix (%)

See Fig. 5.40.
EfficientNet B2

Training/Validation Accuracy
See Fig. 5.41.
There is a gradual increase in validation accuracy that can be observed. Training
terminates at the 30th epoch due to lack of improvement in accuracy. The loss remains
close to zero after the initial dip.

Fig. 5.39 Confusion matrix showing the number of hits for each of the classes for EfficientNet B1
model
126 5 Remote Sensing Example for Deep Learning

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.89 0.00 0.01 0.01 0.00 0.01 0.03 0.00 0.04 0.01
Forest 0.01 0.97 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00
HerbaceousVegetation 0.07 0.02 0.82 0.00 0.01 0.00 0.02 0.06 0.00 0.00
Highway 0.24 0.00 0.04 0.49 0.05 0.00 0.04 0.05 0.08 0.00
Industrial 0.04 0.00 0.00 0.00 0.83 0.00 0.00 0.13 0.00 0.00
Pasture 0.21 0.14 0.05 0.01 0.00 0.54 0.01 0.02 0.04 0.00
PermanentCrop 0.21 0.00 0.10 0.02 0.04 0.00 0.53 0.11 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.16 0.02 0.02 0.07 0.01 0.02 0.02 0.01 0.67 0.00
SeaLake 0.03 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.91

Fig. 5.40 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B1 model.
The PermanentCrop class shows the lowest accuracy of 53% while the Residential class shows the
highest accuracy of 100%

Fig. 5.41 Training and validation loss and accuracy for EfficientNet B2 model

Precision, Recall, F-Score, Support

See Fig. 5.42.
The highest precision and recall were obtained for Residential class (97.9% recall,
74.5% precision) followed by Forest class (95.8% recall, 87.2% precision). The
global F2-score not shown here is estimated to be 73.5%.

Confusion Matrix
See Fig. 5.43.

Confusion Matrix (%)

See Fig. 5.44.
EfficientNet B3
5.4 Background of Experimental Comparison of Keras Applications Deep … 127

Fig. 5.42 Precision, recall, and F-score of each of the classes for EfficientNet B2 model

Fig. 5.43 Confusion matrix showing the number of hits for each of the classes for EfficientNet B2
model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.92 0.00 0.01 0.01 0.00 0.01 0.02 0.00 0.02 0.00
Forest 0.02 0.96 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.17 0.03 0.64 0.00 0.01 0.01 0.03 0.10 0.00 0.00
Highway 0.33 0.00 0.00 0.43 0.06 0.01 0.04 0.05 0.08 0.00
Industrial 0.06 0.00 0.00 0.01 0.81 0.00 0.00 0.11 0.00 0.00
Pasture 0.20 0.08 0.03 0.01 0.00 0.64 0.02 0.00 0.02 0.00
PermanentCrop 0.37 0.00 0.08 0.02 0.03 0.01 0.42 0.06 0.01 0.00
Residential 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.98 0.00 0.00
River 0.28 0.02 0.01 0.07 0.02 0.02 0.00 0.01 0.57 0.00
SeaLake 0.03 0.07 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.89

Fig. 5.44 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B2 model.
The PermanentCrop class shows the lowest accuracy of 42% while the Forest class shows the highest
accuracy of 96%
128 5 Remote Sensing Example for Deep Learning

Training/Validation Accuracy
See Fig. 5.45.
Wild swings in validation accuracy can be observed. However, the loss remains
close to zero after the initial dip.

Precision, Recall, F-Score, Support

See Fig. 5.46.

Fig. 5.45 Training and validation loss and accuracy for EfficientNet B3 model

Fig. 5.46 Precision, recall, and F-score of each of the classes for Efficient-Net B3 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 129

The highest precision and recall were obtained for Residential class (99.5% recall,
90.8% precision) followed by SeaLake class (96.7% recall, 98.7% precision). The
Forest class achieves a decent performance of 95.8% recall, 97.0% precision. The
global F2-score not shown here is estimated to be 90.5%.

Confusion Matrix
See Fig. 5.47.

Confusion Matrix (%)

See Fig. 5.48.
EfficientNet B4

Training/Validation Accuracy
See Fig. 5.49.
There is a steady increase in both training and validation accuracy with initial
fluctuations. Over-fitting begins to show after 80th epoch where the accuracy becomes
unstable. The loss is small and close to zero after the initial dip.

Fig. 5.47 Confusion matrix showing the number of hits for each of the classes for EfficientNet B3
model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.92 0.00 0.01 0.01 0.00 0.01 0.04 0.00 0.01 0.01
Forest 0.01 0.96 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.02 0.00 0.94 0.00 0.00 0.00 0.02 0.01 0.00 0.00
Highway 0.06 0.00 0.02 0.81 0.02 0.00 0.02 0.02 0.04 0.00
Industrial 0.01 0.00 0.00 0.00 0.92 0.00 0.01 0.06 0.00 0.00
Pasture 0.04 0.04 0.04 0.00 0.00 0.84 0.03 0.00 0.01 0.00
PermanentCrop 0.07 0.00 0.08 0.01 0.02 0.00 0.80 0.02 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00
River 0.06 0.00 0.01 0.08 0.00 0.00 0.00 0.00 0.84 0.00
SeaLake 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.97

Fig. 5.48 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B3 model.
The Highway class shows the lowest accuracy of 81%, while the Residential class shows the highest
accuracy of 99%
130 5 Remote Sensing Example for Deep Learning

Fig. 5.49 Training and validation loss and accuracy for EfficientNet B4 model

Precision, Recall, F-Score, Support

See Fig. 5.50.
The highest precision and recall were obtained for Residential class (98.5% recall,
82.0% precision) followed by Forest class (96.8% recall, 91.8% precision). The
global F2-score not shown here is estimated to be 83.5%.

Confusion Matrix
See Fig. 5.51.

Fig. 5.50 Precision, recall and F-score of each of the classes for EfficientNet B4 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 131

Fig. 5.51 Confusion matrix showing the number of hits for each of the classes for EfficientNet B4
model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.93 0.00 0.00 0.00 0.00 0.02 0.02 0.00 0.01 0.01
Forest 0.01 0.97 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.07 0.00 0.81 0.00 0.00 0.01 0.05 0.05 0.00 0.00
Highway 0.20 0.00 0.02 0.59 0.03 0.01 0.06 0.03 0.05 0.00
Industrial 0.03 0.00 0.00 0.00 0.86 0.00 0.00 0.10 0.00 0.00
Pasture 0.07 0.04 0.02 0.00 0.00 0.85 0.02 0.00 0.00 0.00
PermanentCrop 0.13 0.00 0.04 0.01 0.04 0.01 0.72 0.05 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.98 0.00 0.00
River 0.14 0.01 0.01 0.06 0.00 0.04 0.01 0.00 0.71 0.01
SeaLake 0.03 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.90

Fig. 5.52 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B4 model.
The River class shows the lowest accuracy of 71% while the Residential class shows the highest
accuracy of 98%

Confusion Matrix (%)

See Fig. 5.52.
EfficientNet B5.

Training/Validation Accuracy
See Fig. 5.53.
There is a steady increase in both training and validation accuracy with shallow
initial fluctuations. Over-fitting begins to show after 80th epoch. The loss is small
and close to zero after the initial dip.

Precision, Recall, F-Score, Support

See Fig. 5.54.
The highest precision and recall were obtained for Residential class (98.9% recall,
89.9% precision) followed by Forest class (97.6% recall, 90.0% precision). The
global F2-score not shown here is estimated to be 88.8%.

Confusion Matrix
See Fig. 5.55.
132 5 Remote Sensing Example for Deep Learning

Fig. 5.53 Training and validation loss and accuracy for EfficientNet B5 model

Fig. 5.54 Precision, recall, and F-score of each of the classes for EfficientNet B5 model

Confusion Matrix (%)

See Fig. 5.56.
EfficientNet B6

Training/Validation Accuracy
See Fig. 5.57.
There is a steady increase in both training and validation accuracy. Over-fitting
begins to show after 80th epoch.
5.4 Background of Experimental Comparison of Keras Applications Deep … 133

Fig. 5.55 Confusion matrix showing the number of hits for each of the classes for EfficientNet B5
model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.93 0.00 0.00 0.00 0.00 0.01 0.04 0.00 0.01 0.00
Forest 0.01 0.98 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.03 0.01 0.89 0.00 0.00 0.01 0.04 0.02 0.00 0.00
Highway 0.08 0.00 0.02 0.78 0.02 0.01 0.04 0.01 0.04 0.00
Industrial 0.03 0.00 0.00 0.01 0.87 0.00 0.01 0.09 0.00 0.00
Pasture 0.04 0.09 0.03 0.00 0.00 0.82 0.02 0.00 0.01 0.00
PermanentCrop 0.05 0.00 0.05 0.01 0.01 0.01 0.86 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.99 0.00 0.00
River 0.08 0.01 0.02 0.06 0.00 0.01 0.00 0.00 0.82 0.00
SeaLake 0.03 0.05 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.90

Fig. 5.56 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B5 model.
The Highway class shows the lowest accuracy of 78% while the Residential class shows the highest
accuracy of 99%

Fig. 5.57 Training and validation loss and accuracy for EfficientNet B6 model
134 5 Remote Sensing Example for Deep Learning

Precision, Recall, F-Score, Support

See Fig. 5.58.
The highest precision and recall were obtained for Residential class (99.7% recall,
90.7% precision) followed by Forest class (97.9% recall, 88.4% precision). The
global F2-score not shown here is estimated to be 89.3%.

Confusion Matrix
See Fig. 5.59.

Confusion Matrix (%)

See Fig. 5.60.
EfficientNet B7

Fig. 5.58 Precision, recall, and F-score of each of the classes for EfficientNet B6 model

Fig. 5.59 Confusion matrix showing the number of hits for each of the classes for EfficientNet B6
model
5.4 Background of Experimental Comparison of Keras Applications Deep … 135

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.93 0.00 0.00 0.01 0.00 0.01 0.04 0.00 0.01 0.00
Forest 0.00 0.98 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.02 0.01 0.93 0.00 0.00 0.00 0.02 0.01 0.00 0.00
Highway 0.07 0.00 0.03 0.79 0.01 0.00 0.03 0.01 0.04 0.00
Industrial 0.01 0.00 0.00 0.01 0.90 0.00 0.01 0.08 0.00 0.00
Pasture 0.06 0.10 0.05 0.01 0.00 0.75 0.03 0.00 0.01 0.00
PermanentCrop 0.05 0.00 0.06 0.01 0.01 0.00 0.85 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.05 0.01 0.00 0.07 0.00 0.00 0.00 0.00 0.85 0.00
SeaLake 0.03 0.07 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.88

Fig. 5.60 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B6 model.
The Pasture class shows the lowest accuracy of 75%, while the Residential class shows the highest
accuracy of 100%

Training/Validation Accuracy
See Fig. 5.61.
There is a steady increase in training accuracy and validation accuracy becomes
stable after initial fluctuations. Over-fitting begins to show after 50th epoch. The same
trend can be observed for the loss function.

Precision, Recall, F-Score, Support

See Fig. 5.62.
The highest precision and recall were obtained for Residential class (99.4% recall,
91.2% precision) followed by Forest class (99.0% recall, 92.0% precision). The
global F2-score not shown here is estimated to be 93.0%.

Confusion Matrix
See Fig. 5.63.

Fig. 5.61 Training and validation loss and accuracy for EfficientNet B7 model
136 5 Remote Sensing Example for Deep Learning

Fig. 5.62 Precision, recall, and F-score of each of the classes for EfficientNet B7 model

Fig. 5.63 Confusion matrix showing the number of hits for each of the classes for EfficientNet B7
model

Confusion Matrix (%)

See Fig. 5.64.
There is a progressive increase in accuracy for EfficientNet class of models from
EfficientNet B0 to EfficientNet B7. However, the accuracy for the classes varies from
model to model.
Model Performance Comparison and Analysis
Intra-Model type comparison (Table 5.1)
There were no drastic improvements in ResNet models from ResNet50 to
ResNet152. VGG16 model showed slightly better performance than VGG19.
NasNetLarge showed the overall best performance with an F2-score of 97.6%. The
EfficientNet models showed a gradual increase in accuracy from B0 to B7 with
anomaly for B2 and B3. Reasons for this need further investigation but could due
to model initialization issues in the case of B2. For the EfficientNetB7 it was found
5.4 Background of Experimental Comparison of Keras Applications Deep … 137

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.93 0.00 0.00 0.01 0.00 0.01 0.04 0.00 0.01 0.00
Forest 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.02 0.01 0.92 0.00 0.00 0.01 0.02 0.02 0.00 0.00
Highway 0.03 0.00 0.01 0.89 0.01 0.00 0.01 0.01 0.03 0.00
Industrial 0.01 0.00 0.00 0.00 0.94 0.00 0.00 0.05 0.00 0.00
Pasture 0.02 0.04 0.02 0.01 0.00 0.90 0.01 0.00 0.01 0.00
PermanentCrop 0.04 0.00 0.04 0.00 0.01 0.00 0.88 0.02 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00
River 0.02 0.00 0.00 0.06 0.00 0.00 0.00 0.00 0.91 0.00
SeaLake 0.01 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.93

Fig. 5.64 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B7 model.
The PermanentCrop class shows the lowest accuracy of 88% while the Forest and Residential classes
show the highest accuracy of 99%

Table 5.1 Comparison of

Model Accuracy (%) F2-score (%)
accuracy of the
evaluated family of models ResNet50 96.34 96.41
ResNet101 96.42 96.62
ResNet152 95.40 95.47
VGG16 97.14 97.06
VGG19 96.84 96.78
NasNetLarge 97.39 97.61
NasNetMobile 61.41 62.96
EfficientNetB0 72.10 73.07
EfficientNetB1 76.38 77.82
EfficientNetB2 72.58 73.50
EfficientNetB3 90.00 90.47
EfficientNetB4 83.32 83.45
EfficientNetB5 88.26 88.78
EfficientNetB6 88.51 89.30
EfficientNetB7 92.77 93.02

that by adding BatchNormalization layers after the top Dense layers, the global
mean recall could be improved from 92.8 to 96.4% (+ 3.6%) and global F2-score
improved from 93 to 96.7% (+ 3.7%)!! In addition, the highest precision and recall
were obtained for Residential class (99.8% recall, 96.5% precision), a noticeable
improvement from (99.4% recall, 91.2% precision), followed by Forest class (99.3%
recall, 95.7% precision) up from (99.0% recall, 92.0% precision). The results are
shown below.
EfficientNet B7

Training/Validation Accuracy
See Fig. 5.65.
138 5 Remote Sensing Example for Deep Learning

Fig. 5.65 Training and validation loss and accuracy for EfficientNet B7 model

Precision, Recall, F-Score, Support

See Fig. 5.66.

Confusion Matrix
See Fig. 5.67.

Confusion Matrix (%)

See Fig. 5.68.

Fig. 5.66 Precision, recall, and F-score of each of the classes for EfficientNet B7 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 139

Fig. 5.67 Confusion matrix showing the number of hits for each of the classes for EfficientNet B7
model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.96 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.01 0.00
Forest 0.00 0.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.97 0.00 0.00 0.00 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00
Pasture 0.01 0.03 0.02 0.00 0.00 0.94 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.03 0.00 0.01 0.00 0.94 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.96 0.00
SeaLake 0.01 0.02 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.94

Fig. 5.68 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B7 model.
The Pasture class shows the lowest accuracy of 94% while the Residential class achieves the highest
accuracy of 100%

In short, with some slight modifications to the model, great benefits in perfor-
mance can be achieved as shown above for EfficientNet B7 model. This is also
applicable to other models and illustrates the advantage of the Keras framework for
experimental modeling when quick confirmations and decision have to be made. In
fact, we checked the effect of similar changes on ResNet50, ResNet101, VVG16,
VGG19, and NasNetLarge models. The results were shown below.
ResNet50

Training/Validation Accuracy
See Fig. 5.69.

Precision, Recall, F-Score, Support

See Fig. 5.70.

Confusion Matrix
See Fig. 5.71.

Confusion Matrix (%)

See Fig. 5.72.
140 5 Remote Sensing Example for Deep Learning

Fig. 5.69 Training and validation loss and accuracy for ResNet50 model

Fig. 5.70 Precision, recall, and F-score of each of the classes for ResNet50 model

ResNet101

Training/Validation Accuracy
See Fig. 5.73.

Precision, Recall, F-Score, Support

See Fig. 5.74.
5.4 Background of Experimental Comparison of Keras Applications Deep … 141

Fig. 5.71 Confusion matrix showing number of hits for each of the classes for ResNet50 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.97 0.00 0.00 0.00 0.02 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.01 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00
Pasture 0.01 0.02 0.01 0.00 0.00 0.96 0.00 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.02 0.00 0.00 0.00 0.95 0.02 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97

Fig. 5.72 Confusion matrix showing ratio of hits for each of the classes for ResNet50 model. The
PermanentCrop class shows the lowest accuracy of 95% while the Forest and Residential can be
classified with 100% accuracy

Fig. 5.73 Training and validation loss and accuracy for ResNet101 model
142 5 Remote Sensing Example for Deep Learning

Fig. 5.74 Precision, recall, and F-score of each of the classes for ResNet101 model

Confusion Matrix
See Fig. 5.75.

Confusion Matrix (%)

See Fig. 5.76.
VGG16

Training/Validation Accuracy
See Fig. 5.77.

Precision, Recall, F-Score, Support

See Fig. 5.78.
Forest, Residential, and River have a recall of greater than 99% which can
be considered as state of the art. The improvements gained from applying batch
normalizations can be summarized as follows.

Fig. 5.75 Confusion matrix showing number of hits for each of the classes for ResNet101 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 143

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.95 0.00 0.00 0.00 0.02 0.01 0.00 0.00
Highway 0.00 0.00 0.00 0.97 0.00 0.00 0.00 0.01 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00
Pasture 0.01 0.02 0.01 0.00 0.00 0.96 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98

Fig. 5.76 Confusion matrix showing ratio of hits for each of the classes for ResNet101 model. The
HerbaceousVegetation class shows the lowest accuracy of 95% while the Forest and Residential
can be classified with 100% accuracy

Fig. 5.77 Training and validation loss and accuracy for VGG16 model

Before: Average Accuracy: 97.14%, Global F2-Score: 97.06%

After: Average Accuracy: 98.07% (+ 0.93%), Global F2-Score: 98.05% (+
0.99%).

Confusion Matrix
See Fig. 5.79.

Confusion Matrix (%)

See Fig. 5.80.
VGG19

Training/Validation Accuracy
See Fig. 5.81.
144 5 Remote Sensing Example for Deep Learning

Fig. 5.78 Precision, recall, and F-score of each of the classes for VGG16 model

Fig. 5.79 Confusion matrix showing number of hits for each of the classes for VGG16 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.98 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00
Highway 0.01 0.00 0.00 0.96 0.02 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00
Pasture 0.01 0.00 0.01 0.00 0.00 0.97 0.00 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.00 0.00 0.01 0.00 0.97 0.00 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.98

Fig. 5.80 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The
Highway class shows the lowest accuracy of 96% while the Residential class can be classified with
100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep … 145

Fig. 5.81 Training and validation loss and accuracy for VGG19 model

Precision, Recall, F-Score, Support

See Fig. 5.82.

Confusion Matrix
See Fig. 5.83.

Confusion Matrix (%)

See Fig. 5.84.

Fig. 5.82 Precision, recall, and F-score of each of the classes for VGG19 model
146 5 Remote Sensing Example for Deep Learning

Fig. 5.83 Confusion matrix showing number of hits for each of the classes for VGG19 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.01 0.00 0.97 0.00 0.00 0.00 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00
Pasture 0.01 0.01 0.01 0.00 0.00 0.97 0.00 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.01 0.00 0.01 0.00 0.97 0.00 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.98

Fig. 5.84 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The
AnnualCrop, HerbaceousVegetation, Industrial, Pasture, and PermanentCrop classes have the lowest
accuracy of 97% while the Forest and Residential can be classified with 100% accuracy

NasNetLarge

Training/Validation Accuracy
See Fig. 5.85.

Precision, Recall, F-Score, Support

See Fig. 5.86.

Confusion Matrix
See Fig. 5.87.

Confusion Matrix (%)

See Fig. 5.88.
In terms of mean accuracy and F2-score, the results can be summarized as shown
in Table 5.2.
The VGG models give the best performance in terms of both accuracy and F2-
score. VGG16 achieves an accuracy of 98.07% while VGG19 lower at 97.89%. The
same result is reflected in the F2-score which is 98.05% and 98.02% for the respective
models. NasNetLarge showed a lower than expected performance.
Train–test Split 80–20 Result
5.4 Background of Experimental Comparison of Keras Applications Deep … 147

Fig. 5.85 Training and validation loss and accuracy for NasNetLarge model

Fig. 5.86 Precision, recall, and F-score of each of the classes for NasNetLarge model

Changing the train–test split is one way to improve the validation accuracy. In
this case, we change the split case from 70–30 to 80–20 and check some of the top
performing models so far. The results of this change are shown below.
ResNet50

Training/Validation Accuracy
See Fig. 5.89.
148 5 Remote Sensing Example for Deep Learning

Fig. 5.87 Confusion matrix showing number of hits for each of the classes for NasNetLarge model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.98 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.02 0.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.96 0.01 0.00 0.00 0.01 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.03 0.00 0.00
Pasture 0.01 0.02 0.02 0.00 0.00 0.95 0.01 0.00 0.00 0.00
PermanentCrop 0.02 0.00 0.02 0.00 0.00 0.00 0.94 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.98

Fig. 5.88 Confusion matrix showing ratio of hits for each of the classes for NasNetLarge model.
The PermanentCrop class shows the lowest accuracy of 94% while the Forest and Residential can
be classified with 100% accuracy

Table 5.2 Comparison of

Model Accuracy (%) F2-score (%)
the accuracy of top models of
each family ResNet50 97.48 97.57
ResNet101 97.57 97.64
VGG16 98.07 98.05
VGG19 97.89 98.02
NasNetLarge 97.21 97.45
EfficientNetB7 96.44 96.68

Precision, Recall, F-Score, Support

See Fig. 5.90.

Confusion Matrix
See Fig. 5.91.

Confusion Matrix (%)

See Fig. 5.92.
ResNet101
5.4 Background of Experimental Comparison of Keras Applications Deep … 149

Fig. 5.89 Training and validation loss and accuracy for ResNet50 model

Fig. 5.90 Precision, recall, and F-score of each of the classes for ResNet50 model

Training/Validation Accuracy
See Fig. 5.93.

Precision, Recall, F-Score, Support

See Fig. 5.94.

Confusion Matrix
See Fig. 5.95.
150 5 Remote Sensing Example for Deep Learning

Fig. 5.91 Confusion matrix showing number of hits for each of the classes for ResNet50 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.96 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.98 0.00 0.00 0.00 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.01 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.95 0.00 0.00 0.05 0.00 0.00
Pasture 0.01 0.02 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98

Fig. 5.92 Confusion matrix showing ratio of hits for each of the classes for ResNet50 model.
The AnnualCrop class shows the lowest accuracy of 95% while the Forest and Residential can be
classified with 100% accuracy

Fig. 5.93 Training and validation loss and accuracy for ResNet101 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 151

Fig. 5.94 Precision, recall, and F-score of each of the classes for ResNet101 model

Fig. 5.95 Confusion matrix showing number of hits for each of the classes for ResNet101 model

Confusion Matrix (%)

See Fig. 5.96.
VGG16

Training/Validation Accuracy
See Fig. 5.97.

Precision, Recall, F-Score, Support

See Fig. 5.98.

Confusion Matrix
See Fig. 5.99.
152 5 Remote Sensing Example for Deep Learning

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.98 0.00 0.00 0.01 0.01 0.00 0.00 0.00
Highway 0.01 0.00 0.00 0.94 0.00 0.00 0.00 0.00 0.04 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00
Pasture 0.01 0.00 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99

Fig. 5.96 Confusion matrix showing ratio of hits for each of the classes for ResNet101 model. The
Highway class shows the lowest accuracy of 94% while the Forest and Residential can be classified
with 100% accuracy

Fig. 5.97 Training and validation loss and accuracy for VGG16 model

Confusion Matrix (%)

See Fig. 5.100.
Retraining the above VGG16 model using the previous weights as input does not
result in notable improvements in accuracy. The results are shown below.

Training/Validation Accuracy
See Fig. 5.101.

Precision, Recall, F-Score, Support

See Fig. 5.102.

Confusion Matrix
See Fig. 5.103.
5.4 Background of Experimental Comparison of Keras Applications Deep … 153

Fig. 5.98 Precision, recall, and F-score of each of the classes for VGG16 model

Fig. 5.99 Confusion matrix showing number of hits for each of the classes for VGG16 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.96 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.98 0.00 0.00 0.01 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.96 0.01 0.00 0.00 0.01 0.02 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00
Pasture 0.00 0.02 0.01 0.00 0.00 0.96 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99

Fig. 5.100 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The
AnnualCrop and Highway classes show the lowest accuracy of 96% while the Forest and Residential
classes can be classified with 100% accuracy
154 5 Remote Sensing Example for Deep Learning

Fig. 5.101 Training and validation loss and accuracy for VGG16 model

Fig. 5.102 Precision, recall, and F-score of each of the classes for VGG16 model

Confusion Matrix (%)

See Fig. 5.104.
VGG19

Training/Validation Accuracy
See Fig. 5.105.
5.4 Background of Experimental Comparison of Keras Applications Deep … 155

Fig. 5.103 Confusion matrix showing number of hits for each of the classes for VGG16 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.98 0.00 0.00 0.01 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00
Pasture 0.00 0.01 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.01 0.00 0.01 0.00 0.95 0.02 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99

Fig. 5.104 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The
PermanentCrop class shows the lowest accuracy of 95% while the Forest and Residential classes
can be classified with 100% accuracy

Fig. 5.105 Training and validation loss and accuracy for VGG19 model
156 5 Remote Sensing Example for Deep Learning

Precision, Recall, F-Score, Support

See Fig. 5.106.

Confusion Matrix
See Fig. 5.107.

Confusion Matrix (%)

See Fig. 5.108.
VGG19 Test2—rerun
A rerun of the model as with VGG16 is performed to check if any improvements
in accuracy can be obtained. The weights from previous training are re-used in this
case. Again, no significant improvements could be obtained from this approach. The
reason could be that no additional learning is possible with the current parameter set.
The results are shown below.

Fig. 5.106 Precision, recall, and F-score of each of the classes for VGG19 mod-el

Fig. 5.107 Confusion matrix showing number of hits for each of the classes for VGG19 model
5.4 Background of Experimental Comparison of Keras Applications Deep … 157

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.94 0.00 0.00 0.01 0.00 0.01 0.04 0.00 0.01 0.00
Forest 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.01 0.00 0.94 0.00 0.00 0.01 0.03 0.01 0.00 0.00
Highway 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00
Pasture 0.01 0.01 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.00 0.00 0.01 0.00 0.97 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.01 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.01 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.96

Fig. 5.108 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The
AnnualCrop class shows the lowest accuracy of 94% while the Forest and Residential can be
classified with 100% accuracy

Training/Validation Accuracy
See Fig. 5.109.

Precision, Recall, F-Score, Support

See Fig. 5.110.

Confusion Matrix
See Fig. 5.111.

Confusion Matrix (%)

See Fig. 5.112.
NasNetLarge

Training/Validation Accuracy
See Fig. 5.113.

Fig. 5.109 Training and validation loss and accuracy for VGG19 model
158 5 Remote Sensing Example for Deep Learning

Fig. 5.110 Precision, recall, and F-score of each of the classes for VGG19 model

Fig. 5.111 Confusion matrix showing number of hits for each of the classes for VGG19 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.96 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.97 0.00 0.00 0.01 0.02 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.96 0.01 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00
Pasture 0.01 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.98

Fig. 5.112 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The
AnnualCrop and Highway classes show the lowest accuracy of 96% while the Forest and Residential
can be classified with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep … 159

Fig. 5.113 Training and validation loss and accuracy for NasNetLarge model

Precision, Recall, F-Score, Support

See Fig. 5.114.

Confusion Matrix
See Fig. 5.115.

Fig. 5.114 Precision, recall, and F-score of each of the classes for NasNetLarge model
160 5 Remote Sensing Example for Deep Learning

Fig. 5.115 Confusion matrix showing number of hits for each of the classes for NasNetLarge
model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.98 0.00 0.00 0.00 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.02 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00
Pasture 0.00 0.03 0.02 0.00 0.00 0.94 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.97 0.00
SeaLake 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97

Fig. 5.116 Confusion matrix showing ratio of hits for each of the classes for NasNetLarge model.
The Pasture class shows the lowest accuracy of 94% while the Forest and Residential can be classified
with 100% accuracy

Confusion Matrix (%)

See Fig. 5.116.
EfficientNetB7

Training/Validation Accuracy
See Fig. 5.117.

Precision, Recall, F-Score, Support

See Fig. 5.118.

Confusion Matrix
See Fig. 5.119.

Confusion Matrix (%)

See Fig. 5.120.
Table 5.3 summarizes the results of this subsection on train–test split.
There is a general improvement in accuracy and the F2-score compared to the
70–30 split. Specifically, the VGG family of models showed the best performance
with VGG16 being slightly better compared to VGG19.
Weight Regularization
5.4 Background of Experimental Comparison of Keras Applications Deep … 161

Fig. 5.117 Training and validation loss and accuracy for EfficientNet B7 model

Fig. 5.118 Precision, recall, and F-score of each of the classes for EfficientNet B7 model

To reduce overfitting, one of strategies that can be used is L2 weight regulation.

Three kinds of regularization exist, namely kernel regularization where a penalty
is applied on the layer’s kernel, bias regularization where a penalty is applied on
the layer’s bias, and activity regularization where a penalty is applied on the layer’s
output. Kernel regularization is commonly used, but here will check how L2 kernel
regularization and L2 activity regularization affect the performance using the VGG16
as examples.
162 5 Remote Sensing Example for Deep Learning

Fig. 5.119 Confusion matrix showing the number of hits for each of the classes for EfficientNet
B7 model

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.97 0.00 0.00 0.00 0.01 0.01 0.00 0.00
Highway 0.01 0.00 0.00 0.95 0.01 0.00 0.01 0.02 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.04 0.00 0.00
Pasture 0.01 0.01 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.01 0.00 0.00 0.00 0.96 0.02 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98

Fig. 5.120 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B7 model.
The Highway class shows the lowest accuracy of 95% while the Residential class shows the highest
accuracy of 100%

Table 5.3 Comparison of results for the two splits of data

Model 70–30 Split 80–20 Split
Accuracy (%) F2-score (%) Accuracy (%) F2-score (%)
ResNet50 97.48 97.57 97.81 97.80
ResNet101 97.57 97.64 98.01 98.00
VGG16 98.07 98.05 98.18 98.09
VGG19 97.89 98.02 98.14 98.06
NasNetLarge 97.21 97.45 97.65 97.65
EfficientNetB7 96.44 96.68 97.45 97.43

Kernel Regularization (2048 units)

Training/Validation Accuracy
See Fig. 5.121.

Precision, Recall, F-Score, Support

See Fig. 5.122.
5.4 Background of Experimental Comparison of Keras Applications Deep … 163

Fig. 5.121 Training and validation loss and accuracy after applying kernel regularization to VGG16
model with a capacity of 2048 units

Fig. 5.122 Precision, recall, and F-score of each of the classes for after applying kernel
regularization to VGG16 model with a capacity of 2048 units

Confusion Matrix
See Fig. 5.123.

Confusion Matrix (%)

See Fig. 5.124.
Activity Regularization (2048 units)
164 5 Remote Sensing Example for Deep Learning

Fig. 5.123 Confusion matrix showing number of hits for each of the classes after applying kernel
regularization to VGG16 model with a capacity of 2048 units

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.98 0.00 0.00 0.01 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00
Pasture 0.00 0.00 0.01 0.00 0.00 0.99 0.00 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.97

Fig. 5.124 Confusion matrix showing ratio of hits for each of the classes after applying kernel
regularization to VGG16 model with a capacity of 2048 units. The AnnualCrop, Industrial, and
SeaLake classes show the lowest accuracy of 97% while the Forest and Residential classes can be
classified with 100% accuracy

Training/Validation Accuracy
See Fig. 5.125.

Precision, Recall, F-Score, Support

See Fig. 5.126.

Confusion Matrix
See Fig. 5.127.

Confusion Matrix (%)

See Fig. 5.128.
Activity regularization seems to provide better results than kernel regularization.
We will therefore consider improvements by reducing the network capacity with
Activity regularization.
Activity Regularization (1024 units)

Training/Validation Accuracy
See Fig. 5.129.
5.4 Background of Experimental Comparison of Keras Applications Deep … 165

Fig. 5.125 Training and validation loss and accuracy after applying activity regularization to
VGG16 model with a capacity of 2048 units

Fig. 5.126 Precision, recall, and F-score of each of the classes for after applying activity
regularization to VGG16 model with a capacity of 2048 units

Precision, Recall, F-Score, Support

See Fig. 5.130.

Confusion Matrix
See Fig. 5.131.
166 5 Remote Sensing Example for Deep Learning

Fig. 5.127 Confusion matrix showing number of hits for each of the classes after applying activity
regularization to VGG16 model with a capacity of 2048 units

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.99 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.01 0.00 0.96 0.00 0.00 0.01 0.02 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00 0.00
Industrial 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00
Pasture 0.01 0.00 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99

Fig. 5.128 Confusion matrix showing ratio of hits for each of the classes after applying activity
regularization to VGG16 model with a capacity of 2048 units. The HerbaceousVegetation class
shows the lowest accuracy of 97% while the Forest and Residential classes can be classified with
100% accuracy

Fig. 5.129 Training and validation loss and accuracy after applying Activity regularization to
VGG16 model with a capacity of 1024 units
5.4 Background of Experimental Comparison of Keras Applications Deep … 167

Fig. 5.130 Precision, recall, and F-score of each of the classes for after applying activity
regularization to VGG16 model with a capacity of 1024 units

Fig. 5.131 Confusion matrix showing number of hits for each of the classes after applying activity
regularization to VGG16 model with a capacity of 1024 units

Confusion Matrix (%)

See Fig. 5.132.
A global accuracy of 98.27% is achieved when capacity is reduced to 1024 units.
In this case there is benefit in in increasing the network capacity to 2048 units since the
same accuracy is obtained. There is a 1% increase in accuracy of the HerbaceousCrop
class from 96 to 97%. Additionally, all the other 9 classes have an accuracy of 98%
and above.
Activity Regularization (512 units)

Training/Validation Accuracy
See Fig. 5.133.
168 5 Remote Sensing Example for Deep Learning

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.98 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.97 0.00 0.00 0.01 0.02 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00
Pasture 0.01 0.01 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98

Fig. 5.132 Confusion matrix showing ratio of hits for each of the classes after applying Activity
regularization to VGG16 model with a capacity of 1024 units. The HerbaceousCrop class shows
the lowest accuracy of 97% while the Forest and Residential classes can be classified with 100%
accuracy

Fig. 5.133 Training and validation loss and accuracy after applying Activity regularization to
VGG16 model with a capacity of 512 units

Precision, Recall, F-Score, Support

See Fig. 5.134.

Confusion Matrix
See Fig. 5.135.

Confusion Matrix (%)

See Fig. 5.136.
The overall accuracy reduces to 98.17% compared to 98.27% obtained for the
1024 units. Therefore, there is a no merit in reducing the capacity in this particular
case.
Comparing L2 kernel regularization with L2 activity regularization, we see that
activity regularization tends give better performance in terms of both combating
5.4 Background of Experimental Comparison of Keras Applications Deep … 169

Fig. 5.134 Precision, recall, and F-score of each of the classes for after applying Activity
regularization to VGG16 model with a capacity of 512 units

Fig. 5.135 Confusion matrix showing number of hits for each of the classes after applying Activity
regularization to VGG16 model with a capacity of 512 units

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00
Highway 0.01 0.00 0.00 0.96 0.00 0.00 0.00 0.01 0.02 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00
Pasture 0.00 0.00 0.00 0.00 0.00 0.99 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99

Fig. 5.136 Confusion matrix showing ratio of hits for each of the classes after applying Activity
regularization to VGG16 model with a capacity of 512 units. The Highway class shows the lowest
accuracy of 96% while the Forest and Residential classes can be classified with 100% accuracy
170 5 Remote Sensing Example for Deep Learning

overfitting and accuracy. It can also be seen that in some cases, the validation accuracy
loss is lower than the training loss. This can be due the fact that regularization is
applied only during training and not during validation. Some other valid reasons
normally given are that the evaluation of the training loss happens during training
while that of validation happens at the end of validation resulting in slight shift of loss
time by about half an epoch. Some strategies to avoid being too conservative during
training would be lowering the regularization constant, reducing dropout rate and
increasing model capacity. In our case, a model capacity of 1024 seems to be the
best so far achieving an accuracy of greater than 98% for 9 out of the 10 classes and
also showing resistance to overfitting.
Dropout
In this section we take a look at what happens if we vary the dropout rate for a
given network capacity of 1024 using the VGG16 model as base. We evaluate the
Dropout rates of 0.2, 0.3, 0.4, and 0.5. In practice, dropout rates of between 0.2 and
0.5 are recommended.
Dropout Rate 0.2

Training/Validation Accuracy
See Fig. 5.137.

Precision, Recall, F-Score, Support

See Fig. 5.138.

Confusion Matrix
See Fig. 5.139.

Fig. 5.137 Training and validation loss and accuracy of the VGG16 model with a dropout rate of
0.2
5.4 Background of Experimental Comparison of Keras Applications Deep … 171

Fig. 5.138 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout
rate of 0.2

Fig. 5.139 Confusion matrix showing number of hits for each of the classes using the VGG16
model as base with a dropout rate of 0.2

Confusion Matrix (%)

See Fig. 5.140.
For the dropout rate of 0.2 the overall accuracy is about 98.28%. In terms of
classification performance for the resulting model, HerbaceousVegetation, Highway
and Permanent crop have a recall of less than 98%.
Dropout Rate 0.3

Training/Validation Accuracy
See Fig. 5.141.

Precision, Recall, F-Score, Support

See Fig. 5.142.
172 5 Remote Sensing Example for Deep Learning

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.96 0.00 0.00 0.01 0.02 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00
Pasture 0.00 0.01 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.01 0.00 0.01 0.00 0.97 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99

Fig. 5.140 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model
as base with a dropout rate of 0.2. The HerbaceousVegetation class shows the lowest accuracy of
96% while the Forest and Residential classes can be classified with 100% accuracy

Fig. 5.141 Training and validation loss and accuracy of the VGG16 model with a dropout rate of
0.3

Confusion Matrix
See Fig. 5.143.

Confusion Matrix (%)

See Fig. 5.144.
For the dropout rate of 0.3 the overall accuracy is about 98.13%. In terms of clas-
sification performance for the resulting model, AnnualCrop, HerbaceousVegetation,
and Highway crop have a recall of less than 98%.
Dropout Rate 0.4

Training/Validation Accuracy
See Fig. 5.145.
5.4 Background of Experimental Comparison of Keras Applications Deep … 173

Fig. 5.142 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout
rate of 0.3

Fig. 5.143 Confusion matrix showing number of hits for each of the classes using the VGG16
model as base with a dropout rate of 0.3

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00
Forest 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.96 0.00 0.00 0.01 0.03 0.00 0.00 0.00
Highway 0.01 0.00 0.00 0.97 0.01 0.00 0.01 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00
Pasture 0.00 0.01 0.00 0.00 0.00 0.99 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99

Fig. 5.144 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model
as base with a dropout rate of 0.3. The HerbaceousVegetation class shows the lowest accuracy of
96% while the Residential class can be classified with 100% accuracy
174 5 Remote Sensing Example for Deep Learning

Fig. 5.145 Training and validation loss and accuracy of the VGG16 model with a dropout rate of
0.4

Precision, Recall, F-Score, Support

See Fig. 5.146.

Confusion Matrix
See Fig. 5.147.

Fig. 5.146 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout
rate of 0.4
5.4 Background of Experimental Comparison of Keras Applications Deep … 175

Fig. 5.147 Confusion matrix showing number of hits for each of the classes using the VGG16
model as base with a dropout rate of 0.4

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.96 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.01 0.96 0.00 0.00 0.01 0.02 0.00 0.00 0.00
Highway 0.01 0.00 0.00 0.96 0.00 0.00 0.00 0.01 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00
Pasture 0.00 0.02 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.96

Fig. 5.148 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model
as base with a dropout rate of 0.4. The AnnualCrop, HerbaceousVegetation, Highway, and SeaLake
classes show the lowest accuracy of 96% while the Forest and Residential classes can be classified
with 100% accuracy

Confusion Matrix (%)

See Fig. 5.148.
For the dropout rate of 0.4, the overall accuracy is about 97.74%. In terms of clas-
sification performance for the resulting model, AnnualCrop, HerbaceousVegetation,
Highway, and SeaLake crop have a recall of less than 98%.
Dropout Rate 0.5

Training/Validation Accuracy
See Fig. 5.149.

Precision, Recall, F-Score, Support

See Fig. 5.150.

Confusion Matrix
See Fig. 5.151.

Confusion Matrix (%)

See Fig. 5.152.
176 5 Remote Sensing Example for Deep Learning

Fig. 5.149 Training and validation loss and accuracy of the VGG16 model with a dropout rate of
0.5

Fig. 5.150 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout
rate of 0.5

For the dropout rate of 0.5, the overall accuracy is about 98.17%. In terms of clas-
sification performance for the resulting model, AnnualCrop, HerbaceousVegetation,
Pasture, and PermanentCrop have a recall of less than 98%.
We found out that for this particular dataset, there was no marked improvement
in model accuracy associated with a dropout rate increase from 0.2 to 0.5. A dropout
rate of 0.2 would still be sufficient to achieve a decent accuracy of 98.28%. This does
5.4 Background of Experimental Comparison of Keras Applications Deep … 177

Fig. 5.151 Confusion matrix showing number of hits for each of the classes using the VGG16
model as base with a dropout rate of 0.5

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.01 0.01 0.96 0.00 0.00 0.01 0.02 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00
Pasture 0.01 0.00 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.01 0.00 0.00 0.00 0.97 0.00 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99

Fig. 5.152 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model
as base with a dropout rate of 0.5. The HerbaceousVegetation class shows the lowest accuracy of
96% while the Forest and Residential classes can be classified with 100% accuracy

not however imply that there is no merit in investigating drop as part of a broader
strategy to improve validation accuracy by algorithm tuning.
Network Capacity Reduction
Another strategy to reduce overfitting is network capacity reduction. We check the
effect of reducing the network capacity from 2048 units, which is the default setting
in all the above simulations, to 1024 and 512 respectively. As in the above cases, we
evaluate the performance in terms of accuracy and F2-score for the VGG16 as an
example.
Capacity reduced to from 2048 to 1024 units
The network capacity can be easily obtained from the model summary. In our
case, we use vgg16_model.summary()to get this information since we defined
vgg16_model as the model name. The results of network capacity with 1024 units
compared to 2048 units are shown in Table 5.4.
As can be seen from the numbers, the trainable parameters reduced by slightly
more than half from over 8 million to about 3 million parameters. The following
figures show the effect of the capacity reduction.

Training/Validation Accuracy
See Fig. 5.153.
178 5 Remote Sensing Example for Deep Learning

Table 5.4 Comparison of parameters for 2048 and 1024 units of VGG16 model
Parameters Number of model units
2048 1024
Total parameters 23,144,266 17,880,906
Trainable parameters 8,421,386 3,162,122
Non-trainable parameters 14,722,880 14,718,784

Fig. 5.153 Training and validation loss and accuracy of VGG16 model with a reduced capacity of
1024 units

Precision, Recall, F-Score, Support

See Fig. 5.154.

Confusion Matrix
See Fig. 5.155.

Confusion Matrix (%)

See Fig. 5.156.
In summary, it can be observed that in comparison with case of in which we applied
weight regularization without network capacity reduction, there was a slight decrease
in performance. With regularization only, an accuracy and F2-score of 98.15% were
obtained. With the reduced capacity, the accuracy and F2-score of 98.11%, which
translates to a decrease of 0.04%. If can be said there is hardly noticeable impact
on performance in this case. When large amount training data is available, it could
beneficial to reduce the capacity with to overcome overfitting as shown by the reduced
difference between the training and validation loss until about the 20th epoch.
Capacity reduced to 512 units
5.4 Background of Experimental Comparison of Keras Applications Deep … 179

Fig. 5.154 Precision, recall, and F-score of each of the classes of VGG16 model with a reduced
capacity of 1024 units

Fig. 5.155 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a reduced capacity of 1024 units

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00
Highway 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00
Pasture 0.01 0.01 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00
PermanentCrop 0.01 0.00 0.01 0.00 0.01 0.00 0.96 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99

Fig. 5.156 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with a
reduced capacity of 1024 units. The PermanentCrop class shows the lowest accuracy of 96% while
the Forest and Residential classes can be classified with 100% accuracy
180 5 Remote Sensing Example for Deep Learning

Using 2048 units as the base for comparison, we can see a considerable reduction
of the trainable parameters to less than one quarter as shown Table 5.5.
In terms of numbers, it translates to slightly over 8.4 million parameters compared
to about 1.3 million parameters. The following figures show the effect of the capacity
reduction.

Training/Validation Accuracy
See Fig. 5.157.

Precision, Recall, F-Score, Support

See Fig. 5.158.

Confusion Matrix
See Fig. 5.159.

Confusion Matrix (%)

See Fig. 5.160.

Table 5.5 Comparison of parameters for 2048 and 512 units of VGG16 model
Parameters Number of model units
2048 512
Total parameters 23,144,266 16,035,658
Trainable parameters 8,421,386 1,318,922
Non-trainable parameters 14,722,880 14,716,736

Fig. 5.157 Training and validation loss and accuracy of VGG16 model with a reduced capacity of
512 units
5.4 Background of Experimental Comparison of Keras Applications Deep … 181

Fig. 5.158 Precision, recall, and F-score of each of the classes of VGG16 model with a reduced
capacity of 512 units

Fig. 5.159 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a reduced capacity of 512 units

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.96 0.02 0.00 0.00 0.01 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00
Pasture 0.00 0.01 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.00 0.00 0.01 0.00 0.98 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99

Fig. 5.160 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with
a reduced capacity of 512 units. The Highway class shows the lowest accuracy of 96% while the
Forest and Residential classes can be classified with 100% accuracy
182 5 Remote Sensing Example for Deep Learning

As with the case of reduction to 1024 units, there is slight loss in performance and
improved robustness to overfitting when 512 units were used as shown in the training/
validation loss graph above. In this case, the accuracy and F2-score of 98.13% were
obtained after network capacity reduction which is comparable to 98.15% for 2048
units. This translates to a decrease of 0.02% which we think can be acceptable in
most practical situations.
The results for regularization can be summarized in Table 5.6. Although the model
showed increased resistance to overfitting, the price to pay was a slight decrease in
accuracy of the model. But this is common in many machine learning and deep
learning scenarios where a trade-off of some sort has to be made. More training data
is always better to have. The results also show doubling the network capacity from
1024 to 2048 did not give additional benefits of increased accuracy.
Effect of Batch Size
Up to this point we have set the batch size for train/validation to 128. Adjusting
the batch size can have an impact on the accuracy of the resulting model. For some
easily trainable data like the standard MNIST dataset, reducing the batch size may
lead to improved performance. However, there is no general rule on the impact of
batch size as the effect can depend on the complexity of the problem under modeling.
This means that it is necessary to try a couple of batch sizes to see how much how it
affects the output model performance. In general, batch size of 32 is a good starting
point when using Keras and it is advisable to try other sizes like 64, 128, and 256.
Choosing batch sizes which are powers of 2 is recommended when using GPUs for
processing in order to exploit parallel execution. We changed the batch size to 32, 64,
and 256 and performed the training using the VGG16 model as the base with 1024
units, L2 activity regularization constant of 1e-4, dropout of 0.2, and early stopping
patience of 30. The number of epochs was set to 200. The results are shown below.
Batch size 32

Training/Validation Accuracy
See Fig. 5.161.

Table 5.6 Comparison of accuracy with various regularization kernels and network size VGG16
model
Regularization method Accuracy (%) F2-score (%)
No Regularization (2048 units) 98.18 98.08
L2 Kernel Regularization (2048 units) 98.15 98.15
L2 Kernel Reg + Network size 1024 98.11 98.11
L2 Kernel Reg + Network size 512 98.13 98.13
L2 Activity Regularization (2048 units) 98.28 98.28
L2 Activity Reg + Network size 1024 98.28 98.28
L2 Activity Reg + Network size 512 98.17 98.17
5.4 Background of Experimental Comparison of Keras Applications Deep … 183

Fig. 5.161 Training and validation loss and accuracy of VGG16 model with a batch size of 32

Precision, Recall, F-Score, Support

See Fig. 5.162.

Confusion Matrix
See Fig. 5.163.

Confusion Matrix (%)

See Fig. 5.164.

Fig. 5.162 Precision, recall, and F-score of each of the classes of VGG16 model with a batch size
of 32
184 5 Remote Sensing Example for Deep Learning

Fig. 5.163 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a batch size of 32

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.97 0.00 0.00 0.02 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00
Pasture 0.01 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99

Fig. 5.164 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with
a batch size of 32. The AnnualCrop, HerbaceousVegetation, Pasture and PermanentCrop classes
show the lowest accuracy of 97% while the Forest and Residential classes can be classified with
100% accuracy

Batch size 64

Training/Validation Accuracy
See Fig. 5.165.

Precision, Recall, F-Score, Support

See Fig. 5.166.

Confusion Matrix
See Fig. 5.167.

Confusion Matrix (%)

See Fig. 5.168.
Batch size 256

Training/Validation Accuracy
See Fig. 5.169.
5.4 Background of Experimental Comparison of Keras Applications Deep … 185

Fig. 5.165 Training and validation loss and accuracy of VGG16 model with a batch size of 64

Fig. 5.166 Precision, recall, and F-score of each of the classes of VGG16 model with a batch size
of 64

Precision, Recall, F-Score, Support

See Fig. 5.170.

Confusion Matrix
See Fig. 5.171.

Confusion Matrix (%)

See Fig. 5.172.
186 5 Remote Sensing Example for Deep Learning

Fig. 5.167 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a batch size of 64

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.98 0.00 0.00 0.02 0.01 0.00 0.00 0.00
Highway 0.00 0.00 0.00 0.97 0.01 0.00 0.01 0.00 0.00 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00
Pasture 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.00 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00
SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

Fig. 5.168 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with
a batch size of 64. The AnnualCrop and Highway classes show the lowest accuracy of 97% while
the Forest, Residential, and SeaLake classes can be classified with 100% accuracy. The rest of the
classes are above 98% accuracy

Fig. 5.169 Training and validation loss and accuracy of VGG16 model with a batch size of 256
5.4 Background of Experimental Comparison of Keras Applications Deep … 187

Fig. 5.170 Precision, recall, and F-score of each of the classes of VGG16 model with a batch size
of 256

Fig. 5.171 Confusion matrix showing number of hits for each of the classes of VGG16 model with
a batch size of 256

Confusion Matrix (%)

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.98 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00
Forest 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
HerbaceousVegetation 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00
Highway 0.01 0.00 0.00 0.96 0.01 0.00 0.00 0.01 0.01 0.00
Industrial 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00
Pasture 0.01 0.00 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00
PermanentCrop 0.00 0.00 0.01 0.00 0.01 0.00 0.97 0.01 0.00 0.00
Residential 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
River 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00
SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99

Fig. 5.172 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with
a batch size of 256. The Highway class shows the lowest accuracy of 96%, while the Forest and
Residential classes can be classified with 100% accuracy
188 5 Remote Sensing Example for Deep Learning

Table 5.7 Impact of batch

Batch size Accuracy [%]
size on accuracy when using
1024 units with the VGG16 32 98.31
model as base 64 98.46
128 98.28
256 98.15

Table 5.7 shows that for batch size of 64 a state-of-the-art accuracy of 98.46%
was achieved. Further reducing the batch size to 32 gave an accuracy of 98.31%. On
the other hand, increasing the batch size to 256 resulted in accuracy of 98.15%. In
summary, an as the batch size increases, accuracy was observed to decrease for sizes
of 64 and above. When a fixed training data sample size is available, a reduced batch
size will lead to an increase in the number steps per epoch and therefore training
time depending on the available computation resources. In our case, a batch size of
64 was experimentally determined to be the best to employ.
Intermodel type comparison
Using recall and F2-score as performance metrics, with a train–test split of 70–30,
NasNetLarge gave the best performance with an accuracy of 97.4% and F2-score
of 97.6% followed by VGG16 (mean recall 97.14%, F2-score 97.1%), ResNet101
(mean recall 96.4%, F2-score 96.6%), and EfficientNetB7 (mean recall 92.8%, F2-
score 93.0%) in the that order. We explored the train–split ratio of 80–20 and found out
that VGG16 gave the best results with an accuracy of 98.18% and F2-score of 98.06%.
The above evaluation was performed with a batch size of 128. One of the known
strategies to fight overfitting is regularization. We investigated the effect of weight
regularization, network capacity reduction, and dropout. It was found out that there
was minor degradation in the performance with better resistance to overfitting for
the VGG16 model. In fact, regularization can produce meaningful results and stable
validation loss. Specifically, L2 activity regularization produced a peak accuracy
of 98.28%. An investigation into the impact of batch size resulted in the final best
performance of 98.46% for the VGG16 model. This was with a batch size of 64.
There were varying degradations in accuracy with higher and lower batch sizes. It is
generally recommended to fix the batch size throughout model evaluations and also
to choose a value that is a power of 2 in order to exploit computation optimizations
in some GPU implementations. It should be possible to further increase the accuracy
of the models by further tuning the models and also acquiring more training data.
Based on the preceding evaluation, we can summarize the important best model
hyperparameter as shown in Table 5.8.
5.5 Application of EuroSAT Results to Uncorrelated Dataset 189

Table 5.8 Setting of best

Parameter Setting
model hyperparameters with
the VGG16 model as base Units 1024
model Batchsize 64
Dropout 0.2
Regularizer activity_regularizer, l2(1e-4)
Normalization BatchNormalization
Earlystoppping monitor = ‘val_categorical_accuracy’,
patience = 30,
restore_best_weights = True,
mode = ‘max’
Learning Rate 1e-4 (Adam optimizer)
Epochs 200
ReduceLROnPlateau monitor = ‘val_categorical_accuracy’,
factor = 0.5,
patience = 5,
min_lr = 0.00001

5.5 Application of EuroSAT Results to Uncorrelated

Dataset

We applied the above model to separately acquired data. This dataset is also Sentiel-
2 acquired data which covers the areas surrounding Gweru city of Zimbabwe [7].
Gweru is a small city that is characterized by a dry, cool winter season from May to
July, a hot, dry period in August to early November, and a warm, rainy period from
early November to April. The hottest month is October, while the coldest is July.
The temperatures range from an average of 21 °C in July to 30 °C in October, while
the annual rainfall is about 684 mm. In this chapter, only median post rainy-season
Sentinel-2 imagery will be used for land cover classification. Although the median
post-rainy Sentinel-2 imagery (April - June 2020) comprises 13 spectral bands with
a spatial resolution that range between 10 and 20 m, we will only use RGB bands
in a similar fashion to the EuroSAT dataset. It has been already proven in [8] that
RGB bands give the highest accuracy when deep learning algorithms considered. As
preparation the original GeoTiff format data is converted into 64 × 64 patches for
processing by the deep learning algorithm. Since we have already confirmed that the
VGG16 is the best performing model on the EuroSAT dataset, we will evaluate only
this model on the Gweru dataset. It is obvious from the location information that the
data is completely uncorrelated with the EuroSAT data.

5.5.1 Evaluation of 10-Classes with Best EuroSAT Weights

Please refer to the companion Notebook (zimsat-projectbook-blg.ipynb) to get a

better insight into the nature the data and also as part of hands-on experience [6].
190 5 Remote Sensing Example for Deep Learning

Below is the class distribution of the Gweru dataset.

See Fig. 5.173.
Using the best model (vgg16_eurosat8breg_act_batch64.h5) from the EuroSAT
training data, the following results are obtained.
See Fig. 5.174.
See Fig. 5.175.
The accuracy barely exceeds 20% and some classes like Forest and Pasture
cannot be correctly classified at all. The Gweru test dataset consisted of only 1648
images which makes it difficult to judge if the same trend will happen with a larger
dataset although the expectation is that it shouldn’t be factor in testing. We note that
EuroSAT was evaluated with 5400 test images and achieved an accuracy of 98.46%
(see the Figure below). This observation is the well-known fact that high validation
accuracy does not always translate to high accuracy when the model is exposed to
completely unseen data. So, what to do next in this situation?
Results Summary:
GweruData: model = vgg16_eurosat8breg_act_batch64.h5
1648 images belonging to 10 classes.
Accuracy: 0.20449029126213591
Global F2 Score: 0.20449029126213591

Recap of results from the EuroSAT dataset with the same model.
See Fig. 5.176.
Result Summary:
5400 images belonging to 10 classes.
Accuracy: 0.9846296296296296
Global F2 Score: 0.9846296296296296

See Fig. 5.177.

Fig. 5.173 Class distribution of original Gweru data

5.5 Application of EuroSAT Results to Uncorrelated Dataset 191

Fig. 5.174 PRF results for the Gweru dataset

AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop 0.15 0.00 0.07 0.02 0.02 0.03 0.02 0.25 0.06 0.38
Forest 0.08 0.00 0.57 0.04 0.00 0.00 0.00 0.22 0.03 0.05
HerbaceousVegetation 0.04 0.00 0.12 0.04 0.03 0.00 0.00 0.57 0.03 0.17
Highway 0.03 0.00 0.03 0.06 0.00 0.00 0.00 0.75 0.06 0.06
Industrial 0.00 0.00 0.03 0.03 0.14 0.00 0.00 0.81 0.00 0.00
Pasture 0.10 0.01 0.22 0.07 0.00 0.00 0.00 0.35 0.05 0.19
PermanentCrop 0.08 0.00 0.02 0.08 0.12 0.00 0.02 0.38 0.05 0.25
Residential 0.02 0.00 0.02 0.02 0.02 0.01 0.00 0.82 0.02 0.08
River 0.03 0.00 0.09 0.03 0.05 0.00 0.00 0.63 0.03 0.15
SeaLake 0.13 0.00 0.00 0.13 0.13 0.00 0.00 0.38 0.13 0.13

Fig. 5.175 Confusion matrix (percentage) for the Gweru dataset

Fig. 5.176 Precision, recall, accuracy results from the vgg16_eurosat8breg_act_batch64.h5 with
EuroSAT dataset
192 5 Remote Sensing Example for Deep Learning

Fig. 5.177 Confusion matrix results from the vgg16_eurosat8breg_act_batch64.h5 with the
EuroSAT dataset

Since we already have a working model, our best bet, which is the utility of deep
learning approach, is that we can re-use this model as a starting point to see how much
improvement can achieved. However, we are faced with class imbalanced problem
as the HerbaceousVegetation class data is about 47% (779/1648) of the whole dataset
by far outnumbers the rest of the classes and while the minority class SeaLake has as
few samples as about 0.5% (8/1648). The data scarcity issues also apply Highway,
Industrial, PermanentCrop, and River which have less than 100 data points per class.
See Fig. 5.178.
Some strategies to explicitly deal with this class imbalance problem have been
addressed in literature include but not limited to [9–12]:
Strategy 1: Merging near-identical classes in one class
Strategy 2: Downsizing majority samples
Strategy 3: Resampling specific classes
Strategy 4: Adjusting the loss function.

Gweru Class Distribution

779

123 131 32 36 98 60 264 117 8

Fig. 5.178 Distribution of Gweru class data by numbers

5.5 Application of EuroSAT Results to Uncorrelated Dataset 193

We first try a combination of strategies 1 and 2. To realize the Strategy 1, we

define 6 classes as in [7]: Built-up, Bare areas, Cropland, Woodland, Grass/open
areas and Water and then map the 10 EuroSAT classes into these classes as described
in Table 5.9. As for Strategy 3 and 4, it has been observed that there is no innovation
increase therefore gains from these approaches are minimal. We will therefore leave
them for future consideration. In any case there is nothing better than having more
real data for each class if time and resources allow.
Mapping Strategy 1:
Cropland (Cr) = AnnualCrop + PermanentCrop
Built-up (BU) = Residential + Industrial + Highway
Woodland (Wd) = Forest
Grass/open grass (Gr) = Pastures
Water(Wt) = River + SeaLake
Bare Areas (BA) = HerbaceousVegetation.
With the above mapping the distribution of Gweru dataset is shown Fig. 5.7. As a
result of this operation, the majority class becomes BareAreas at about 44% while the
minority class is Grassland (Grass/open grass) at 5%. There is a slight improvement
in the class balanced but not on the classification results as reflected in the PRF results
below.
See Fig. 5.179.
Samples images from the 6 classes:
See Fig. 5.180.

Table 5.9 Mapping Gweru dataset classes to EuroSAT dataset classes

Land cover Description Corresponding merged EuroSAT
classes
Built-up (BU) Residential, commercial, services, Residential + Industrial +
industrial, transportation, Highway
communication and utilities and
construction sites
Bare areas (BA) Bare sparsely vegetated area with HerbaceousVegetation
> 60% soil background. Includes
sand and gravel mining pits, rock
outcrops
Cropland (Cr) Cultivated land or cropland under AnnualCrop + PermanentCrop
preparation, fallow cropland, and
cropland under irrigation
Woodland (Wd) Woodlands, riverine vegetation, Forest
shrub and bush
Grass/Open areas (Gr) Grass cover, open grass areas, golf Pastures
courses, and parks
Water (Wt) Rivers, reservoirs, and lakes River + SeaLake
Land cover classes
194 5 Remote Sensing Example for Deep Learning

Fig. 5.179 Distribution of Gweru class data after Strategy 1 is applied

Fig. 5.180 Sample images from 6 classes after applying Strategy 1

5.5 Application of EuroSAT Results to Uncorrelated Dataset 195

5.5.2 Training Results with 6 Classes—Unbalanced/

Balanced Case

See Fig. 5.181.

Summary of Result:
Found 312 images belonging to 6 classes.
Accuracy: 0.6474358974358975
Global F2 Score: 0.6474358974358975

See Fig. 5.182.

Despite the inability to classify Grassland, Water, and Woodland, a drastic increase
in overall accuracy by 44.29% from 20.45% for 10 classes to 64.74% for the 6
classes has been achieved. It can therefore be confirmed Strategy 1 is effective for
accuracy. However, precision and recall cannot be accepted for all classes.

Fig. 5.181 PRF with distribution of Gweru class data after Strategy 1 is applied. Grassland, Water,
Woodland classes have 0% recall

BareAreas BuiltUp Cropland Grassland Water Woodland

BareAreas 0.92 0.06 0.02 0.00 0.00 0.00
BuiltUp 0.18 0.82 0.00 0.00 0.00 0.00
Cropland 0.52 0.03 0.45 0.00 0.00 0.00
Grassland 0.83 0.06 0.11 0.00 0.00 0.00
Water 0.79 0.13 0.08 0.00 0.00 0.00
Woodland 0.88 0.04 0.08 0.00 0.00 0.00

Fig. 5.182 Confusion matrix results for reduction to classes from 10 to 6

196 5 Remote Sensing Example for Deep Learning

We experimentally, apply Strategy 2 to data used in Strategy 1 and reduce the

sample size of BareAreas to 200 by carefully selecting the data and we end up with
the distribution shown below.
See Fig. 5.183.
See Fig. 5.184.
Summary of Result:
Found 196 images belonging to 6 classes.
Accuracy: 0.6275510204081632
Global F2 Score: 0.6275510204081632

Some improvement for Water & Woodland can be observed but Grassland still
has recall and precision of zero meaning prediction is not possible.
Different combinations of precision and recall (ability to remember) which give
you a better understanding of how well your model is performing for a given class
are shown in Table 5.10 [13].

Fig. 5.183 Data distribution after applying Strategy 2 to Strategy 1 dataset

Fig. 5.184 PRF with of

Gweru class data after
Strategy 2 is applied
5.5 Application of EuroSAT Results to Uncorrelated Dataset 197

Table 5.10 Interpretation of precision and recall results with respect to a given class
Low recall High recall
Low Class prediction unreliable (model cannot Class prediction reliable but not
Precision recall many precisely) others (model recall imprecisely)
High Class prediction reliable but not detectability Class prediction reliable (model
Precision is low (model recall few precisely) recall many precisely)

BareAreas BuiltUp Cropland Grassland Water Woodland

BareAreas 0.85 0.00 0.03 0.00 0.00 0.13
BuiltUp 0.07 0.89 0.00 0.00 0.02 0.02
Cropland 0.39 0.03 0.55 0.00 0.00 0.03
Grassland 0.78 0.06 0.11 0.00 0.00 0.06
Water 0.58 0.00 0.08 0.00 0.33 0.00
Woodland 0.38 0.00 0.04 0.00 0.04 0.54

Fig. 5.185 ConfMat of Gweru class data after Strategy 2 is applied

See Fig. 5.185.

We are able to classify Water and Woodland but not classify Grassland, with
some sacrifice on the accuracy decrease of about 2% to about 62.75%. We also note
that when precision is high recall is low and vice versa. This could be the impact
of limited data size for all classes. There is not enough information to learn all the
classes accurately.
So, what to do next? We observe that it makes sense use Strategy 1 again and this
time around merge Grassland and BareAreas classes into one GrassBareAreas. We
end up with the class distribution shown in Fig. 5.14.
See Fig. 5.186.
See Fig. 5.187.

5.5.3 Training Results with 5 Classes

It is known that accuracy is the best measure of performance for imbalance datasets.
We therefore introduce the AUC ROC metric as part of evaluation including PRF as
shown in the figure below.
See Fig. 5.188.
See Fig. 5.189.
Summary of Result:
Found 196 images belonging to 5 classes.
Accuracy: 0.6887755102040817
Global F2 Score: 0.6887755102040817

See Fig. 5.190.

198 5 Remote Sensing Example for Deep Learning

Fig. 5.186 Distribution after applying Strategy 1

Fig. 5.187 Sample images from 5 classes after applying Strategy 1

It can be seen that the strategy is effective in improving the PRF metrics across
all classes achieving at the same time achieving a validation of AUC of 94.23%. The
accuracy still remains around 68.88%.
This demonstrates that importance of using different metrics for different data
distributions.
The classification of imbalanced data is not a simple task especially when there
is a very limited number of samples as in our case. The best chance of improving
performance is to start with a large dataset in which even the minority class is well
5.5 Application of EuroSAT Results to Uncorrelated Dataset 199

Fig. 5.188 Training/Validation accuracy and loss and AUC performance for 5-class Gweru dataset

Fig. 5.189 PRF with of Gweru class data after Strategy 1 is applied on 6 classes

BuiltUp Cropland GrassBareAreas Water Woodland

BuiltUp 0.96 0.02 0.00 0.02 0.00
Cropland 0.21 0.64 0.00 0.06 0.09
GrassBareAreas 0.24 0.07 0.53 0.10 0.05
Water 0.33 0.00 0.04 0.63 0.00
Woodland 0.38 0.00 0.00 0.04 0.58

Fig. 5.190 Confusion matrix of Gweru class data after Strategy 1 is applied on 6 classes

represented. Additionally, relying on traditional evaluation metrics may be counter-

productive as expected results cannot be obtained. In that case, it may be become
necessary to try new metrics or create new metrics [14].
200 5 Remote Sensing Example for Deep Learning

5.6 Concluding Remarks

All models tested were shown be good predictors for Residential and Forest classes
on the EuroSAT dataset. This gives us a hint that they can be used to detect changes in
urban expansion where forest is converted to residential areas. EfficientNet models
tend to classify residential better than forest for the 70–30 train–test split. This is
opposite of what was observed for ResNet, VGG, and NasNet models. In general,
for the EuroSAT dataset we could see that the VGG models performed well on the
80–20 split with and without regularization. This leads us to explore more on the
utility of the VGG models for land-cover classification.
In this investigation, we came to discovered that there are a lot of opportuni-
ties to improve the performance of the deep learning algorithms to achieve the
highest possible target. Most state-of-the-art algorithms are required achieve an
accuracy of not less than 98%. Through data manipulation and algorithm hyper-
parameter tuning we could achieve an accuracy of 98.46% by using the VGG16 as
the base model without feature engineering. Other methods such model ensembles
have been suggested in the literature as viable approaches although this may lead to
increased effort in training due to the huge number of parameters involved. If time
and computation resources are not an issue, this approach is sure worth trying.
We also evaluated the performance of the best EuroSAT model weights on non-
EuroSAT dataset, specifically using the Gweru dataset described in Sect. 5.3. Unfor-
tunately, we could not get good results using these model weights. However, on
retraining with VGG16 model we could get some reasonable results albeit with limi-
tations due to imbalanced data. Next steps will be to explore emerging approaches
including wide ResNet and expand the algorithms on to non-EuroSAT dataset to
solve real problems. The journey has just started!

References

1. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.

1038/nature14539
2. C. Francois, Deep Learning with Python, Manning Publications Co., 2018.
3. C. Francois, “Xception: Deep Learning with Depthwise Separable Convolutions,” Google, Inc.,
2017. https://arxiv.org/abs/1610.02357
4. Zhu XX et al (2017) Deep learning in remote sensing: a comprehensive review and list of
resources. IEEE Geoscience and Remote Sensing Magazine 5(4):8–36. https://doi.org/10.1109/
MGRS.2017.2762307
5. Maggiori E, Tarabalka Y, Charpiat G, Alliez P (2017) Convolutional neural networks for large-
scale remote-sensing image classification. IEEE Trans Geosci Remote Sens 55(2):645–657.
https://doi.org/10.1109/TGRS.2016.2612821
6. Deep-Learning-Models: https://github.com/sn-code-inside/Deep-Learning-Models
7. Kamusoko C, Kamusoko OW, Chikati E, Gamba J (2021) Mapping Urban and Peri-Urban
Land Cover in Zimbabwe: Challenges and Opportunities. Geomatics 1(1):114–147. https://
doi.org/10.3390/geomatics1010009
8. P. Helber, B. Bischke, A. Dengel and D. Borth, “Introducing Eurosat: A Novel Dataset and Deep
Learning Benchmark for Land Use and Land Cover Classification,” IGARSS 2018 - 2018 IEEE
References 201

International Geoscience and Remote Sensing Symposium, pp. 204–207, 2018. doi: https://doi.
org/10.1109/IGARSS.2018.8519248
9. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-
sampling technique. J Artif Intell Res 16:321–357
10. D. Koßmann, T. Wilhelm and Fink GA (2021) Generation of attributes for highly imbalanced
land cover data. In: 2021 IEEE International Geoscience and Remote Sensing Symposium
IGARSS, pp 2616–2619. https://doi.org/10.1109/IGARSS47720.2021.9554331
11. G. Douzas, F. Bação, J. Fonseca and M. Khudinyan, “Imbalanced Learning in Land Cover
Classification: Improving Minority Classes’ Prediction Accuracy Using the Geometric SMOTE
Algorithm. Remote Sensing. 11. 3040, 2019. https://doi.org/10.3390/rs11243040.
12. Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem
in convolutional neural networks. Neural Netw 106:249–259. https://doi.org/10.1016/j.neunet.
2018.07.011
13. Scikit-learn: https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_
recall.html
14. TensorFlow: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data