(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
Witold Pedrycz
Shyi-Ming Chen Editors
Deep Learning:
Algorithms and
Applications
Studies in Computational Intelligence
Volume 865
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new develop-
ments and advances in the various areas of computational intelligence—quickly and
with a high quality. The intent is to cover the theory, applications, and design
methods of computational intelligence, as embedded in the fields of engineering,
computer science, physics and life sciences, as well as the methodologies behind
them. The series contains monographs, lecture notes and edited volumes in
computational intelligence spanning the areas of neural networks, connectionist
systems, genetic algorithms, evolutionary computation, artificial intelligence,
cellular automata, self-organizing systems, soft computing, fuzzy systems, and
hybrid intelligent systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
The books of this series are submitted to indexing to Web of Science,
EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink.
Editors
123
Editors
Witold Pedrycz Shyi-Ming Chen
Department of Electrical Department of Computer Science
and Computer Engineering and Information Engineering
University of Alberta National Taiwan University of Science
Edmonton, AB, Canada and Technology
Taipei, Taiwan
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
Contents
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Activation Functions
Abstract Activation functions lie at the core of deep neural networks allowing them
to learn arbitrarily complex mappings. Without any activation, a neural network learn
will only be able to learn a linear relation between input and the desired output.
The chapter introduces the reader to why activation functions are useful and their
immense importance in making deep learning successful. A detailed survey of several
existing activation functions is provided in this chapter covering their functional
forms, original motivations, merits as well as demerits. The chapter also discusses
the domain of learnable activation functions and proposes a novel activation ‘SLAF’
whose shape is learned during the training of a neural network. A working model for
SLAF is provided and its performance is experimentally shown on XOR and MNIST
classification tasks.
1 Introduction
Neural Networks (NNs) are powerful information processing tools and can learn data
representation [1] implicitly and efficiently. As a result, these networks have been
shown to give excellent performance on tasks like statistical pattern recognition,
Dendrites
From Axon Terminal
Impulses
collected from Cell
axon Body
terminals Output axon
Terminals
Axon Activation
Function
Nucleus
classification, time series prediction, etc., [2, 3]. NNs are functionally inspired from
working of a human brain. As there are trillions of neurons in a brain heavily inter-
connected to each other, so are in NNs, providing path for the information to flow
through. Similar to human beings, NNs can also learn from examples and can make
predictions/decisions based on the observed trends. Analogous to firing of only spe-
cific neurons in the brain, only some of the nodes in any NN’s hidden layer are
activated in response to an input stimuli. This firing of neuron comes from the term
action potential [4]. It plays an important role in cell to cell communication by assist-
ing the propagation of signals along the neuron’s axon. A very simplistic model is
shown in Fig. 1 (derived from [5]). A neuron receives signals (x0 ) from the other neu-
ron’s axon terminals, which are collected through dendrites. Theses signals undergo
multiplicative interaction in dendrites (W0 x0 ) based on the synaptic strength (W0 ).
Note that W ’s can be excitory when the W ’s are positive as well as inhibitory when
W ’s are negative. If the final sum obtained (Wi xi + b) is greater than the threshold,
then the neuron fires. This frequency of firing is modeled by an activation function in
NN. This model doesn’t generalize to all kinds of neruons and takes certain assump-
tions which might not be satisfied in actual neurons. Interested readers can refer to
[6] for more insights.
One of the early form of neural networks is Perceptron. It was developed at
Cornell Aeronautical Laboratory by Frank Rosenblatt [7] for image classification
task. The weights of the network were stored in physical potentiometers, and the
learning occurred with the help of electric motors which would update these weights
during the training phase of the neural network. Technically, the perceptron was
designed to learn a mapping from input space, i.e., pixels of image to a binary
space corresponding to the class that image belongs to. To realize this mapping, a
threshold function with only two possible outcomes was required. Hence, the step
function was used at the output which gave a unit value for positive inputs and zero for
negative input values. Later, when the stochastic gradient descent [8] became popular,
a differentiable and continuous approximation of step function called sigmoid took its
place. It is a doubly saturating non linear function whose output is bounded between
zero and one. This gave rise to logistic regression which became the de facto for
classification tasks. Unlike the step function which was a hard classifier, the output
of sigmoid can be interpreted as the probability of belonging to a particular class.
Activation Functions 3
Fig. 2 a Linear decision boundary for XOR problem b Best decision boundary for XOR problem
Though simple and easy to use, logistic regression exhibits a major downside as
well when the classes are not linearly separable. To better understand linear separa-
bility of classes, let’s look at the XOR (exclusive OR) classification problem. It is
easy to see that it is not linearly separable, (Fig. 2) and a linear classifier would only
classify seventy five percent of the points correctly (Fig. 2a). On the other hand, the
perfect decision boundary would have to be non linear. Figure 2b shows one possi-
ble boundary for XOR problem defined by two straight lines dividing the complete
features space into three regions. The middle region belongs to class, labelled as one
and the other two belong to the class, labelled as zero. To increase the representation
capability of NNs such that they can learn non linear boundaries, a hidden layer
is introduced between input and output which readers have learned in the earlier
parts of the book. This provides the ability to learn more complex and highly non
linear functions, to the neural networks (NN) which in the chapter is informally
called as an increase in the capacity of NNs. It is well known that increasing width
and depth increase the representation capacity of NN in general. Hence, it becomes
important to ask this question, ‘does the choice activation affect the relationship
between capacity and width or depth of a deep neural network (DNN)?’. Universal
function approximation theorem says that we can learn any function considering a
wide enough architecture with any of the accepted activations. But does there exist
a better activation function, and if it does, how do we characterize it?
Another objective of this chapter is to allow readers to not only choose the correct
activation function, but rather understand the reasons as to why a certain activation
might perform better. It is well known that (Long Short Term Memory) LSTMs,
variants of recurrent neural networks, use hyperbolic tan as the activation function and
sigmoid activation function for gating mechanism. Since sigmoid has significance in
terms of its interpret-ability as probability, it is intuitive to use it for gating or binary
classification problems. In the later part of chapter, we will understand that sigmoid
suffers from vanishing gradient problem. Hyperbolic tan on the other hand gives
convergence in lesser iterations due to its mathematical properties which is why it
is used in LSTMs over sigmoid. Such properties are important to understand for a
researchers developing new architectures and activations.
4 M. Goyal et al.
Tuning the width and depth of the network, applying regularization penalties
[9, 10] and using normalization methods [11] have been shown to help in improving
the performance of Deep Neural Networks (DNNs). Another important parameter
that affects the performance of NNs is the type of activation function. As we will in
the later parts of the chapter, no two activations performing equivalently even with
same architectures, it might become critical to choose the right activation function.
Hence, it can be seen as a hyper parameter which can be tuned specifically for the
task and data set. One possible way is to try all existing activation functions on the
architecture and compare their performance. However, this will be impractical when
we consider using different activation function at each layer, or maybe at even each
node. Another problem that one can be pointed out in fixed activation functions is
that they will inevitably end up providing some non-linearity in the architecture. For
example, it is impossible to learn identity operation using neural networks. Keeping
the activations as fixed, it is impossible to recover input without loss of information.
This can be seen as an important issue with auto encoders [12]. Even with high
dimensional encoding, the resulting output is always lossy. One might say that, had
the activation function been linear, this problem wouldn’t occur. These factors serve
as motivation for using and studying adaptive/learnable activation functions. In the
last section of the chapter, we will see the design of such activation functions and
understand their practical importance.
In this section, we will try to understand various activation functions present in the
literature. We will study how they were proposed, why are they advantageous and how
newer activations, developed onto them. Though we are far from fully understanding
what actually happens inside a neural networks, but mathematical analysis of these
non linearities can provide proper basis and intuition of what might be happening
inside neural networks. One must understand activation functions not just as a tool
to make NNs work but also their theoretical implications.
Neural networks equipped with Linear Activation functions are generally called as
linear neural networks. They have been widely studied to systematically study the
learning dynamics of deep learning [13, 14]. They have analytic importance and
are useful to understand the nature of neural networks. Linear neural networks with
multiple hidden layers are equivalent to neural networks of single hidden layer.
Activation Functions 5
(a) (b)
Fig. 3 a Linear activation function and its gradient b Binary step activation function and its gradient
They are used for linear regression tasks and a closed form solution exists for such
networks. Linear activation functions are defined as
y = x, (1)
where, x is the input to activation function and y is its output. It is clear from Eq. (1)
that the gradient of linear activation functions is unity. Figure 3a shows the activation
function and its gradient pictorially.
A binary step function is similar to a threshold function that can be used for the
purpose of binary classification tasks [15, 16]. Although, this function is quite old,
it still has historical importance in machine learning. It is used in classification tasks
done with signal processing methods. It is mathematically defined (for threshold at 0)
as:
1 if x > 0
B. Step(x) =
−1 if x ≤ 0
This function is non-differentiable at zero (or generally threshold) and has zero
gradient at all other points. Hence, it is used in conjunction with stochastic gradient
descent. This led to discovery of a smoother, differentiable approximation known as
6 M. Goyal et al.
sigmoid through which gradients could flow backwards. Let’s see what sigmoid and
its multi-class variant look like.
Sigmoid was originally designed for binary classification but now has wide applica-
tion in tasks related to attention models, and bounded output regression as well [17,
18]. Mathematically, sigmoid is defined as:
1
sigmoid(x) = σ(x) =
1 + e−x
The output of sigmoid lies in the range [0, 1] and hence, it can be interpreted as
probability. Its gradient is interestingly easy to compute as it can be expressed in
terms of the original activation itself:
d (σ(x)) e−x
= = σ(x) · (1 − σ(x))
dx (1 + e−x )2
The major problem which we will discuss below with sigmoid is the vanishing
gradient problem. The gradient of sigmoid activation function as shown in Fig. 4 van-
ishes away from zero and is upper bounded by value 0.25. This characteristic makes
the training of deep neural networks slower when all its activations are sigmoid.
Before we discuss performance details of sigmoid, one would ask this question.
Why would anyone specifically use only this form of sigmoid? There are multiple
ways to approximation binary step function. So why this? To answer that, we can
look at the argument proposed in [19].
Consider a generative model for generating data denoted by random variable X ,
where, x (one realization of X ) can either belong to class C1 or C2 . This model is
expressed by the prior probability distribution P(C1 ), P(C2 ) and class conditional
probability density functions p(x|C1 ), p(x|C2 ). Now, the posterior estimates on prob-
ability of an input belonging to either of the class and, can be written with the help
of Bayes’ theorem as:
p(x|C1 ) · P(C1 )
p(C1 |x) =
p(x|C1 ) · P(C1 ) + p(x|C2 ) · P(C2 )
1 (2)
=
1 + exp(−a)
= σ(a)
p(C1 |x)
where, a = ln( p(C 2 |x)
), and is frequently referred to as ‘logits’.
The above idea is extended to write the posterior probability for multiple classes
(let’s say K classes (C1 , C2 , . . . , CK )). This gives rise to the softmax activation
function [19]:
p(x|Ck ) · P(Ck )
p(Ck |x) = K
i=1 p(x|Ci ) · P(Ci )
exp(−zk ) (3)
= K
i=1 exp(−z i )
= softmax(z̄)
where, z̄ is a k-dimensional vector containing K logits, one for each class. This wraps
up the two most important activation functions sigmoid and softmax. The gradient
of softmax can also be written in terms of original function itself.
d (softmax(zi )) d (s(zi )) s(zi ) · (1 − s(zi )) if i = j
= = (4)
dzj dzj −s(zi ) · s(zj ) if i = j
Given such important meaning associated with sigmoid activation functions, their
use inside hidden layer was initially a heuristic (inspired biologically). Later, it
was found that sigmoid led to vanishing gradient problems. For cascaded sigmoid
activations across m hidden layers of one node each (calling the outer most layer as
h1 , its activated output as o1 and the incoming weight of the edge as w1 , such that
hk = wk ∗ ok+1 and ok = σ(hk )), the corresponding gradient at kth layer is:
do1 do1 dok−1 dok
= · w1 · . . . · · wk−1 · (5)
dhk dh1 dhk−1 dhk
8 M. Goyal et al.
Now, each do i
dhi
is derivative of sigmoid which is upper bounded by a value of 0.25.
These successive multiplications lead to a smaller and smaller value in shallower lay-
ers. This makes the learning slower for the shallower parameters. This disadvantage
of sigmoid activation functions can be overcome by using hyperbolic tanh activation
function. It has an interesting mathematical properties even its characteristics are
close to sigmoid activation function.
Hyperbolic tan is another activation function that can also be called as symmetric
sigmoid [20]. It is a zero centered, doubly saturating activation function (saturating
away from zero in both directions). It is mathematically defined as:
ex − e−x
tanh(x) = (6)
ex + e−x
d (tanh(x)) 4
= x = 1 − tanh2 (x) ≤ 1.0 (7)
dx (e + e−x )2
The above equations show that the gradient of tanh is upper bounded by one, which
is four times that of sigmoid activation function. Therefore, due to larger gradients,
symmetric sigmoids are often seen to converge faster (Fig. 5).
Proof Consider a multi-layered neural network with N hidden layers, with non linear
vector transformation defined as F(.). This transformation takes in a vector, and
applies an activation function f (.) on its each component. The relation between
(n − 1)th layer (with k inputs) of neural network to nth layer (with r inputs) can be
written as:
Yn = Wn Xn−1 (8)
Xn = F(Yn ) (9)
where, Wn is the weight matrix of shape (r × k), and Xn is the nth hidden layer after
applying activation F(.) on Yn . The derivatives can be calculated from the following
recurrence relation.
∂E p i ∂E
p
= f (y ) (10)
∂yni n
∂xni
∂E p k
i,j ∂E
p
= w (11)
∂xn−1
i
j=1
n
∂yni
∂E p j ∂E p
= xn−1 (12)
i,j
∂wn ∂yni
Hyperbolic tan is a zero centered activation function and has major application in
recurrent neural networks, where the problem of vanishing gradient is much more
prominent. One should take note that if the gradient of an activation function is more
than one, this would result in a phenomenon termed as exploding gradients [22].
This leads to instability during training and never achieves steady state convergence.
Techniques such as gradient clipping [22] ensure that the gradient is appropriately
scaled to lower values in events of gradient explosion.
ReLU is one of the most popoular and widely used activation function. Many versions
of the ReLU activation function have been proposed. Below we discuss the activation
functions belonging to this family.
2.5.1 ReLU
ReLU was proposed in [23, 24], and was shown to stabilize the feedback in analogue
circuits. They proved that under certain conditions, networks equipped with ReLU
non linearity would always converge to a steady state. Mathematically ReLU is
defined as,
x if x ≥ 0
ReLU (x) = max(0, x) = (13)
0 if x < 0
1. Gradient stability: From Eq. (5), it is easy to see that no matter how deep we
go, the gradient of ReLU (when input is positive), is always one and hence the it
will not vanish. A more generic formulation can be written using Eqs. (10), (11),
(12). Hence, the case where the gradient of ReLU dies will happen when all of
the ReLU outputs are zero across the hidden layer.
2. Computationally cheaper: Other activation functions require evaluation of
exponents and performing division operations. However, ReLU is a simpler
because it only requires a max operation to generate output. The same argument
is valid for the gradient calculation. This is another reason, ReLU is preferable
over other complex activation functions.
3. Sparsity: ReLUs result in sparsity in layers as if the input becomes zero, a con-
nection becomes irrelevant to the model. This allows for analyzing the features,
or variables which are important. However, if it actually makes the generalization
or performance better is still a question.
In contrast to the above advantages of ReLU, there exists some drawbacks as well.
ReLU can result in dead neurons, if the output of a particular activation becomes
zero, then its gradient might die forever. Since the gradient flowing backward would
also be zero, it might be possible that a neuron which could have contributed greatly,
might never recover from that stage. Another problem with ReLU is its non-zero
centric nature, because of which the activations are biased to being positive. Had the
ReLU been zero centered, it certainly would have accelerated the training.
12 M. Goyal et al.
(a) (b)
Fig. 7 a LReLU and its corresponding gradient. b PReLU with α = 0.3 and its corresponding
gradient
A new activation function called leaky ReLU [25] is used to avoid dying problem in
ReLU. The major disadvantage with ReLU is that it saturates at zero whenever it is
not activated. This leads to zero gradient and leads to slower training. To solve this
issues, a leak is added to the activation instead of hard zero. Mathematically LReLU
is defined as:
x if x ≥ 0
LReLU (x) = (15)
0.01x if x < 0
This leak helps to increase the range of the ReLU. The gradient of the leaky ReLU
is obtained as
dLReLU (x) 1 if x ≥ 0
= (16)
dx 0.01 if x < 0
The characteristics of Leaky ReLU and its gradient are shown in Fig. 7a.
where, αi is a parameter that is learned during training. Here, the subscript i refers to
the ith channel in the convolutional neural network. For each feature map, αi is shared,
thereby reducing the chance of over-fitting. LReLU uses αi = 0.01, but PReLU
Activation Functions 13
adaptively learns the slope of the negative part. There’s a trade off in computational
cost when using PReLU with the motivation to learn better and specialized activation.
The gradient of the PReLU is
dPReLU (xi ) 1 if xi ≥ 0
= (18)
dxi αi if xi < 0
The characteristics of PReLU and its gradient are shown in Fig. 7b. The update
formulas of parameter αi are derived from chain rule. To write the update rules for
the parameters αi , consider the equation below,
∂E ∂E ∂PReLU (xi )
= (19)
∂αi x
∂PReLU (xi ) ∂αi
i
∂E
where, E represents the objective function. The term ∂PReLU (xi )
is the gradient back
propagated from the deeper layers. The gradient of the activation with αi is given by
∂PReLU (xi ) 0 if xi > 0
= (20)
∂αi xi if xi ≤ 0
The gradient of α, when it is shared across both channels as well as feature maps is
∂E ∂E ∂PReLU (xi )
= (21)
∂α i α
∂PReLU (xi ) ∂α
i
The authors of PReLU activation function suggest using momentum optimizer for
updating parameters αi . Hence, the update at nth iteration can be given as
∂E
δαin = μδαin−1 + (22)
∂αi
(a) (b)
Fig. 8 a The exponential linear unit (ELU) with α = 1 and its corresponding gradient. b The scaled
exponential linear unit (SELU) with α = 1.6733 and λ = 1.0507, with its corresponding gradient
x if x > 0
elu(x) = (24)
α(e − 1) if x ≤ 0
x
where, α is a hyper parameter that controls the value for negative inputs. The gradient
of ELU is given by
delu(x) 1 if x > 0
= (25)
dx elu(x) + α if x ≤ 0
ELU also solves the problem of vanishing gradient similar to ReLU by having unit
gradient for all positive inputs. Whereas, its gradient is non-zero for negative inputs
pushing the mean of the activation function closer to zero. As discussed in Proposition
1, this leads to faster training of the neural network. Major improvement of ELU
over LReLU and PReLU comes from its saturating behavior for negative inputs.
This brings relatively less variation in activation function which in turn makes its
noise robust. Moreover, the ELU activation has shown improved results over ReLU
activation on both supervised and unsupervised machine learning tasks. Figure 8a
shows ELU activation function and its gradient for hyper parameter α = 1.
It is well known that, even though FNNs (Fully Connected Neural Networks) are
highly sophisticated machine learning algorithm, they fail to stand up to their repu-
tation in real life. However, with the support of CNNs and RNNs, NNs can achieve
state of the art results. This can be attributed to CNNs and RNNs having parame-
ter sharing across feature maps and cells respectively combined with normalization
techniques like batch normalization and layer normalization. Since both image and
time series data have structure across space and time respectively, these parameters
are efficiently shared, and exponentially reduce the complexity. In [28], authors find
Activation Functions 15
that the reason behind lacking performance of FNNs is their high variance across
different training examples and sensitivity to perturbations. This brings us to the
concept of SNNs. SNNs (Self Normalized Neural Networks) keep normalization of
activation function when propagating them through layers of networks. SNNs need
two things to work, first is a custom weight initialization and SELUs as activation
function. The weight initialization allows the mean and variance at each layer to be
zero and one respectively.
SNNs can not be derived with ReLUs, sigmoid, tanh and leaky ReLU. SELUs are
obtained by multiplying the exponential linear unit with a parameter λ which is kept
greater than 1 to ensure slope greater than one. The authors of SELU provide four
conditions which any activation function should ideally possess (which SELUs do
follow) are:
1. Both positive and negative values should be present in the range of the activation
function so that the mean is zero.
2. Saturating regime in the activation function to dampen or reduce the variance of
the output of activation.
3. A regime with slope greater than one so as to increase the variance if needed by
the network.
4. The activation function should be continuous.
SELU (Scaled Exponential Linear Unit) is mathematically defined as
λx if x > 0
selu(x) = (26)
λα(e − 1) if x ≤ 0
x
where, α and λ are two fixed parameters derived from the input. For standard scaled
inputs authors provide optimal value of α as 1.6733 and λ as 1.0507. The gradient
of SELU can be written as
dselu(x) λ if x > 0
= (27)
dx λα + selu(x) if x ≤ 0
The characteristics of SELU are shown in Fig. 8b. SELU has self normalizing prop-
erty that allows to train networks with high learning rate that have many layers. This
activation function has no exploding and vanishing gradient problems.
2.6 Softplus
d (Softplus(x)) ex 1
= = = σ(x) (29)
dx 1 + ex 1 + e−x
2.7 Maxout
Fig. 10 Image Courtesy: [29]. How maxout can approximate arbitrary uni-variate functions. Sim-
ilarly, multivariate functions can also be approximated by maxout
So far, we have seen various activations that have been mostly hand-designed to
improve certain characteristics of ReLU activation function. In [30], authors present
a method to discover novel activation functions using automated reinforcement
18 M. Goyal et al.
(a) (b)
learning based search. The core idea is to design a composite combination of various
existing activation functions and empirically compare them to find the best one. In
other words, design a search space consisting of composite combination of existing
functions and then test each composite combination on a standard data set to com-
pare them with each other. The crux of the method lies in designing the composite
combination of existing functions. Using an exhaustive search, i.e., trying all com-
binations of activation functions would result in a very large search space making
the search practically infeasible. Hence, in [30], authors used a RNN controller to
predict different components of an activation function and feed it back to predict the
other components of the same new activation function.
Now, once an activation function has been found, it is tested on CIFAR-10 data
set using a child network. A list of top performing functions is maintained to keep
track of best performing activations. This method resulted in various novel activation
functions which outperformed ReLU on CIFAR-10 data set (explained in 3) atleast
using the child network. It was found that the activation function f (x) = x · σ(βx)
outperformed ReLU on both CIFAR-10 and CIFAR-100 on various Deep Architec-
tures. The function f (x) = x · σ(βx) is called Swish Activation function. Here, β
is a constant or trainable parameter and σ(z) = (1 + exp(−z))−1 . Figure 11 shows
swish activation function and its first derivate for various values of β. It can be
seen that for β = 0, f (x) = 2x and behaves similar to identity function. As β → ∞,
f (x) → max(0, x) or f (x) acts as ReLU activation function. This suggests that the
swish activation function can be loosely viewed as a smooth version of the ReLU
activation function.
sets using 3 different state of the art architectures. CIFAR-10 and CIFAR-100 data
sets contain 60,000 colored images belonging to 10 and 100 classes respectively.
50,000 of these images belong to training set and the rest 10,000 are the test images.
The task is to accurately classify the test images based on training data set. The 3
different architectures used are ResNet-164 [31], Wide ResNet 28-10 (WRN) [32]
and DenseNet 100-12 [33]. The results shown are taken from [30].
Tables 1 and 2 showcase the test accuracy of several non linear activation functions
on three different architectures. It is evident that no single activation works best on
every architecture. For example, in case of CIFAR-10 data set, Softplus outperforms
all other activation functions when ResNet is used as the underlying architecture but
performs poorly on Wide ResNet as compared to other activations. This creates a
dilemma around how to select the optimal activation function for any architecture.
Hence, we want an activation function which can adapt itself depending on the data
set and architecture. In the next section, we will focus on such activation functions
and discuss two different approaches that can be used to learn activation functions.
Till now, we have discussed various activation functions starting from Identity func-
tion to Swish. In this section, we shift our focus from fixed activation functions to
those which can be learned. Learning the activation functions means the shape of the
function is not fixed but is paramterized by learnable weights which are learned during
training of NN. Owing to their capability of adaptation, they are also known as ‘Adap-
tive Activation Functions (AAFs)’. References [34–37] show different approaches
to design AAFs. Below, we discuss two separate approaches to learn activation func-
tions. The first method uses an ‘Adaptive Piecewise Linear Activation Function’ [37]
which is learned independently for each neuron using gradient descent. Next, we pro-
pose a unique technique which aims to learn an activation function using techniques
of non linear approximation. We call it ‘Self Learnable Activation Function (SLAF)’.
As the name suggests, this method learns a piecewise linear activation function for
each neuron in the neural network. It formulates an activation function h(x) as,
s=S
h(x) = max(0, x) + ais max(0, −x + bsi ) (32)
s=1
where, S is a hyper parameter and the variables ais and bsi for i ∈ 1, ...., S are learned
using the same algorithm as the other network weights are learned. Note that the
method aims to learn the best piecewise linear function for a given data set and
architecture. Hence, Eq. (32) should span the entire space of continuous piece wise
linear functions.
Theorem 1 Any continuous piecewise linear function g(x) can be expressed by
Eq. (32) for some S if it satisfies the following two conditions:
1. There exists a scalar u such that g(x) = x for x ≥ u
2. There exists two scalars α and v such that ∇x g(x) = α for all x < v
We are not providing the proof of above theorem but reader may refer to [37] for
further details on above result. Figure 12 shows APL activation function when sum-
mation in Eq. (32) has only 1 term. It is very interesting to note all the curves except
(b) show non-monotonic behavior of activation function which is contrary to the
behavior observed in fixed activation function (other than swish activation function).
Moreover, Fig. 12b shows the non-convex behavior of the activation function. This
show the freedom of APL activation functions to adapt themselves depending on
the task. Above activation function also outperforms ReLU on various data sets. As
Activation Functions 21
Fig. 12 Adaptive piece wise linear activation function with different a, b parameters for S = 1
mentioned in [37], the best error rate on CIFAR-10 using APL activation function is
7.51% whereas ReLU had an error rate of 7.73% using Network in Network (NIN)
architecture [38]. Similarly, on the same architecture for CIFAR-100 data set, APL
outperforms ReLU by around 2% by achieving an error rate of 30.83%.
Above method enables neural networks to learn diverse set of activation functions.
All these learned activation functions will be piecewise linear. Hence, the search
space explored by this method is limited to only piecewise linear functions. Below,
we discuss a more generalized method of learning activation functions which doesn’t
make any inherent assumptions about the nature of activation functions.
i=∞
f (x) = ai φi (x) (33)
i=0
Here, ai s are the coefficients of basis elements and are unique to f (x). If we fix
the basis function and learn their coefficients using a suitable algorithm, we can
effectively learn f (x). The only problem here is that the expression contains infinite
elements and it is practically impossible to learn all of them. Restricting the number
of elements in the basis results in an approximation of the function. A suitable
approximation of a function f (x) with N basis elements can be given by:
where, φi are the basis elements and ai are the corresponding coefficients. If we take
f (x) to be the activation function which we aim to learn, learning {a0 , . . . , aN −1 }
would eventually learn f˜ (x). Since, this method intends to learn the approximation
f˜ (x) and not the actual function f (x), it becomes a prime concern to find a basis
which provides a good approximation with N basis elements. Although there can be
many choices for basis functions, we use Even Mirror Fourier Non-linear (EMFN)
Filter Basis owing to its strong approximation capabilities.
EMFN filters [40] can be used for approximation of any continuous function f (x) in
the interval [−1, 1]. So, an extension of f (x) in the EMFN basis on entire real axis IR
is considered by taking its periodic even mirror repetition. To do this, the values of
f (x) lying between [−1, 1] are taken and repeated on entire real line. The repetition
is done to satisfy the following two properties,
The EMFN filters use sine and cosine functions as basis elements. Since, f (x) is
periodic with period 4, it is easy to write the Fourier series expansion of f (x). The
Fourier series expansion contains the following basis elements,
π
π
3π
{1, cos x , sin x , cos (πx) , sin (πx) , cos x ,
2 2 2
(36)
3π 5π
sin x , cos (2πx) , sin (2πx) , cos x , ...}
2 2
Now, since the basis elements {cos π2 x , sin (πx) , cos 3π
2
x , sin (2πx)} don’t
satisfy even mirror property of f (x), they can be removed from the basis function.
Activation Functions 23
Hence, the resultant basis and the corresponding function approximation can be
given as:
π
3π
{1, sin x , cos (πx) , sin x , cos (2πx) , ...} (37)
2 2
π
3π
f (x) ≈ a0 + a1 sin x + a2 cos (πx) + a3 sin x + .... (38)
2 2
X
X̂1 = (39)
m · max(|X |)
It is possible that during training phase, max(|X |) might be 0 or a very small positive
quantity. This would lead to division by 0 in Eq. (39). Hence, we add a small learnable
24 M. Goyal et al.
parameter in the denominator of Eq. (39). This gives us the final transformed tensor
X̂ which is defined as follows:
X
X̂ = (40)
m · max(|X |) +
We keep both m and as learnable parameters as they both are data set dependent.
Now, we can use this scaled input tensor for approximation of our activation function.
Using this, we can define our activation function as,
where Wi s are learn-able and φi s belong to the EMFN basis and x̂ is one element of
X̂ (which has the same shape as X ). Note that, Wi s will be shared across the complete
tensor X to avoid over-fitting. This is pictorially depicted in Fig. 13.
EMFN filters have strong expressive power which means that the coefficients of
basis elements learned by model can highly over fit to training data set resulting in
poor generalization. To avoid this, we need improved training routines. We propose
following methods which can be used along with SLAF to improve the generalization.
1. Regularization is a standard technique to reduce over fitting in machine learn-
ing. L2 regularization [9] is used on the coefficients being learned. Reference [41]
Activation Functions 25
states the problem in using L2 regularization with Adam optimizer [42]. Hence,
if Adam optimizer is being used for optimizing network weights, a separate opti-
mizer is used for regularization loss. The experiments in below sections have used
SGD optimizer for the regularization loss and adam optimizer for minimizing the
cross entropy or mean squared loss depending the task.
2. Learning rate decay: Learning rate decay [43, 44] is essentially very important
for most of the tasks when using the self-learnable activation function in the neural
network. This helps in avoiding local minima.
3. Tuning the number of basis elements: The number of basis elements change
the expressive power of the network. High number of basis elements can not only
lead to over fitting but also raise convergence issues. The experiments in the below
subsection have used only 3 or 4 basis elements.
4.2.4 Experiments
In this subsection, we present series of experiments where the results of fixed activa-
tion function and self-learnable activation function are compared for different tasks.
1. XOR: XOR is a logical operator which takes its two binary inputs and its ouput
is also a binary value. Table 3 shows this classification (“x” and “o” denote two
separate classes). It is clear that XOR operation is not linearly separable and
therefore, it is impossible to learn it without using hidden layer or simply with
a “Perceptron”. The architecture used for this experiment first takes a weighted
combination of inputs and then apply an activation function (acting as a non-
linearity) on the output of this weighted combination. The main reason for using
this sort of architecture is to see whether the existing activation function have
enough capacity to learn this decision boundary or not. Table 4 shows that the
maximum accuracy that can be achieved using ReLU is 75% whereas SLAF can
classify this data set with 100% accuracy. This is because SLAF can adapt to the
task depending on the data set. The final decision boundaries learned by both the
activation functions are shown in Fig. 14.
2. MNIST: MNIST data set contains 70,000 images of 28 × 28 pixels each con-
taining a hand written digit from 0 to 9. The task is to classify these images into
10 classes depending on the letter written in image. We use Convolutional Neural
Network (CNN) [45] consisting of 2 convolutional layers followed by 2 fully
connected layers to train our model. The architecture uses 3 activation functions.
We replace all three activation functions by SLAF. We used L2 regularization on
SLAF weights and learning rate decay, and achieve an accuracy of 99.46%. The
Table 4 Results on XOR problem. k1 , k2 are the number of elements used for sine and cosine terms
respectively
Activation function Accuracy
ReLU 0.75
EMFN filter k1 = 3 k2 = 3 1.0
(a) (b)
Fig. 14 Comparison of decision boundaries learned with training epochs by using two different
activation functions. ‘o’ (dot) refers to the class labeled as zero and ‘x’ (cross) refers to the class
labelled as one. a Using ReLU b Using SLAF
Fig. 15 Test accuracy versus iterations on MNIST data set using different activation functions
Activation Functions 27
maximum accuracy achieved using ReLU on the same architecture was 99.34%.
Figure 15 shows that the testing accuracy using SLAF is almost always better
than ReLU. Moreover, the accuracy curve of neural network using SLAF had
negligible fluctuation curve as compared to ReLU showing the much stronger
generalization capability of SLAF.
5 Conclusion
6 Future Works
The chapter highlights the importance of learnable activation functions and discusses
two completely different methods to learn them. The performance of SLAF is shown
over simple data sets. To empirically validate the usefulness of SLAF, one must
conduct experiments on more complex data sets such as CIFAR-100 and Image Net.
The basic methodology used by EMFN filters to learn non-linear approximation of
activation functions is discussed in Sect. 4.2. Different basis functions can be used
in place of EMFN filters to empirically find the most suitable one. A proper model
setup must be designed for every basis. More training routines, such as, applying
dropout and different regularization techniques on the activation coefficients can be
proposed to achieve faster optimization of the neural networks. The chapter focuses
on only one block of DNNs, viz. activation functions. One can always extend the
concept of learning to other non-linear components present in the neural networks.
References
1. Bengio, Y., Aaron, C., Courville, Vincent, P.: Unsupervised feature learning and deep learning:
a review and new perspectives. In: CoRR abs/1206.5538 (2012). arXiv:1206.5538. http://arxiv.
org/abs/1206.5538
2. Zhang, G.P.: Neural networks for classification: a survey . IEEE Trans. Syst. Man Cybern. Part
C Appl. Rev. 30(4), 451–462 (2000). ISSN 1094-6977. https://doi.org/10.1109/5326.897072
3. Tian, G.P., Pan, L.: Predicting short-term traffic flow by long short-term memory recurrent
neural network. In: 2015 IEEE International Conference on Smart City/SocialCom/SustainCom
(SmartCity), pp. 153–158 (2015). https://doi.org/10.1109/SmartCity.2015.63
4. Wiki. Activation Potential | Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/
Action_potential (2018). Accessed 31 Dec 2018
5. Stanford CS231n—Convolutional neural networks for visual recognition. http://cs231n.github.
io/neural-networks-1/. Accessed 01 May 2019
6. London, M., Hausser, M.: Dendritic computation. Annu. Rev. Neurosci. 28(1), 503–532 (2005).
https://doi.org/10.1146/annurev.neuro.28.061604.135703
7. Wiki. Activation Function | Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/
Activation_function (2018). Accessed 31 Dec 2018
8. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Statist. 22(3), 400–
407 (1951). https://doi.org/10.1214/aoms/1177729586
9. Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: Proceedings of
the 4th International Conference on Neural Information Processing Systems, NIPS’91, pp.
950–957. Morgan Kaufmann Publishers Inc., Denver, Colorado. http://dl.acm.org/citation.
cfm?id=2986916.2987033 (1991). ISBN 1-55860-222-4
10. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc.
Ser. B 67, 301–320 (2005)
11. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing
internal covariate shift. In: CoRR abs/1502.03167 (2015). arXiv:1502.03167. http://arxiv.org/
abs/1502.03167
12. Autoencoders. https://en.wikipedia.org/wiki/Autoencoder. Accessed 05 Sept 2019
Activation Functions 29
13. Saxe, A.M., Mcclelland, J., Ganguli, G.: Exact solutions to the nonlinear dynamics of learning
in deep linear neural networks (2013)
14. Arora, S., et al.: A convergence analysis of gradient descent for deep linear neural networks.
In: CoRR abs/1810.02281 (2018). arXiv:1810.02281. http://arxiv.org/abs/1810.02281
15. Toms, D.J.: Training binary node feedforward neural networks by back propagation of error.
Electron. Lett. 26(21), 1745–1746 (1990)
16. Muselli, M.: On sequential construction of binary neural networks. IEEE Trans. Neural Netw.
6(3), 678–690 (1995)
17. Ito, Y.: Representation of functions by superpositions of a step or sigmoid function and their
applications to neural network theory. Neural Netw. 4(3), 385–394 (1991)
18. Kwan, H.K.: Simple sigmoid-like activation function suitable for digital hardware implemen-
tation. Electron. Lett. 28(15), 1379–1380 (1992)
19. Bishop, C.M.: Pattern Recognition and Machine Learning. Information Science and Statistics.
Springer, Berlin (2006). ISBN 0387310738
20. Parkes, E.J., Duffy, B.R.: An automated tanh-function method for finding solitary wave solu-
tions to non-linear evolution equations. Comput. Phys. Commun. 98(3), 288–300 (1996)
21. LeCun, Y., et al.: Efficient backprop In: Neural Networks: Tricks of the Trade, This Book is an
Outgrowth of a 1996 NIPS Workshop, pp. 9–50. Springer, Berlin (1998). ISBN: 3-540-65311-2.
http://dl.acm.org/citation.cfm?id=645754.668382
22. Pascanu, R., Mikolov, R., Bengio, Y.: Understanding the exploding gradient problem. In: CoRR
abs/1211.5063 (2012). arXiv:1211.5063. http://arxiv.org/abs/1211.5063
23. Hahnloser, R.H.R., et al.: Digital selection and analogue amplification coexist in a cortex-
inspired silicon circuit. Nature 405, 947 (2000). https://doi.org/10.1038/35016072
24. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Pro-
ceedings of the 27th International Conference on International Conference on Machine Learn-
ing. ICML’10, pp. 807–814. Omnipress, Haifa, Israel (2010). ISBN 978-1-60558-907-7. http://
dl.acm.org/citation.cfm?id=3104322.3104425
25. Maas, A.L.: Rectifier nonlinearities improve neural network acoustic models. In: ICML, vol.
30 (2013)
26. He, K., et al.: Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp.
1026–1034 (2015)
27. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by
exponential linear units (elus). In: arXiv preprint (2015). arXiv:1511.07289
28. Klambauer, G., et al.: Self-normalizing neural networks. In: Advances in Neural Information
Processing Systems, pp. 971–980 (2017)
29. Goodfellow, I., et al.: Maxout networks. In: Dasgupta, S., McAlleste, D. (eds.) Proceedings of
the 30th International Conference on Machine Learning, vol. 28. Proceedings of Machine
Learning Research 3. Atlanta, Georgia, USA: PMLR, June 2013, pp. 1319–1327. http://
proceedings.mlr.press/v28/goodfellow13.html
30. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions (2018)
31. He, K., et al.: Identity mappings in Deep residual networks. In: CoRR abs/1603.05027 (2016).
arXiv:1603.05027. http://arxiv.org/abs/1603.05027
32. Zagoruyko, S., Komodakis, N.: Wide residual networks. In: CoRR abs/1605.07146 (2016).
arXiv:1605.07146. http://arxiv.org/abs/1605.07146
33. Huang, G., et al.: Densely connected convolutional networks. In: 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017). https://doi.org/10.
1109/CVPR.2017.243.
34. Yu, C.C., Tang, Y.C., Liu, B.D.: An adaptive activation function for multilayer feedforward
neural networks. In: 2002 IEEE Region 10 Conference on Computers, Communications, Con-
trol and Power Engineering. TENCOM ’02. Proceedings, vol. 1, pp. 645–650 (2002). https://
doi.org/10.1109/TENCON.2002.1181357.
35. Qian, S., et al.: Adaptive activation functions in convolutional neural networks. Neurocomput
272, 204–212 (2018). ISSN 0925-2312. https://doi.org/10.1016/j.neucom.2017.06.070.
30 M. Goyal et al.
36. Hou, L., et al.: ConvNets with smooth adaptive activation functions for regression. In: Singh,
A., Zhu, J. (eds.) Proceedings of the 20th International Conference on Artificial Intelligence
and Statistics. vol. 54. Proceedings of Machine Learning Research. Fort Lauderdale, FL, USA:
PMLR, pp. 430–439 (2017). http://proceedings.mlr.press/v54/hou17a.html
37. Agostinelli, F., et al.: Learning activation functions to improve deep neural networks. In: CoRR
abs/1412.6830 (2014). arXiv:1412.6830. http://arxiv.org/abs/1412.6830
38. Lin, M., Chen Q., Yan, S.: Network in network. In: 2nd International Conference on Learning
Representations, ICLR 2014, Banff, AB, Canada, 14–16, 2014 Conference Track Proceedings,
(2014). http://arxiv.org/abs/1312.4400
39. Basis function. June 2018. https://en.wikipedia.org/wiki/Basis_function
40. Carini, A., Sicuranza, G.L.: Even mirror Fourier nonlinear filters. In: IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5608–5612 (2013)
41. Loshchilov, L., Hutter, F.: Fixing weight decay regularization in adam. In: CoRR
abs/1711.05101 (2017). arXiv:1711.05101. http://arxiv.org/abs/1711.05101
42. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Con-
ference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Con-
ference Track Proceedings, (2015). http://arxiv.org/abs/1412.6980
43. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic
optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). ISSN 1532-4435. http://dl.acm.org/
citation.cfm?id=1953048.2021068
44. Zeiler, M.D.: ADADELTA: An adaptive learning rate method. In: CoRR abs/1212.5701 (2012).
arXiv:1212.5701. http://arxiv.org/abs/1212.5701
45. Krizhevsky, A., Sutskever, I., Hinton, G.E.: imagenet classification with deep convolutional
neural networks. In: Pereira, F., et al. (eds.) Advances in Neural Information Processing Sys-
tems 25. Curran Associates, Inc., pp. 1097–1105 (2012). http://papers.nips.cc/paper/4824-
imagenet-classification-with-deep-convolutional-neural-networks.pdf
Adversarial Examples in Deep Neural
Networks: An Overview
1 Introduction
Artificial intelligence is on the rise and Deep Neural Networks (DNNs) are an impor-
tant part of it. Whether it is in speech analysis [29] or visual tasks [26, 34, 58, 68],
they shine with a performance beyond what was imagined a decade ago. Their suc-
cess is undeniable, nevertheless a flaw has been spotted in their performance. They
are not stable under adversarial perturbations [69]. Adversarial perturbations are
intentionally worst case designed noises that aim at changing the output of a DNN
to an incorrect one. The perturbations are most of the time so small that an ordinary
observer may not even notice it, and even the state-of-the-art DNNs are highly con-
fident in their, wrong, classification of these adversarial examples. This phenomena
is depicted in Fig.1, borrowed from [24], where a subtle adversarial perturbation is
able to change the classification outcome. Robustness to adversarial perturbations
is different from robustness to random noise [19], a trait that can be achieved by
DNNs. The existence of adversarial perturbations was known for machine learning
algorithms [9], however, they were first noticed in deep learning research in [69].
These discoveries generated interest among researchers to understand the instability
of DNNs, to explore various attacks and devise multiple defenses. Although it is
very difficult to keep up with the pace of results in this area, there are many excel-
lent surveys on the topic. For instance, the surveys [1, 78] cover many interesting
instances for which adversarial examples exist. In this chapter, we overview as well
some of the most important findings regarding adversarial examples for DNNs. How-
ever, we adopt a different approach. Instead of creating a catalog of existing attacks
and defenses, we present an adequately general framework which can recover many
existing attacks. Theoretical findings regarding the nature of adversarial examples
are additionally addressed. In this light, we address three problems in this chapter,
namely, adversarial attacks, their theoretical explanation and adversarial defenses.
The first question is about generating adversarial examples and designing attacks.
This is discussed in the first part of this chapter. Historically these examples were first
found for classification tasks and were based on first order approximations of DNNs.
These methods require knowledge of model parameters and are therefore sometimes
called white-box attacks. We overview some of the most important attacks including
Fig. 1 A demonstration from [24] of adversarial examples generated using the FGSM. By adding
an imperceptibly small vector, we can change GoogLeNet’s classification of the image
Adversarial Examples in Deep Neural Networks: An Overview 33
iterative and non-iterative methods, as well as single and multiple pixel attacks.
Instead of listing different attacks, our goal is to present a unifying framework for
generating adversarial examples. The framework, which goes beyond classification
problems, is based on a convex optimization formulation of adversarial input gener-
ation. We overview, furthermore, black-box attacks where only partial knowledge of
model parameters is available for generating adversarial examples. Universal adver-
sarial perturbations and the transferability of adversarial examples are other topics
discussed in this part.
The second question is about the nature of adversarial examples. Why are DNNs
and other machine learning models vulnerable to adversarial examples? In the sec-
ond part, we overview some of the attempts to investigate theoretically this ques-
tion. In many works, the adversarial vulnerability is attributed to some properties
of machine learning models. Some examples are linearity of models, curvature of
decision boundaries of classifiers and low 1 -norm of weight matrices. After review-
ing some of these theories, we discuss statistical learning theoretic approaches that
explore the relation between adversarial robustness and generalization capabilities
of machine learning models. Out of this study come new guidelines for designing
adversarially robust algorithms, which brings us to the third question of this chapter.
How can we design effective defenses against adversarial examples?
The defenses take up different approaches from modifying the training process
by changing the training set to adding new regularizations or considering new DNN
architectures or a combination of preceding approaches. Some of the most recent
contributions in this direction are discussed in the last part.
We introduce first the notation used in this chapter and some of the basic definitions
needed throughout this chapter. The letters x, y, . . . are used for vectors, A, B, . . .
for matrices and X , Y, . . . for sets. We denote the set {1, . . . , n} by [n] for n ∈ N.
For any vector x = (x1 , . . . , xn )T ∈ Rn and p ∈ N, the p -norm of x is defined by
1/ p
n
p
x p := xi .
i=1
When p tends to zero, the above definition converges to the number of non-zero
entries of the vector. This is called, with an abuse of terminology, the 0 -norm. The
explicit definition is given as
n
x0 := 1(xi = 0).
i=1
34 E. R. Balda et al.
The 0 -norm gives the sparsity order of the vector x. The ∞ -norm of a vector x is
obtained when p → ∞. It is defined as
The Shatten p-norm of the matrix X is equal to the p -norm of the singular value
vector (σ1 , . . . , σmin(m,n) ) of X, namely:
min{m,n} 1/ p
p
X p := σi .
i=1
The Frobenius norm of X is equivalent to the 2 -norm of the singular value vector.
The 1 -norm of the singular value vector is called the nuclear norm. The 0 -norm is
similarly defined and gives the rank of X.
Consider a function f : Rn → Rm given by f (x) = ( f 1 (x), . . . , f m (x)) for m
function f i : Rn → R. The Jacobian of f at x is denoted by J f (x) and defined as
∂f ∂f ∂ fi
J f (x) := ∂x1
(x), . . . , ∂x (x) = (x) .
m
∂x j i∈[n], j∈[m]
black-box attacks, which require no information about the target neural network, see
for instance [62].
In the pioneering work of [69], the attack is based on finding adversarial perturba-
tions that maximize the prediction error at the output. The perturbations are approx-
imated by minimizing the 2 -norm of the perturbation. If the multi-class classifier
mapping is defined by f : Rn → [K ], Szegedy et al. [69] minimize the 2 -norm of
the perturbation η such that the classifier output is changed to the target label l ∈ [K ],
i.e., f (x + η) = l. The perturbation η is constrained to be inside the box [0, 1]n . The
adversarial example is obtained by adding the perturbation η to the input vector x. In
the next attack, the FGSM in [24], the sign of the gradient of the cost function is used
for designing perturbations which were scaled to have bounded ∞ -norm, and there-
fore to be almost undetectable. If the cost function used for training is given by c(x),
the perturbation is given by η = sign(∇c(x)). The ∞ -norm of the perturbation
is . An example of the FGSM is shown in Fig. 1. Iterative procedures or random-
izations can significantly strengthen adversarial attacks. An iterative linearization
of the DNN is proposed in the algorithm DeepFool [47] to generate minimal p -
norm perturbations for p > 1. The iterative approach continues to add perturbations
with bounded p -norm until the classifier’s output is altered. An iterative version
of FGSM, called Basic Iterative Method (BIM) is proposed in [35]. The Projected
Gradient Descent (PGD) attack is an extension of previous techniques, proposed in
[43], where randomness is additionally introduced in the computation of adversarial
perturbations. The PGD attack can bypass many defenses and is employed in [43] to
devise a defense against adversarial examples. An iterative algorithm based on PGD
combined with randomization is introduced in [5] and has been used to dismantle
many defenses so far [4]. Another popular way of generating adversarial examples is
by constraining the 0 -norm of the perturbation. Manipulating only few entries, these
types of attacks are known as single pixel attacks [66] and multiple pixel attacks [52].
In what follows, to generate adversarial examples, we provide a unifying frame-
work that incorporates the above techniques. The main ingredient of this framework
is perturbation analysis. Given a classifier function, the perturbation analysis of this
function quantifies how much its output is perturbed when a known perturbation is
applied to its input. An approximation of this output error is usually obtained using
a first-order Taylor approximation of the function, under the assumption that the
input perturbations are of small norms. Adversarial examples suitably fall into this
framework, as they are perturbed versions of original inputs, the perturbations are
small and the function at hand comes naturally from the model. Consider, for exam-
ple, the FGSM given in [24]. The proposed attack aims at maximizing the training
loss function that is approximated by its first-order Taylor approximation. Similarly,
the authors of [24, 47] constructed adversarial examples by maximizing the error,
on a relevant function, that occurs as a consequence of input perturbations. Iterative
methods like the DeepFool method [47], the BIM [35], the PGD method [43], and the
gradient-based norm-constrained method (GNM) [8], maximize the output perturba-
tion using successive first order approximations. A summary about the connections
36 E. R. Balda et al.
and differences between these methods is provided in [8]. It is based on this frame-
work that we formulate the problem of generating adversarial examples in this
section.
Let us first fix the terminology used in this section. The input of classifiers is
denoted by x. Then, adversarial examples are constructed by adding an adversarial
perturbation η, of the same dimension as x, to that input. For a multi-class classi-
fication with K classes, a classifier maps inputs to the discrete set of labels [K ].
Classifiers modeled by DNNs based their decision usually on a set of functions,
often differentiable, known as score functions. These functions can replace the non-
differentiable classification function for which a first-order Taylor approximation is
not possible because the gradients are not properly defined. The score functions and
classification functions are defined below.
Definition 1 (Score functions and classifier functions) A classifier is defined by the
mapping k : R M → [K ] that maps an input x ∈ R M to its estimated class k (x) ∈ [K ].
The mapping k(·) is itself defined by
where fl (x) : R M → R’s represent the probability of class belonging. The function
f (x) given by the vector ( f 1 (x), . . . , f K (x))T is known as score function and can be
assumed to be differentiable almost everywhere for many classifiers.
Finding adversarial examples amounts to finding a perturbation that changes the
classifier’s output. However, since they are imperceptible, such adversarial perturba-
tions should not modify the inputs significantly. The undetectability of adversarial
examples can be better understood using image classification tasks as an example.
For instance, in Fig. 1 we observe that the human eye can not distinguish between
the original and adversarial image. A common way to impose this restriction is by
constraining adversarial perturbation to belong to a certain set of unnoticeable per-
turbations. For example, the authors of the FGSM bounded the ∞ -norm of their
perturbation, or in the DeepFool method, the norm is incrementally increased until
the output classifier changes. Note that DeepFool may produce perceptible pertur-
bations, while the FGSM may not fool the classifier.
Another way of imposing undetectability of adversarial examples is to impose
on the input perturbation to preserve the outcome of the ground truth classifier [74],
also known as oracle classifier. In many applications, the oracle classifier refers to
the human brain. Similar to Definition 1, denote the score function of the oracle
classifier as g : R M → R K , which outputs a vector with entries gl : R M → R for
l = 1, . . . , K . The adversarial perturbation η is said to be undetectable if
Using this notion, the problem of finding adversarial examples amounts to the
following.
Adversarial Examples in Deep Neural Networks: An Overview 37
Find : η
s.t. L f (x, η) = f k(x) (x + η) − max fl (x + η) < 0
l=k(x) (3)
L g (x, η) = gk(x) (x + η) − max gl (x + η) > 0
l=k(x)
However, since the oracle classifier is usually unknown, this problem is not
interesting for practical purposes. To overcome this issue, it is shown in forthcom-
ing sections how the solution of this problem can be approximated by tractable
relaxations.
The white-box setting corresponds to the scenario when the classification function
f (·) and input x are both known to the attacker. Thus, adversarial perturbations are
designed with full knowledge of the target system.
Non-iterative Methods As discussed above, the constraint on the oracle function
of (3) cannot be computed in practice, since the oracle classifier is not available.
To address this problem, such constraints are approximated by restricting the set of
possible adversarial perturbations to a known subset. The most common choice is to
restrict η to belong to the set of vectors with bounded p -norm for p ≥ 1. The values
of p are restricted to be p ≥ 1 so that the set η p ≤ is convex for any > 0.
Note that the choice of p will determine the structure of the obtained adversarial
examples. The case of p = ∞ has been the focus of research in recent years. Even
after replacing the oracle constraint on (3), with a convex one, the problem remains
non-convex. For the case of white-box attacks, a similar relaxation can be carried
out by approximating L f (x, ·) with its firs-order Taylor expansion. This is possible
since we assume to have full knowledge about x and the function f (·).
To that end, the first-order Taylor expansion of L f (x, ·) around 0 leads to
where O(η22 ) contains higher order terms. Therefore, by replacing the oracle func-
tion constraint in (3) with η p ≤ , for sufficiently small ∈ R+ , we get
Find: η
s.t. L f (x, 0) + η T ∇η L f (x, 0) < 0, η p ≤ , (4)
38 E. R. Balda et al.
which is a relaxed version of the problem exposed in (3). This formulation of the
problem can be used to construct well known existing adversarial attacks from the
literature. This will be discussed in detail in Sect. 2.1. Nevertheless, the following
theorem shows that this problem is not always feasible.
Theorem 1 The optimization problem (4) is not feasible if for q = p
p−1
The proof can be obtained from the results in [27], as well as in [7].
The theorem points to the insight that there might be no perturbation that is
small enough and yet changes the output label. This implication befits the intuition
that it should not be expected to fool a classifier for an arbitrarily small with the
perturbation’s norm constraint. This result suggests that a feasible problem can be
obtained if we only impose one of the constraints while trying to preserve the other
one as much as possible. To that end, a proper objective function that penalizes the
deviation from the original constraint is minimized. This gives rise to the following
two problems, as feasible counterparts of (4).
First, the norm-constraint in (4) is imposed resulting in the following optimization
problem, called GNM in [7]. It minimizes L f (x, 0) + η T ∇η L f (x, 0) as
Using this approach we can find the best possible perturbation under the norm-
constraint. However, a proper value for must be chosen beforehand to guarantee
that the perturbations remain unnoticed. Moreover, this problem has a closed form
solution which can be computed efficiently, as stated in the following theorem.
∂ L (x,η) ∂ L f (x,η)
Theorem 2 If ∇η L f (x, η) = ( ∂η f
1
,..., ∂η M
), the closed form solution to
the minimizer of the problem (6) is given by
1
η = − q−1
sign(∇η L f (x, 0))
|∇η L f (x, 0)|q−1 (7)
∇η L f (x, 0)q
for q = p−1p
, where sign(·) and | · |q−1 are applied element-wise, and
denotes the
element-wise (Hadamard) product. Particularly for p = ∞, we have q = 1 and the
solution is given by the following
The proof can be found in [7]. One advantage of using (6), besides having a closed-
form solution, is that additional constraints on the perturbation can be added to the
problem. In addition, the solution shown in (7) can be reused for other choices of
Adversarial Examples in Deep Neural Networks: An Overview 39
L f (x, ·), which can be more suitable depending on the scenario. For instance, the
FGSM chooses L f (x, ·) to be the negative of the loss function used for training,
which is often the cross-entropy loss in classification problems. Then, minimizing
L f (x, ·) corresponds to maximizing the loss. A caveat is that using problem (6)
ensures perturbations with bounded norms, but such perturbations may not be able
to fool the classifier.
A second approach for relaxing (4) into a feasible problem is to keep the constraint
regarding L f (x, ·) and minimize over the norm of η. Therefore, the problem of (4)
is replaced by
This approach is used by [47] on every iteration of the DeepFool algorithm (more
details in Sect. 2.1). Similarly to (6), this problem has a closed for solution as well,
which is given in the following theorem.
∂ L f (x,η) ∂ L f (x,η)
Theorem 3 If ∇η L f (x, η) = ( ∂η1
, . . . , ∂η M
), the closed form solution to
the problem (9) is given by
L f (x, 0)
η=− q−1
sign(∇η L f (x, 0))
|∇η L f (x, 0)|q−1 (10)
∇η L f (x, 0)q
for q = p
p−1
.
Observe that the perturbation from Theorem 3, similar to the solution in
Theorem 2, is nothing but an adjusted version of the gradient of the classifier with
a different norm. The perturbation in (10) might grow unbounded to ensure that the
classifier is misled, which makes it perceptible by the oracle. There are other similar
methods for computing adversarial examples that depend on a first-order approxi-
mation of other performance-related functions. These algorithms are later shown to
be slight variations of the methods presented in this section. Furthermore, using the
present formulation, we can build iterative procedures by repeating the optimiza-
tion problem until the classifier output changes. In Sect. 2.1, we compare different
methods, which are formulated as iterative versions of (6) and (9).
There are some methods that rely on adding randomness in the generation process.
The PGD attack, from [43], is one well known example. For the PGD attack, the first-
order approximation is taken not around η = 0, but instead, around a random point
η̃ with its norm bounded by some ˜, that is ˜ η̃ p ≤ . In short, the objective
function L f (x, ·) is approximated by its linear counterpart around the point η̃, which
lies within an ˜-radius from η = 0. The distribution η̃ can be arbitrarily chosen as
long as the norm constraint is not violated. A common choice is to use the uniform
distribution over the set of vectors with bounded p norm. We denote this technique
as dithering. Moreover, displacing the center of the first order approximation from
0 to η̃ does not lead to solutions which differ from the ones given so far. This is true
since L f (x, η) ≈ L f (x, η̃) + (η − η̃)T ∇η L f (x, η̃) leads to the following problem
40 E. R. Balda et al.
where 1(·) denotes the indicator function. Hence, the norm η0,S counts the number
of subsets altered by an attacker. Moreover, we can guarantee that only one subset
stays active by including this as an additional constraint in (3), yielding
As a remark, the mixed norm .0,S is extensively used in signal processing and
compressed sensing to promote group sparsity [57]. In a similar manner as in Sect. 2.1,
we employ the approximation L f (x, η) ≈ L f (x, η̃) + (η − η̃)T ∇η L f (x, η̃) which
yields the following linear programming formulation of (12) as
Z
η s = − sign((∇η L f (x, η̃))isz )eisz ,
z=1
Z
which implies that ∇η L f (x, η̃)T η s = − z=1(∇η L f (x, η̃))i z . Then, this problem
s
has the closed form solution given by
Z
η ∗ = η s∗ , with s ∗ = argmaxs (∇η L f (x, η̃))i z . (14)
s
z=1
leads to a targeted attack, that is when the objective is to apply perturbations such
¯
that the outcome of classification is always some “target” class l.
As we can see, different configurations for Algorithm 1 lead to known adversarial
attacks from the literature. A summary of Algorithm 1 configurations with their
corresponding attack from the literature is presented in Table 2. In classification,
these methods are usually compared using the fooling ratio, that is the percentage
of correctly classified inputs that are misclassified when adversarial perturbations
are added. Visualizing the fooling ratio for different values of is often used to
empirically asses the performance of an adversarial attack. For example, in Fig. 2 we
observe the fooling ratio of different attacks on standard DNNs (not trained to resist
adversarial attacks).
Regression Problems and Other Learning Tasks. The objective
L f (x, η) = − f (x + η) − f (x + η)2
Fig. 2 Fooling ratio, from [7], of different adversarial attacks on vanilla DNNs on the MNIST
dataset. a 5-layered LeNet architecture from [37], b DenseNet architecture from [30] with 40 layers
Fig. 3 Adversarial examples for regression [8]. a MNIST autoencoder, b STL-10 colorization
network
possible. Two examples are provided using autoencoders and colorization2 DNNs
in Fig. 3. In that figure, we observe how adversarial perturbations heavily distort
the outcome of regression. Using the principles explicated in this section, other
algorithms have been developed for attacking other types of learning systems. In
the field of computer vision, [28] constructed an attack on image segmentation,
while [76] designed attacks for object detection. The Houdini attack [12] aims at
distorting speech recognition systems. In addition, [53] tailored an attack for recurrent
neural networks, and [40] for reinforcement learning. Adversarial examples exist for
probabilistic methods as well. For instance, [33] showed the existence of adversarial
examples for generative models. For regression problems, [70] designed an attack
that specifically targets variational autoencoders.
Robustness metrics. Going back to the definitions of Sect. 2.1, Theorem 1 shows
that given a vector x and a score function f (·), the adversarial perturbation should
L (x,0)
have at least p -norm equal to ∇η Lf f (x,0)q to fool the linearized version of f (·). In
2A colorization model predicts the color values for every pixel in a given gray-scale image.
44 E. R. Balda et al.
Table 3 Experiment from [7] showing the robustness measures for different DNNs on the MNIST
and CIFAR-10 datasets. The acronyms FCNN denotes a standard fully connected neural network,
while NIN refers to the network-in-network architecture from [39]. LeNet-5 and DenseNet are the
same architectures used in Fig. 2
Test error (%) ρ̂1 ( f ) [47] ρ̂2 ( f ) [7] Fooled >99%
FCNN (MNIST) 1.7 0.036 0.034 = 0.076
LeNet-5 (MNIST) 0.9 0.077 0.061 = 0.164
NIN (CIFAR-10) 13.8 0.012 0.004 = 0.018
DenseNet (CIFAR-10) 5.2 0.006 0.002 = 0.010
L (x,0)
other words if the ratio ∇η Lf f (x,0)q is small, then it is easier to fool the network
with p -attacks. In that sense, Theorem 1 provides an insight into the stability of
L (x,0)
classifiers. Therefore, regularizing the loss function with ∇η Lf f (x,0)q may lead to
adversarial robustness. Moreover, one can also include dithering by regularizing
L (x,η̃)
with ∇η Lf f (x,η̃)q with some randomly chosen η̃.
In [47], the authors suggest that the robustness of the classifiers can be measured
as
1 r̂(x) p
ρ̂1 ( f ) = ,
|D| x∈D x p
where D denotes the test set and r̂(x) is the minimum perturbation required to change
the classifier’s output. Proposition 1 suggests that one can also use the following as
the measure of robustness
1 L f (x, 0)
ρ̂2 ( f ) = .
|D| x∈D ∇η L f (x, 0)q
The lower ρ̂2 ( f ), the easier it gets to fool the classifier and therefore it becomes less
robust to adversarial examples. According to the experiments in [7], shown in Table 3,
these two robustness metrics seem to be coherent when measuring the robustness of
non-adversarially trained DNNs.
So far we have assumed that the adversarial attacker has perfect knowledge of the tar-
get classifier function f (·) as well as the input x. By loosening of these requirements,
into more realistic assumptions, new types of algorithms arise, namely
Adversarial Examples in Deep Neural Networks: An Overview 45
– black-box attacks: these methods correspond to the settings where the target
classifier f (·) is unknown but the input x may still be known to the attacker,
– universal adversarial perturbations: these perturbations are designed to work
regardless of the input x, which is assumed to be unknown. Nevertheless, the
classifier f (·) may be available to the attacker.
If both the target model f (·) and input x are unknown to the attacker, the adver-
sarial attack would be a black-box as well as universal adversarial perturbation.
These types of attacks are still possible by assuming partial or indirect knowl-
edge about the input x and the classifier f (·). For example, the attacker may have
access to a set of independent realizations {x1 , x2 , . . . } of the input, which provides
knowledge about the input distribution Px . Similarly, implicit information about
the classifier can be inferred by observing the independent realizations of the pairs
{(x1 , f (x1 )), (x2 , f (x2 )), . . . }. Finally, we may also have knowledge about the struc-
ture (number of layers, types of connections, activation functions, etc.) of the used
classifier.
It is probably unexpected that some adversarial perturbations produce the same
effect over different inputs and different DNNs architectures, although they are gen-
erated for a particular model. These universal adversarial perturbations are reported
in [24, 57] where the authors show the existence of such perturbations for vari-
ous datasets and DNNs. This phenomena suggests that there exist certain common
properties shared by adversarial perturbations that account for most of the success
when attacking a system. This can explain why certain perturbations are able simul-
taneously fool a target DNN on different inputs. Adversarial examples are indeed
transferable. In [72] the authors construct an attack such that adversarial examples can
transfer from one random instance of a neural network to another. Surprisingly, these
methods were proved to be effective against well known DNNs. Since no explicit
knowledge about the DNN weights is required to compute these perturbations, they
can be thought of as black-box attacks. Moreover, the authors showed that including
such black-box adversarial examples into the training set significantly enhances the
robustness of neural networks. Finally, the authors in [45] showed that there exist
adversarial examples that are both universal and black-box, that is perturbations that
are independent from target DNN and input.
Black-Box Attacks. As discussed, in the black-box setting the classifier function f (·)
is unknown, thus we cannot compute the gradient necessary for Algorithm 1. A com-
mon approach to circumvent this issue is to estimate the gradient by choosing a sub-
stitute model f˜ which is hoped to behave in a similar way as the unknown f (·). This
concept is introduced in [51] under the assumption that the input x is known, as well
as n independent realizations of (x, f (x)) denoted as (x1 , f (x1 )), . . . , (xn , f (xn )).
This method consists on the following two steps:
46 E. R. Balda et al.
1. Train a substitute model f˜ that predicts f (x), thus it resembles the target classifier.
2. Perform a white-box attack on the substitute model f˜ and hope it transfers to the
target model f (·).
This concept is later extended in [41], where the authors make use of several substi-
tute models, that is f˜1 , . . . , f˜r for r > 1. In that work, adversarial perturbations are
computed by approximately solving the following optimization for an ensemble of
loss functions: r
min − log αi L f˜i (x, η) + λη p ,
η
i=1
where λ > 0, p > 1, 0 < αi < 1, i αi = 1 and L f (x, ·) is some positive loss
function like the cross-entropy loss of f (·) at the point (x + η). The key idea of this
method is that a perturbation η that is able to fool the classifiers f˜1 , . . . , f˜r will most
likely fool the unknown classifier f˜r +1 f as well. Note that it is also possible to
generate norm-constrained versions of this method by approximately solving
r
min − log αi L f˜i (x, η) s.t. η p ≤ (15)
η
i=1
Find : u
s.t.u p ≤
Px (k(x + u) = k(x)) ≥ 1 − δ .
Note that in order to approximately solve this problem one needs information about
the distribution of the input. A common assumption when designing universal pertur-
bations is that the attacker has perfect knowledge of the classifier k(·), but only partial
knowledge about Px in the form of n independent realizations Xn = {x1 , . . . , xn } of
x ∼ Px .
This problem is first approached in [45] by iteratively aggregating the perturba-
tions that move x1 , . . . , xn to their corresponding decision boundaries. Given an input
xi , such perturbations are computed by iteratively solving (9) in the same manner as
in Algorithm 1. Then, in order to preserve the p -norm constraint, these perturba-
tions are projected into the p ball of radius . A summary of this method is shown
in Algorithm 2. In addition, some example pictures showing the effectiveness of
this algorithm are shown in Fig. 4. Note that this algorithm does not converge for an
arbitrary choice of δ, thus additional stopping criteria are needed.
Adversarial Examples in Deep Neural Networks: An Overview 47
n
min L f (xi , η) s.t. η p ≤ ,
η
i=1
Among various theories regarding the nature of adversarial examples, two directions
can be singled out. One line of research focuses on local properties of classifiers, for
example, decision boundaries of classifiers and their geometric properties. A notable
example is the linearity hypothesis, proposed by the authors in [24], where the exis-
tence of adversarial images is attributed to the approximate linearity of classifiers.
Another line of research tend to explain such phenomena by means of global prop-
erties of classifiers such as the topological dimension of their feature spaces or the
sparsity of weight matrices in DNNs. We present some of the most important results
along these two lines.
There is, however, another theoretical question raised in the literature. After some
experimental results witnessed a seemingly opposing relation between adversarial
48 E. R. Balda et al.
Fig. 4 The authors in [45] add a universal perturbation (center image) that is able to mislead the
classification of several images
3 We call the pair ( p, q) dual if the corresponding norms are dual. In particular 1/ p + 1/q = 1.
50 E. R. Balda et al.
B = {x : f (x) = 0},
1
κq (B) = where
rmin
rmin = inf min sup {xo − xq : Bq (xo , xo − xq ) ⊆ Ri }
x∈B i∈{−1,1} x ∈R M
o
and Bq (x, ) denotes the q -ball of radius centered at x. In other words rmin is
obtained by first finding at each point x, on the decision boundary, the largest radius
of q -balls that contain x while being contained in R1 and R−1 . The radius is infinity
for all q ≥ 1 when the decision boundary is flat, which means that the local curvature
4 In this section, we focus mainly on binary classification examples assuming that the results can be
is equal to zero at this point. The minimum of such radii for all x, that is rmin , points
at the most curved portion of the surface. The global curvature of the surface B is
the inverse of rmin . A linear classifier yields decision boundaries with zero curvature,
and a small curvature surface might be completely flat in most of its points. It turns
out that the q -curvature κq (B) determines the robustness against p -attacks with
( p, q) as a dual pair.
It is based on this notion of curvature and for 2 -attacks that the authors in [19]
compare the robustness of classifiers to random noise and adversarial noise and
characterize it according to the curvature of decision boundaries.5 The random noise
is modeled as a random direction and the robustness against random noise at x,
denoted by ρ M (x), is defined by the minimum 2 -norm of a random vector required
to change the label of x. The adversarial robustness, denoted by ρ(x), is the minimum
2 -norm of a perturbation particularly designed to change the label.
Theorem 4 ([19, Theorem 2]) Suppose that for a binary classifier the curvature
κ2 (B) satisfies
0.2
κ2 (B) ≤ ,
ζ2 (δ)Mρ(x)
where
−1
ζ1 (δ) = 1 + 2 ln(1/δ) + 2 ln(1/δ) ,
−1
ζ2 (δ) = max (1/e)δ 2 , 1 − 2(1 − δ 2 ) .
5 They consider semi-random noise as well, however, we restrict ourselves to simple random noise.
52 E. R. Balda et al.
The linear hypothesis is not the only available theory. Other theories attribute adver-
sarial robustness to other features of classifiers. A simple intuition already emerged
from our discussion of linear classifiers. The ∞ -attacks for linear classifiers with
unit-norm parameters w generated a perturbation equal to −w1 . Therefore, among
all unit-norm w’s, the most robust classifiers are those with smallest 1 -norm, which
are also the sparsest possible vectors.
For binary classification problems, the authors in [25] showed theoretically that
the adversarial robustness decreases when 1 -norm of w increases. We introduce
some definitions before stating the theorem. Suppose that the instances and labels
(x, y) follow a distribution Px,y . The adversarial robustness for this probabilistic
model is defined as
ρ∞ = Px,y [y = sign(wT xadv )],
where xadv is the ∞ -perturbed instance and given by xadv = x − ysign(w). Let us
define μk for k ∈ {1, −1} as follows
Adversarial Examples in Deep Neural Networks: An Overview 53
Theorem 5 ([25, Theorem 3.1]) For a binary classification problem with uniformly
distributed labels, if the accuracy of a classifier is given by t then the adversarial
robustness ρ∞ against bounded ∞ attacks is given by
The denominator of the bound on the right hand side contains the 1 -norm of w.
Therefore, as the authors in [25] maintain, among those linear classifiers with a similar
discriminatory capability, those with the smallest 1 -norm perform better under ∞ -
attacks. The small 1 -norm implies a larger ρ∞ , which means better robustness.
The theorem, however, provides only an upper bound, and, although it can char-
acterize the negative effect of large 1 -norm on robustness, it cannot necessarily
guarantee that the small 1 -norm promotes robustness nor that small 1 -norms nec-
essarily lead to sparse w. The claim, however, seems to hold, as experimental findings
seem to support the idea that the sparsity of weights promote adversarial robustness.
Besides, since the difference μ+1 − μ−1 is independent of the norm of w, the inner
product wT (μ+1 − μ−1 ) scales with the norm of w. In this light, another reading
of this theorem suggests that among all unit 1 -norm w’s, the one with smallest
wT (μ+1 − μ−1 ) restricts robustness the least. To summarize the theorem, a first step
toward robustness of linear classifiers is to find the smallest 1 -norm w for which the
inner product wT (μ+1 − μ−1 ) is high enough.
Another explanation of adversarial examples, introduced in [71], starts from the
assumption that the data lies on a low-dimensional manifold in higher dimensional
space, and many classifiers exist with similar accuracy. This is shown in Fig. 5 using a
simple example of linear manifolds and linear classifiers. We assume that the data lies
on a linear subspace and the dashed line represents the boundary of an optimal Bayes
linear classifier for the data distribution with zero error. However, the rotated versions
of this linear classifier, for example the one with the solid line as the boundary, yield
the same accuracy. The main difference between these classifiers is their robustness
to adversarial examples. If the linear boundary is tilted so that it lies close to the data
subspace, the smaller ∞ -norm perturbation can fool the classifier. This can be seen in
Fig. 5 as the ∞ -ball touching the tilted classifier is smaller that the original not-tilted
classifier. This is known under boundary tilted hypothesis. Under this hypothesis, the
adversarial vulnerability of classifiers arises from the tilted classification boundary
close to the data manifold. A further exploration of the linear classifier example can
reveal that some of the tilted boundaries can indeed improve the robustness.
54 E. R. Balda et al.
Fig. 5 Adversarial robustness of tilted boundaries: a the dashed line is the ground truth linear
classifier for the data supported on X , b the solid line, a tilted boundary, yields the same risk as the
ground truth but is fooled with smaller ∞ -perturbations
Many classifier functions can be clearly decomposed into feature extraction and
classification parts. In [74], adversarial robustness is shown to be affected by the
feature selection part of the model. The results of [74] rely on the assumption that
there is an oracle classifier function g(x) that generates the ground truth labels.
For image classification problems, it is simply the human eye. Classifiers f (·), in
particular g(·), are decomposed into a feature extraction part e f (·) and a classifier
part c f (·). The feature spaces of a classifier is the image of the domain set X under
the feature extraction e f (·). Feature spaces are assumed to be metric spaces. Denote
the oracle feature space by (Xg , dg ) where dg is the respective metric, and (X f , d f )
similarly for a classifier f (·).
Adversarial perturbations do not change the oracle decision, i.e., g(x) = g(x + η)
nor the feature extraction:
dg (eg (x), eg (x + η)) < δ.
However, the classifier is fooled ( f (x) = f (x + η)). A classifier is called (, δ)-
robust if for all x, y ∈ X for which g(x) = g(y) and dg (eg (x), eg (y)) < then with
probability at least 1 − δ it holds that f (x) = f (y).
Theorem 6 ([74, Theorem 3.2–3.4]) Let a classifier f (·) be continuous almost
everywhere and g(·) be the oracle classifier. The classifier f (·) is (, δ)-robust to
adversarial examples if and only if the topology of the feature space (X f , d f ) is finer
than the topology of the oracle feature space (Xg , dg ).
Adversarial Examples in Deep Neural Networks: An Overview 55
As a direct corollary of the above theorem, when the two features spaces are
Euclidean spaces of dimension n g and n f , then the classifier h(·) is robust if and
only if n f < n g . The above theorem implies first that the selection of features and
feature spaces is crucial for adversarial robustness. Although the assumption of an
oracle function and a unique suitable feature space can be contested, the theorem
applies to any two classifiers and states that if a perturbation does not fool g(·)
then it does not fool f (·). So among the classifiers those with feature spaces of
finer topology, or lower dimension in Euclidean spaces, are favored for adversarial
robustness.
The importance of selecting proper features is addressed in other works such as
[17, 18]. A toy example is used in [17, 18] to show that linear classifiers are unable
to use more robust features of an image for adversarial robust classification, unlike
quadratic classifiers that are more robust in that example. For 2 -attacks, the authors
point out that the adversarial robustness is directly related to the so-called distin-
guishability measure of classes and the risk of the classifier. The distinguishability
measure can be seen as low flexibility of classifiers in general compared to the dif-
ficulty of the classification task. We state a simplified version of their theorem for
linear classifiers using the definition of ρ(x) from Theorem 4.
Theorem 7 ([17, Theorem 4.1]) For a binary classification task with uniformly
distributed labels and x2 ≤ B a.e., the adversarial robustness of a linear classifier
sign(wT x) with accuracy t satisfies
1
E(ρ(x)) ≤ E Px|y=+1 (x) − E Px|y=−1 (x)2 + 2Bt.
2
The distinguishability measure E Px|y=+1 (x) − E Px|y=−1 (x)2 is a feature of clas-
sification problem and not dependent on the classifier. However, an unexpected
conclusion of the theorem is that if the classification task is difficult, in that the
distinguishability measure is small, the risk of the classifier becomes dominant in
the upper bound and inversely related with the robustness. Therefore, low risk clas-
sifiers have less adversarial robustness for difficult classification tasks.
The inverse connection of risk and robustness is further explored in [73] through a
binary classification example. An instance of data is given by x = (x1 , . . . , xn , xn+1 )T
and it is related to its label y randomly as follows. The first entry is a Bernoulli
random variable with P(x1 = y) = p and the other entries, xi , are normal dis-
tributed random variables with mean value ξ y and unit variance. A linear classifier
with w = √ (0, 1/n, . . . , 1/n) can be shown to achieve more than 0, 99 accuracy if
ξ = Θ(1/ M). However, this classifier can achieve an adversarial accuracy at most
0.01 under the ∞ -attack with = 2ξ. However, if one uses only the first feature
x1 for the classification both standard and adversarial accuracies are 0.7. The data
consists of, on the one hand, robust features with less accuracy and, on the other
hand, informative and non-robust features. We might ask whether this tension can be
circumvented using a smart combination of features so that the adversarial robust-
ness does not come at the price of accuracy. The authors of [73] answer negatively
by stating a no free-lunch theorem for adversarial robustness.
56 E. R. Balda et al.
Theorem 8 ([73, Theorem 2.1]) Any classifier with standard accuracy at least 1 − δ
on the above problem cannot achieve adversarial accuracy more that 1−p p δ against
∞ -bounded perturbations with η∞ ≥ 2ξ.
There exist several types of defenses against adversarial examples, as well as subse-
quent methods for bypassing them. It is difficult to point out, at the time of writing
this chapter, a consensus on the effective defense against adversarial examples with
the possible exception of adversarial training. For instance, the authors in [11] pro-
posed three attacks to bypass defensive distillation of the adversarial perturbations
[54]. Moreover, the attacks from [5], bypassed 7 out of 9 non-certified defenses of
ICLR 2018 that claimed to be white-box secure. Adversarial training, however, adds
adversarial examples to the training set and is the most commonly accepted defense
against adversarial attacks. In what follows, we discuss some difficulties of adversar-
ial training as well as those methods that try to promote robustness merely through
regularization techniques.
to others. Using this idea, the authors in [5] provide the following conditions to
identify models that exhibit this problem.
– One-step attacks perform better than iterative attacks.
– Black-box attacks perform better than white-box attacks.
– Attacks with large do not reach 100% fooling ratio.
– Random sampling finds adversarial examples, while adversarial attacks
don’t.
If a model satisfies one of the above conditions, it suffers from the obfuscated gra-
dient problem. Using these guidelines, the authors identified that 7 out of 9 defenses
accepted to ICLR 2018, that were deemed to be white-box secure, suffered from this
issue. In addition, the authors fooled those defenses using customized attacks.
So far, the most successful defenses against adversarial attacks consist of adding
adversarial examples to the training set. This is known as adversarial training. Ini-
tial attempts to perform adversarial training using the FGSM proved to suffer from
obfuscated gradients. This occurs since a DNN trained solely with the FGSM learns
to shatter the gradients that are in a close vicinity to the data samples, such that
the gradients used for the FGSM point into misleading directions. While this pro-
cess may mislead the FGSM it is still vulnerable to other perturbations, for instance
black-box attacks. To overcome this issue the dithering mechanism proposed for
the PGD attack is employed in [43], along with large values6 during adversarial
training. This approach provided diverse sets of random adversarial examples, which
prevented DNNs from obtaining low fooling ratios by shattering the gradients around
the data samples. In other words, the randomness of the starting point in the PGD
attack prevents the model from overfitting the perturbations.
From these initial findings it is concluded that diversity in the adversarial examples
used for training is necessary in order to prevent DNNs for overfitting to specific types
of perturbations. To that end, in [72] the authors include black-box perturbations into
the training set. This is carried out using substitute models, as well as ensembles
of these models, in the objective function of white-box attacks (such as PGD) as
described in Sect. 2.2.
6 Inthat work, the ∞ -constraint η∞ ≤ = 0.3 is employed to train models where the input
values were between 0 and 1.
60 E. R. Balda et al.
Fig. 6 Reshaped input weight matrix W1 ∈ R20×784 of a DNN, from [36], after natural training
as well as adversarial training with = 0.05. A simultaneously low-rank and sparse structure is
observed in the weights after adversarial training
weight matrices of the DNN, as well sparsity. As an example the authors provide a
visualization of the weights in the first layer of a DNN, shown in Fig. 6, where the
simultaneously low-rank and sparse structure of such weights is clearly visible. This
is additionally confirmed by looking at the mutual information between the input
and the layers. Information theoretically, increasing adversarial robustness coincides
with decreasing the mutual information that indicates more compression of the input
in hidden layers. These results serve as motivation for aiming research towards find-
ing the key properties that lead to robustness of DNNs. The idea is to propose a metric
for robustness and promote it during training. A common technique for promoting
specific properties during training is to add a penalty term in the loss function, known
as regularization term, that penalizes undesired properties of the classifier function.
Here are some examples of robust regularization.
– Sparsity: In [25, 75] the authors argue that sparsity of the weight matrices of a
DNN promotes robustness against adversarial examples. They propose to add a
regularization term with the sum of the 1 -norm of the weight matrices involved,
which is known to promote approximately sparse solutions. In addition, the authors
make use of pruning7 to impose arbitrary sparsity levels.
– Low-Rankness: In [36, 61] it is observed that adversarial training induces low-
rank structures on the weight matrices of DNNs. Motivated by this phenomena,
low-rank regularization techniques are proposed. In [61] the authors explicitly
7 Pruning consists of setting to zero smallest weights (in absolute value) of the a given weight matrix,
thus enforcing a certain level of sparsity. The amount of weights to be set to zero is arbitrarily
chosen. Usually pruning requires an extra phase of retraining (fine-tunning of the remaining non-
zero weights) to compensate for the performance degradation caused by the initial manipulation of
the weights.
Adversarial Examples in Deep Neural Networks: An Overview 61
constrain the rank of weight matrices in the optimization algorithm used for train-
ing. On the other hand in [36] the nuclear norm of the weight matrices is employed
as a regularization term in the training loss. The nuclear norm of a matrix can be
written as the 1 -norm of its vector of singular values, thus using it as a regulariza-
tion term promotes sparsity in that vector of singular values (i.e., low-rankness).
– Norm of the network’s Jacobian: In [50], the authors aim at minimizing the
2 -norm of the output perturbation, that is f (x) − f (x + η)2 . Assuming η2 ≤
, upper-bounding an approximate of this functional yields
Motivated by this result, the authors propose using the Frobenius norm of the
Jacobian J f (x)F as regularization term to promote robustness. The Frobenius
norm is an upper bound on the 2 -norm of the output perturbation. If it is limited
during training by proper regularization, it can restrict the 2 -perturbations.
– Curvature: In Sect. 3.1 it is argued that low curvature in the decision boundaries,
as well as in the loss function, are desired properties for robustness. Motivated
by that discussion, the authors of [48] proposed penalizing solutions with high
curvature of the loss function around the training data.
5 Future Directions
Acknowledgements The authors would like to thank the reviewers for their fruitful feedbacks.
62 E. R. Balda et al.
References
1. Akhtar, N., Mian, A.: Threat of adversarial attacks on deep learning in computer vision: a
survey. IEEE Access 6, 14410–14430 (2018)
2. Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge
University Press, Cambridge (2009)
3. Arora, S., Ge, R., Neyshabur, B., Zhang, Y.: Stronger generalization bounds for deep nets via a
compression approach. In: International Conference on Machine Learning, pp. 254–263 (2018)
4. Athalye, A., Carlini, N.: On the Robustness of the CVPR 2018 white-box adversarial example
defenses (Apr 2018). arXiv:1804.03286
5. Athalye, A., Carlini, N., Wagner, D.: Obfuscated gradients give a false sense of security:
circumventing defenses to adversarial examples. In: International Conference on Machine
Learning (2018)
6. Attias, I., Kontorovich, A., Mansour, Y.: Improved generalization bounds for robust learning.
In: Algorithmic Learning Theory, pp. 162–183 (2019)
7. Balda, E.R., Behboodi, A., Mathar, R.: On generation of adversarial examples using convex
programming. In: 52-th Asilomar Conference on Signals Systems, and Computers, pp. 1–6.
Pacific Grove, California, USA (Oct 2018)
8. Balda, E.R., Behboodi, A., Mathar, R.: Perturbation analysis of learning algorithms: generation
of adversarial examples from classification to regression. IEEE Trans. Signal Process. (2018)
9. Barreno, M., Nelson, B., Joseph, A.D., Tygar, J.D.: The security of machine learning. Mach
Learn 81(2), 121–148 (2010)
10. Bartlett, P.L., Foster, D.J., Telgarsky, M.J.: Spectrally-normalized margin bounds for neural
networks. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6240–6249.
Curran Associates, Inc. (2017)
11. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE
Symposium on Security and Privacy (SP), pp. 39–57. IEEE (2017)
12. Cisse, M., Adi, Y., Neverova, N., Keshet, J.: Houdini: fooling deep structured prediction models
(2017). arXiv:1707.05373
13. Cullina, D., Bhagoji, A.N., Mittal, P.: PAC-learning in the presence of adversaries. In: Advances
in Neural Information Processing Systems, vol. 31, pp. 228–239. Curran Associates, Inc. (2018)
14. Diochnos, D., Mahloujifar, S., Mahmoody, M.: Adversarial risk and robustness: general def-
initions and implications for the uniform distribution. In: Advances in Neural Information
Processing Systems, pp. 10359–10368 (2018)
15. Dohmatob, E.: Limitations of adversarial robustness: strong No Free Lunch Theorem (Oct
2018). arXiv:1810.04065 [cs, stat]
16. Fawzi, A., Moosavi-Dezfooli, S.M., Frossard, P.: The robustness of deep networks: a geomet-
rical perspective. IEEE Signal Process. Mag. 34(6), 50–62 (2017)
17. Fawzi, A., Fawzi, O., Frossard, P.: Fundamental limits on adversarial robustness. In: Proceed-
ings of ICML, Workshop on Deep Learning (2015)
18. Fawzi, A., Fawzi, O., Frossard, P.: Analysis of classifiers’ robustness to adversarial perturba-
tions. Mach. Learn. 107(3), 481–508 (2018)
19. Fawzi, A., Moosavi-Dezfooli, S.M., Frossard, P.: Robustness of classifiers: from adversarial to
random noise. In: Advances in Neural Information Processing Systems, vol. 29, pp. 1632–1640.
Curran Associates, Inc. (2016)
20. Fawzi, A., Moosavi-Dezfooli, S.M., Frossard, P., Soatto, S.: Classification regions of deep
neural networks (May 2017). arXiv:1705.09552
21. Franceschi, J.Y., Fawzi, A., Fawzi, O.: Robustness of classifiers to uniform p and Gaussian
noise. In: 21st International Conference on Artificial Intelligence and Statistics (AISTATS),
vol. 84, p. 9. Lanzarote, Spain (2018)
22. Gilmer, J., Metz, L., Faghri, F., Schoenholz, S., Raghu, M., Wattenberg, M., Goodfellow, I.:
Adversarial spheres. In: ICLR 2018-Workshop Track (2018)
Adversarial Examples in Deep Neural Networks: An Overview 63
23. Golowich, N., Rakhlin, A., Shamir, O.: Size-independent sample complexity of neural net-
works. In: Bubeck, S., Perchet, V., Rigollet, P. (eds.) Proceedings of the 31st Conference On
Learning Theory. Proceedings of Machine Learning Research, vol. 75, pp. 297–299. PMLR
(Jul 2018)
24. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In:
International Conference on Learning Representations (Dec 2014)
25. Guo, Y., Zhang, C., Zhang, C., Chen, Y.: Sparse DNNs with improved adversarial robustness. In:
Advances in Neural Information Processing Systems, vol. 31, pp. 240–249. Curran Associates,
Inc. (2018)
26. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (Jun
2016)
27. Hein, M., Andriushchenko, M.: Formal guarantees on the robustness of a classifier against
adversarial manipulation. In: NIPS (2017)
28. Hendrik Metzen, J., Chaithanya Kumar, M., Brox, T., Fischer, V.: Universal adversarial per-
turbations against semantic image segmentation. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 2755–2764 (2017)
29. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke,
V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in
speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6),
82–97 (2012)
30. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional
networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017)
31. Khim, J., Loh, P.L.: Adversarial risk bounds via function transformation (Oct 2018).
arXiv:1810.09519
32. Khrulkov, V., Oseledets, I.V.: Art of singular vectors and universal adversarial perturbations.
In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8562–8570
(2018)
33. Kos, J., Fischer, I., Song, D.: Adversarial examples for generative models. In: 2018 IEEE
Security and Privacy Workshops (SPW), pp. 36–42 (May 2018)
34. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105
(2012)
35. Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world (2016).
arXiv:1607.02533
36. Langenberg, P., Balda, E.R., Behboodi, A., Mathar, R.: On the effect of low-rank weights on
adversarial robustness of neural networks (2019). arXiv:1901.10371
37. LeCun, Y., Haffner, P., Bottou, L., Bengio, Y.: Object recognition with gradient-based learning.
In: Shape, Contour and Grouping in Computer Vision, pp. 319–345. Springer (1999)
38. Liao, F., Liang, M., Dong, Y., Pang, T., Hu, X., Zhu, J.: Defense against adversarial attacks
using high-level representation guided denoiser. In: The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (June 2018)
39. Lin, M., Chen, Q., Yan, S.: Network in network (2013). arXiv:1312.4400
40. Lin, Y.C., Hong, Z.W., Liao, Y.H., Shih, M.L., Liu, M.Y., Sun, M.: Tactics of adversarial
attack on deep reinforcement learning agents. In: Proceedings of the 26th International Joint
Conference on Artificial Intelligence, pp. 3756–3762. IJCAI’17, AAAI Press, Melbourne,
Australia (2017)
41. Liu, Y., Chen, X., Liu, C., Song, D.: Delving into transferable adversarial examples and black-
box attacks. In: ICLR 2017 (2017)
42. Luo, Y., Boix, X., Roig, G., Poggio, T., Zhao, Q.: Foveation-based mechanisms alleviate adver-
sarial examples (Nov 2015). arXiv:1511.06292
43. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning mod-
els resistant to adversarial attacks. In: International Conference on Learning Representations
(2018)
64 E. R. Balda et al.
44. Mahloujifar, S., Diochnos, D.I., Mahmoody, M.: The curse of concentration in robust learn-
ing: evasion and poisoning attacks from concentration of measure. In: Thirty-Third AAAI
Conference on Artificial Intelligence (AAAI) (2019)
45. Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturba-
tions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1765–1773 (2017)
46. Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P., Soatto, S.: Robustness of classifiers
to universal perturbations: a geometric perspective. In: International Conference on Learning
Representations (2018)
47. Moosavi Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to
fool deep neural networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2016)
48. Moosavi-Dezfooli, S.M., Fawzi, A., Uesato, J., Frossard, P.: Robustness via curvature regular-
ization, and vice versa (Nov 2018). arXiv:1811.09716 [cs, stat]
49. Neyshabur, B., Bhojanapalli, S., Srebro, N.: A PAC-Bayesian approach to spectrally-
normalized margin bounds for neural networks. In: International Conference on Learning
Representations (2018)
50. Novak, R., Bahri, Y., Abolafia, D.A., Pennington, J., Sohl-Dickstein, J.: Sensitivity and gen-
eralization in neural networks: an empirical study. In: International Conference on Learning
Representations (2018)
51. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box
attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Conference on
Computer and Communications Security, pp. 506–519. ACM (2017)
52. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The limitations
of deep learning in adversarial settings. In: 2016 IEEE European Symposium on Security and
Privacy (EuroS&P), pp. 372–387. IEEE (2016)
53. Papernot, N., McDaniel, P., Swami, A., Harang, R.: Crafting adversarial input sequences for
recurrent neural networks. In: Military Communications Conference, MILCOM 2016-2016
IEEE, pp. 49–54. IEEE (2016)
54. Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial
perturbations against deep neural networks. In: 2016 IEEE Symposium on Security and Privacy
(SP), pp. 582–597. IEEE (2016)
55. Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., Ganguli, S.: Exponential expressivity in
deep neural networks through transient chaos. In: Advances in Neural Information Processing
Systems, vol. 29, pp. 3360–3368. Curran Associates, Inc. (2016)
56. Raghunathan, A., Steinhardt, J., Liang, P.: Certified defenses against adversarial examples. In:
International Conference on Learning Representations (2018)
57. Rao, N., Recht, B., Nowak, R.: Universal measurement bounds for structured sparse signal
recovery. In: Artificial Intelligence and Statistics, pp. 942–950 (Mar 2012)
58. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with
region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
59. Sabour, S., Cao, Y., Faghri, F., Fleet, D.J.: Adversarial manipulation of deep representations.
In: ICLR 2016 (2016)
60. Samangouei, P., Kabkab, M., Chellappa, R.: Defense-GAN: Protecting classifiers against adver-
sarial attacks using generative models. In: International Conference on Learning Representa-
tions (2018)
61. Sanyal, A., Kanade, V., Torr, P.H.S.: Intriguing properties of learned representations (2018)
62. Sarkar, S., Bansal, A., Mahbub, U., Chellappa, R.: Upset and angri: breaking high performance
image classifiers (2017). arXiv:1707.01159
63. Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., Madry, A.: Adversarially robust gener-
alization requires more data. In: Advances in Neural Information Processing Systems, pp.
5014–5026 (2018)
64. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algo-
rithms. Cambridge University Press, New York, NY, USA (2014)
Adversarial Examples in Deep Neural Networks: An Overview 65
65. Song, Y., Kim, T., Nowozin, S., Ermon, S., Kushman, N.: Pixeldefend: Leveraging generative
models to understand and defend against adversarial examples. In: International Conference
on Learning Representations (2018)
66. Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE
Trans. Evol. Comput. (2019)
67. Suggala, A.S., Prasad, A., Nagarajan, V., Ravikumar, P.: Revisiting adversarial risk (Jun 2018).
arXiv:1806.02924 [cs, stat]
68. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V.,
Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1–9 (2015)
69. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intrigu-
ing properties of neural networks. In: International Conference on Learning Representations
(2014)
70. Tabacof, P., Tavares, J., Valle, E.: Adversarial images for variational autoencoders (2016).
arXiv:1612.00155
71. Tanay, T., Griffin, L.: A boundary tilting persepective on the phenomenon of adversarial exam-
ples (Aug 2016). arXiv:1608.07690 [cs, stat]
72. Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., McDaniel, P.: Ensemble adver-
sarial training: attacks and defenses. In: International Conference on Learning Representations
(2018)
73. Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., Madry, A.: Robustness may be at odds
with accuracy. In: International Conference on Learning Representations (2019)
74. Wang, B., Gao, J., Qi, Y.: A theoretical framework for robustness of (Deep) classifiers against
adversarial examples. In: International Conference on Learning Representations (2017)
75. Wang, L., Ding, G.W., Huang, R., Cao, Y., Lui, Y.C.: Adversarial robustness of pruned neural
networks (2018). https://openreview.net/forum?id=SJGrAisIz
76. Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.: Adversarial examples for semantic
segmentation and object detection. In: International Conference on Computer Vision. IEEE
(2017)
77. Yin, D., Ramchandran, K., Bartlett, P.: Rademacher complexity for adversarially robust gen-
eralization (Oct 2018). arXiv:1810.11914 [cs, stat]
78. Yuan, X., He, P., Zhu, Q., Li, X.: Adversarial examples: attacks and defenses for deep learning.
IEEE Trans. Neural Netw. Learn. Syst. 1–20 (2019)
79. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires
rethinking generalization. In: ICLR 2018 (2017)
Representation Learning in Power Time
Series Forecasting
Abstract Renewable energy resources have become a fundamental part of the elec-
trical power supply in many countries. In Germany, renewable energy resources
contribute up to 29% to the energy mix. However, the challenges that arise with
the integration of those variable energy resources are various. Some of these tasks
are short-term and long-term power generation forecasts, load forecasts, integration
of multiple numerical weather prediction (NWP) models, simultaneous power fore-
casts for many renewable farms and areas, scenario generation for renewable power
generation, and the list goes on. All these tasks vary in difficulty depending on the
representation of input features. As an example, consider formulas that express laws
of physics and allow cause and effect of otherwise complex problems to be cal-
culated. Similar to the expressiveness of such formulas, deep learning provides a
framework to represent data in such a way that it is suited for the task at hand. Once
the neural network has learned such a representation of the data in a supervised
or semi-supervised manner, it makes it possible to utilize this representation in the
various available tasks for renewable energy. In our chapter, we present different
techniques to obtain appropriate representations for renewable power forecasting
tasks, showing the similarities and differences of deep learning-based techniques to
traditional algorithms such as (kernel) PCA. We support the theoretical foundations
with evaluations of these techniques found on publicly available datasets for renew-
able energy, such as the GEFCOM 2014 data, Europe Wind Farm data, and German
Solar Farm data. Finally, we give a recommendation that assists the reader in building
and selecting representation learning algorithms for domains other than renewable
energy.
1 Introduction
whole task of selecting and engineering features is tedious and can take up several
selection, evaluation, and engineering cycles. Therefore, the overall process is quite
labor intensive, and requires a great deal of computational capacity.
Representation learning (RL) tries to overcome these disadvantages as a special-
ized field of ML that exploits an automatic data-driven feature engineering and feature
selection approach. During RL we learn latent (or hidden) features. Latent features
describe the data with sufficient accuracy and often provide the distribution under-
lying the original input features, which helps the ML task to perform better. Finding
and constructing the latent features is also referred to as feature extraction [28]. RL
can be seen as an advancement of manual feature engineering and selection, as we
do not need to employ domain knowledge to select essential features, or to come up
with mathematical models describing relations between features. Instead, during RL
we may employ a deep ANN that learns latent features about our input data.
A latent feature model obtained with the help of RL often makes it possible to
further improve those tasks, that use the latent feature representation as input. These
tasks (using the latent features) do not necessarily need to be forecasting a power time
series but can also be the classification of images, or prediction of a car’s trajectory
using different sensory inputs. To present concepts from the field of RL which are
applicable in the field of power time series forecasting and other domains, this chapter
aims to:
As this chapter will introduce the general concepts of RL for time series, we do
not include a comparative study with the traditional techniques of feature selection
and feature engineering.
The remainder of this chapter is structured as follows: Sect. 2 defines a forecasting
task based on three examples of power time series. Section 3 explains feature extrac-
tion in more detail, introducing traditional algorithms as well as deep architectures
for RL. Section 4 proposes several evaluation strategies for RL in renewable power
forecasts based on the previous mentioned definitions. Section 5 shows several exam-
ples for RL in power time series forecasting. It utilizes the algorithm and evaluation
measures to provide examples of RL in power time series forecasting. Section 6 con-
cludes this chapter and gives some advice on how to apply RL to other ML problems.
It also provides insights on how to design your RL network and select appropriate
parameters.
This section defines time series forecasts in the context of renewable energies. Chal-
lenges for those time series are introduced based on those definitions.
70 J. Henze et al.
We use the term power time series for three different types of data here: wind, solar,
and load time series. The whole process to forecast the targets of the data is typically
a two-step approach.
The first step involves forecasting the weather features [5] with a typical time
step of up to k = 72 h in the future. Some important input features provided by the
NWP for power time series forecasting are: Wind speed, air pressure, wind direction,
temperature, humidity, solar irradiance, rainfall, snow coverage, and much more that
can be selected. Creating these features based on NWP models is computationally
expensive and that is assumed to be a given in this chapter. The second step maps the
set of weather features to the generated power of a wind power plant, the load of a
household, or the output generated by a solar power plant. Forecasting a power time
series in the second step, e.g., using neural networks, a Support Vector Machine,
or a linear regression, involves finding a regression model mapping the set of input
features to the power.
While the second step can be considered as a “classic” regression problem without
modelling a specific time dependency, the overall process including the first step is
referred to as a time series problem [5]. However, in most cases the second step is
also considered as a time series problem, as it includes so-called time-shifted features
(explained later in this section) that improve the forecast quality.
Correspondingly, both kinds of data, the NWP and the generated power, are time
series consisting of an ordered list of tuples:
Eq. (2), in the case of a linear regression we want to find a set of weights w that,
multiplied with the input NWP data x, results in the current power, allowing for a
particular variance of ε [24, p. 19]:
Power(x) = wT x + ε. (2)
When doing regression with (deep) neural networks, we try to learn the parameter
of a neural network to be able to map our inputs to the appropriate output. Neural
networks blow up the linear regression task of Eq. (2) to several of these regression
models that are hierarchically combined (including non-linear activation function)
with different inputs from different layers. With the help of the neural network, we
are also capable of learning even non-linear relations between the NWP data and the
power time series, as we can allow for non-linear activation functions, such as the
logistic function.
The significant challenges in power time series forecasting arise due to the aformen-
tioned transformation of the energy system. The former centralized power grid is
changing to a more decentralized grid [22]. Those decentralized grids often have
renewable power plants close to where energy is consumed. This introduces signif-
icant challenges for the power grids as a whole but also generates new challenges
regarding the forecasting.
As we have more and more renewable power plants connected to the grid, we
cannot rely on forecasting an aggregated power output, e.g., for all wind power plants
in a region, but we must forecast for each power plant individually. Depending on
their connection to the power grid, each plant influences a specific part of the local
grid [18]. These challenges get even more complicated with home mounted solar
panels, or electric vehicles charged at home.
Another challenge arises due to smart grids and their smart measuring infrastruc-
ture [13]. This kind of measuring infrastructure increases the amount of available
data that needs to be processed by our models, whether they are machine learning
or traditional models. In addition to data complexity, additional (smart) measuring
infrastructure allows us to create more detailed models of our power grid.
With an increased amount of available data and an increased amount of power
plants that need to be forecasted, we have to think about algorithms that allow for easy
integration and processing of this vast amount of data. Additionally, the algorithms
need to be able to adapt the embedded knowledge to different power plants. One
solution to these challenges can be representation learning and based on top of this
even multi-task learning.
72 J. Henze et al.
Deep
Representation
Learning (IV)
Preprocessed Machine
Data Knowledge
Data Learning
Choose
Features
I: Feature
Missing Values Classification
Engeneering
Filter/Wrapper
II: Feature Selection
Normalization (II) Regression
III: Feature
Extraction PCA (II/III) Clustering
IV: Feature Learning Forecasting
Expert
Knowledge
(I/III)
Fig. 1 A basic overview of the data mining process with examples of different methods. The
methods are annotated with the terms features engineering, selection, extraction, or learning. Note
that feature reduction can be part of I-IV
Representation Learning in Power Time Series Forecasting 73
Figure 1 shows a general data mining process annotated with the terms feature
engineering, selection, extraction, and learning. Some items have one or two terms, as
some of the algorithms do both feature selection and feature extraction. We describe
our view on the different terms of feature engineering, selection and extraction, and
how we apply them throughout this chapter.
In feature selection, we reduce the number of given features using approaches
such as filters and wrappers. Both methods select a subset of k features from the set
of all features n [20]. Filters are, e.g., information theory-based methods that allow
for selecting relevant features based on entropy. Filters are independent of the ML
algorithm, while wrapper algorithms are dependent on the ML algorithm and select
features based on the best evaluation of the algorithm and the current set of features.
Even though filters and wrappers allow the reduction of the number of relevant
features, feature engineering is still a crucial concept to obtain good forecast qual-
ity [4]. In feature engineering, we create features in a way that they help the ML
algorithm to improve its performance. This engineering is done either by a human
expert with domain knowledge or automatically by an algorithm. In the latter case,
it is called representation learning or feature extraction.
RL, also called feature learning, is a field of methods to derive features that are most
relevant to an algorithm. Often this process is described as determining latent or hid-
den features that explain the process that underlies the data. By learning these latent
features, ideally, superior forecast performance over manual feature engineering and
filter and wrapper-based methods is achieved. Further, by reducing the number of
features through RL we reduce computational effort at the same time.
In areas such as vision and natural language processing deep learning-based rep-
resentation learning methods improve the forecast quality dramatically compared to
traditional approaches and manual feature engineering based on domain knowledge.
Between 2012 and 2015, the classification accuracy on the ImageNet-2012 dataset
improved from an error rate of 16.4–3.57% by utilizing deep learning-based repre-
sentation learning methods [1]. Section 3.2 explains some of those state-of-the-art
deep architectures for various domains.
To compare those deep architectures to traditional dimensionality reduction, we
give a brief overview of PCA, kernel PCA, and explain the concepts of the wrapper
and filter approaches in Sect. 3.1 in detail.
the underlying distribution. Respectively, filter and wrapper methods are feature
selection methods, but not feature extraction techniques.
In contrast, PCA, (Sect. 3.1) allows latent features to be extracted. These features
are derived based on the assumption that important features are the directions of
the largest variance in the original feature space. For additional details on feature
reduction methods, we refer to [2, 20, 32].
Filter and Wrapper: Filters are approaches that select a set of relevant features
based on a given measure, whereas the size of the set can vary. Typically, they use
different evaluation measures. Typically, we differentiate between similarity, infor-
mation theoretical, and statistical measures [20]. After selecting a suitable measure,
the features are evaluated and ranked according to the selected measure. Afterward,
the algorithm or the human selects the most relevant features. This process provides
an interpretable selection of features with small computational effort in compari-
son to other dimensionality reduction techniques. Therefore, it scales well with the
number of features. The disadvantages are:
– Filters often select redundant features,
– filters ignore the relation to the ML algorithm, and
– filters only reduce the number of visible features and are not determining latent
features.
While filters operate independently from the ML algorithm, wrapper algorithms
depend on it [10]. In particular, they select the features by iteratively training the
ML algorithm on i subsets of the n features, then evaluate the performance of the
ML model and select the feature set that performs best. In sequential feature forward
selection, one starts with an empty set of features. The feature that improves the ML
algorithm the most is added to the set of relevant features iteratively, until a pre-
defined number of k features is selected [9]. Other methods use an iterative approach
starting with the set of all features and then successively removing the least important
feature. As a result, the effort rises quickly with the number of features and extracting
latent features is not possible.
Principal Component Analysis: PCA is an algorithm that is designed to extract
orthogonal features that are linearly uncorrelated. Therefore, PCA assumes that
important features have a high variance. The individual steps of the algorithm are:
1. Remove the mean from all features.
2. Calculate the covariance matrix of all original features.
3. Calculate the eigenvalues and eigenvectors of the covariance matrix.
4. Sort eigenvectors by their eigenvalues (highest eigenvalue corresponds to the
highest variance in the direction of the corresponding eigenvector).
5. Select k highest eigenvalues for dimensionality reduction.
6. Transform the original input data into a new features space using the eigenvectors.
The transformed features are also called principal components, and they are the
hidden features extracted by the algorithm.
Representation Learning in Power Time Series Forecasting 75
x2
−1
−1 0 1
x1
z2
−1
−1 0 1
z1
0.00
z2
−0.25
−0.50
−0.5 0.0 0.5
z1
0
z2
−1
−1 0 1
z1
76 J. Henze et al.
However, often it is beneficial to apply a non-linear PCA, e.g., a kernel PCA. Non-
linearity is beneficial if the input features do not follow a linear pattern, as in Fig. 2. In
this case, the linear PCA is not capable of finding essential components, as shown in
Fig. 3. Therefore, kernel PCA implicitly calculates the covariance matrix of a higher
dimensional representation of the input. The well-known Vapnik–Chervonenkis
theory [35] states the effect that data transformed into a higher dimensional span
often provides linearly separable features.
Therefore, we first transform the input data with the kernel function into a higher
dimensional representation. In the second step, the kernel calculates the dot product
of the transformed data to obtain the covariance matrix [3, 29, 36]. This combination
of non-linear transformation and dot product is referred to as the kernel-trick. Once
the covariance of the transformed data is calculated, the PCA algorithm is applied.
We can interpret the resulting eigenvectors of the kernel PCA as projections from the
higher dimensions onto the principal components. After applying the kernel PCA
we often obtain better features, as seen in Fig. 4.
Each kernel has a different characteristic, e.g., the higher dimensional features
obtained the radial-basis function (RBF) kernels yielding infinite dimensionalities
[3, p. 297], while other kernels have different characteristics. Thus, it is important
to utilize the kernel that is most suitable for the data. Figure 2 shows the input to the
different examples of PCAs. The input has two features, x1 and x2 . The circular data
presented here has two color-coded labels to indicate a reasonable and non-reasonable
transformation concerning the class label.
Figure 3 shows a non-reasonable result, where the linear PCA is not capable of
extracting meaningful latent features z1 and z2 . In Fig. 4 we see the results of a
RBF kernel applied to the input data. We observe that a single PCA component (z2 )
is sufficient to separate the different classes, while in Fig. 5 the separability of the
two color-coded classes is even decreased when compared to the original input. For
additional information on kernels and in particular kernel PCA refer to [3].
In this section, we explain deep architectures that allow for latent feature learning.
While in the sections above, traditional algorithms from the field of feature selection
and extraction are explained, this section focuses on modern architectures that permit
the extraction of useful features.
We focus on autoencoders that are capable of learning a representation z of the
original input x by constraining the learning process or the representation z. In partic-
ular, we are interested in methods that allow determining latent features, while reduc-
ing the number of features for further processing and keeping the relevant information
at the same time [2, 8]. As stated earlier, reducing the number of features to reduce
computational effort is an essential concept in feature extraction. Correspondingly,
this section focuses on undercomplete autoencoders to learn a compressed represen-
tation of the data.
Representation Learning in Power Time Series Forecasting 77
L(x, h( f (x))),
78 J. Henze et al.
Data Flow
Input Output
Latent
Features
Bottleneck
Encoder Decoder
Fig. 6 An example undercomplete AE topology. The AE reduces the dimensionality in each layer
of the encoder. The latent features representation, at the bottleneck, are the extracted hidden features
that are sufficient to reconstruct the original input successively in each layer of the decoder
where L is a loss function, e.g., a squared loss, that penalizes the dissimilarity between
x and the reconstruction h( f (x)).
After training the AE, we cut the network behind the bottleneck, and attach a
conventional ML algorithm. Using the learned encoding as an input to the regression
or classification model is similar to using components of a kernel PCA.
It can even be shown that when the decoder is linear, and we use a squared error loss
function, the latent features of the AE are in a similar sub-space to PCA. Moreover,
by using singular value decomposition, it is possible to reconstruct the original PCA
components [27]. The results from [27], details the similarities of PCA and AEs. To
be capable of comparing forecast results obtained from AEs with the more advanced
techniques such as the nonlinear PCA, we extend the idea of AE to more complex
structures utilizing the potential of deep architectures for representation learning
even further.
Denoising Autoencoder: An undercomplete AE, as described above, learns latent
features by going through a bottleneck. We achieve a similar form of restriction
by adding a noise term to the input features. DAEs are what is known as regular-
ized autoencoders, allowing similar results, even with overcomplete architectures.
In practice, however, an undercomplete DAE is used to minimize the following loss
function
L(x, h( f (x̃))),
Representation Learning in Power Time Series Forecasting 79
Filter
Kernel
Padding a0
t0 x0 a1 y0 t0
x1 a2 y1
x2 y2
... ...
Input Output
Features Features
... ...
t9 x9 y9 t9
Padding
Fig. 7 Example of one-dimensional CNN with a filter size, 1 × 3, applied to an input time series
size of 1 × 10 with additional padding. The additional padding allows to keep the dimension of the
time series and to extract relevant information
Padding
t ... t t ... t
t 0 ... t 23 t 0 ... t 23
t 0 ... t2323 t 0 ... t2323
t 0 t t 0 t
t0 0 ...... t2323 t0 0 ...... t2323
es
t t t 0 t t t t t
t 0 ...... t 23 t 0 ...... t 23 t 0 ...... t2323 t 0 ...... t2323
at
t ... t23
Fe
Time
Series
where q(z|x) is the scaled version of the unit Gaussian given the current input x and
the Kullback-Leibler Divergence D K L , see Sect. 4.2, penalizes the deviation between
the learned distribution q from a unit Gaussian.
By applying the reparameterization trick, it is possible to extend the original idea
of an AE and achieve the following properties:
– Often the combination of a generative network with an encoder forces the VAE to
learn a representation in a much lower dimensional space, see [8, p. 699] and [16].
– The decoder and the latent vector provide a generative framework.
Fig. 9 An example VAE. The VAE reduces the dimensionality in each layer of the encoder and
learns vectors of μ and σ from a Gaussian distribution. The vectors are used to scale samples
from a Gaussian distribution. The scaled samples are used to reconstruct the original features at the
decoding side
needs to reveal how much information our learned representation maintains and how
well this representation performs in a forecasting task.
1
N
M AE(T, T̂ ) = xi − x̂i (4)
N i
The RMSE, as seen in Eq. (3), and the MAE, as seen in Eq. (4), use the data of
the model output T̂ and the actual time series T . They first calculate the difference
between both data points. The RMSE then squares this difference, averages the
values, and takes the square root of the average. Therefore, RMSE is non-negative
Representation Learning in Power Time Series Forecasting 83
and gives the average distance between the data points and the model output. The
MAE similar to the RMSE is non-negative, as it takes the absolute difference of
the regressed time series and the original time series, and averages it over all data
points. The main difference between those two measures is that the RMSE penalizes
substantial differences between the two time series more than then MAE [17].
During the training of the representation learner, we use the RMSE to evaluated
the reconstruction loss. Later on, we also employ the RMSE to compare the model
quality of different learned representations.
Assessing the influence of the features on the regression model output is another way
to evaluate the input features of a regression model. Such an assessment uses one set
of features, e.g., the learned feature representation, to determine their performance in
comparison to another set of features, e.g., the original input features. Furthermore,
we can gain information about the amount of compression, contained information,
and the features ability to improve the regression. We explain three different mea-
sures to compare the feature representations, learned or not. These measures are the
compression rate, the Kullback-Leibler Divergence (KLD), and correlation-based
measures.
First, we start by giving information about measuring the compression rate we
achieve. This information allows us to group algorithms with similar compression
rates and to evaluate within these groups.
Uncompressed Size
Compression Rate = (5)
Compressed Size
Number of input features
Compression Rate =
Number of latent features
The compression rate as seen in Eq. (5) is the ratio between input data and output
data. In our case, we compare the number of features in the input layer to the number
of features after the encoding. This comparison allows us to compare several metrics
on different datasets grouped by the compression rate.
An essential measure to assess the learned feature representation is the mutual
information which is based on the KLD. The KLD allows us to measure the similarity
of two distributions. The mutual information allows us to measure the influence of
each latent feature with the regression model output [3]. For simplicity, we limit the
explanation to the discrete case of distributions.
P(x)
DKL (P Q) = P(x) log (6)
x∈X
Q(x)
M I (X ; Y ) = DKL (P(X, Y ) P(X )P(Y )) (7)
84 J. Henze et al.
The KLD in Eq. (6) is used to compare two discrete distributions Q and P. It
measures the similarity of the two random variables X and Y with x ∈ X and y ∈ Y .
The KLD is not symmetric, i.e., DKL (P Q) = DKL (Q P). If both P and Q, are
feature representations, we obtain information about their relative entropy. Using
the mutual information (MI), see Eq. (7), with X being the feature representation
and Y the original power time series, we obtain information about how well Y can
be encoded using the current feature representation X [19], allowing us to compare
different learned feature representations regarding their ability to contribute towards
the regression model output.
KLD can be used to calculate the MI, or relative entropy of two distributions, e.g.,
when comparing the distributions in the feature space, or calculating the information
loss of a linear model performed with the original input features to a linear model
performed with the learned features. This approach is similar to the way t-SNE works
[33].
Furthermore, we can apply correlation measures, such as Pearson’s correlation
coefficient, as shown in Eq. (8). The correlation coefficient quantifies how well
our learned features—again, considered as a pair of random variables X and Y —
linearly correlate with the power time series. Therefore, we identify these feature
representations in a linear regression task.
cov(X, Y )
ρ X,Y = (8)
σ X σY
In total, we evaluate three different power time series datasets. All of these datasets are
a combination of an NWP model and the measured power production or consumption.
These datasets include:
– Europe Wind Farm dataset,
– German Solar Farm dataset, and
– GEFCOM 2014 dataset.
The Europe Wind Farm and German Solar Farm dataset can be downloaded from
our website2 and the GEFCOM2014 is also publicly available online.3 These datasets
make our data quite diverse, and we cover a broad spectrum of power time series
forecasting.
Europe Wind Farm Dataset: The Europe Wind Farm Dataset consists of the data
from 45 wind power plants scattered across Europe. The dataset provides the NWP
data as well as the corresponding power output normalized according to the installed
capacities. In addition to the available features in the dataset, we augmented the
available features using 1h and 2h time-shifted features for wind speed and wind
direction allowing time-dependent changes of the future and past weather to be
taken into account, see Sect. 2.1.
German Solar Farm Dataset: The German Solar Farm dataset consists of the data
from 21 photovoltaic facilities in Germany. Their installed nominal power ranges
are between 100 and 8500 kW. The PV facilities range from PV panels installed
on rooftops to full-fledged solar farms. All these facilities are distributed through-
out Germany [6]. Analogous to the Europe Wind Farm dataset, they provide the
corresponding NWP and the power time series which are normalized to the corre-
sponding installed capacities. Again, we augmented the available features using 3h
time-shifted features for sun position, solar height, clear sky, and radiation.
Select Best
Train Machine Test Machine
Representation Model Based on
x,y Learning Learning
Learning on Reconstruction
Algorithm on Algorithm on
Training Loss and
z,y of the z,y of the Test
Dataset Compression
Validation Data Data
Rate
Fig. 11 Overview of RL steps. z refers to the latent features and y is the power generation
Representation Learning in Power Time Series Forecasting 87
the validation set. The last split, the test, set is then used to do a final evaluation of
the regression task on unseen data. For preprocessing, we assure that we normalize
the data to have zero mean and a standard deviation of 1. The values of standard-
ization parameters from the training dataset are applied to the validation and test
dataset afterward. Further, we normalize the generated power between 0 and 1 by the
maximum power generation of each farm. We also avoid categorical input features,
because those only roughly describe the weather phenomena but have a considerable
influence on the training process of the AE, especially on the reconstruction. Often,
when we used categorical features, the AE learned to reconstruct those categorical
features and was not capable of reconstructing the nominal weather features from
the hidden representation. Therefore, we avoid categorical features in all experi-
ments. By adding the time-shifted features, we allow features with time dependency
as those are relevant for time series forecasting. For details refer to Sect. 2.1. In the
Europe Wind Farm dataset we use four different time-shifted features.4 Similarly,
four shifted features5 are added for the German Solar Farm dataset.
Applied Machine Learning Models: In the experiments, we always use the same
set of hyperparameters for the support vector regression (SVR) and the MLP, the
ML models.
For both models, we use the standard parameters given by the scikit-learn frame-
work [26]. The SVR uses an RBF kernel and we train it without a hard limit on
the iterations. The MLP uses one hidden layer with 100 neurons, ReLu activation
functions. We train with the Adam optimizer [15] for a maximum of 200 iterations.
Applied Time Series Measures: For all of our experiments we use the reconstruction
loss and the forecast error, see Sect. 4 for more details. The reconstruction loss allows
us to measure the maintained information within our representation. The forecast
error allows us to determine if the latent feature representation performs well in a
forecasting task. These two measures are the most intuitive and well-known ones in
forecasting tasks.
Guidelines for the Training: This section examines the training process to provide
a guideline for use in other domains.
As previously mentioned, we select the best performing model based on the recon-
struction loss of the validation dataset. After selecting the RL model we encode the
input features x to latent features z. Afterward, z acts as the new input to the ML
model. By using the validation dataset, we minimize the risk that the RL model is
overfitted to the training data and as a result, it does not generalize well on unseen
data. Furthermore, using the validation dataset assures, that the ML learns data not
seen during the training of the RL. We evaluate the final model using the test dataset
and the measures introduced in Sect. 4.2.
In our experiments, one significant difference between the evaluated RL models
concerns the latent features. In the case of PCA, the latent features are the number
of components which first transform the input data into a higher dimensional space
by using kernel functions. In the case of the AEs and DAE, see Sect. 3.2, the number
of latent features is equal to the neurons in the bottleneck. In the case of a VAE, see
Sect. 3.2, the number of latent features is equal to the number of learned μ of the
normal distribution, as in practice μ is sufficient as expectation. For the convolutional
autoencoder (CAE), see Sect. 3.2, since the feature representation includes informa-
tion about 24 time steps of the NWP. It is important to note that in none of the deep
learning-based architectures have we transformed the data into a higher-dimensional
feature space. Also note that including the time-shifted features in the input forces
the RL model to determine latent features that include the time dependency of input
features.
In future applications, it might be necessary to select the RL model with a specific
compression rate, see, e.g., Fig. 12, depending on the reconstruction loss. Selecting
an appropriate model with a specific compression rate can reduce the computational
effort of certain ML models, as their computational effort increases with the number
of input features.
In the following sections, we describe our experimental results and give details
on the training procedure and the hyperparameters for RL in power time series fore-
casting. We show how the different proposed RL approaches perform compared with
traditional approaches. Therefore, we apply four different types of AEs, as well as
linear and non-linear PCA on the dataset to learn and extract new features. We use
the latent features as input in a regression forecast model. This model maps the latent
features from the NWP data to the power time series. In particular, we are interested
in showing the advantages and disadvantages of the evaluated RL methods.
We try to explain everything in a manner that allows for the easy repetition of the
experiment in other domains. Therefore, we use the state-of-the-art machine learning
framework scikit-learn [26] in connection with pytorch [25].
In the following section, we first evaluate the traditional feature extraction tech-
nique, PCA, in Sect. 5.3. Afterward, we evaluate the RL methods for feature extraction
in Sect. 5.4. In Sect. 5.6, we discuss the results achieved by both methods. By separat-
ing the evaluation for traditional and RL methods, we aim to derive recommendations
on how to apply RL to power time series forecasting.
In this section, we highlight the results obtained by PCA. We use PCA as a reference
because it extracts hidden features and can reduce their number at the same time,
see Sect. 3.1. This extraction and selection process permits comparisons to the deep
architecture based RL methods. In contrast, filter and wrapper-based methods do not
allow for the extraction of new hidden features, see Sect. 3.1. Further, by evaluating
different kernels, we show their characteristics concerning the compression rate and
the regression task. Assessing these values allows a wide number of representations
Representation Learning in Power Time Series Forecasting 89
to be compared with the same algorithm, similar to the different deep learning-based
representations.
Figures 12, 13, and 14 show the reconstruction loss of three different kernels
PCAs applied to the German Solar Farm dataset. In all cases, we observe that when
90 J. Henze et al.
increasing the compression rate the reconstruction loss increases as well. This obser-
vation is to be expected, as the decreasing number of hidden features limits the
available information when performing a reconstruction. Alternatively, in terms of
PCA, the number of components is not sufficient to reconstruct the full variance of
the data.
However, the figures presented here illustrate the different behaviors of the applied
kernel. In case of the linear kernel in Fig. 12 we observe an almost constant recon-
struction loss which then increases quickly. In contrast, the reconstruction loss of the
rbf kernel increases rapidly after a compression rate larger than 8. The reconstruction
loss of the cosine kernel is roughly constant at a median reconstruction loss between
0.41 or 0.43 until a compression rate of 11.2. The loss then increases up to an RMSE
of 0.49. We also note, that for the cosine kernel, we have at least two outliers for
every compression rate. Comparing all techniques, we observe that the rbf kernel
has the lowest reconstruction error, followed by the cosine and the linear kernels.
Figures 15 and 16 summarize the results of the MLP and the SVR. We achieve
these results by training the ML model on the extracted features from each kernel
PCA for all compression rates. Correspondingly, these figures show the relationship
between the forecast error and the compression rate.
Representation Learning in Power Time Series Forecasting 91
The results show that the linear kernel has the most substantial RMSE deviations.
The median RMSE of the linear kernel for the MLP increases with the compression
rate. The RBF is the best performing kernel for the MLP, as it has the lowest median
RMSE when compared to the other kernels or at least a similar median RMSE. The
RMSE for both non-linear kernels behaves similarly to the MLP model, with only
slight changes in lower compression rates. The SVR shows a similar RMSE behavior
but with more variations throughout the different compression rates.
The results for the Europe Wind Farm dataset are shown in Figs. 17 and 18. The
compression rates vary between 2.67 and 12.0. The linear kernel has the lowest
median forecast error for both ML models on all compression rates up to a compres-
sion rate of 8. From a compression rate of 8, the cosine kernel seems to be performing
better for all ML models in comparison to the other kernels. It is worth noting that
all kernels show some outliers in forecast error.
Figures 19 and 20 show the results of the GEFCOM2014 Wind dataset. Due to
the amount of input features, the compression rates vary between 1.44 and 6.5. It
can be seen that for most compression rates the cosine kernel performs well for both
ML models. The median error of the cosine kernel varies between 0.225 and 0.26
for the MLP model and between 0.21 and 0.26 for the SVR model.
92 J. Henze et al.
– We vary the number of latent features between 9 and 2. These numbers provide
a broad range of compression rates depending on the dataset showing the effects
concerning input features.
– For AE, DAE, and VAE, learning rates of 0.001, 0.0005, and 0.0001 are evaluated.
For CAE we test the learning rates of 0.01, 0.001, and 0.0001.
– Leaky ReLus are used as activation functions.
– The Adam optimizer is used to train the network.
– In each layer, the number of neurons is reduced by a factor of 0.8 making it possible
to create deep nets that successively reduce the number of features to the required
number of latent features. Note that another common possibility is first to increase
the number of neurons compared to the original input features. In a sense, this
would be similar to the transformation of the nonlinear kernel PCA, but we do not
consider it.
– Utilizing Xavier as initialization, as a state-of-the-art method to initialize weights,
minimizes the risk of exploding gradients [7].
– Similar advantages are achieved by normalizing the input (e.g., avoiding exploding
gradients). Therefore, we use batch normalization in each layer [12].
Preliminary examinations showed that using batch normalization for AE, DAE,
and VAE achieves at least similar good results as without.
Summary of the Evaluation Results for the AE, DAE, VAE and CAE Architec-
tures: The results of the deep architectures for the German Solar Farm dataset are
shown in Figs. 21 and 22. In all cases the AE and DAE have a predominantly similar
median RMSE and forecast error deviation. Compared to the other RL models, the
CAE has the highest median RMSE values for both ML models and the VAE has
the smallest median RMSE for both ML models, except for a few compression rates.
For the MLP, the VAE obtains a median RMSE comparable to the PCA experiment.
Figures 23 and 24 show the results for the Europe Wind Farm dataset. The AE and
the DAE have the smallest median forecast error, where the DAE has a slightly smaller
94 J. Henze et al.
standard deviation. For smaller compression rates the two models have similar results
to those of the PCA, but slightly improve for higher compression rates. The VAE
has a similar forecast error for all compression rates and forecast models. The CAE
performs a bit worse than the AE and DAE. All RL models produce some outliers
regarding the forecast error.
The results of the GEFCOM2014 Wind dataset are shown in the Figs. 25 and
26. The AE and DAE perform similarly on both forecast models. Both of these RL
models also have the best performance for smaller compression rates. For the more
conspicuous compression rates, the VAE performs the best. The CAE is the worst
performing model with an high overall RMSE.
Representation Learning in Power Time Series Forecasting 95
In the previous section, we use the learned feature representation directly and train
ML models based on those. However, fine-tuning provides a more sophisticated
approach towards forecasting power time series with previously learned AEs. The
problem with the previously mentioned approach is that the autoencoder’s weights
are not optimal for the forecasting problem. However, we optimize the weights to
reconstruct the input features from a smaller feature representation. Apparently, due
to this unsupervised learning process for the autoencoder, the learned representation
of the autoencoder might not be ideal for forecasting power time series. Fine-tuning
tries to overcome this problem by partly updating the weights of the trained AE for
the supervised task of power time series forecasting.
96 J. Henze et al.
For our scenario, this means that we partially re-train the previously learned
encoder. First, we add a linear layer, equal to the MLP from the previous experiment,
to the bottlenecks of the AEs. Correspondingly, we have a hidden layer with 100
neurons connected to the number of latent features of the encoder. Furthermore, we
add an output layer to map the 100 neurons to the power time series. Second, in the
process of fine-tuning, we restrict the adaptation of weights to the last four layers:
The output layer of the MLP, the hidden layer of the MLP, and the last two layers
of the already trained AE. This restriction minimizes the training effort and makes
the best use of the previously learned representation.
Figure 27 illustrates the results of the procedure described above. In contrast to
the default parameters of scikit-learn, we use a weight decay of 0.2 as described
in [21] for AE and DAE. Furthermore, we train for 2000 epochs with a batch size of
2048.
The fine-tuning improves the previous results shown in Fig. 21, even though we
train the same number of neurons compared to the previous experiment on the MLP
model. By using fine-tuning, the median RMSE improves to values between 0.09
Representation Learning in Power Time Series Forecasting 97
Fig. 28 Forecast error of fine-tuned MLP model based on AE, DAE, VAE, and CAE. Evaluated
for the different compression rates for the Europe Wind dataset
Fig. 29 Forecast error of fine-tuned MLP model based on AE, DAE, VAE, and CAE. Evaluated
for the different compression rates for the GEFCOM2014 Wind dataset
and 0.1 for all compression rates for the encoder of the AE and DAE. The median
RMSE of the VAE is about 0.01 higher in comparison to the other fine-tuned models.
Comparing to the PCA results in Fig. 15, we can see that for compression rates
higher than 6.2 the finetuned AE and DAE have smaller median RMSE values.
Further, in all cases, the fine-tuned models have at least a similar small standard
deviation of the forecast error.
The Europe Wind Farm dataset achieves a more substantial improvement, as seen
in Fig. 28. The median RMSE decreases to 0.15 up to a compression rate of 6 and
then increases to 0.19 for the best models. These results are an improvement over the
best PCA, with the smallest median RMSE of 0.1755 for lower compression rates.
We obtain similar improvements with the GEFCOM2014 Wind dataset, as shown
in Fig. 29. For almost all compression rates the median RMSE is around 0.175. These
98 J. Henze et al.
results are an improvement over the best PCA, with the smallest median RMSE of
0.225.
6 Conclusion
This chapter proposed to introduce RL in the context of power time series forecasting.
We did this by introducing traditional pre-processing methods such as feature selec-
tion and feature engineering. Instead of manually finding useful input features for
our ML task, we applied RL algorithms, especially ANNs with deep architectures, to
learn latent features. We additionally showed how to evaluate the representation with
and without successive ML algorithms to find good RL models. In most cases, we
differentiated between distribution-based measurements and measurements applied
to the output of ML models trained on the feature representation. In the end, we
showed various examples of RL in the field of renewable power forecasting.
Representation learning can be seen as the starting point for every ML task, as it
obviates the necessity of domain knowledge and permits machine learning models
Representation Learning in Power Time Series Forecasting 99
Acknowledgements This work was supported within the project Prophesy (0324104A) and c/sells
RegioFlexMarkt Nordhessen (03SIN119) funded by the BMWi (Deutsches Bundesministerium für
Wirtschaft und Energie / German Federal Ministry for Economic Affairs and Energy).
References
1. Alom, M.Z., Taha, T.M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M.S., Van Esesn, B.C.,
Awwal, A.A.S., Asari, V.K.: The history began from AlexNet: a comprehensive survey on deep
learning approaches. CoRR arXiv:1803.01164 (2018)
2. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives.
CoRR arXiv:1206.5538 (2012)
3. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
100 J. Henze et al.
4. Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10),
78 (2012)
5. Gensler, A.: Wind power ensemble forecasting. Ph.D. thesis, University of Kassel (2018)
6. Gensler, A., Henze, J., Sick, B., Raabe, N.: Deep learning for solar power forecasting—an
approach using AutoEncoder and LSTM neural networks. In: 2016 IEEE International Con-
ference on Systems, Man, and Cybernetics, pp. 002858–002865 (2016)
7. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural net-
works. AIStats 9, 249–256 (2010)
8. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
9. Freedman, D.A.: Statistical Models: Theory and Practice, 2nd edn. Cambridge University Press
(2005)
10. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning The Elements of
Statistical Learning, vol. 27. Springer (2017)
11. Hong, T., Pinson, P., Fan, S., Zareipour, H., Troccoli, A., Hyndman, R.J.: Probabilistic energy
forecasting: global energy forecasting competition 2014 and beyond. Int. J. Forecast. 32(3),
896–913 (2016)
12. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing
internal covariate shift. CoRR arXiv:1502.03167 (2015)
13. Joy, J., Jasmin, E.A., John, V.: Challenges of smart grid. Int. J. Adv. Res. Electric. Electron.
Instrum. Eng. 2(3), 2320–3765 (2013)
14. Khalid, S., Khalil, T., Nasreen, S.: A survey of feature selection and feature extraction tech-
niques in machine learning. In: Science and Information Conference, pp. 372–378. IEEE (2014)
15. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR arXiv:1412.6980
(2014)
16. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. CoRR arXiv:1312.6114 (2013)
17. Kirchgässner, G., Wolters, J.: Introduction to Modern Time Series Analysis. Springer, Berlin
(2007)
18. Kohler, S., Agricola, A.C., Seidl, H.: dena-Netzstudie II. Technical report, Deutsche Energie-
Agentur GmbH (dena), Berlin (2010)
19. Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E Stat.
Phys. Plasmas Fluids Related Interdiscip. Topics 69(6), 16 (2004)
20. Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R., Tang, J., Liu, H.: Feature selection: a
data perspective. CoRR arXiv:1601.07996 (2016)
21. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. CoRR arXiv:1711.05101
(2017)
22. Lund, H., Østergaard, P.A.: Electric grid and heat planning scenarios with centralised and
distributed sources of conventional CHP and wind generation. Energy 25(4), 299–312 (2000)
23. Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for
hierarchical feature extraction. In: International Conference on Artificial Neural Networks, pp.
52–59. Springer, Berlin (2011)
24. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press (2012)
25. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A.,
Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: Neural Information Processing
Systems (2017)
26. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M., Duchesnay, É.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res.
12, 2825–2830 (2011)
27. Plaut, E.: From principal subspaces to principal components with linear autoencoders. CoRR
arXiv:1804.10253 (2018)
28. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit
invariance during feature extraction. In: Proceedings of The 28th International Conference on
Machine Learning (ICML-11), vol. 1, pp. 833–840 (2011)
Representation Learning in Power Time Series Forecasting 101
29. Schölkopf, B., Smola, A., Mulle, K.R.: Kernel principal component analysis. In: International
Conference on Artificial Neural Networks, pp. 583–588 (1997)
30. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural net-
works for conversational speech transcription. In: Workshop on Automatic Speech Recognition
& Understanding, pp. 24–29. IEEE (2011)
31. Serrà, J., Arcos, J.L.: An empirical evaluation of similarity measures for time series classifica-
tion. CoRR arXiv:1401.3973 (2014)
32. Stańczyk, U., Lakhmi, J.C. (eds.) Feature Selection for Data and Pattern Recognition, 1st edn.
Springer-Verlag, Berlin (2015)
33. Van Der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–
2605 (2008)
34. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbren-
ner, N., Senior, A., Kavukcuoglu, K.: WaveNet: a generative model for raw audio. CoRR
arXiv:1609.03499 (2016)
35. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (2000)
36. Wang, Q.: Kernel principal component analysis and its applications in face recognition and
active shape models. CoRR arXiv:1207.3538 (2012)
37. Zheng, Y., Liu, Q., Chen, E., Ge, Y., Zhao, J.L.: Time Series Classification Using Multi-
Channels Deep Convolutional Neural Networks. Lecture Notes in Computer Science, vol.
8485 (2014)
Deep Learning Application: Load
Forecasting in Big Data of Smart Grids
Abstract Load forecasting in smart grids is still exploratory; despite the increase of
smart grids technologies and energy conservation research, many challenges remain
for accurate load forecasting using big data or large-scale datasets. This chapter
addresses the problem of how to improve the forecasting results of loads in smart
grids, using deep learning methods that have shown significant progress in various
disciplines in recent years. The deep learning methods have the potential ability to
extract problem-relevant features and capture complex large-scale data distributions.
Existing research in load forecasting tends to focus on finding predicted loads using
small historical datasets and the behavior of the load’s consumers in smart grids.
Moreover, current research which applies the conventional deep learning methods
for load forecasting has shown better performance than conventional load forecasting
methods. However, there is little evidence that researchers have addressed the issue of
hybridizing different deep learning methods for complex large-scale load forecasting
in smart grids, with the intent of building a robust predictive model in smart grids
and understanding the relationships that exist between different predictive models
and deep learning methods. Consequently, the purpose of this chapter is to provide
an overview of how the load forecasting performances using deep learning methods
in smart grids can be improved.
A. Almalaq (B)
The Department of Electrical Engineering, University of Hail, Hail, Saudi Arabia
e-mail: [email protected]
J. J. Zhang
The Department of Power Engineering, Wuhan University, Wuhan, China
1 Introduction
Today’s power system infrastructure has been developed and improved using new
technologies in different aspects. The new concept of power system grids, “smart
grids”, is a modern power system infrastructure that aims to build robust, reliable,
efficient grids and minimize the cost of production. Enhancing the grids with renew-
able energy resources, automated control, and communication technologies provides
possible means of efficiency, reliability, and safety for smart grids. The objective of
the smart grids is to advance the use of technology and communication dramatically
by investing in the bidirectional flow of power and data. The smart grid infrastruc-
ture is full of advanced sensing, communicating and computing abilities that work
interoperable way in different power system parts, generation, and distribution [1].
The infrastructure scheme is illustrated in Fig. 1.
The effectiveness of smart grids relies on three primary roles that can help maintain
and manage the grids as follows:
– Dynamic pricing.
– Demand-side management.
– Load forecasting.
The implications from these roles highlight the need to consider the planning and
operation of the power system. The dynamic pricing provides a real-time pricing and
control [1]. An application of demand-side management is the demand response that
can be categorized into these three aspects [2]:
– Peak clipping: reducing peak loads to avoid exceeding the capacity of substations.
– Valley filling: promoting energy storage devices during off-peak loads.
Consumptions
Generation Transmission Distribution
loads
Information flow
Fig. 1 An overview of smart grids scheme. The physical part of the smart grids includes generation,
transmission, distribution and electric loads. The power flow is stepped-up after the generation and
stepped-down after transmission. The distribution power is measured with smart meters installed
at the end-user’s side. The information flow is bidirectional from the generator side to the end-user
Deep Learning Application: Load Forecasting in Big Data … 105
– Load shifting: shifting the energy consumption, e.g., shifting the energy demand
from peak load time to off-peak load time using energy storage devices.
Load forecasting is an essential task that predicts future energy consumption in
order to meet the primary roles of smart grids at any time. In planning, operation, and
control of the power system, load forecasting is a crucial primary element to define
the distribution system capabilities that need to be obtained by the future system.
If load forecasting is applied inaccurately, all relevant steps will affect the planning
of future loads, and the entire planning and operation are at risk. Accurate load
forecasting not only helps in optimizing the future generating units, it also saves the
investment of future power facilities and helps to define the risk factors in planning,
operation, and control tasks. Moreover, electricity price forecasting provides useful
information for power suppliers and customers using a developed bidding system.
Both suppliers and customers need accurate price predictions in order to establish
their bidding strategies to maximize their profits and benefits. Therefore, to achieve
the smart grids’ goals, accurate and efficient load and price forecasting has become
a crucial technique.
Although extensive research has been done using different physical and statistical
models, accurate electric load forecasting remains a challenge in smart grids. Var-
ious artificial intelligence techniques and machine learning algorithms used in the
load forecasting problem are still insufficient to predict the load in the desired form
accurately. Moreover, most of these models are based on small datasets, and their
prediction errors are relatively high. Enhancing the smart grids with deep learning
methods to forecast loads will provide accurate predictions and efficient predictive
modeling as illustrated in Fig. 2.
In this chapter, we will explore the importance of load forecasting in the energy
industry and power systems; in particular, how energy consumption and electrical
loads are reflecting the critical decisions in smart grids. We will research the tradi-
tional deep learning methods used for load forecasting in smart grids and investigate
the hybrid deep learning methods applied for load forecasting using a real big dataset.
Fig. 2 An overview load forecasting scheme. The aggregated electric loads are represented as
load profiles using the smart meters. The CPU-based computer is used for data preprocessing e.g.,
data cleaning and data normalization. The GPU-based computer is used to process deep learning
methods. The electric loads and load profiles are illustrated in Sect. 2. The data preprocessing and
forecasting model are illustrated in Sect. 4
106 A. Almalaq and J. J. Zhang
In the next section, we will demonstrate how load forecasting is becoming a signif-
icant contributor to energy expansion, has different objective terms of periods, is
affected by various influential factors, and has significant challenges of stochas-
tic time series. In the third section, we will review the existing research of load
forecasting extensively using conventional methods, machine learning methods, and
deep learning methods, highlight the insight of existing issues and narrow the gaps of
existing research using significant deep learning approaches. In the fourth section, we
will elaborate different promising load forecasting models using hybrid deep learn-
ing methods and compare their results with existing load forecasting approaches.
In the last section, we will conclude with a summary, a balanced assessment of the
contribution of load forecasting in smart grids, and a roadmap for future research
directions.
The content of this chapter would be useful to researchers interested in the field
of electricity market forecasting as well as graduate students who research on elec-
trical engineering problems; especially, load forecasting and energy consumption
prediction.
2 Background
The evolution of many smart systems such as smart grids around the world has
raised new challenges and opportunities for utility providers as well as households
and enterprises. Before this development, energy suppliers and integrated utilities
had less financial risks and energy management adventures; in addition, the end
users did not have other option but to buy electricity cost-based contracts from local
providers. With all assesses, providers managed and passed all tariffs and costs to
their customers.
Later, developed electricity markets faced a new challenge in the competitive
markets that have allowed any energy supplier to buy electricity and natural gas.
Subsequently, utility costs changed from cost-based to market-based tariffs that pro-
vide end customers multiple options for the same utility based on different rates. On
the suppliers’ side, this competitive market developed a variety of risks such as the
fluctuation of fuel prices and electricity prices and the uncertainties of renewable
energy resources. Moreover, on the end user’s side, energy consumption is the main
risk because of the modernization of customers’ lifestyles. This factor has a massive
challenge of the uncertainty of customer loads and peak demands.
With these risk factors and huge uncertainties of fuel prices, renewable energy,
and demands, accurate load forecasting has become an essential technique for energy
market participants such as providers and end users. In addition to these risk factors,
the importance of accurate predictions is based on several other factors. For examples,
addressing the electrical demand and determining the peak time are essential reasons
for providers, and avoiding high electricity prices and reducing energy consumptions
are crucial reasons for end customers.
Deep Learning Application: Load Forecasting in Big Data … 107
The electrical load at the distribution level is oscillatory and subject to change because
human activities follow the daily, weekly and monthly event cycles. For instance, the
load is generally higher in the daytime and early evening, but it is lower in the late
evening and early morning. This means that every electrical appliance or light bulb
that is switched on or off by customers can directly affect the electrical load seen on
the distribution feeder. In general, customers buy electricity from providers to power
their end-product. Therefore, the distribution system exists to deliver demand energy
to customers in the form of electrical appliances and equipment, lighting, heating,
and cooling as well as other demands in the commercial and industrial sectors. The
distribution system of the smart grids must satisfy customer needs in order to deliver
a high quality of service.
At the same time, electricity capacity is the maximum electric power generated
by a specific energy resource under ideal conditions. The capacity represents the
demand to have adequate resources to ensure satisfying the load peaks at all times.
The generation capacity is commonly measured in kilowatts (kW) or megawatts
(MW). For example, if a wind farm power plant produces 6% (2 MW) of the local
generation capacity, this does not mean that it contributes to the utility with 2 MW
under all conditions but it is under ideal conditions, and it is not the necessary actual
amount. Indeed, the electricity generation is the amount of energy produced for a
specific period of time, and it is commonly measured in kilowatt hours (kWh) or
megawatt hours (MWh). For example, if the wind farm generation plant runs at its
maximum capacity for three consecutive hours, the wind farm plant will produce 6
MWh of energy. If the wind farm runs at only half of its maximum capacity for these
three hours, it will produce 3 MWh of energy. Generally, many energy resources do
not operate at their maximum capacity all the time. Therefore, the produced energy
may vary according to the conditions at the power plants.
Accordingly, the electrical load demand is a trade-off between the high quality of
service and electricity generation capacity. Besides, the uncertainty of fuel, renewable
energy, and actual load demands are substantial risk factors which are considered
on the suppliers’ side. Therefore, energy suppliers need adequate planning models
and efficient forecasting models to determine the actual loads and satisfy customer
demands.
• Medium-term • Long-term
• Demand response. maintenance maintenance
planning. planning.
Fig. 3 Different types of load forecasting. The time horizons of each category is illustrated with
the purpose of the application. The average time interval of the STLF is days, the MTLF is months,
and the LTLF is years
The forecasting technique is used to predict load, electricity price, fossil fuels, wind
power, and solar power. In this chapter, we will focus on the load forecasting in the
literature review and our case study model. We will elaborate on the deep learning
methods applied for load forecasting.
Different categories of load forecasting differ in the time horizon perspective.
These load forecasts categorize the purpose of the prediction in the future time as
illustrated in Fig. 3. We define the three main categories and their objectives as
follows [3, 4]:
– Long-term load forecasting (LTLF): The time interval of this type of forecasting
lies from five years to decades in the future. The objective of the LTLF application
is mainly for the generation and transmission systems which aim to plan for the
future electricity capacity or grid by the size and cost efficient.
– Medium-term load forecasting (MTLF): The forecasting time interval of this type
prevails from a month to five years. The purpose of the MTLF is essentially to
plan for near future power plants and show the dynamics of the smart grid.
– Short-term load forecasting (STLF): This type handles time horizons of a sin-
gle hour up to a couple of weeks. The STLF is necessary for the scheduling of
power plants. Besides, the applications of this type of forecasting include real-time
control, energy transfer scheduling, economy dispatch, and demand response.
Deep Learning Application: Load Forecasting in Big Data … 109
The elemental purpose of load forecasting is to predict future load patterns for cost
saving and better planning and operation. The prior knowledge of the influential fac-
tors that affect the load patterns is a substantial key for accurate load predictions. The
different influential factors on the electrical load and energy demand were identified,
researched and utilized in different papers in the literature [5–7]. These factors are
difficult to distinguish certainly due to different types of time forecasting models
which may influence the STLF, MTLF, and LTLF. The important factors that should
be considered while modeling load forecasting, are classified as follows:
– Time factor: The electrical load varies with respect to customers activities. In
a daily load pattern or energy consumption, it is worth noticing that the higher
energy demands at certain timings. In general, the load demand is higher in the
day time than the night time. For instance, industrial and commercial energy con-
sumptions are higher at working times while residential energy consumption is
higher at evening times. The working hours and working days are crucial because
the variation in load patterns. The early working hours are less consumption than
the middle working hours. Similarly, weekends are less energy consuming than
working days. The energy consumption in holidays is more difficult to forecast
because of the infrequent activities. The load curve in each time resolution such
as daily, weekly, monthly or yearly is periodic but variant and inconsistent. The
load curve always has the highest time of the day, the day of the week, the week
of the month, and the month of the year.
– Weather factor: Significant weather conditions are influential factors on load fore-
casting. The weather conditions include temperature, humidity, wind speed, and
cloud cover. These conditions can be considered mostly for the STLF modeling.
The high temperature in the summer season can affect the customers’ comfort,
and they will consume more energy for cooling. Likewise, the low temperature in
the winter season can affect the customers’ feeling, and they will use more energy
for heating. Therefore, a strong positive correlation between high temperature
and energy consumption in the summer season and a strong negative correlation
between low temperature and energy consumption. The humidity is a relative
weather condition to the severity feeling of high temperature and low temperature.
Hence, customers increase their energy consumption during significant humidity
and temperature conditions. Therefore, humidity is a considerable component for
load forecasting.
– Customer factor: There are different kinds of customers who consume energy
for different purposes such as residential, commercial and industrial customers.
The energy consumption activities differ from one kind of customer to another.
However, the load curves are slightly similar for the majority of one kind. The
customer factor mainly depends on the size of the property, the type of property, the
number of occupants, and the amount of electrical equipment. However, electrical
equipment usage and energy consumption may vary from one consumer to another
within one kind.
110 A. Almalaq and J. J. Zhang
Time Load patterns in: minutely, hourly, daily, weekly, monthly, STLF and MTLF
seasonally and yearly
Weather Significant conditions in: temperature, humidity, wind STLF and MTLF
speed and cloud cover
Economy Increase in: fossil fuel price, electricity price, industrial LTLF
and commercial development and population growth
Other Large social events, sport events and industrial experiments LTLF
– Economy factor: The economy factor plays an important role for load forecast-
ing; especially, for the LTLF models. The economic factors include fossil fuels
price, electricity price, industrial and commercial development, and population
growth. The fuel prices can influence load curves by increasing the electricity
price which impacts the customers’ consumption. Likewise, low electricity prices
increase energy consumption, hence, the load demand increase. The industrial and
commercial development at a particular area increases the energy consumption as
well as the increasing of population growth in a particular area.
– Other factors: Other factors can affect the load demands which are mainly non-
periodic occasions and events that consume large energy consumption such as large
social events, sports events, and industrial experiments. These types of high energy
consumptions are difficult to predict resulting in a high average of prediction errors
in the forecasting model.
In short, these factors may not influence each load forecasting model in the same
way, but they are essential for consideration. Thus, the most critical factor is the time
which directly impacts on the end customers activities. In addition, temperature and
humidity are relevant influential factors for the load forecasting because of human
feelings and activities that directly response to weather conditions with heating and
cooling. Accordingly, Table 1 summarizes the different influential factors and their
use in each load forecasting model.
In this section, we will highlight the existing issues of load forecasting, review
the existing research of load forecasting extensively using conventional statistical
methods, machine learning methods, and deep learning methods and narrow the
gaps of existing research using significant deep learning approaches. First, we will
take a look at some current general issues of load forecasting modeling. Then, we will
give a short description of the most commonly encountered methods and highlight
the advantages and disadvantages of each method. Finally, we will focus on deep
Deep Learning Application: Load Forecasting in Big Data … 111
learning methods applied for load forecasting, demonstrate their key conceptual and
algorithmic facets and narrow the gaps of existing methods. We will give an overview
of their prediction results, the field of studies, locations, scale, the dataset used, the
model used and the year of publication.
In general, good load forecasting models and accurate predictions are vital elements
to lead for appropriate planning, operation, and control. All load forecasting cate-
gories are difficult to be modeled over a planning period due to the many challenges
and influential factors as mentioned above. Thus, the accurate prediction is still chal-
lenging due to the following difficulties:
– The large correlation with weather conditions. Sometimes, weather conditions are
unpredictable, and it turns to an unexpected state.
– The large variation of energy consumption between customers due to the unpre-
dictable events and activities. Also, the lack of using smart meters to record energy
consumption efficiently.
– Single customer load forecasting is more difficult than forecasting the grid load.
This difficulty exists because of the lack of large historical data for the single
customer and the stochastic effects of the customer activities.
– Non-stationary time series effects of the electric load. These effects do not have a
constant mean and variance.
– The high volatility of electrical load due to the change of seasonality and time
factor effects. Sometimes, the same seasons are different from one year to another.
Statistical-based models
So far, various statistical-based techniques applied for load forecasting have been
investigated, all of them with differing degrees of success. There are conventional
forecasting models such as similar-day method, exponential smoothing, linear regres-
sion, multiple regression, Autoregressive Moving-Average (ARMA), and Autore-
gressive Integrated Moving-Average (ARIMA). Since the scope of this work is
limited to deep learning-based methods, we will give a short description of some
statistical-based methods below.
– Similar-day method: It is one of the naivest methods for load forecasting because
the approach depends on searching for a similar day in the past. For instance, we
search for days with similar characteristics in the historical load data and averaging
them to find a forecasted day result. This method is fast and easy to get the overall
112 A. Almalaq and J. J. Zhang
load behavior; however, it lacks the acquisition of the grid expansion and structural
changes.
– Exponential smoothing: This approach depends on smoothing the time series
through the use of the exponentially weighted moving average of the past load
observations. This approach is robust and accurate; however, it lacks the accom-
modation of more than one seasonal pattern. This approach was applied for load
forecasting in [8].
– Linear and multiple regressions: Regression is a statistical tool to estimate the
relationship between a dependent variable and independent variables. It helps
to understand how the dependent variable is related to the change of independent
variables and which one is more related to the dependent variable. Linear regression
is a simple regression method that accommodates one dependent variable and
one independent variable and predicts the dependent variable as a function of the
independent variable. It finds the best fitted straight line between the points of these
variables and it is called the regression line. Multiple regression is an extension
of the simple regression and it accommodates one dependent variable and more
than one independent variables. The dependent variable could be the measured
load data within a certain period of time and the independent variables could be
any influential factor e.g., a day of the week, temperature, population size, etc.
The regression approaches are easy to be modeled and calculate, however, they
are sensitive to data outliers and linearity assumptions.
– ARMA and ARIMA: The autoregressive (AR) model and moving average (MA)
model helps to understand the correlation between dependent and independent
variables and predict future values of the dependent variables. The AR model
uses the association between the observations and its own lagged values and the
MA model uses the moving average for lagged observations and finds residual
errors. The advance model ARIMA includes integrated (I) that subtracts the current
observations from past observations to make the time series stationary. The models
usually referred to their level of orders such as ARMA (p, q) and ARIMA (p, d,
q). The lag order, which is the number of lag observations included in the model,
is denoted as p. The degree of difference, which is the time of raw observations
differenced, is denoted as d. The moving average order, which is the size of the
window, is denoted as q. Therefore, the success of the models depends on the
developer experience and skills to choose the right orders. This approach was
widely used for load forecasting; especially, for the STLF in [9, 10].
Machine learning-based models
On a similar note, machine learning-based techniques are widely known for their abil-
ity to accommodate complex systems, non-linear models and non-stationary time
series. These advantages advance the machine learning-based models over tradi-
tional statistical-based models that must have prior influential factors, knowledge,
and modeling experience to achieve accurate load forecasting. The machine learn-
ing-based techniques are self-learning methods that can classify and predict the input
and output data automatically through the algorithms. Besides, there is no necessary
Deep Learning Application: Load Forecasting in Big Data … 113
experience and knowledge of the forecasting model to achieve accurate load fore-
casting. Relevant findings of machine learning-based techniques concerning load
forecasting problems in the literature revealed acceptable prediction errors; how-
ever, these attempts did not provide better performance than deep learning-based
methods. Since this work concentrate on deep learning-based techniques applied
for load forecasting, we will give a short overview of some machine learning-based
techniques such as decision tree regression, support vector regression and artificial
neural networks below.
– Decision tree regression: The approach builds a predictive regression model based
on partitioning the data into subsets that form a tree structure. It partitions the data
into smaller and smaller subsets while the decision tree increase in each fold.
The tree structure has decision nodes and leaf nodes that consist of two or more
branches and one numerical target, respectively. The branches in the tree represent
the attribute values of the observations. The decision tree regression has continuous
values of the target variables. This approach was implemented for the LTLF in [11]
and energy demand modeling in buildings in [12].
– Support vector regression: It is a supervised learning method that represents the
data observations into points in the space of the data categories. The mapped points
are separated and divided by a hyperplane between the categories. Ideally, the
hyperplane should be large and clear. The method was utilized to load forecasting
in [13–15].
– Artificial neural networks: The approach is one of the widely-used techniques
in machine learning. It is brain-inspired that mimics the process of human self-
learning. The method architecture consists of one input layer, one hidden layer, and
one output layer; however, when it has more than one hidden layer, it is considered
as a deep neural network or deep learning. Generally, the connections between the
artificial neurons are called edges which have connection weights of neurons. The
learning process is computed by the weights and non-linear activation function.
This method was widely utilized for load forecasting in [16–19, 20].
Deep learning-based models
Advance machine learning techniques are called deep learning because they have
deeper neural networks that compute more complex systems using multiple layers of
non-linear functions. The advantages of deep learning models over machine learning
are more complex feature extractions, less modeling, and more accurate predictions;
however, its computational cost is higher than machine learning and statistical mod-
els. The top records in the accuracy of deep learning-based models were found in
many important problems such as face detection, image processing, recommender
systems, natural language processing, and time series predictions. Although few
efforts are conducting deep learning-based models for load forecasting, for exam-
ple, multilayer perceptron, convolutional neural networks, recurrent neural networks,
long short-term memory, and gated recurrent unit and produced more accurate pre-
dictions, most these attempts were based on conventional implementations. Since
this thesis concentrates on deep learning-based techniques applied for load fore-
casting, we will give a short review of some deep learning-based methods and their
114 A. Almalaq and J. J. Zhang
X = {X 0 , X 1 , . . . X T } (1)
where X (t) is the historical data at a time t and t ∈ {1, 2, . . . T }. The load fore
casting output L(t) is computed as follows:
where f (.) denotes the activation function, X (t) denotes inputs, L(t) denotes
the load forecasting outputs, h(t) denotes the hidden state, h(t − 1) denotes the
previous hidden state, W X denotes the input to hidden weights, Wh denotes the
hidden to hidden weights, W L denotes the hidden to output weights, bh is the
hidden to hidden bias vector and b L is the hidden to output bias vector.
– Long short-term memory: Generally, long short-term memory (LSTM) works
essentially in the same way of the RNN, but it employs more gates for the recur-
rent neurons called the forget gate, update gate and output gate and more internal
processing unit called the cell. Each gate has a specific function in the cell. For
example, the forget gate discards unwanted information from the previous state,
the updated gate updates the state with new candidates, the cell filters the current
state and finds the wanted and unwanted information and the output gate selects
the necessary information from the cell output. This approach received attention
due to its superior performance in accurately modeling. It was used widely in load
forecasting in [25, 32–35]. Since the RNN method employs only one non-linear
116 A. Almalaq and J. J. Zhang
function, the LSTM technique imposes five different non-linear functions at the
same cell. In the context of load forecasting, the mathematical representation is
defined as follows:
i t = g1 Wi,n × X (t) + Wi,m × L(t − 1) + bi , (7)
f t = g1 W f,n × X (t) + W f,m × L(t − 1) + b f , (8)
ot = g1 Wo,n × X (t) + Wo,m × L(t − 1) + bo , (9)
U = g2 WU,n × X (t) + WU,m × L(t − 1) + bU , (10)
where g1 denotes the sigmoid activation function, g2 denotes the hyperbolic tan-
gents activation function, X (t) is the input vector, i t is the input of the input gate
where the subscript means input, f t is the input of the forget gate where the sub-
script means forget, ot is the input of the output gate where the subscript means
output, U is the update signal, C(t) is the state value at the time t of computation
and L(t) is the output of the cell for load forecasting. W(.) and b(.) are the weight
matrices and bias vectors, respectively. The weights correspond to the current state
values of a particular variable are denoted as W(.),n and previous state signal as
W(.),m .
– Gated recurrent unit: Similarly, gated recurrent unit (GRU) is another recent and
popular gated architecture of the RNN that adaptively captures dependencies and
features of time series. It also solves the problem of vanishing gradient descent.
The main difference between this approach and LSTM is that it has a single update
gate and a reset gate. The update gate z t combines the forget gate and the input gate
of the LSTM method to control the unwanted and wanted information. The reset
gate rt reconstructs the cell memory with the next processed input. This approach
outperformed the LSTM in [36]. Also, it was utilized for load forecasting problems
in [37, 38]. The mathematical representation for load forecasting context is defined
as follows:
z t = g1 Wz,n × X (t) + Wz,m × L(t − 1) + bz , (13)
rt = g1 Wr,n × X (t) + Wr,m × L(t − 1) + br , (14)
U = g2 WU,n × X (t) + WU,m × [rt L(t − 1)] + bU , (16)
Deep Learning Application: Load Forecasting in Big Data … 117
where g1 denotes the sigmoid activation function, g2 denotes the hyperbolic tan-
gents activation function, X (t) is the input vector, L(t) is the output vector of
load forecasting, U is the update signal, and is element-wise multiplication.
W(.) and b(.) are the wights’ matrices and bias vectors, respectively. The weights
correspond to the current state values of a particular variable are denoted as W(.),n
and previous state signal as W(.),m .
Since these approaches are subcategories of the RNN, they are appropriate tools for
sequential problems such as time series prediction and load forecasting. Addition-
ally, they solve the problem of vanishing gradient descent in the RNN by avoiding
any bias of recent observations.
Modeling deep learning-based paradigm for load forecasting is not an easy task.
There are a large number of choices and parameters that have to be adequately
made to achieve the appropriate modeling and accurate predictions. However, few
guidelines can help developers to overcome these challenges. We will go over these
guidelines in the next section, but now we classify these challenges as the following:
– Data scale: The data scale of the historical load is a major influential factor that
affects the deep learning-based modeling. This factor can influence the model
predictions because of any of the following:
• Outliers and missing values: If the data scale is small, even few outliers or few
missing values will form a significant alteration to the model.
• Train and test data: To evaluate the model properly, the model splits the data
into train and test data. Each data has enough portion of data observations to
perform the proper training and testing. However, if the original data scale is
small, each data may not have enough observations to perform properly.
– Data preprocessing: Usually, the data preprocessing is an important step that has to
be conducted before the data is ready as an input to the deep learning-based model.
This step helps to manage the forecasting model problems and avoid excessive
volatility of the data.
– Designing the deep learning-based model: Selecting an appropriate deep learning
method is the first step in designing an adequate model. There are many architec-
tures of deep learning techniques that were utilized for time series predictions and
electricity load forecasting.
– The appropriate number of hidden layers and neurons: Determining the size of
hidden layers and neurons may be the most challenging task in designing deep
learning-based models. This issue arises because the model needs to be fit and
118 A. Almalaq and J. J. Zhang
have less computational cost. The small size of hidden layers and neurons may
lead the model to inflexible performance with data. On the other hand, the large
number of hidden layers and neurons may increase the chance of overfitting the
model with the data. Besides, the larger size of the hidden layers and neurons is
more computational complexity.
– Model overfitting and validation: Overfitting issue arises when the model learns
the details of the training data well, however, it lacks its performance when it is
tested with new data which is the unseen testing data. Although examining the
model for overfitting is a good strategy for determining the excellent forecasting
model, validating the model using other tests is necessary to achieve a sufficient
deep learning-based model for load forecasting.
– Offline modeling: Generally, deep learning-based models are designed for offline
training and testing. This technique learns the entire data at once and evaluates
the model with a portion of the data that is testing data. On the other hand, online
modeling is a dynamic model that learns at each time step of the brand-new data
and update the predictive model according to the latest data.
Since most of the previous research in load forecasting focuses on small historical
datasets and uses conventional modeling approaches, there is little emphasis on using
big data, hybridizing different significant deep learning-based models and finding
optimal deep learning parameters by using different solutions such as evolutionary
computation algorithms. An initial analysis was able to find evidence of using deep
learning methods that are powerful techniques for precise time series predictions.
We hypothesize that hybridizing two or more deep learning-based methods for load
forecasting in smart grids could produce more accurate prediction and form the
groundwork for explicitly broad load forecasting models. Besides, finding optimal
deep learning parameters using different evolutionary computation methods could
form the preliminary search space of deep learning parameters.
In this section, we will present some guidelines and solutions for the issues of
modeling deep learning-based paradigms. We will elaborate a case study of different
promising load forecasting models using hybrid deep learning methods and compare
their results with existing load forecasting approaches.
Since the deep learning-based models are sensitive to a large number of choices and
parameters, finding the appropriate model is the point at issue. Therefore, an element
key in finding an adequate model is to follow some of the following guidelines
Deep Learning Application: Load Forecasting in Big Data … 119
and successful solutions that have been utilized in the literature. Few guidelines
and recommendations can help model designers to achieve their goal quickly, but
not every guideline will work efficiently with each load forecasting model. These
guidelines and recommendations are listed according to the modeling issues that we
mentioned earlier:
– Data scale: Nowadays because of big data of electrical loads and prices in smart
grids, load forecasting designers need new approaches and technologies in order to
achieve their goals. In general, the big data or a large-scale dataset has a large vol-
ume of information and complex data structures that cannot be processed with tra-
ditional load forecasting models. The large-scale dataset helps utility providers and
energy management operators to analyze their systems comprehensively. Besides,
they can design their forecasting models with high computational techniques such
as deep learning models that perform well with big data by using batches for
training. The primary objective of this chapter is to design a deep learning-based
forecasting model using a large-scale dataset.
– Data preprocessing: Data preprocessing refers to all processing techniques on the
raw data before it is fed to the deep learning model. Some preprocessing techniques
are necessary to be applied before the model learns the dataset. These preprocessing
techniques help the model perform better and consume less computation time. We
list some of the common data preprocessing techniques used for deep learning
methods below:
• Data cleaning: Since deep learning-based models are sensitive to defective
samples in the dataset, so data cleaning technique is essential for better deep
learning performance. The technique may include removing or fixing missing
data and outliers.
• Data normalization: Normalizing the dataset features avoids the problem of
dominating the large number ranges of attributes and helps the model to per-
form accurately. While most of the electrical load datasets consist of different
value scales and various quantities, for example, load profiles, weather data,
and fuel prices, normalizing these values before feeding to the deep learning
model provides easier learning and less computation cost. The mathematical
representation of data normalization is as follows:
X (t) − min
X (t) = (17)
max − min
where X (t) is the original value of the input dataset, X (t) is the normalized
value scaled to the range [0, 1], max is the maximum value of the features, and
min is the minimum value of the features.
– Designing the deep learning-based model: Selecting an appropriate deep learning
architecture is the first step for the load forecasting model. Since various research
papers applied deep learning methods for load forecasting, reviewing these papers,
analyzing their techniques and comparing their results is an important step in
120 A. Almalaq and J. J. Zhang
Data
CPU-based GPU-based
visualization
computer computer
tools
Examples of data Examples of deep Examples of forecasting
Examples of collected
preprocessing: learning-based model: results:
data:
• Fixing or removing • MLP. • The model is well fitted
• Electric load data
outliers. • CNN. the data.
• Date and time data
• Fixing or removing • RNN. • The prediction error is
• Weather data
missing data. • LSTM. reasonable.
• Economic data
• Normalization • GRU. • The prediction load
• Splitting the dataset into • CNN-LSTM. profile follows original
training and testing sets. load profile.
Fig. 4 An overview of deep learning-based model procedure. The procedure consists of four main
segments. The first segment is selecting an electricity load dataset which may include influential
factors data. The second segment is the data preprocessing using a CPU-based computer. This
means that the preprocessing does not need high computation processing. The third segment is the
predictive model which is one of the deep learning algorithms. This step need a high computational
tool such as a GPU-based computer. The last step shows the forecasting results and prediction errors.
This step needs visualization tools to visualize the outputs such as prediction graphs, training and
testing performances, and comparative charts
prediction errors are not reasonable, the deep learning model could be modified or
changed to improve the prediction accuracy. Besides, the model can be compared with
other deep learning methods. Figure 4 demonstrates the general modeling procedures
of load forecasting.
In this case study, we consider a big dataset in the form of power consumption in a
commercial building. First, we perform some analysis and preprocessing techniques
in order to understand the nature of the time series dataset and make it ready for
the forecasting model. Then, we set up a hybrid deep learning-based model for
STLF in an hour-ahead, a day-ahead and a week-ahead forecasting. We compare the
forecasting results with traditional statistical-based models, machine learning-based
models and deep learning-based models.
Commercial building data
The large-scale dataset of power consumption in a commercial building is publicly
published in [41]. The time series dataset consists of one year in 2010 with fifteen
minutes’ resolution. The dataset includes the power consumption in (kW) and outdoor
temperature in (F). The chosen building in this study is building 1 which is a retail
building in Fremont, CA. Figure 5 shows the variation of average power consumption
for 2010.
Hybrid deep learning-based models
Referring to the modeling procedure which is shown in Fig. 7 our modeling stands
for four main parts including preprocessing and hybrid deep learning-based model.
122 A. Almalaq and J. J. Zhang
Fig. 5 The load profile in kilowatts (kW) of the averaged daily power consumption of a commercial
building for one year
The data preprocessing segment prepares the input features collected in the dataset
to the hybrid deep learning-based model.
There are three main steps in the preprocessing segment where the first depends
on normalizing the original datasets as in (17), the second is preparing the input
data for the supervised learning technique, and the third is splitting the normalized
supervised dataset into three parts, the training, the validating and the testing datasets.
To evaluate the performance of the proposed model accurately, the training data is
used for the training process of the approach, the validating data is used to validate
the model performance, and the testing data is used just for testing the forecasting
process using unseen data.
The hybrid deep learning-based model in the third step is based on a coder and
decoder which are the CNN model and LSTM model, respectively. The input of the
CNN-LSTM is the record of power consumption dataset of the commercial building
after the preprocessing analysis, and the output is the power consumption forecasting
for the next day and next week. It is unlike traditional CNN or LSTM models because
hybridizes these two superior methods to improve the learning process. The first half
is CNN, which is utilized to extract the input features and encode them as in (5)
and (6), and the second half is the LSTM, which is used to analyze the extracted
features as in (7)–(12) from the CNN and decode the features to predict the power
consumption for the next period of time. The approach includes two layers of the
one-dimensional CNN to improve extracting the input features, one layer of the one-
dimensional pooling to collect the extracted features, and two layers of the LSTM to
analyze the collected extracted features and predict the output as shown in Fig. 6.
Deep Learning Application: Load Forecasting in Big Data … 123
CNN Pooling
Input n LSTM LSTM
1D 1D
64 32 64
32 16
Input
t
Fig. 6 The architecture of the hybrid deep learning-based model. Circles in the input layer and
output layer represent input X (t) and output L(t), respectively. Circles in the other layers represent
the cells of each layer. The activation function in the convolutional layers is ReLU and in the LSTM
layers is sigmoid
This CNN-LSTM model is implemented using Python 2.7, the Keras deep learn-
ing framework [42], and the scikit-learn framework [43]. We configured the model
network with the same parameters and activation functions shown in Fig. 6. Because
the CNN model has a few choices of the number of neurons, we selected the number
of hidden neurons as 64 neurons in the first convolution layer, 32 neurons in the
second convolution layer and 16 neurons for the pooling layer. Thus, the decoder
segment reverses the number of hidden neurons with 32 and 64 hidden neurons. The
applied optimizer function is Adam, and the applied loss function is the mean square
error. The total number of training epochs is 1000.
Results and discussions
To evaluate the forecasting performance results, we utilized 70% from the original
datasets to train the approach model, 20% from the original dataset to validate the
performance of the model and the last unseen 10% from the original dataset to test the
model predictions. The conventional metrics used to evaluate the predictive models
are utilized to evaluate the forecasting in our experiments. The traditional metrics
such as the root-mean-squared error (RMSE), and the coefficient of variation of
the RMSE, known as the normalized root-mean-squared error (NRMSE), and mean
absolute percentage error (MAPE), are defined as follows:
T
RMSE = (L(t) − X (t))2 , (18)
t=1
124 A. Almalaq and J. J. Zhang
Fig. 7 The energy consumption forecasting graph results of the CNN-LSTM model. The forecast-
ing curves of one hour-ahead are represented in dashed lines that follow the original load profile
line curves
RMSE
N RMSE = × 100%, (19)
X
1 T |X (t) − L(t)|
M AP E = × 100%, (20)
T t=1 |X (t)|
where T represents the total number of time steps in the time series dataset, L(t) is
the predicted output of the time series, X (t) is the real measured time series in the
dataset, and X is the average of the actual values of power consumption.
As shown Fig. 7, the forecasting result is in dashed line curves with triangles and
the original data is shown with line curves. It is worth noticing, the forecasting curves
are almost consistent with the original curves except for several abrupt deviation
points. This represents the effectiveness of the CNN-LSTM forecasting model.
Applying the cross-validation method to the CNN-LSTM model produces a robust
averaged estimation of the forecasting when each observation in the dataset is used
for training and testing at each fold. We utilized 10-fold cross-validation in our fore-
casting model using a time series cross-validator [43]. By applying this method, we
avoided the overfitting problem in our model and validated the prediction model by
testing unseen data at each fold. Besides, we compared our model with traditional sta-
tistical, machine learning and deep learning models as in Table 2. It is worth noticing
that the best forecasting model performance was the CNN-LSTM for the one hour-
ahead forecasting and one day-ahead forecasting. Also, the one-dimensional CNN
model performed better than other models. The LSTM and GRU models performed
in a similar way for both time prediction resolutions. The GRU performance was a
little bit better than the LSTM in both forecasting steps. The decision tree model was
Deep Learning Application: Load Forecasting in Big Data … 125
the worst forecasting performance in our case study. Therefore, the CNN-LSTM
model showed its superiority of forecasting since it is hybridizing two successful
deep learning-based models.
The infrastructure of the energy market has changed dramatically in recent years.
With the development of the smart technologies implemented in the grid, the intro-
duction of renewable energy resources and distributed energy resources, energy mar-
ket participants are in need to update their methodologies for planning, operating,
and controlling electrical loads and energy consumptions. This chapters focused on
the deep learning application applied to load forecasting in smart grids; thus we
gave a snapshot of the background of smart grids and electrical load patterns. We
discussed the importance of the load forecasting in the energy market and the factors
influencing the load forecasting modeling.
In this overview, we reviewed traditional load forecasting methods such as statis-
tical methods, machine learning methods, and deep learning methods. We explored
the key conceptual and algorithmic facets of deep learning methods applied to load
forecasting, and discussed the general issues of deep learning modeling. Also, We
performed a case study of big data and hybrid deep learning-based model for a com-
mercial building load forecasting. We found the CNN-LSTM model outperformed
other traditional deep learning models.
From the literature review, we conclude that no specific deep learning model
outperforms other deep learning models for every forecasting problem. Thus, the
best architecture choice depends on the forecasting task and challenges. The LSTM
and GRU models are close to each other in their performances because they are
subcategories of RNN and suitable for sequential problems. They usually accomplish
126 A. Almalaq and J. J. Zhang
References
1. Gungor, V.C., et al.: Smart grid technologies: communication technologies and standards. IEEE
Trans. Ind. Inf. 7(4), 529–539 (2011)
2. Deng, R., Yang, Z., Chow, M., Chen, J.: A survey on demand response in smart grids: mathe-
matical models and approaches. IEEE Trans. Ind. Inf. 11(3), 570–582 (2015)
3. Almalaq, A., Edwards, G.: A review of deep learning methods applied on load forecasting. In:
2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA),
pp. 511–516 (2017)
4. Raza, M.Q., Khosravi, A.: A review on artificial intelligence based load demand forecasting
techniques for smart grid and buildings. Renew. Sustain. Energy Rev. 50, 1352–1372 (2015)
5. Khatoon, S., Ibraheem, Singh, A.K., Priti: Effects of various factors on electric load forecasting:
an overview. In: 2014 6th IEEE Power India International Conference (PIICON), pp. 1–5 (2014)
6. Fahad, M.U., Arbab, N.: Factor affecting short term load forecasting. J. Clean Energy Technol.
2(4), 305–309 (2014)
7. Feinberg, E.A., Genethliou, D.: Load Forecasting. In: Chow, J.H., Wu, F.F., Momoh, J. (eds.)
Applied Mathematics for Restructured Electric Power Systems: Optimization, Control, and
Computational Intelligence, pp. 269–285. Springer US, Boston, MA (2005)
8. Ji, P., Xiong, D., Wang, P., Chen, J.: A study on exponential smoothing model for load fore-
casting. In: 2012 Asia-Pacific Power and Energy Engineering Conference, pp. 1–4 (2012)
9. Amjady, N.: Short-term hourly load forecasting using time-series modeling with peak load
estimation capability. IEEE Trans. Power Syst. 16(3), 498–505 (2001)
10. Hagan, M.T., Behr, S.M.: The time series approach to short term load forecasting. IEEE Trans.
Power Syst. 2(3), 785–791 (1987)
11. Ding, Q.: Long-term load forecast using decision tree method. In: 2006 IEEE PES Power
Systems Conference and Exposition, pp. 1541–1543 (2006)
12. Yu, Z., Haghighat, F., Fung, B.C.M., Yoshino, H.: A decision tree method for building energy
demand modeling. Energy Build. 42(10), 1637–1646 (2010)
13. Chen, B.-J., Chang, M.-W., et al.: Load forecasting using support vector machines: a study on
EUNITE competition 2001. IEEE Trans. Power Syst. 19(4), 1821–1830 (2004)
14. Pai, P.-F., Hong, W.-C.: Support vector machines with simulated annealing algorithms in elec-
tricity load forecasting. Energy Convers. Manag. 46(17), 2669–2688 (2005)
15. Zhu, Z, Sun, Y., Li, H.: Hybrid of EMD and SVMs for short-term load forecasting. In: 2007.
ICCA 2007. IEEE International Conference on Control and Automation, pp. 1044–1047 (2007)
16. Park, D.C., El-Sharkawi, M.A., Marks, R.J., Atlas, L.E., Damborg, M.J.: Electric load fore-
casting using an artificial neural network. IEEE Trans. Power Syst. 6(2), 442–449 (1991)
Deep Learning Application: Load Forecasting in Big Data … 127
17. Hayati, M., Shirvany, Y.: Artificial neural network approach for short term load forecasting for
Illam region. World Acad. Sci. Eng. Technol. 28, 280–284 (2007)
18. Kandil, N., Wamkeue, R., Saad, M., Georges, S.: An efficient approach for short term load
forecasting using artificial neural networks. Int. J. Electr. Power Energy Syst. 28(8), 525–530
(2006)
19. Zhang, G., Patuwo, B.E., Hu, M.Y.: Forecasting with artificial neural networks: the state of the
art. Int. J. Forecast. 14(1), 35–62 (1998)
20. González, P.A., Zamarreño, J.M.: Prediction of hourly energy consumption in buildings based
on a feedback artificial neural network. Energy Build. 37(6), 595–601 (2005)
21. Tsakoumis, A.C., Vladov, S.S., Mladenov, V.M.: Electric load forecasting with multilayer
perceptron and Elman neural network. In: 6th Seminar on Neural Network Applications in
Electrical Engineering, pp. 87–90 (2002)
22. Dudek, G.: Multilayer perceptron for GEFCom2014 probabilistic electricity price forecasting.
Int. J. Forecast. 32(3), 1057–1060 (2016)
23. Kuo, P.-H., Huang, C.-J.: An electricity price forecasting model by hybrid structured deep
neural networks. Sustainability 10(4), 1280 (2018)
24. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
25. Amarasinghe, K., Marino, D.L., Manic, M.: Deep neural networks for energy load forecasting.
In: 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE), pp. 1483–1488
(2017)
26. Khan, S., Javaid, N., Chand, A., Khan, A.B.M., Rashid, F., Afridi, I.U.: Electricity load fore-
casting for each day of week using deep CNN. In: Kalbitzer, U., Jack, K.M. (eds.) Primate
Life Histories, Sex Roles, and Adaptability, pp. 1107–1119. Springer International Publishing,
Cham (2019)
27. Kollia, I., Kollias, S.: A deep learning approach for load demand forecasting of power systems.
In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India,
pp. 912–919 (2018)
28. Dong, X., Qian, L., Huang, L.: A CNN based bagging learning approach to short-
term load forecasting in smart grid. In: 2017 IEEE SmartWorld, Ubiquitous Intelli-
gence Computing, Advanced Trusted Computed, Scalable Computing Communications,
Cloud Big Data Computing, Internet of People and Smart City Innovation (Smart-
World/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp. 1–6 (2017)
29. Shi, H., Xu, M., Li, R.: Deep learning for household load forecasting—a novel pooling deep
RNN. IEEE Trans. Smart Grid 9(5), 5271–5280 (2018)
30. Yu, Z., Niu, Z., Tang, W., Wu, Q.: Deep learning for daily peak load forecasting–a novel gated
recurrent neural network combining dynamic time warping. IEEE Access 7, 17184–17194
(2019)
31. Bedi, J., Toshniwal, D.: Deep learning framework to forecast electricity demand. Appl. Energy
238, 1312–1326 (2019)
32. Kong, W., Dong, Z.Y., Hill, D.J., Luo, F., Xu, Y.: Short-Term residential load forecasting based
on resident behaviour learning. IEEE Trans. Power Syst. 33(1), 1087–1088 (2018)
33. Marino, D.L., Amarasinghe, K., Manic, M.: Building energy load forecasting using deep neural
networks. In: IECON 2016-42nd Annual Conference of the IEEE Industrial Electronics Society,
pp. 7046–7051 (2016)
34. Gan, D., Wang, Y., Zhang, N., Zhu, W.: Enhancing short-term probabilistic residential load
forecasting with quantile long–short-term memory. J. Eng. 2017(14), 2622–2627 (2017)
35. Zheng, J., Xu, C., Zhang, Z., Li, X.: Electric load forecasting in smart grids using long-short-
term-memory based recurrent neural network. In: 2017 51st Annual Conference on Information
Sciences and Systems (CISS), pp. 1–6 (2017)
36. Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural
networks on sequence modeling. In: CoRR (2014). http://arxiv.org/abs/1412.3555
37. Kumar, S., Hussain, L., Banarjee, S., Reza, M.: Energy load forecasting using deep learn-
ing approach-LSTM and GRU in spark cluster. In: 2018 Fifth International Conference on
Emerging Applications of Information Technology (EAIT), pp. 1–4 (2018)
128 A. Almalaq and J. J. Zhang
38. Gao, X., Li, X., Zhao, B., Ji, W., Jing, X., He, Y.: Short-term electricity load forecasting model
based on EMD-GRU with feature selection. Energies 12(6), 1140 (2019)
39. Almalaq, A., Zhang, J.J.: Evolutionary deep learning-based energy consumption prediction for
buildings. IEEE Access 7, 1520–1531 (2019)
40. Bouktif, S., Fiaz, A., Ouni, A., Serhani, M.A.: Optimal deep learning LSTM model for elec-
tric load forecasting using feature selection and genetic algorithm: comparison with machine
learning approaches. Energies 11(7) (2018)
41. Long-Term Energy Consumption & Outdoor Air Temperature For 11 Commercial Buildings-
Openei Datasets. Openei.org (2019)
42. Chollet, F. et al.: Keras. GitHub (2015)
43. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12,
2825–2830 (2011)
Fast and Accurate Seismic Tomography
via Deep Learning
1 Introduction
The main workflow of hydrocarbon exploration starts with acquiring field data, which
consist of recordings of the response of the subsurface to artificial perturbations. Fol-
lowing data acquisition, several disciplines [1], geology, geophysics, petrophysics,
etc., combine efforts to produce a model of the earth (see Fig. 1, top right) which may
or may not have clear indications of hydrocarbon presence. In areas such as the Gulf
of Mexico, hydrocarbons tend to accumulate near salt bodies making them a key
geological structure in earth model building [2]. This earth model is a critical part
of the decision making process and is given utmost importance during exploration
projects. The average success ratio of the industry is low, thus avoiding unnecessary
expenses, such as drilling wells, translates into saving millions of dollars. Therefore,
techniques to accelerate the decision time and increase the success ratio are crucial.
What we propose in this chapter goes beyond what is currently making inroads in
exploration geosciences, which is machine learning (ML) techniques being applied to
specific well-known steps of the standard hydrocarbon exploration workflow (Fig. 1,
red arrow). Most of the advances happen on the interpretation [3, 4] of the models
rather than in the generation of them. Alternatively, our method is a end-to-end solu-
tion, producing earth models directly from unmanipulated seismic data. Our method
differs from current velocity building methods, seismic tomography [5] (similar to
medical tomography but the penetrating wave is seismic) or wave equation-based
modeling/inversion, in that our method is automatic and without human interven-
tion. The deep learning (DL) technique employed follows recent work [6, 7] that
demonstrates this new approach, which uses a deep neural network (DNN) statisti-
cal model to transform raw input seismic data directly to the final mapping in 2D
or 3D model space. The computational cost of the proposed approach is mostly due
to the training phase, which occurs only once and offline. After training, velocity
model reconstruction computational costs are negligible, thus making the overall
computing requirements a fraction of those needed for traditional techniques, in
particular the ones involving partial differential equations (PDE)-based simulations.
As a preliminary step, velocity semblance [8] is used as the input feature space,
which apparent seismic velocity (main attribute of an earth model) information for
the training process. While we do perform feature extraction, rather than use the raw
data, this feature extraction step is automated and not subject to human bias. Later,
we extend the approach to work directly on the raw recordings thus freeing it from
feature extraction and using the fully accepted unmanipulated seismic data as input.
The main design concern relates to the generalization capability of the DL-based
solution, which basically indicates how much a trained model can accurately predict
unseen data. To address that concern, we foresee models being trained with specific
data belonging to different major exploration areas such as: pre-salt (Brasil offshore)
or subsalt (Gulf of Mexico or West Africa). Regarding future hydrocarbon exploration
workflows, one can imagine this technique being used just after data acquisition (field
recording), then trained models are loaded up to the cloud from which interpreters
can access realizations, thus performing online scenarios testing when feeding back
Fast and Accurate Seismic Tomography via Deep Learning 131
Fig. 1 Overall vision of the new exploration geophysics workflow (green arrow), where the classical
way of approaching the problem is depicted in bottom following the red arrow
their model modifications. This anticipated workflow is fully ML-based, flexible and
with the domain experts at the center of the critical decision making process. Finally,
we envision that this technique can also be applied to other tomography problems
that arise in the geosciences such as global seismology, shallow hazards, etc.
This chapter is organized as follows: Sect. 2 introduces the seismic tomography
problem and the principles of seismic data acquisition. Section 3 explains the DL
approach and the semblance geophysical feature. Section 4 presents experimental
results with 2D synthetic seismic data. Section 5 compares our DL results against the
state-of-the-art results obtained with industry’s tool of choice. Section 6 introduces
our preliminary results without extracting features from the data. Finally, conclusions
and future research are provided in Sect. 7.
To provide a complete context of the earth model building problem, before delving
into our proposed DL-based solutions, this section explains the data to be used
through the chapter and, in a succinct manner, reviews the scientific problem at
hand.
Seismic data are acquired, for the onshore case, via sources positioned on the earth’s
surface and arrays of receivers (geophones). In the offshore case, the sources and
receivers (hydrophones) are towed by a ship, as illustrated in Fig. 2.
132 M. Araya-Polo et al.
Fig. 2 Offshore seismic data acquisition (Source Houston Chronicle, BP, Schlumberger, Fairfield
Nodal)
When energy is emitted from the source, it propagates through a highly het-
erogeneous medium (i.e., subsurface) which in turn creates reflections, refractions
and diffraction effects that are recorded at the receiver (sensors) locations. As these
recorded events are created due to changes in subsurface rock properties, inherently
they contain information about the subsurface from whence they originated. The
goal of seismic tomography and seismic imaging in general is to reconstruct the
subsurface (earth model) that created the recorded seismic data.
With only one source firing and a finite number of receivers, only a limited portion
of the subsurface target of interest can be sampled. Therefore, in order to adequately
illuminate the subsurface, it is required that the source and array of receivers be
positioned at multiple spatial locations. In reflection seismic terminology, the data
obtained from the source firing at a single position xsi into an array of receivers
xri , i = 1, . . . , Nr where Nr is the total number of receivers recording during a
source firing is known as a “shot gather”. Modern reflection seismic data acquired for
industrial purposes are composed of hundreds of thousands of shot gathers. Figure 3
depicts the ray paths (discrete approximation of a wavefront) associated with a shot
gather for a single layer subsurface model and synthetic data recorded as a result of
finite-difference modeling with a point source located at the position xs . Note that
as the source moves along the surface with a dense array of receivers, subsurface
Fast and Accurate Seismic Tomography via Deep Learning 133
Fig. 3 (left) Raypaths for a seismic shot gather acquired over a flat layer earth. (right) Simulated
data using finite-difference modeling. The linear event corresponds to the wave that travels directly
from the source to the receivers along the surface. The hyperbolic event corresponds to the reflection
of the wave off of the layer interface
points will be illuminated multiple times. To take advantage of this data redundancy,
seismic data are typically transformed into midpoint and half-offset coordinates via
the following relations
xsi + xri
xm i = ,
2
xs − xri
xh i = i .
2
where xm i and xh i are the midpoint and half-offset coordinates respectively. Figure 4
shows the resulting raypaths and data that arise from sorting the data in Fig. 3 into the
midpoint and half-offset domain. As this collection of records is for a fixed midpoint
and several offsets, this type of data is known as a common-midpoint gather. The
processing of seismic data for velocity model building in general is performed with
the data transformed into the midpoint and half-offset coordinates.
In Fig. 5, a selected group of traces from a more complex subsurface recording
is presented. The field recordings—depending on the origin—are like the above
depicted ones or more complex, therefore direct interpretation of subsurface structure
is ruled out and this originates the need for advanced techniques to transform this
data into usable models.
The study of seismic tomography has spanned the past several decades and continues
to be part of ongoing research [9]. While there exist many ways to formulate this
134 M. Araya-Polo et al.
Fig. 4 (left) Raypaths for a seismic common midpoint gather acquired over a flat layer earth.
(right) The synthetic data from Fig. 3 sorted into midpoint and half-offset coordinates and with a
mute applied to the direct wave
Fig. 5 (left) Windows in time and space on a shot gather from a complex subsurface model simulated
with finite-difference modeling, therefore very high signal-to-noise ratio. (right) Selected traces
from the shot gather of the left, traces presented as wiggles, where characteristics of the signals are
shown
Fast and Accurate Seismic Tomography via Deep Learning 135
where m represents the earth model that we desire to recover, d represents the
recorded seismic data, f (m) is a physics-based modeling that generates synthetic
data from a prescribed earth model, L is a loss function that measures the mis-
fit between the recorded data and the simulated data, and m∗ is the optimal earth
model that minimizes the loss L. While a highly complex m that informs us of many
different earth properties (elastic moduli, density, viscoleastic parameters, etc.) is
generally desired, m commonly represents a three-dimensional acoustic wavespeed
model. This choice of m generally leads to the scalar acoustic wave equation as the
choice for our physics-based simulation f (m). Further simplifications can be made
in taking the high-frequency limit of the scalar acoustic wave equation which results
in the eikonal equation [13]. While the wave equation describes the propagation of
waves and calculates synthetic seismograms (waveforms), the Eikonal equation is
based on ray theory and calculates traveltimes. Regardless of the model parameter-
ization and physics-based forward model used to fit the recorded geophysical data,
the relationship between the data and the desired earth model is nonlinear. There-
fore, a nonlinear optimization algorithm is required in order to minimize the loss-
function (in Eq. (1)). Additionally, because f (m) is in general very computationally
expensive to evaluate, local/gradient-based methods for optimization must be used
as opposed to global optimization methods. Using only the gradient information of
the loss function can result in convergence to a local minimum and therefore unsat-
isfactory solutions. Additionally, because for reflection seismic surveys the data are
recorded at the earth’s surface, the data do not contain all of the necessary information
to define a velocity model that varies arbitrarily space. This therefore implies that
Eq. (1) defines a non-linear ill-posed optimization problem. In using a deep-learning
approach, while we still face this issue of non linearity and ill-posedness, we do not
rely on an accurate solution of the wave-equation, but rather directly learn a tomo-
graphic operator from many training examples that consist of the seismic data as
feature and the velocity model as label.
where d is the observed seismic data, m is the unknown earth model, f () is a map-
ping operator and is noise. While inverse imaging problems can be solved using
analytic models, recent works [14–16] (and references within), argue that state-of-
the-art results for a variety of inverse imaging problems can be obtained using deep
learning methods. Following this line of work, we have proposed a novel approach
[7] that implements the tomography operator using a convolutional ceural Nctwork
(CNN), whose coefficients are learned in a data-driven approach [17]. The tomog-
raphy process is depicted in Fig. 6, and it performs reconstruction of the velocity
model from raw seismic traces, or from features computed from raw seismic traces.
In a real-life application, the ground-truth model is unavailable, and the tomography
operator is designed to minimize the difference between the reconstructed velocity
model and the (unavailable) ground-truth one. The input to the tomography operator
T(d; θ) is a set of seismic traces (or their features) d, and it is parameterized by a
coefficients vector θ. The tomography operator approximates the inverse mapping
operator f −1 (), and its output is the predicted velocity model m̂. In the statistical
learning framework, the tomography operator is learned using a collection of N train-
ing example pairs {di , mi }i=1
N
, where the data di denotes the set of seismic traces (i.e.
seismic gather) or their features, as generated by wave propagation simulation using
the i-th velocity model mi (the i-th label). The average misfit between the ground
truth models and their predicted versions, also known as the empirical risk, is defined
by:
1
N
J(θ) = L(mi , m̂i ), (3)
N i=1
where L(mi , m̂i ) is the loss function that measures the misfit between the ground
truth velocity model and its prediction m̂i = T(di ; θ). The tomography operator is
learned by minimizing the empirical risk:
The loss function employed in this work is the squared L 2 -norm of the pixel-wise
difference m̂ − m, given by: L(mi , m̂i ) = mi − m̂i 22 , which is frequently used in
regression problems, and leads to the following risk minimization problem:
1
N
θ̂ = arg min mi − T(di ; θ)22 . (5)
θ N i=1
1
N
θ̂ = arg min mi − T(di ; θ)22 + λR(θ), (6)
θ N i=1
Fast and Accurate Seismic Tomography via Deep Learning 137
Fig. 7 Convolutional Neural Network (CNN): a a CNN with two convolutional layers and one
fully-connected layer; and b zoom into the first convolutional layer (Source [20])
where λ ≥ 0 controls the weight of the regularization term R(θ), which is often
defined as Ridge regression R(θ) = θ22 or Lasso regression R(θ) = θ1 .
The tomography operator is implemented by a CNN, and thus can be represented
as a hierarchical composition of k non-linear functions, each representing one of the
k layers of the network:
Feature extraction is an optional step in our workflow as it can accelerate the training
of the CNN by providing it with the most relevant data for learning. Our ML platform
is capable of handling diverse network architectures and data, but given the focus
on learning a tomographic operator from the data, we perform what is known as
velocity (main subsurface model attribute) analysis and use its output as the input
feature space.
To perform velocity analysis, we first transform the data into the midpoint half-
offset coordinates as discussed previously. Then, we perform a time shift to each
offset h of the common-midpoint gather in order to flatten the reflection (which has a
hyperbolic shape) along the offset direction. This time-shift is a function of the half-
offset h and the velocity in the medium V and can be calculated via the following
relationship
h2
t 2 (h, V ) = t02 + 2 , (8)
V
where t is the travel time of the hyperbolic event and t0 is the time at which the data
were recorded. Note that Eq. 8 describes the shape of a hyperbola which is exactly the
shape of the recorded reflection shown in Fig. 4. Performing this time shift requires
that the medium velocity be known a priori (which in the case of VMB is not).
Therefore, trial velocities are prescribed in order to flatten the reflection event and
then the following coherency measure is used in order to measure the flatness of the
time-shifted event N −1 2
i+M
q[ j, k]
j=i−M k=0
s[i] = , (9)
i+M N
−1
N q[ j, k]2
j=i−M k=0
Fig. 8 (left) 2D Synthetic earth model, layers of sediments in a simple depositional system, velocity
ranges between 2000 and 4500 m/s. Horizontal coordinate is and vertical represents depth. (right)
Example of a calculated semblance cube for the model in left. In our case, during training, models
like left are the labels and semblance cubes the input data
In this section we first describe the experimental setup, including network archi-
tecture, datasets, hardware and software and metrics used for quantitative analysis.
Second, we present the qualitative and quantitative results and discussion.
The network architecture is composed of four 3D convolutional layers (64 filters with
kernel size of 6 × 6 × 6) and two fully connected layers. Each layer employs a ReLU
activation function. In addition, max-pooling, batch normalization and dropout with
probability of 0.25 are deployed after each convolutional layer, as depicted in Fig. 9.
The loss function is mean squared error and Nesterov ADAM [21, 22] optimizer is
used. The network is implemented in python using TensorFlow [23] and Keras [24]
as DL supporting frameworks.
The training reaches early stopping on around 250 epochs, in about 6 hours run-
ning on one high performance computing (HPC) node sporting four general purpose
graphical processing units (GPGPUs) NVIDIA K80 [25] in data parallelism fash-
ion. In this parallel execution mode, the model is copied to all computing units and
140 M. Araya-Polo et al.
Fig. 9 Semblance-based 3D CNN architecture: the semblance cube is the input feature to the net-
work, which includes four 3D convolutional layers and two fully connected layers. Each convolu-
tional layer is composed of 3D kernels, ReLU activation per kernel, Maxpool, Batch Normalization
and Droput layer
Fig. 10 (left) The plot shows the metrics value across training time. The vertical axis represent the
metric value and the horizontal axis represents time in epoch units, where one epoch is a complete
sweep through the training dataset. (right) A detailed view of the left plot around an area of interest.
Plots share color codes and mb stands for minibatch
then the training data are evenly split and distributed among the computing units to
be solved. Inference per model is a matter of seconds, which is extremely appeal-
ing when large amount of data is predicted or multiple velocity scenarios are under
investigation.
Two datasets are prepared for both training and testing our model. In the first
dataset, the velocity models only contain layers with velocities that increase with
depth. Additionally, the layers exhibit both undulation and dip (Fig. 11). The second
dataset consists of similar velocity models as the first dataset, but now a portion have
been augmented with salt bodies. To add a factor of realism to these models, the shape
of the salt bodies were extracted from earth models that were the end result of real
life exploration projects in the Gulf of Mexico. Moreover, this dataset also contains
velocity models without salt bodies. Each dataset consists of 6400 semblance cubes
and the corresponding velocity model labels of size 100 × 100 grid points (the size of
the output layer). For validation and testing purposes, we separated 1600 data/label
pairings.
Fast and Accurate Seismic Tomography via Deep Learning 141
The prediction accuracy metrics on the testing set for the first dataset (earth models
only containing layers) are 0.812 for the R 2 score and 0.919 for the SSIM. R 2 score for
Fig. 11 (top, left) model 1 of the test dataset, includes salt bodies, which have high velocity and
tend to distort classical modeling. Salt bodies are key in offshore hydrocarbon exploration. (top,
right) prediction, where vertical axis represents depth and horizontal axis represents lateral offset.
(bottom, left) comparison of the velocity profile for x = 400, where the vertical axis represents depth
in meters. (bottom, center) absolute error between the ground-truth and prediction and (bottom,
right) the corresponding error distribution
142 M. Araya-Polo et al.
Fig. 12 Improving model reconstruction (for one model in the testing dataset) as the learning
process sweeps through the training dataset
Fast and Accurate Seismic Tomography via Deep Learning 143
the test set with the second dataset is 0.741 and the SSIM is 0.892. The convergence
of these metrics with respect to epoch can be observed in Fig. 10. The convergence
curves shows the influence of different batch sizes on the performance for both
metrics, although the effect is most noticeable on the R 2 metric, which converges
later than the SSIM metric. As expected, the task of predicting a model with salt
bodies is more difficult and therefore the performance is lower for dataset two than
dataset one. It is difficult to learn the size, shape, and location of these salt bodies
from the input data space. Furthermore, the datasets are relatively small for the task at
hand. The main impediment to obtaining more training data is the computationally
expensive step of generating features via finite difference wave propagation and
calculating the semblance feature.
Qualitatively, the overall performance trend is positive, the salt bodies are mostly
located properly and the surrounding formation resembles the labels in structure and
velocity value (see Fig. 11), thus making the predicting model valid.
The main structural elements of the predicted model matches the ground-truth.
The predicted expression and location of the salt body in Fig. 11 is remarkable. In
particular, the velocity profile shows that the velocity trend is perfectly recovered,
only missing the sharp interfaces between layers. In Fig. 12 we observe how a model
from the validation set is learned as the training of the network progresses (by epochs).
The first prediction (Fig. 12, top row) shows a model with low velocity in a gradient-
based background, with many unresolved samples (blue dots). After few epochs
(Fig. 12, center row) the predicted model corrects the deeper sections towards higher
velocity and the salt body is clearly reconstructed. Finally, the model prediction is
complete (Fig. 12) and even fine grained details of the salt body are satisfactory
resolved.
FWI uses the entire seismic wavefield recording, that being all recorded frequencies
and locations, to invert for earth model parameters beneath the surface. The goal of
FWI is to find some earth model that minimizes the distance between modeled seis-
mic data, which is a function of the earth model, and recorded seismic data, which
was gathered in the field. When we have changed the earth model in such a way that
the modeled data very closely resembles the recorded data, we assume we have found
an earth model that is representative of the true earth model. Albert Tarantola [30]
was the first to propose solving for earth parameters with such an inverse solution. In
exploration geophysics, FWI is a topic of intense study and is at the forefront of earth
model building from seismic data [32]. That being said, it is plagued with numer-
ous limitations including high computational cost, extreme sensitivity to the choice
of starting model, and unwanted convergence to incorrect earth model solutions.
Moreover, when these limitations are properly accounted for and addressed, FWI is
regarded as an area of development that may rectify the gap between low and high
wavenumber earth model building and represent an all-inclusive solution to seismic
exploration. For this reason we have chosen it as a baseline method to compare the
velocity model prediction results of the ML approach defined previously. If ML can
compete with the current cutting edge industry techniques, it will surely make waves
in the exploration community.
More verbosely, consider the ith shot of a seismic survey diobs where
i = 1, 2, ..., M. Further, consider some modeled data, dimod , which is the synthetic
recreation of the i th observed experiment. We can define the distance between the
observed and modeled data as the L 2 norm of the two vectors,
To create the modeled data we use some wave equation operator, f i , which rep-
resents a single seismic experiment. f i is a function of the earth model, m, and
maps from the earth model space into the data space, f i (m) = dimod . In our case m
represents 1/v 2 , the inverse of the squared pressure wave velocity, at each point in
the subsurface. Many wave equation formulations can model how seismic energy
propagates through the earth. Generally speaking, the more complex and accurate
the wave equation, the more computationally expensive the wave modeling becomes.
For our purposes we use the acoustic, constant density, isotropic wave equation [33].
(A − MD2 ) p = f, (11)
shear modulus. We can represent this wave equation operator with a matrix, H (m),
and solve for the wavefield pi :
H (m)pi = fi (12)
−1
pi = H (m)fi , (13)
where pi represents the wavefield resulting from the ith seismic experiment in the
entire domain. We can use an operator, K , to extract the wavefield at the point receiver
locations to arrive at the modeled data, dimod :
Using this wave equation operator we can define a scalar function, J (m), which
sums the L 2 difference between modeled and observed seismic data over all experi-
ments:
N
J (m) = || f i (m) − diobs ||22 . (15)
i=1
Here we have arrived at what is referred to as the FWI objective function. The
model that reaches the minimum of this objective function is the solution to the FWI
problem and the model that is our best estimate of the velocity profile beneath the
surface.
Solving this inverse problem, that is finding the model that minimizes J (m),
is notoriously difficult for a variety of reasons. Primarily, the objective function is
nonlinear with respect to m, which means a perturbation in the earth model is not
linearly mapped into the modeled data. It follows that the numerous, well studied
strategies to solve linear least squares inverse problems are useless to us. Instead we
must resort to nonlinear regression techniques for which there is no general theory for
finding the optimal model parameters [34]. Iterative methods are a popular choice for
solving nonlinear inverse problems and rely on the gradient of the objective function
at the current model iteration, m j , to update the model parameters to find the next
model iteration, m j+1 .
m j+1 = m j + α j s j . (16)
The next model, m j+1 , is found by summing the current model, m j , to some search
direction, s j , scaled by a step length, α j . There are many ways to compute the search
direction, s j . We use the nonlinear conjugate gradient method in which:
s j = s j−1 + β∇ J (m j ), (17)
where s j−1 is the previous search direction and β is the conjugate direction coef-
ficient, and ∇ J (m j ) is the gradient of the objective function at the current model.
146 M. Araya-Polo et al.
Furthermore, B(m j )∗ is the adjoint of the wave equation operator linearized around
the current model iteration applied to the difference between the modeled and
observed data:
N
∇ J (m j ) = − B(m j )∗ ( f i (m) − diobs ) (18)
i=1
To put it concisely, at each iteration of FWI we use the gradient of the objective
function to update the earth model in order to reduce the value of the objective func-
tion. We stop iterating when the objective function reaches zero or, more realistically,
once it stops reducing.
However, the nonlinearity of the FWI objective function means it is not convex
with respect to the earth model. Gradient descent methods, like the one described
above, will fall into a local minimum, that is find an earth model at which the objective
function stops reducing but does not represent the global minimum of the objective
function. In order to avoid local minima, the initial model used in the inversion
scheme, m0 must be fairly close to the true model. Herein lies one of the largest
restrictions of FWI, that being we must start from an earth model that is fairly close
to the true model in order for the gradient based optimization algorithm to converge
to the true solution.
Many methods exist and extensive research continues to find ways to avoid these
convergence issues. A highly effective and widely accepted method is that of [35]
which is referred to as multiscale FWI. This technique decomposes the FWI problem
by scale and performs conventional FWI with progressively higher bandpasses of
the source wavelet and observed data.
The data itself is generated from 19 shots at the surface with 40 m spacing in the x
direction beginning at 520 m. The shot wavelet is a 15 Hz peak Ricker. 144 receivers
located at the surface record pressure data. They begin at 180 m in x with 10 m spacing.
The wave propagation modeling assumes an acoustic, constant density earth and uses
second order approximation in time and eight order in space. Figure 13 illustrates
the four models used to compare each method. Note, the data was generated on
1.8 × 1.4 km model but the velocity predictions were made on a 1.0 × 1.0 km subset
of the original models.
5.4 Experiments
The first experiment is conducted with conventional FWI; 1000 iterations of non-
linear conjugate gradient are performed using all frequencies of all modeled shots.
The starting model was a linear velocity gradient from 2.0 to 4.5 km/s. A variation
of this experiment is also conducted in which 200 conjugate gradient iterations are
performed using the predicted model from the CNN as the starting model for FWI.
The second experiment is multiscale FWI which performed 150 conjugate gradient
inversions over 5 bandpasses of the all modeled shots. The first 4 bandpasses of the
data were smoothly tapered at 4, 8, 16, and 32 Hz. The fifth inversion used all frequen-
cies. The starting model for the 4 Hz inversion was a linear velocity gradient from
2.0 to 4.5 km/s. Each progressively higher bandpass inversion uses the final model
from the previous bandpass inversion. The third experiment results are obtained by
exposing the trained neural network to unseen data, in our case, to unseen semblance
cubes from velocity models created by our pseudo-random velocity model generator.
5.5 Results
We perform the comparative analysis on four seismic datasets generated from the
velocity models in Fig. 13. The comparison is limited to four datasets because of
the high computational cost of FWI. In fact, retrieving one multiscale FWI result
148 M. Araya-Polo et al.
takes more time than training the CNN used for the ML approach. After the upfront
cost of creating the trained CNN, a single model prediction can be made almost
instantaneously. This speaks to the computational cost of ML compared to FWI.
Figures 14, 15, and 16 depict the results of the three VMB methods on the four
models both visually and numerically. In Figs. 14 and 15, rows correspond to various
models and columns to VMB methods. Since salt diapers are of large interest in the
oil and gas community, comparisons are also made over windowed portions of the
earth models that contain such bodies. Figure 16 gives a more in-depth look into the
results on model 0 by computing difference plots between the true models and each
of the VMB method results. This gives intuition on where each method is over or
under-performing relative to the others. It also shows the error histograms to illustrate
the distribution of velocity errors.
5.6 Discussion
Fig. 14 Velocity Model Building results comparison: (1st row) Models 0, (2nd row) zoom into the
salt body in Model 0, (3rd row) Models 1; and (4th row) zoom into the salt body in Model 1
we needed to use the multiscale scheme. There are dozens of other regularization
methods that may or may not work depending on the specific experiment. It is left
up to the geophysicist to decide. Furthermore, the sensitivity to the starting model
means a priori information on the structure of the earth must be known. In real
world scenarios, the starting model used in these experiments, the linearly increasing
velocity profile with depth, will not suffice. A fairly detailed starting model must be
constructed by the geophysicist beforehand. Alternatively, the ML approach did not
need any handpicked regularization of the input data and it requires no starting
model. ML retreived competitive results without any human bias. Furthermore, FWI
has been in development for 20 years and our ML method is also in its infancy.
If we can recover sharper velocity model results with ML, and thus beat the FWI
150 M. Araya-Polo et al.
Fig. 15 Velocity Model Building results comparison: (1st row) Models 2; and (2nd row) Models
3, (3rd row) zoom into the salt body in model 0
results, there is nothing stopping ML from replacing FWI. Beyond comparing the
velocity model results, we must address an equally important aspect, computational
efficiency. The ML and FWI results were computed at different high performance
computer cluster facilities, making a direct computational comparison difficult. But,
we will find that examining precise clock cycles is not necessary because, by rough
estimation, ML is orders of magnitude more efficient. Consider that to perform 1000
iterations of nonlinear conjugate gradient to recover the multiscale FWI results took
about two days on a busy Stanford University computer cluster. Now, of course, the
modeling and inversion codes used were for academic purposes and were therefore
not fully optimized. But, the earth models used are also fairly small, 1 km × 1 km,
by industry standards. If more efficient code was used on larger models, the compute
time would most likely remain on the same order of magnitude; days. Now, consider
that training the CNN model to map from semblance cubes to velocity models took
about a day to finish. If larger models are used this may increase. Regardless, one
may conclude that both methods are about equally efficient as both take on the order
of days to finish. But, herein lies a critical difference between the two approaches;
the CNN model is reusable. Once the training is finished, mapping from a single
new dataset to a velocity model is nearly instant. Whereas, mapping a new dataset
using multiscale FWI would take days to finish. The cost of the ML approach is all
Fast and Accurate Seismic Tomography via Deep Learning 151
Fig. 16 Comparison of tomography results from the DL and FWI for model 0. Leftmost column
shows ground-truth (label), second from left shows the prediction from the DL (top), the multiscale
(MS) FWI result (middle) and the standard FWI result (bottom). Third column from left shows the
difference between the ground truth and the prediction as a percentage of the velocity error. The last
column shows the percentage of velocity errors for each sample binned and plotted in a histogram
form. When comparing the prediction of the DNN to the MS FWI result, we observe that the DNN
has difficulty in resolving sharp interfaces. Also note that a MS FWI approach was necessary to
avoid cycle skipping that is apparent with the conventional FWI result
upfront and can be reused an infinite amount of times to make instant predictions.
Nothing about FWI is reused and each application is equally expensive.
We show a new way of doing tomography with ML that leaves human biases and
reoccurring high computational cost behind. While the ML results are competitive,
they are still beat by a regularized FWI method. But, our ML method is also in it’s
infancy and FWI has been in development for over 20 years. Further progress may
reap a ML method that can outperform FWI on all fronts, including model quality. A
synergistic approach that utilizes both techniques is also an interesting, and a more
realistic proposition. Using the unbiased results of ML as a starting model, FWI
could fill in the remaining, sharp contrasts with fewer required iterations. This could
quickly produce high quality models completely void of human bias. The broader
case we make here is for the revolution of workflows in industry exploration. We see
potential for many intermediate steps to be absorbed by ML-driven approaches, and
seismic tomography is a stepping stone towards that.
152 M. Araya-Polo et al.
Human biased feature extraction is not desired when truly following the deep learning
paradigm, which encourages an end-to-end learning process which maps from the
relevant elements of the raw data to the ground-truth. After the initial success of the
semblance-based approach, experiments were conducted with a modified version of
the network (as depicted in Fig. 9) that accepts seismic gathers without manipulation
as inputs (Fig. 17, right). The label dataset (Fig. 17, left) is composed by the velocity
models as described in previous sections. The main change in the network, compared
to the one presented in Sect. 4, is in the input layer. Now the input is the raw seismic
shot gathers which are of the dimension (number of shots × number of receivers ×
time samples). Each data/label pairing is composed by the later described 3D seismic
gather and the corresponding velocity model as label. Furthermore, this network’s
training used the Nesterov optimizer ADAM with a learning rate of 1e-03, batch
size of 20 (per GPGPU) and the experiments where executed for 250 epochs using
the MSE loss function. Training takes less than two hours and inference takes only
seconds.
It can be observed in Fig. 18 that the velocity model used as label is larger than the
ones used in Sect. 4 and much more rich in features (velocity variation), this is because
these velocity models belong to datasets used in exploration in the Gulf of Mexico
(due to confidentiality reasons actual geographical locations can not be shared).
Consequentially, the generated seismic data is much closer to what actual field records
look like. The decision of using this data is not random, the final purpose is to expose
the ML approach to real field data and therefore cross the threshold from research
Fig. 17 (left) Example velocity model from the training dataset, (right) examples of corresponding
seismic traces (only three selected sources), obtained by wave propagation and without first arrival
removal, where vertical axis represents time and horizontal is offset from source location
Fast and Accurate Seismic Tomography via Deep Learning 153
Fig. 18 Results examples for a trained model without pre-computed features. (top, left) shows the
ground-truth and (top, right) the corresponding prediction. (bottom, left) comparison of the velocity
profile for x = 500, vertical axis represents depth in meters
into industrial-tested tool. One step in data preparation is to downsample the label
and data, in particular the data was downsampled to fit in GPGPU memory, which for
these experiments are three NVIDIA V100s with 16 GB of internal memory each. The
input data dimension used is 31 × 256 × 300 (where dimension where described in
previous paragraph). The label and predicted model dimension is 100 × 100 samples,
where the first dimension represents the horizontal axis and the second dimension
represents depth or vertical dimension. The total size of the training dataset is 960
samples, where 80 sample where separated for validation and another 80 samples
for testing.
The results training no pre-computed features (in Fig. 18) are at least comparable
if not superior to the ones presented in Sect. 4. Comparable in the sense that all major
features of the expected reconstructed models are present and the error ranges are
similar. In quantitative terms, the test dataset SSIM metric is 0.8181 and the R 2
metric is 0.8272. This two figures are slightly less impressive that the ones reported
in Sect. 4, three main factors are the culprit: complex velocity models, smaller dataset
and forced downsampled data. In qualitative terms, the largest errors appears around
the fine-grained contours of salt bodies, which is also the case for the traditional
154 M. Araya-Polo et al.
techniques as is shown in Sect. 5. Nonetheless, the results are superior in the sense that
the expected reconstructed models are more detailed and therefore harder to predict,
also these velocity models are essentially what is used in exploration geophysics
within seismic imaging workflows.
7 Conclusions
This chapter presents a novel DL approach to a key geoscience problem [5, 37]. By
utilizing DL, it is possible to predict earth models directly from the recorded seismic
data. Essentially, we are replacing an nonlinear inverse problem with a data-driven
learning process. Results with synthetic data achieve high visual accuracy, both
with structural similarity image metric (SSIM) and PSNR. This solution enables
fast turnaround of exploration workflows that nowadays take weeks to complete,
therefore empowering domain experts allowing them to focus on the most complex
prospects within the data. The proposed approach can be extended to other relevant
geoscience problems where accurate earth model are also required. Future work is
twofold: extension to 3D tomography, namely reconstruction of three dimensional
subsurface models, and validation with field recorded seismic data.
Acknowledgements The authors would like to thank Shell International Exploration and Produc-
tions for supporting and allowing us to share this material.
References
1. Alpak, F.O., Araya-Polo, M.: Big loop in the machine learning era. In: 81st EAGE Conference
and Exhibition (2019)
2. Farmer, P., Miller, D., Pieprzak, A., Rutledge, J., Woods, R.: Exploring the subsalt. Oilfield
Rev. 50–64 (1996)
3. AlRegib, G., Deriche, M., Long, Z., Di, H., Wang, Z., Alaudah, Y., Shafiq, M.A., Alfarraj, M.:
Subsurface structure analysis using computational interpretation and learning: a visual signal
processing perspective. IEEE Signal Process. Mag. 35(2), 82–98 (2018)
4. Di, H., Wang, Z., AlRegib, G.: Why using CNN for seismic interpretation? An investigation.
In: SEG, pp. 2216–2220 (2018)
5. Rawlinson, N., Pozgay, S., Fishwick, S.: Seismic tomography: a window into deep earth. Phys.
Earth Plan. Inter. 178(3), 101–135 (2010)
6. Araya-Polo, M., Dahlke, T., Frogner, C., Zhang, C., Poggio, T., Hohl, D.: Automated fault
detection without seismic processing. Lead. Edge 36(3), 208–214 (2017)
7. Araya-Polo, M., Jennings, J., Adler, A., Dahlke, T.: Deep-learning tomography. Lead. Edge
37(1), 58–66 (2018)
8. Claerbout, J.F.: Straightedge Determination of Interval Velocity (1978)
9. Virieux, J., Brossier, R., Mtivier, L., Etienne, V., Operto, S.: Challenges in the full waveform
inversion regarding data, model and optimisation. In: 74th EAGE Conference and Exhibition-
Workshops (2012)
Fast and Accurate Seismic Tomography via Deep Learning 155
10. Stork, C., Clayton, R.W.: Linear aspects of tomographic velocity analysis. Geophysics 56(4),
483–495 (1991)
11. Sava, P., Biondi, B.: Wave-equation migration velocity analysis. I. theory. Geophys. Prospect.
52(6), 593–606 (2004)
12. Tromp, J., Tape, C., Liu, Q.: Seismic tomography, adjoint methods, time reversal and banana-
doughnut kernels. Geophys. J. Int. 160(1), 195–216 (2005)
13. Fomel, S.: Traveltime computation with the linearized Eikonal equation. Stanf. Explor. Project
Rep. 94, 123–131 (1997)
14. Jin, K.H., McCann, M.T., Froustey, E., Unser, M.: Deep convolutional neural network for
inverse problems in imaging. IEEE Trans. Image Process. 26(9), 4509–4522 (2017)
15. Lucas, A., Iliadis, M., Molina, R., Katsaggelos, A.K.: Using deep neural networks for inverse
problems in imaging: beyond analytical methods. IEEE Signal Process. Mag. 35(1), 20–36
(2018)
16. McCann, M.T., Jin, K.H., Unser, M.: Convolutional neural networks for inverse problems in
imaging: a review. IEEE Signal Process. Mag. 34(6), 85–95 (2017)
17. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press. http://www.
deeplearningbook.org (2016)
18. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009)
19. LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. In:
The Handbook Of Brain Theory and Neural Networks, vol. 3361, no. 10 (1995)
20. Couchot, J.F., Couturier, R., Guyeux, C., Salomon, M.: Steganalysis via a convolutional neural
network using large convolution filters (2016). CoRR, arXiv:1605.07946
21. Kinga, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on
Learning Representations (ICLR), vol. 5 (2015)
22. Dozat, T.: Incorporating nesterov momentum into adam. In: Proceedings of ICLR Workshop
(2016)
23. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis,
A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia,
Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S.,
Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker,
P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke,
M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous systems
(2015). Software available from www.tensorflow.org
24. Chollet, F., et al.: Keras. https://keras.io (2015)
25. nVIDIA, Tesla K40 and K80 GPU accelerators for servers. http://www.nvidia.com/object/
tesla-servers.html (2014)
26. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error
visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
27. Červený, V.: Seismic Ray Method. Cambridge University Press (2001)
28. Vidale, J.: Finite-difference calculations of travel times. Bull. Seismol. Soc. Am. 78(6), 2062–
2076 (1988)
29. Ellefsen, K.J.: A comparison of phase inversion and traveltime tomography for processing
near-surface refraction traveltimes. Geophysics 74(6), WCB11–WCB24 (2009)
30. Tarantola, A.: Inversion of seismic reflection data in the acoustic approximation. Geophysics
49(8), 1259–1266 (1984)
31. Romdhane, A., Grandjean, G., Brossier, R., Réjiba, F., Operto, S., Virieux, J.: Shallow-structure
characterization by 2D elastic full-waveform inversion. Geophysics 76(3), R81–R93 (2011)
32. Virieux, J., Operto, S.: An overview of full-waveform inversion in exploration geophysics.
Geophysics 74(6), WCC1–WCC26 (2009)
33. Aki, K., Richards, P.G.: Quantitative seismology
34. Aster, R.C., Borchers, B., Thurber, C.H.: Parameter Estimation and Inverse Problems. Elsevier
(2005)
156 M. Araya-Polo et al.
35. Bunks, C., Saleck, F.M., Zaleski, S., Chavent, G.: Multiscale seismic waveform inversion.
Geophysics 60(5), 1457–1473 (1995)
36. Farris, S.: Tomography: a deep learning vs full-waveform inversion comparison. In: First EAGE
Workshop on High Performance Computing for Upstream in Latin America (2018)
37. Taner, M.T., Koehler, F.: Velocity spectra—digital computer derivation applications of velocity
functions. Geophysics 34(6), 859–881 (1969)
Traffic Light and Vehicle Signal
Recognition with High Dynamic Range
Imaging and Deep Learning
This work was supported by the A*STAR Grant for Autonomous Systems Project, Singapore. Both
authors contributed equally to this work. This work was done when Lu-Bing Zhou was with Institute
for Infocomm Research. He is now with nuTonomy, Singapore 139954.
have shown that our dual-channel approach outperforms the state of the art which
uses only bright images. Encouraged by the promising performance of the TLR,
we extend the dual-channel approach to vehicle signal recognition. The algorithm
reported in this chapter has been integrated into our autonomous vehicle via Data
Distribute Service (DDS) and works robustly in real roads.
1 Introduction
Traffic Light Recognition (TLR) locates the traffic light from an image and then esti-
mates the status of the light signal. Vehicle Signal Recognition (VSR) estimates the
signal of the vehicles ahead from an image. Automatic recognition of traffic light and
vehicle signal are two of perception (functionalities) for Advanced Driver Assistance
Systems (ADAS) or Autonomous Vehicle (AV) because failure of following traffic
light or vehicle signal could lead to a fatal accident. There have been lots of studies
on TLR. However, not much attention has been paid to practical TLR problems. The
most challenging issues of TLR include computation time, day/night lighting condi-
tions, confusion of tail lights or other kinds of ambient light, low image resolution and
vehicle occlusion. Most of existing TLR approaches are sensitive to lighting condi-
tions because only bright images are used. In this chapter, in the premise of ensuring
real-time conditions, we are interested in TLR problem under varying lighting con-
ditions and confusion of tail lights. A two-stage approach is proposed to solve the
problems: detect traffic light candidates in low exposure/dark image which is less
sensitive to lighting conditions and then recognize their traffic light state in high
exposure/bright image which has rich texture. Deep learning is adopted to improve
the recognition accuracy significantly.
Some good surveys on TLR can be found in [1–3]. The existing TLR methods
can be roughly divided into three categories: (1) template matching; (2) circular
extraction; and (3) color distribution. In the first category, templates of red or green
light are matched with the extracted regions. The circular shape is detected from
images by using Hough transform in the second category. The third category is
mainly color segmentation. One of the major disadvantages of these three categories
approaches is the high sensitivity to lighting conditions. Color and shape information
are used to detect TL candidates [4–7]. Some image preprocessing is applied to prune
the candidates before being fed to the classifier. Before pruning candidates using
shape, temporal, edge and symmetry, image segmentation in HSV [8] or RGB [9]
space is adopted. In order to recognize the traffic light states robustly, an adaptive
template matching method is proposed [10].
In order to improve detection accuracy, region of interest (ROI) is used to reduce
the search region. Map and GPS (annotated) are used to generate ROI [11, 12]. Some
Traffic Light and Vehicle Signal Recognition … 159
range. To speed up the recognition processing, the candidates are pruned by saliency
map and region of interest (ROI). The ROI in our approach is determined by: (1)
calibrating the camera with respect to the ground world coordinate; (2) the knowledge
about the physical heights of the TLs. Finally, based on temporal trajectory analysis,
we develop a tracking technology. In doing so, the robustness and accuracy of the
TLR have been improved.
This chapter is a technical summary of three papers [20, 22, 23] which respectively
uses related technologies for TLR and VSR. The original methodologies can refer
to these papers.
The remainder of the chapter is organized as follows. The HDR imaging based
traffic light detection will be discussed in Sect. 2. The CNN traffic light recognition is
discussed in Sect. 3. In Sect. 4, tracking technology is discussed. The experimental
results are given in Sect. 5. The extension of the dual-channel method to vehicle
signal recognition is discussed in Sect. 6. Conclusion and future work are discussed
in Sect. 7.
device [16]. HDR technology has been successfully applied in both photo and TV
to make images/frames have a greater contrast between bright and dark. Different
from the existing applications of HDR which enhance visual quality by combining
bright and dark images, we use dual-channel separately and combine them using the
association between the two channels.
The motivation for us to adopt HDR is that the higher detection ability of the dark
images and higher recognition ability of the bright images. Although the dark and
bright image are not exact captured simultaneously, the relatively time difference
between them is short enough to be neglected. In other words, we can easily find the
corresponding regions between bright and successive dark images, and vice versa.
This helps us associate the detected traffic light candidates in the dark image with
their location in the bright image.
As mentioned, the dark images are used for traffic light candidate detection and
bright images are used for recognition. As for vehicle signal recognition, the bright
images are used for vehicle detection and dark images are used for vehicle signal
recognition.
It is an essential step to detect TL candidates from images for a successful traffic light
state classification/tracking system. Currently, most of TLR systems detect traffic
light using only bright images. However, similar to other detection problems, their
performance is very sensitive to the environment lighting conditions, and confusion
with the tail lights of vehicle ahead or other similar ambient light, for example, traffic
sign, temporary roadblocks, pedestrian. How to robustly detect traffic light candidates
under varying illumination is still an open problem. In this chapter, instead of using a
single image, we propose a novel method to use dual channel (low and high exposure,
respectively) provided by a HDR camera. Unlike previous HDR imaging in which
a single image is synthesized from bright and dark channels, in our approach these
channels are used separately to detect and recognize traffic lights. As mentioned in
Sect. 2.1, a HDR camera, with more dynamic ranges than a normal camera, can be
used in such a way that the successive two channels can be set as high exposure
and low exposure, respectively. The traffic light candidates detected in a dark image
can be located easily in bright channel as they are captured within a very short
time interval, about 40 ms for our camera having 25 fps with high-definition serial
digital interface (HD-SDI). The association between the candidate regions on dark
and bright images is not affected largely by high speed moving. Anyway, a way to
re-locate the TL candidate detection results on the bright image is proposed in this
chapter, and will be discussed in Sect. 3.1.
The way we use the HDR imaging makes the traffic light candidate detection
more robust than others because the lights are with a clean dark background on low
exposure images. By using this HDR dual-channel mechanism, undistorted color and
162 J.-G. Wang and L.-B. Zhou
Fig. 2 Traffic light detection and recognition with dual-channel mechanism [22]. a High expo-
sure/bright image; b Low exposure/dark image; c Dark image with saliency map of the ROI;
d Traffic lights candidate detection and recognition results. The traffic light state result is displayed
in the upper right
shape information on a dark image and rich context information on a bright image
are fully used.
Figure 2 shows an example of the dual-channel TLR. We can see that the lights,
including traffic lights and vehicles’ tail lights, are prominent in the dark image. The
rich context can be seen from the bright image.
Low lighting conditions is a challenging issue in using HDR to recognize traffic
light. Traffic light candidates are detected from dark image by a simple color thresh-
olding segmentation in [17]. The detection performance could be unreliable as it is
hard to adapt the varying illumination conditions with a threshold. A saliency map
filtering, aims to simplify and/or change the representation of an image into some-
thing that is more meaningful and easier to analyze [24], is adopted in this chapter
to handle low lighting problem.
Traffic Light and Vehicle Signal Recognition … 163
Most of the existing traffic light recognition methods detect traffic lights (color blobs)
by tuning thresholds. The color information is used for locating and identifying traffic
light states. YCbCr [25], instead of RGB, is considered for this purpose because the
color and intensity are mixed in three channels of the RGB color space.
Usually, the parameters used to identify the traffic light states (red, green and
amber) are very sensitive to environment lighting conditions. As verification needs
to be done for each pixel in order to determine the state, the time consumption is
linearly increased with the number of colors. In order to speed up the process, a
non-parameter model is proposed in this chapter to extract blobs of various colors
simultaneously. RGB color space is used in this chapter in order to illustrate the
robustness of our method although the performance could be better when HSV or
other spaces is adopted [17].
Our method contains followings steps. Firstly, the 3D RGB color space is divided
into grids, M × M × M. M is set to be 32 (without fine tuning) in this chapter.
Secondly, the histograms for each state, including red, green and amber colors, are
calculated from samples. Let’s define normalized histogram [0, 1] for red, green
and amber as H r , H g and …, respectively. Those values above 0.1 in H r , H g and
H a are truncated to prevent extreme dominance of a single color bin. The resulting
histograms are renormalized to [0, 1]. Given an input image, I, the saliency score of
a pixel (i, j) in red channel is computed as
Sr (i, j) = Hr (i , j ) (1)
(i , j )∈Nd (i, j)
H = max(Hr , Hg , Ha ) (2)
If the saliency value of a pixel is found to be above the threshold, then the channel
saliency scores will be re-computed using the three channel histogram models. The
pixel is assigned to a type which is with the maximize channel saliency score.
164 J.-G. Wang and L.-B. Zhou
Fig. 3 Saliency map [22]. a High exposure/bright image; b Low exposure/dark image; c Saliency
map of (b); d Saliency map with color label
Following the way mentioned-above, most of the pixels could be filtered out by the
final saliency score, and the types of other pixels could be determined by individual
saliency models. Figure 3 shows an example of the proposed saliency model.
Function findContours() in the OpenCV [26] is used to extract contours of the
blobs from the resulting binary image. Some obvious incorrect blobs can optionally
be removed based on shape analysis, e.g. area or circularity.
Auto exposure has been investigated in literature [28–30]. However, we need a fast
auto exposure approach for our autonomous vehicle application. In this chapter, we
propose a real-time auto exposure approach. The exposure is adjusted by observing
the difference between the mean intensity of an image mask and a reference value.
Let I t represents an expected mean image intensity, I c represents the mean inten-
sity of the current frame, a factor is defined as
It
f = (4)
Ic
Our objective is to let f in Eq. (4) tend to 1, i.e. I c tends to I t , by updating gain or
shutter.
To obtain expected f, shutter and gain are jointly adjusted within their respective
ranges [smin , smax ] and [gmin , gmax ]. In actual implementation, the shutter is adjusted
before gain, as noise could come together with a large gain value. The adjustment of
shuttle will result in a factor:
st
fs = (5)
sc
where sc represents current shutter value, st and represents the updated shutter value.
It is known that the shutter value is directly proportional to intensity. If desired factor
f can be achieved by only adjusting shutter within its range, i.e. f s = f, then no gain
adjustment is needed. However, if f s cannot lead to a targeted image intensity, then the
shutter will be updated to its extreme within the range, and the gain will be adjusted
to cover the remaining portion of the factor, i.e. f = f s f g , where f g is a gain factor.
The remaining factor can be achieved easily by adjusting gain based on a common
observation: when the gain adjustment (increase or decrease) approximately 6 db,
the intensity doubles or halves.
⎡ ⎤ ⎡ ⎤⎡ x ⎤
ut a11 a12 a13 a14
⎣ vt ⎦ = ⎢ ⎥⎢ y ⎥
⎣ a21 a22 a23 a24 ⎦⎢
⎣z
⎥
⎦ (6)
t a31 a32 a33 1
1
where (x, y, z) and (u, v) represent world coordinates and image coordinates, respec-
tively. In this chapter, the world coordinate system is defined XY lies on the ground
and the Z-axis is upward and perpendicular to XY. The origin of the XYZ corre-
sponding to the frontal middle point of the host vehicle; the X-axis towards the front
of the host vehicle. The Y-axis towards the left to make the XYZ follows right-hand
rule. The calibration is a process to estimate the eleven unknown parameters, a, in 3
× 4 matrix of Eq. (6). At least four groups of 3D and 2D coordinates are needed for
this purpose. In practical, more than four groups of 3D and 2D coordinates can be
used to calibrate the camera. The eleven parameters are estimated from these groups
of data by solving a least-square fitting problem. In this chapter, these groups of data
are obtained by reading the coordinates of a few known-size calibration objects.
The location of the TLs on an image could be estimated when the vehicle’s
localization and 3D localization of TLs on the map are given. However, high accuracy
map and localization estimation are needed for this kind of method, makes it hard
to be adopted in real practical application. In this chapter, instead, accurate map and
vehicle localization, the region of interest (ROI) method is adopted to speed up the
TL detection.
In our experiments, we have not made any assumptions about map, traffic lights’
location or host vehicle’s pose. When no localization information is provided, a
roughly ranges in x, y, z direction with a very roughly 2D GPS position is still useful.
Based on these 3D ranges, an intensive 3D grid can be made and 2D image ROI can
be correspondingly generated. In other words, a ROI, corresponding to the longest
distance, is adopted where the traffic lights candidates can be found.
In this chapter, the detection range in XYZ for a vertically hanged traffic light
is defined as follows. X (longitudinal): [0 m, 70 m]; Y (lateral): [−8 m, 8 m] and
Z (upward): [2.5 m, 4 m]. These parameters are set based on normal traffic light
cases in Singapore. To estimate possible ROI, (x, y, z) may change within range.
Figure 4g and h show two detection masks or ROI for horizontally hanged TLs,
obtained by changing Z within [4.5 m, 7 m]. In the case that either vehicle pose or
TLs location is accessible, such ROI could be further shrunk. More examples for
ROIs corresponding to different ranges are shown in Fig. 4a–f. Figure 4g is used as
the ROI.
With the help of the ROI, the computation cost for detecting traffic light candi-
dates can be reduced significantly. We can see this from the experimental results
to be discussed in Sect. 5. Low computation cost is very important for real-time
applications, such as autonomous vehicles, which need to run a few models, e.g. per-
ception, navigation, simultaneously. Besides time saving, the traffic light recognition
accuracy has been improved since some false positives located outside the ROI can
be prevented.
Traffic Light and Vehicle Signal Recognition … 167
Fig. 4 Eight ROIs for different x, y and z [22]. The ROI in (g) is used in our real experiments
Thanks to large data available and hardware progress, deep learning, as a state of the
art machine learning technology, has achieved very promising results in computer
vision (e.g. object detection, data augmentation), speech recognition and natural
language etc. [31]. Hierarchical representations of training data rather than hand-
crafted features can be learned by a deep architecture.
In this chapter, we adopt deep learning to recognize TL status from images. Similar
to other deep learning applications, the idea in this chapter is that a convolutional
neural network (CNN) is able to classify a TL candidate into a TL state efficiently. In
this chapter, we have shown that it is possible to develop a real-time high accuracy
TLR system when a deep model as well as parameters are designed carefully.
As we discussed in Sect. 2, the location of the traffic light candidate on the bright
image can be determined by their locations on the dark images because the interval
time between the successive bright/dark frames is very short and can be neglected.
First of all, we will discuss the correspondence between the bright and dark channels
in the next section.
As we know, the two images that captured via low exposure and high exposure
channels are not synchronized, i.e. they are not captured simultaneously, although
the interval between the two channels’ timestamps is very short. The vehicle’s motion
makes it hard to align the TL candidate detected from dark image with a bright image,
especially when the vehicle’s vibration (due to movement) cannot be ignored. Hence,
a way is needed to re-locate the TL candidate detection results on the bright image.
With the detected candidates on the dark frame, we aim to find the corresponding
regions on the next bright frame which is with richer texture. Considering the time
168 J.-G. Wang and L.-B. Zhou
interval between the consecutive frames, new region centre on the bright image could
be needed to ensure the regions cropped from the bright image are corresponding
to the TL candidates. In this chapter, the center position, p, and radius, r, of the
TL candidates could be used to estimate the new center on the following bright
frame. In details, the new centre is searched within a window, 12r × 12r in this
chapter, centered at p. The centres of the TL candidates is normally with the highest
brightness value and color variance among the pixels within the window. For RGB
space, brightness image, I, is computed as
V = |R − I | + |G − I | + |B − I | (8)
A new center can be found as the highest response in a weighted sum image
[32, 33]:
αV + (1 − α)I (9)
where α is a weight. As the brightness changes significantly when the lighting con-
ditions changes, α is set to be 0.7 in this chapter. As mentioned above, a 12r ×
12r window centered at each new center, is cropped from bright frames and used as
candidate regions for TL state classification.
False positives could be possible, e.g. caused by braking light of the vehicles ahead
or other objects with color similar to TL, although most of them can be removed
during TL candidate detection stage. In order to improve robustness furtherly, a
CNN classifier is applied to identify true positives from false positives.
The accuracy and speed are two considerations for us to select a CNN classifier.
As one of the perception models runs in autonomous vehicle, the running speed is
an important issue for selecting deep learning model because it will share limited
resource with other models, e.g. object detection, lane detection, etc. CaffeNet [34],
see Fig. 6, is a 1-GPU version of AlexNet [35] in which the two paths in AlexNet
are combined to become one path. A customized version CNN model, similar to
CaffeNet [34], is adopted in this chapter. The number of output layer (last layer) is
defined as 13, i.e. twelve positive classes and one background class. The positive
classes are defined based on possible traffic light types.
(1) HARL Horizontally Aligned Red Light
(2) VARL Vertically Aligned Red Light
Traffic Light and Vehicle Signal Recognition … 169
respectively. For modified layers, the multipliers of learning rate, i.e. first convolu-
tional layer and output layer, are set to be 10 in the first 2000 iterations, and 1 for the
other layers. In total, 50000 iterations are taken in the training procedure.
In this chapter, instead of using normal tracking technologies, such as Kalman filter
or particle filter, we propose a simple but efficient temporal trajectory analysis for
improving accuracy and robustness of traffic light recognition.
As mentioned in Sect. 2.4, a HDR camera, Zebra2 [27], is used in this chapter.
This high speed camera ensures that the targets found on images constantly change
from frame to frame. In this scenario, temporal spatial analysis, a process to examine
if a target detected in the current frame has ever been found in the nearly same area
of the last frames, can be used to track the targets. The traffic is controlled by keeping
traffic light status for a certain period of time. As a result, the regions of the light are
spatially continuous on the image sequence no matter that vehicles are moving or
keeping static. Based on these observations, traffic light recognition can be benefited
by proper temporal spatial tracking in two aspects: (1) smoothness is improved as
missing or low confident traffic light status could be filled up; (2) isolated false
positives could be removed.
We define trajectory as the history of a traffic light instance. A trajectory consists
of several components:
(1) type;
(2) history locations;
(3) lifetime;
(4) discontinuity.
The trajectory is grouped according to light status and stability. Hence, six types
of trajectories are defined:
Traffic Light and Vehicle Signal Recognition … 171
5 Experimental Results
In this section, the quantity analysis (in terms of precision and recall) of the proposed
method on a large database will be conducted. The comparison with the state of the
art is provided to show the advantages of our approach. The experiments on both
database and real roads have shown that our TLR method satisfies the speed and
accuracy requirements of an autonomous vehicle.
One of the standard performance evaluation methods is precision and recall. They
are defined as Eqs. (10) and (11).
TP
Pr ecision = (10)
T P + FP
TP
Recall = (11)
T P + FN
where TP represents number of the True Positive samples, FP represents the number
of the False Positive samples and FN represents the number of the False Negative
samples.
In this section, the quantitative analysis of our method in terms of precision and
recall is conducted. For this purpose, a large database has been collected using our
autonomous vehicle. The number of the true positives, false positives and false neg-
atives are computed for computing precision and recall with Eqs. (10) and (11).
The database contains 4,142 images. The images are selected such that each class
contains nearly same number of samples. A total of 21,070 boxes were manually
annotated on these images. There are about 1,750 boxes for each class. The training
set consists of 3,722 (about 90%) images. The evaluation set contains 420 images.
In order to train network, we generate about one million samples from above seed
samples. The scale and translation technology is adopted for generating new samples
from seed sample. Although there is no special requirement about the generation
of the training samples, during the experiments, we found that the performance is
affected by the balance among the number of samples for each class. The generation
of the new samples is presented as follows.
In this chapter, the resolution of the original image is 1600 × 1200. The uniform
random distribution is adopted for shifting and scaling original TL region to generate
new training samples. The region center of the traffic light candidate are shifted from
−0.2 to 0.2 times of the candidate rectangle’s width or height, and then resizing the
region from 1 to 1.2 times. The new samples are finally resized to 111 × 111.
To evaluate our system, the algorithm runs over 63 new video sequences (each
video about 4 min long). The testing videos contain the samples under different
Traffic Light and Vehicle Signal Recognition … 173
conditions: day time, weather, express way and urban road. 1,800 images, sampling
interval 80 frames, are selected from above video sequences. The ground truth comes
from these images contains 5,229 samples. Considering the ROI, there are 3,099
samples.
Table 1 gives the experimental results where the test results with the ROI are
recorded in brackets.
We can see from Table 1 that the vehicle signal light recognition accuracy is worse
than that of other classes. The reason could be that there are not enough vehicle signal
lights samples compared with the one of traffic light samples. Another reason for
this could be that the vehicle lights’ type are much more than that of others. We have
to collect more training samples to cover more kinds of vehicle lights if we want
to improve vehicle signal light recognition accuracy. Anyway, by applying the ROI
discussed in Sect. 2.5, most of the vehicle lights will be removed from the results
because they are at the lower part of the images (see Table 1).
Based on the results given in Table 1, the precision and recall are obtained and
the results are given in Table 2. From Table 2, we can see that the accuracies of the
average recall and precision are improved from 98.04% to 99.03% and from 97.45
to 98.91%, respectively. The detection rate of traffic light candidate is computed as
follows.
D = (T P + F N )/G (12)
In order to prove the advantages of our approach, the performances achieved by our
approach and that of the state-of-the-art are compared. To the best of our knowledge,
no publically available HDR TLR benchmark database can be found for such com-
parison. Most published TLR systems evaluate their method on their own databases
collected using a single color camera. One exception we can find from literature is
[17] which use multiple exposure images. They conducted experiments on several
urban scenes, but there was no accuracy reported.
Nevertheless, the performances achieved by our approach and the state of the
art deep learning object detection approach are compared in this chapter. For this
purpose, the results obtained by using only high exposure images of our test data
will be compared. To make the comparison fair, we use the same training database
with our TLR to re-train the state of the art deep learning detector.
174
Table 1 HDR: Confusion matrix without/with ROI; the results with the ROI are recorded in brackets [22]
HARL VARL HAGL VAGL LVL RVL GAL RAL AL GPL RPL OFRL
HARL 441(441)
VARL 783(783)
HAGL 568(468)
VAGL 9(9) 549(549) 9(9)
LVL 18(0) 1152(66) 72(0)
RVL 18(0) 864(33)
GAL 6(6) 180(180)
RAL 90(90)
AI 72 (72)
GPL 6(6) 117 (117)
RPL 3(3) 162 (162)
OFRL 3(0) 33 (39)
J.-G. Wang and L.-B. Zhou
Table 2 The precision and recall without/with ROI; the results with ROI are recorded in brackets [22]
HARL VARL HAGL VAGL LVL RVL GAL RAL AL GPL RPL OFRL Average
Recall (%) 100 97.4 98.1 97.3 98.5 92.3 100 100 100 92.9 100 100 98
Traffic Light and Vehicle Signal Recognition …
without (100) (100) (98.1) (97.3) (100) (100) (100) (100) (100) (92.9) (100) (100) (99)
ROI/(with ROI)
Precision (%) 100 100 100 96.8 92.8 98 96.8 100 100 95.1 98.2 91.7 97.5
without (100) (100) (100) (96.8) (100) (100) (96.8) (100) (100) (95.1) (98.2) (100) (98.9)
ROI/(with ROI)
175
176
Table 3 HDR: Detection rate without/with ROI; the test results with ROI are recorded in brackets [22]
HARL VARL HAGL VAGL LVL RVL GAL RAL AL GPL RPL OFRL Average
Ground truth 447 804 477 567 1311 921 192 93 72 132 174 39
without (447) (804) (477) (567) (66) (36) (192) (93) (72) (132) (174) (39)
ROI/(with ROI)
Detection rate 98.7 974 98.1 100 94.7 95.8 96.9 96.8 100 93.2 94.8 92.3 96
(%) without (98.7) (97.4) (98.1) (100) (100) (100) (96.9) (96.8) (100) (93.2) (94.8) (100) (97.9)
ROI/(with ROI)
J.-G. Wang and L.-B. Zhou
Traffic Light and Vehicle Signal Recognition … 177
Fig. 8 The Architecture of YOLO [37]. YOLO has 24 convolutional layers followed by 2 fully
connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding
layers. The convolutional layers are pretrained on the ImageNet classification task at half the res-
olution (224 × 224 input image) and then double the resolution for detection. The final output of
YOLO is the 7 × 7 × 30 tensor of predictions
178 J.-G. Wang and L.-B. Zhou
Table 5 YOLOV2: The precision and recall without/with ROI; the results with ROI are recorded in brackets [22]
HARL VARL HAGL VAGL LVL RVL GAL RAL AL GPL RPL OFRL Average
Recall (%) 97.9 97.3 97.4 96.3 95.6 93.2 100 100 100 90.2 100 100 97.3
without (97.9) (97.3) (97.4) (96.3) (95.5) (100) (100) (100) (100) (90.2) (100) (100) (97.9)
ROI/(with ROI)
Precision (%) 97.9 98.8 98.7 96.3 94.6 94.4 96.7 92.9 100 92.5 98.2 83.3 95.4
without (97.9) (98.8) (98.7) (96.3) (95.5) (100) (96.7) (92.9) (100) (92.5) (98.2) (84 6) (95.5)
ROI/(with ROI)
J.-G. Wang and L.-B. Zhou
Table 6 YOLOV2: Detection rate without/with ROI; the test results with ROI are recorded in brackets [22]
HARL VARL HAGL VAGL LVL RVL GAL RAL AL GPL RPL OFRL Average
Ground truth 447 804 477 567 1311 921 192 93 72 132 174 39
Traffic Light and Vehicle Signal Recognition …
without (447) (804) (477) (567) (66) (36) (192) (93) (72) (132) (174) (39)
ROI/(with ROI)
Detection rate 96.6 97.0 96.9 99.5 93.6 99.3 95.3 90.3 95.8 90.9 94.8 92.3 95.2
(%) without (96.6) (97.0) (96.9) (99.5) (95.6) (100) (95.3) (90.3) (95.8) (90.9) (94.8) (100) (96.1)
ROI/(with ROI)
181
182 J.-G. Wang and L.-B. Zhou
Table 7 Comparison of the precision and recall between HDR and YOLOv2; the results with ROI
are recorded in brackets [22]
YOLOv2 HDR
Recall (%) 92.6 (94.3) 94.1 (96.9)
Precision (%) 90.8 (92.5) 93.6 (96.8)
In details, with ROI, the improvement of the precision and recall are from 92.5%
to 96.8% and from 94.3% to 96.9%, respectively.
The use of dark channel for detecting traffic light candidate is an efficient way
to prevent many false positives caused by the traffic signs, sunlight, clothes of the
pedestrian etc. This is because their corresponding regions on the dark image are
not very visible. Figures 10, 11 and 12 give a few examples show that some false
positives detected when only single color camera is used can be prevented by our
dual channel approach. In Fig. 10, YOLOv2 results contain two false positives: the
reflection of the traffic light on the bus body and sunlight on the building. Our dual
channel approach prevents these two false positives successfully as no response for
these two false positives can be found in the dark image in Fig. 10.
Fig. 10 YOLOv2 versus HDR [22]. Left: YOLOv2, two false positives (in red circles) caused by
the reflection of the traffic light on the bus body and the sunlight on the building, respectively;
Middle: HDR, the two false positives in YOLOv2 are prevented because there is no response for
these two false positives in the dark image (right)
Fig. 11 YOLOv2 versus HDR [22]. Left: YOLOv2, one false positive (in red circle); Middle:
HDR, the false positive in YOLOv2 is prevented in HDR because there is no response for this false
positive in the dark image (right)
Traffic Light and Vehicle Signal Recognition … 183
Fig. 12 YOLOv2 (top) versus HDR (bottom) [22]. The false positives in the top caused by traffic
sign and pedestrian can be prevented in the bottom
Table 8 Comparison of
With ROI (ms) Without ROI Time saving
computation cost (for one
(ms)
frame) between HDR and
YOLOv2 [22] HDR 35 130 77%
YOLOv2 40 40 0
The results obtained on the same dataset have shown that the proposed approach
in this chapter is better than the state of the art technique in terms of speed and
accuracy.
The dual-channel algorithms presented in this paper are implemented in C++ on
a Mini-PC (GIGABYTE, NVIDIA GeForce GTX 760), and can run in about 30–40
fps depending on the number of the traffic lights on an image. The processing time
can be saved significantly by using ROI technology, see Table 8. On the contrary,
YOLOv2 does not save time even uses the ROI because the network requires an
image with fixed image size as input.
Our method has been demonstrated on real roads using A*STAR IIR AV [40] via
Data Distribution Service (DDS) [41]. One of the test results are shown in Fig. 13
where a few frames of a video at ten frame intervals are provided. For the interested
readers, please refer to [22] or link for more video demo:
https://www.youtube.com/watch?v=HQqaAvuJI_I
Vehicle signal recognition has been explored by various approaches. The existing
approaches can be roughly divided into two categories: temporal information [43–48]
or single image [49–52]. Most of them detect signal lights using red color features
and then try to pair taillights using their symmetry property. Cui et al. [49] proposed
a hierarchical approach to detect vehicle and tail-lights in the daytime. They adopted
Deformable Part Model (DPM) [53] to detect vehicle from images. The red light
candidates are then found by clustering pixels in HSV color space within the bounding
boxes of the candidates. After pairing taillight based on the prior knowledge about
the vehicle appearance, a sparse dictionary is learned to classify signal lights. The
approach is hard to be applied to autonomous vehicle because the slow processing
speed, occlusion and possible false positives. Besides the slow detection problem
for DPM, a serious problem of this approach is that the tail-lights could be occluded
which resulting in taillight pairing consequently results in failure. In addition, the
noise from the urban road environment, e.g. traffic lights, streetlight, could affect the
detection of the tail-lights.
The use of our dual-channel mechanism makes it possible to separate detection and
recognition as two stages which can run in different exposure images. Similar to TLR
mentioned above, we separate the VSR as two stages: vehicle detection and signal
light recognition. The first stage will be executed with high exposure/bright image
and the second stage will be run with low exposure/dark image.
Although deep learning object detection has achieved very promising results, only
a few of them can run in real-time. Based on very deep VGG-16 model [54], faster
RCNN [36], one of the state of the art object detectors, can only reach frame rate at
5 frame per second, is far from the AV requirements (at least 10 frame per second is
required because more perception module could share the computer source).
YOLO [37] and SSD [39] are examples of object detector can run in real-time. The
detection is formulated as a regression problem in YOLO. As it access image only
once, a fast frame rate is achieved. In this chapter, consistent with the comparison
made in Sect. 5.1, YOLOv2 [38] is adopted as state of the art object detector to
detect vehicle from a signal image. As we mentioned in Sect. 5.2, YOLOv2 [38] has
achieved better performance than others, like Faster R-CNN [36] and SSD [39].
186 J.-G. Wang and L.-B. Zhou
We retrain YOLOv2 detector with our database collected on real roads by using
our autonomous vehicle. 21,000 images are annotated, results in more than 100,000
vehicle samples (about four to five vehicles in each image). We define eight classes
of objects: (1) car; (2) truck; (3) lorry; (4) van; (5) bus; (6) motor cycles; (7) bicycle;
(8) pedestrian. Two examples of the vehicle detection results are shown in Fig. 14.
Unlike existing BLR methods which require one to explicitly extract left and right
tail-lights, appearance based deep learning is proposed in this chapter to recognize
brake lights. In other words, the regions, we call it Brake-Light Pattern (BLP) in this
chapter, within the bounding boxes detected by YOLOv2 detector are directly used
to recognize brake light.
State of the art performance has been achieved by deep learning on a number
of image recognition benchmark databases, e.g. ILSVRC-2012 [55]. Similar to the
TLR presented in the previous sections, we state that the brake-light can be learned
well from dark images than from bright images. Besides the clean background of
dark images makes the lights recognition robust, the occlusion problem could be
overcome to some extent by using BLPs, see Fig. 15, proposed in this chapter rather
than a pair of tail-lights. Furthermore, the middle brake-light included in the BLP,
most of them are located at the rear window of vehicles, makes the recognition more
reliable than that using only left and right tail-lights. The previous approaches do not
use this middle light because it is hard to extract this relatively darker light compared
with the left and right tail lights.
An example is shown in Fig. 15. The BLPs of a vehicle, corresponding to their
bounding boxes in the left (bright images), are shown in the right (dark images). The
brake light can be recognized accurately from dark images because the difference
between the “braking” and “normal” on a dark image is much large than that on their
counterpart on bright image.
Traffic Light and Vehicle Signal Recognition … 187
Fig. 15 The Brake-Light Pattern (BLP) of a vehicle (right, dark image) corresponding to the vehicle
detection bounding box shown in the left (bright image)
Fig. 16 Some training samples generated from seed images for “normal” (top row) and “braking”
(bottom row)
The ten-fold evaluation is adopted to test the accuracy. The results are listed in
Table 9.
The average accuracy of our method is found to be 97.5%, much better than that
of previous approaches, 89%, obtained by using bright image. The vehicle detection
rate is found to be 99.5%.
Figures 17 and 18 are two examples for brake light recognition experiments. The
bounding box of vehicle is marked in green or red when it is identified as “normal”
or “braking”, respectively. The method can solve partial occlusion problem (Fig. 18)
because a pattern rather than pair light is used.
Table 9 Comparison of the previous approaches (bright image) and our approach (dual channel)
Previous approaches (bright image only) Our approach (dual channel)
Accuracy (%) 89 97.5
Fig. 17 Brake light recognition results [20]. Left: “normal” (green); right: “braking” (red)
Traffic Light and Vehicle Signal Recognition … 189
Similar to the TLR presented in Sect. 5, the algorithms developed for VSR in this
chapter have been integrated into our autonomous vehicle, A*STAR IIR AV [40]. The
demonstrations on real roads, including vehicle following, obstacle avoidance, etc.,
have shown that both the accuracy and the speed are satisfied with the autonomous
vehicle requirement. Run the VSR and TLR together in the same PC presented in
Sect. 5.2, i.e. Mini-PC (GIGABYTE, 2.5 Ghz CPU, GTX 760), we achieve 25–35 fps
depending on the number of the traffic lights and vehicle signal lights on an image.
A real-time TLR system has been proposed in this chapter to detect and recognize TL
based on high dynamic range imaging and deep learning. The advantages of a HDR
camera, i.e. multiple exposure images, are fully used. The drawback of the state of
the art, which uses only bright images and false positives could be caused, can be
overcome by our approach because the low exposure image has clean background
(dark) that ensures the TL can be detected reliably. Furthermore, the candidates
on the high exposure image, corresponding to the one on the dark image, can be
recognized with high accuracy because of rich context is available. The number of
the TL candidates to be identified by CNN is significantly reduced by using saliency
map and ROI. This makes it fast as well robust to noise, e.g. vehicles’ tail lights.
Finally, the accuracy and reliability are furtherly improved by developing a tracking
technology. By executing the method on a large database collected from real roads,
we have shown that the performance of our method is better than the state of the
art. Encouraged by the good performance of the TLR, we extend our dual-channel
method to VSR. Vehicles are detected from bright images and the vehicle signal
lights are recognized from the counterpart dark images. Similar to TLR, good VSR
performance has been achieved. The online tests on our autonomous vehicle have
done successfully. It has been verified that our method satisfies the speed and accuracy
requirements of an autonomous vehicle.
190 J.-G. Wang and L.-B. Zhou
The investigations on using both dark and bright images as input to the CNN
network could be done in the near future. The quantitative performance at night
could be done. Actually, during the test on real road, we have observed that our
dual-channel method is feasible at night. This is because that the night effects are not
be high when we detect traffic lights from dark images. Proper camera parameters
and re-trained CNN model with night data would be sufficient for night perfor-
mance. What we need to do is nothing but adjusting camera parameters properly and
re-training the CNN with night data. Lastly, the RNDF (Route Network Definition
File) could be adopted in the future to locate traffic lights. It is clear that the false
positives can be eliminated significantly by fusing with RNDF information.
Acknowledgements We have benefited enormously from ideas and discussions with our
ex-colleagues: Yu Pan, Serin Lee, Zhi-Wei Song, Boon-Siew Han and Vincensius-Billy Saputra.
References
1. Jensen, M.B., Philipsen, M.P., Trivedi, M., Mogelmose, A., Moeslund, T.: Vision for look-
ing at traffic lights: issues, survey, and perspectives. IEEE Trans. Intell. Transp. Syst. 17(7),
1800–1815 (2016)
2. Diaz, M., Pirlo, G., Ferrer, M.A., Impedvov, D.: A survey on traffic light detection. In: Pro-
ceedings of ICIAP 2015 workshops on New Trends in Image Analysis and Processing, Lecture
Notes in Computer Science, vol. 9281, pp. 201–208 (2015)
3. Philipsen, M.P., Jensen, M.B., Mogelmose, T., Moeslund, T.B., Trivedi, M.M.: Ongoing work
on traffic lights: detection and evaluation. In: Proceedings of 12th IEEE International Confer-
ence on Advanced Video and Signal Based Surveillance (AVSS) (2015)
4. Gong, J., Jiang, Y., Xiong, G., Guan, C., Tao, G., Chen, H.: The recognition and tracking
of traffic lights based on color segmentation and CAMSHIFT for intelligent vehicles. In:
Proceedings of IEEE Intelligent Vehicle Symposium (2010)
5. Siogkas, G., Skodras, E., Dermatas, E.: Traffic lights detection in adverse conditions using
color, symmetry and spatiotemporal information. In: Proceedings of International Conference
on Computer Vision Theory and Applications, pp. 620–627 (2012)
6. Charette, R. Nashashibi, F.: Traffic light recognition using image processing compared to
learning processes. In: Proceedings of IEEE/RSJ International Conference on Robots and
Systems, pp. 333–338 (2009)
7. Diaz-Cabrera, M., Cerri, P., Sanchez-Medina, J.: Suspended traffic lights detection and distance
estimation using color features. In: Proceedings IEEE International Conference on Intelligent
Transportation Systems, pp. 1315–1320 (2012)
8. Levinson, J., Askeland, J., Dolson, J., Thrun, S.: Traffic light mapping, localization, and
state detection for autonomous vehicles. In: Proceedings of International IEEE Conference
on Robotics and Automation (ICRA), pp. 5784–5791 (2011)
9. Haltakov, V., Mayr, J., Unger, C., Ilic, S.: Semantic segmentation based traffic light detection
at day and at night. In: Proceedings of German Conference on Pattern Recognition, Lecture
Notes in Computer Science, vol. 9358, pp. 446–457 (2015)
10. Charette, R., Fawzi Nashashibi, F.: Real time visual traffic lights recognition based on spot light
detection and adaptive traffic lights templates. In: Proceedings of IEEE Intelligent Vehicles
Symposium (2009)
11. Fairfield, N., Urmson, C.: Traffic light mapping and detection. In: Proceedings of International
IEEE Conference on Robotics and Automation (ICRA), pp. 5421–5426 (2011)
Traffic Light and Vehicle Signal Recognition … 191
12. John, V., Yoneda, K., Qi, B., Liu, Z. Mita, S.: Traffic light recognition in varying illumination
using deep learning and saliency map. In: Proceedings of International IEEE Conference on
Intelligent Transportation System (ITSC) (2014)
13. Gradinescu, V., Gorgorin, C., Diaconescu, R., Cristea, V., lftode, L.: Adaptive traffic lights using
car-to-car communication. In: Proceedings of 65th IEEE Vehicular Technology Conference,
pp. 21–25 (2007)
14. Kumar, N., Lourenco, N., Terra, D., Alves, L.N., Aguiar, R.L.: Visible light communication
in intelligent transportation systems. In: Proceedings of IEEE Intelligent Vehicle Symposium,
pp. 748–753 (2012)
15. Dresner, K., Stone, P.: A multiagent approach, to autonomous intersection management. Artif.
Intell. Res. 31, 591–656 (2008)
16. High Dynamic Range. https://en.wikipedia.org/wiki/High-dynamic-range_imaging
17. Jang, C., Kim, C., Kim, D., Lee, M., Sunwoo, M.: Multiple exposure images based traffic light
recognition. In: Proceedings of IEEE Intelligent Vehicle Symposium, pp. 1313–1318 (2014)
18. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings
of International IEEE Conference on Computer Vision and Pattern Recognition, pp. 886–893
(2005)
19. Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learn-
ing fine-grained image similarity with deep ranking. In: Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1386–1393 (2014)
20. Wang, J.-G., Zhou, L.-B., Pan, Y., Lee, S., Han, B.-S., Billy, V.: Appearance based brake-lights
recognition. In: Proceedings of IEEE Intelligent Vehicle Symposium (2016)
21. Casares, M., Almagambetov, A., Velipasalar, S.: A robust algorithm for the detection of vehicle
turn signals and brake lights. In: Proceedings of International IEEE Conference on Advanced
Video and Signal-Based Surveillance, pp. 386–391 (2012)
22. Wang, J.-G., Zhou, L.-B.: Traffic light recognition with high dynamic range imaging and deep
learning. IEEE Trans. Intell. Transp. Syst. 20(4), 1341–1352 (2019)
23. Wang, J.-G., Zhou, L.-B., Song, Z.-W., Yuan, M.-L.: Real-time vehicle signal lights recognition
with HDR camera. In: Proceedings of IEEE International Conference on Internet of Things
(iThings) (2016)
24. Saliency Map. https://en.wikipedia.org/wiki/Saliency_map
25. Kim, H.-K., Park, J.H., June, H.-Y.: Effective traffic lights recognition method for real time
driving assistance system in the daytime. Int. J. Electr. Comput. Eng. 5(11), 1429–1432 (2011)
26. Bradski, D.: Dr. Dobb’s journal of software tools
27. Zebra2 camera. https://www.ptgrey.com/zebra2-28-mp-color-gige-hd-sdi-sony-icx687-
camera
28. Lu, H., Zhang, H., Yang, S., Zheng, Z.: Camera parameters auto-adjusting technique for robust
robot vision. In: Proceedings of International IEEE Conference on Robotics and Automation,
pp. 1518–1523 (2010)
29. Agarwal, V., Abidi, B.R., Koschan, A., Abidi, M.A.: An overview of color constancy algo-
rithms. J. Pattern Recogn. Res. 1(1), 42–54 (2006)
30. Shim, I., Lee, J.-Y., Kweon, I.S.: Auto-adjusting camera exposure for outdoor robotics using
gradient information. In: Proceedings of IEEE/RSJ International Conference on Intelligent
Robotics and Systems, pp. 1011–1017 (2014)
31. Deep learning. Wiki. https://en.wikipedia.org/wiki/Deep_learning
32. Hu, Y., Xie, X., Ma, W.-Y., Chia, L.-T., Rajan, D.: Salient region detection using weighted
feature maps based on the human visual attention model. In: Proceedings of Pacific Rim
Conference on Multimedia (2004)
33. Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection.
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009)
34. Krizhevsky, A., Sutskever, L. Hinton, G.E.: ImageNet classification with deep convolutional
neural networks. In: Proceedings of NIPS (2012)
35. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell,
T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACM
International Conference on Multimedia, pp. 675–678 (2014)
192 J.-G. Wang and L.-B. Zhou
36. Ren, S. He, K., Girshic, R., Sun, J.: Faster R-CNN: toward real-time object detection with
region proposal networks. https://arxiv.org/abs/1506.01497
37. YOLO: real-time object detection. https://pjreddie.com/darknet/yolov1/
38. Redmon, J. Farhadi, A.: YOLO9000: better, faster, stronger. https://arxiv.org/abs/1612.08242
39. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single
shot multibox detector. https://arxiv.org/abs/1512.02325
40. IIRAV. https://www.a-star.edu.sg/i2r/RESEARCH/AUTONOMOUS-SYSTEMS
41. Data Distribute Service. https://en.wikipedia.org/wiki/Data_Distribution_Service
42. Long short-term memory. https://en.wikipedia.org/wiki/Long_short-term_memory
43. Koller, D., Weber, J., Malik, J.: Robust multiple car tracking with occlusion reasoning. Springer,
Berlin (1994)
44. She, K., Bebis, G., Gu, H., Miller, R.: Vehicle tracking using online fusion of color and shape
features. In: Proceedings of 7th IEEE International IEEE Conference on Intelligent Transporta-
tion Systems, pp. 731–736 (2004)
45. Chan, Y.-M., Huang, S.-S., Fu, L.-C., Hsiao, P.-Y.: Vehicle detection under various lighting
conditions by incorporating particle filter. In: Proceedings of IEEE Intelligent Transportation
Systems Conference, pp. 534–539 (2007)
46. Malley, R., Jones, E., Glavin, M.: Rear-lamp vehicle detection and tracking in low-exposure
color video for night conditions. IEEE Trans. Intell. Transp. Syst. 11(2), 453–462 (2010)
47. Casares, M., Almagambetov, A., Velipasalar, S.: A robust algorithm for the detection of vehicle
turn signals and brake lights. In: Proceedings of IEEE Ninth International Conference on
Advanced Video and Signal-Based Surveillance, pp. 386–391 (2012)
48. Almagambetov, A., Casares, M., Velipasalar, S.: Autonomous tracking of vehicle rear lights and
detection of brakes and turn signals. In: Proceedings of IEEE Symposium on Computational
Intelligence for Security and Defence Applications (CISDA), pp. 1–7 (2012)
49. Cui, Z.-Y., Yang, S.-W., Tsai, H.-M.: A vision-based hierarchical framework for autonomous
front-vehicle taillights detection and signal recognition. In: Proceedings of IEEE 18th Interna-
tional Conference on Intelligent Transportation Systems, pp. 931–937 (2015)
50. Thammakaroon, P., Tangamchit, P.: Predictive brake warning at night using taillight character-
istic. In: Proceedings of IEEE International Symposium on Industrial Electronics, pp. 217–221
(2009)
51. Ming, Q., Jo, K.-H.: Vehicle detection using tail light segmentation. In: Proceedings of 6th
IEEE International Forum on Strategic Technology (IFOST), vol. 2, pp. 729–732 (2011)
52. Nagumo, S., Hasegawa, H., Okamoto, N.: Extraction of forward vehicles by front-mounted
camera using brightness information. In: Proceedings of IEEE Canadian Conference on Elec-
trical and Computer Engineering, vol. 2, pp. 1243–1246 (2003)
53. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.: Cascade object detection with deformable
part models. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
pp. 2241–2248 (2010)
54. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-
nition. arXiv:1409.1556
55. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein M.: Imagenet large scale visual recognition challenge. arXiv:1409.0575
(2014)
The Application of Deep Learning
in Marine Sciences
Abstract Ecological studies are increasingly using video image data to study the
distribution and behaviour of organisms. Particularly in marine sciences cameras
are utilised to access underwater environments. Up till now image data has been
processed by human observers which is costly and often represents repetitive mun-
dane work. Deep learning techniques that can automatically classify objects can
increase the speed and the amounts of data that can be processed. This ultimately
will make image processing in ecological studies more cost effective, allowing stud-
ies to invest in larger, more robust sampling designs. As such, deep learning will be
a game changer for ecological research helping to improve the quality and quantity
of the data that can be collected. Within this chapter we introduce two case stud-
ies to demonstrate the application of deep learning techniques in marine ecological
studies. The first example demonstrates the use of deep learning in the detection and
classification of an important underwater ecosystem in the Mediterranean (Posido-
nia oceanica seagrass meadows), the other showcases the automatic identification
of several jellyfish species in coastal areas. Both applications showed high levels of
accuracy in the detection and identification of the study organisms, which represents
encouraging results for the applicability of these methodologies in marine ecolog-
ical studies. Despite its potential, deep learning has yet not been widely adopted
in ecological studies. Information technologists and natural scientists alike need to
more actively collaborate to move forward in this field of science. Cost-effective data
collection solutions are desperately needed in a time when large amounts of data are
required to detect and adapt to global environmental change.
1 Introduction
between ecology and new information technologies (but see e.g. [15–19]). This lack
of uptake may however be overcome in the future as end-user-based interfaces for
this technology become more user friendly.
Automated image classifications and segmentation through deep learning
techniques represent a game changer for ecological studies using image-based data
collection. It opens the possibility to increase data collection and processing to a
completely new level with the potential of delivering more robust and statistically
sound data at a highly reduced cost. The reduced cost may also provide the solution
for the maintenance or establishment of highly important long-term data collection
against the backdrop of anthropogenic change [20]. The collection of long-term data
at the appropriate spatial and temporal scale has thus far lacked commitment by gov-
ernments and scientists alike due to their high cost and initially low scientific returns
respectively [21].
In this chapter we present two case studies that use deep learning to automati-
cally process underwater images with the aim of showcasing the potential of these
methodologies in improving ecological data collection and processing. The studies
presented represent promising solutions for the data collection and processing of two
marine organisms highly relevant for society.
The first case study demonstrates the identification of seagrass meadows, Posi-
donia oceanica, from video sequences recorded from an Autonomous Underwater
Vehicle (AUV) using semantic segmentation. Seagrass meadows provide a wide
range of benefits for society such as the attenuation of wave energy thus contribut-
ing to the maintenance of sandy beaches as well as providing a habitat for many
commercial and non-commercial species. With the help of deep learning, larger and
more precise habitat maps can be produced for long term monitoring, vital for the
management and protection of this habitat.
The second case study shows how different species of jellyfish, some of them with
negative impacts for society, can be identified and classified using object detection
deep learning algorithms. This detection and assessment of jellyfish has relevance
with respect to increasing our understanding of jellyfish ecology and also provides the
potential for coastal monitoring systems to mitigate impacts of jellyfish on humans.
2 Methodology
layers. The hidden layers of a CNN consist of diverse convolutional layers, RELU
activation layers, pooling layers, fully connected layers and normalisation layers.
The wide range of algorithms and applications of CNN in computer vision can
be classified into four main different types:
• Classification. Given a raw image the task is to identify the class which the image
belongs to.
• Classification and Localisation. Given a raw image, with only one object in it, the
task is to find the location of the object within the image.
• Object Detection. The task is to identify the location of several objects within an
image. Objects might be of the same class or different classes altogether.
• Image Segmentation. Each pixel composing an image is classified and assigned to
a particular class. Image segmentation is also known as semantic segmentation.
training the network X times, each time making use of one subset to test the network
and the remaining X − 1 subsets to train it. This method achieves a reduction of the
results variability, obtaining a more accurate performance estimation in the validation
process.
From each cross-validation training applied over a set H of hyperparameters,
X models are generated, M Hi where H = 1, 2, ..., h represents the hyperparameter
set number and i = 1, 2, ..., X the model index. Subsequently, the X models are
executed over their corresponding test subset, obtaining the predictions, PHi . From
these predictions, each model is evaluated, assessing its performance, R iH . Note that
depending on the model trained and its output (classification, localisation, object
detection or image segmentation) the metrics to be used and the evaluation process
might be different. Finally, the performance R H of each set H of hyperparameters
is obtained by computing the mean of its X models performance R iH .
The workflow for assessing the performance of each hyperparameter set is repre-
sented in Fig. 1.
The next sections show the training and validation process of two different deep
ConvNet for automatically Posidonea oceanica segmentation and jellyfish detection
and classification, respectively.
3 Seagrass Segmentation
DATA
TRAIN MEAN
TRAINING CLASSIFICATION EVALUATION
GROUND
IMAGES
TRUTH
TEST
Fig. 1 Hyperparameter set “H” workflow. The network is trained X times making use of the training dataset and the k-fold cross validation method, outputting X
models (here X = 5). Following, the models are evaluated over the test dataset. Finally, the evaluation metrics are calculated from the X models mean performance
M. Martin-Abadal et al.
The Application of Deep Learning in Marine Sciences 199
allows to extract low-level coarse information from the image on the first sections,
and then, as more convolutional and max pooling layers are applied, the feature maps
shrink up to a 1/32 of the original image size, incorporating more complex high-level
information. Finally, the convolutional layers of the last section maintain the spatial
information into the decoder and generate a low resolution segmentation while the
drop out layers help to reduce overfitting.
The decoder purpose is to take the low resolution segmentation output of the
encoder and up-sample it to the original image size, obtaining a high-resolution
segmentation of it. In order to accomplish this task, a series of transposed convo-
lutional layers are used. These layers apply an inverse convolution over the input,
up-sampling each pixel to the convolutional kernel size. The decoder also contains
skip layers [41], which are used to integrate the encoder’s low level features to
higher level, coarse information from the transposed convolutional layers. Lastly, an
activation layer obtains the final semantic segmentation.
The selected architecture makes use of the FCN8 decoder [42], which contains
three of the aforementioned transposed convolutional layers and three skip layers
interleaved. By adjusting the kernel sizes and strides of the transposed convolutional
layers, the shrinked feature maps are up-sampled into the original image size. Lastly,
a softmax activation layer obtains the final probabilistic segmentation map. The
explained architecture is presented in Fig. 2.
This architecture has already been used for other segmentation tasks, like road
segmentation for autonomous drive in [43] or class segmentation of the PASCAL
VOC 2011-2 dataset in [42]. Always presenting great results
Training Details
In order to train the VGG16-FCN8 architecture, both encoder and decoder should
be trained. Their training is conducted by means of readjusting the kernel values
in the convolutional layers and transposed convolutional layers, respectively. This
architecture allows to train both encoder and decoder with the same back propaga-
tion functions, allowing its training in a single forward and backward pass for each
iteration.
The training process makes use of images containing P. oceanica, and their cor-
responding label maps, where each class is marked in a different colour.
To train the network a backpropagation function is needed, indicating the direction
and magnitude of change. In this case, a cross-entropy loss function is used [44], its
loss increases as the predicted probability diverges from the ground truth label. Also,
the Adam optimization algorithm is implemented in order to help the training reach
the global minimum error [45]. Finally, in order to help preventing overfitting [46],
two dropout layers are interleaved between the fully connected layers of the encoder.
In order to benefit from the advantages of transfer learning, the encoder layers are
initialized with the pretrained weights of a VGG network trained on ImageNet [47].
The initialization of the transposed convolution layers of the decoder is carried out
using bilinear upsampling. Finally, a truncated Gaussian initialization is applied to
the skip connections. These initialization parameters for this network have already
presented great results in [43].
480x360 480x360 480x360 480x360 480x360
Conv. layer Skip layer
240x180 Pool layer Trans. Conv. layer
Dropout layer SoŌmax layer
120x90
60x45 60x45
30x23 30x23
15x12 15x12
4096 2
512 30x23 2
512 2
The Application of Deep Learning in Marine Sciences
256 60x45
2
128
3 64 2 2 2
Fig. 2 VGG16-FCN8 Neural network architecture. The encoder is conformed by convolutional layers (blue), pooling layers (red) and dropout layers (black).
The decoder is conformed by skip layers (purple), transposed convolutional layers (green) and softmax layers (orange). For each layer it is indicated the number
of feature maps (below) and their shape (above)
201
202 M. Martin-Abadal et al.
The trainings were performed on a computer equipped with an Intel Core i7-7700
processor, a GeForce GTX 1080 graphic card and 16 MB of RAM.
Fig. 3 Map of the study area showing the island of Mallorca in the Western Mediterranean. Sam-
pling points are indicated with arrows
Fig. 4 Posidonia oceanica images presenting a variety of P. oceanica and environmental conditions
Hyperparameter Combination
In order to find the hyperparameters that offer the best performance, the network was
trained with the different values and combinations, shown in Table 2.
First, the network was trained with and without implementing data augmentation,
this technique consists in applying contrast, brightness, color and morphological
transformations to the training images in order to train over more diverse data, helping
to reduce overfitting [48]. Secondly, two different learning rates were set, modifying
the training step size when minimizing the loss [49]. Finally, two number of iterations
were used, setting the times the network backpropagates and trains [49].
Experiments
Following the methodology explained in Sect. 2, eight different experiments were
conducted K = 1, 2, ..., 8, each one assessing the performance of a hyperparameter
combination, using its corresponding hyperparameters and applying a 5 k-fold cross-
validation i = 1, 2, ..., 5. On each cross validation, 4 subsets of the mix dataset (80%
of the data) were used to train the network and the remaining one (20% of the data)
was used to test it. Also, the entire extra dataset was used to test the network. This
process is described in Fig. 6.
The evaluation process of each model starts by binarizing its probabilistic outputs,
we decided to perform this binarization at nine equally distributed threshold values,
j = 1, 2, ..., 9 (Fig. 7).
Then, we preformed a comparison between each binarized output and its corre-
sponding label maps, acting as ground truth.
From this comparison, we generated confusion matrix, which indicates the number
of P. oceanica pixels identified correctly (True Positives, TP) and wrongly (False
Positives, FP), and also the number of background pixels identified correctly (True
Negatives, TN) and wrongly (False Negatives, FN). From these values, the accuracy,
precision, recall and fall-out of the model are computed.
Finally, a Receiver Operating Characteristic (ROC) curve is generated [50], rep-
resenting the recall against fall-out values of the classifier at various thresholds. The
206
TRAIN MEAN
TRAINING CLASSIFICATION EVALUATION
LABEL
IMAGES MAPS
DATASET
EXTRA TEST
Fig. 6 Experiment “K” validation process. For each study cases, the network was trained five times using the k-fold cross validation method, generating five
models. Each model was evaluated on the mix test data and extra whole dataset. The final experiment results were obtained as the mean performance of its five
models
M. Martin-Abadal et al.
The Application of Deep Learning in Marine Sciences 207
Fig. 7 Probabilistic network output of an image (a) and one of its corresponding binarizations (b)
BINARIZED IMAGES
PREDICTIONS ( ) RESULTS ( )
CONFUSION ACCURACY
MATRIX PRECISION
BINARIZATION
RECALL
TN FP FALL-OUT
COMPARISON
FN TP
ROC, AUC
GROUND TRUTHS
Fig. 8 Evaluation process for the model “i” of experiment “K”. The network prediction is binarized
at j = 1, 2, ..., 9 threshold values, generating a confusion matrix for each one. Finally, the evaluation
metrics are are calculated
analysis of the Area Under the Curve (AUC) of the ROC curve offers measure of the
classifier performance.
Figure 8 represents the process followed to evaluate a model.
This section presents the obtained results for each experiment along with the hyper-
parameter selection process.
In this section we use a three digit annotation to refer to each experiment, indicating
its hyperparameters. The first digit implies if data augmentation was used (1) or not
(0). The second one indicates if the used learning rate is 1e–05 (1) or 5e–04 (5). The
last digit indicates if the number of iterations is 8000 (8) or 16,000 (16).
208 M. Martin-Abadal et al.
Experiment Performance
Mi x Dataset Results
Figure 9 shows the results of evaluating the mix test set. In Fig. 9a, the ROC curve
along its AUC value for each experiment is presented. While in the the precision
and accuracy values at the optimal binarization threshold are represented in Fig. 9b
in bar charts. The optimal binarization threshold is selected as the one presenting
higher trade-off between recall and fall-out, calculated as:
Recall + (1 − Fall-out)
Trade-off = (1)
2
The ROC curves for all experiments showed AUC values over 95%, reaching a
maximum of 98.7% for the 1_1_16 experiment. According to the criteria established
in [51] to determine how good a classifier is based on its AUC value, these values
represent excellent classifiers.
Precision and accuracy values were greater than 90% for all the experiments.
The maximum Precision achieved was 97.5%, for the experiment 1_1_8, while the
lowest one was 92.2%. For the accuracy, he maximum achieved was 96.5%, for the
experiment 1_1_16.
The comparison of the different experiments on a hyperparameter basis showed
that:
• Experiments with lower learning rates presented better precision, accuracy and
AUC values than experiments with higher rates.
• The effect of the number of iterations is almost negligible, being the metrics slightly
better when trained over 16,000 iterations.
• The application of data augmentation had a similar slight effect than the number
of iterations, presenting a small benefit when it was applied.
These almost negligible effects may be due specific conditions of our application,
such as the network already being trained after the 8 k iterations, and the train set
already being diverse on its own, respectively.
Figure 10 shows qualitative results over images of the mix test set.
E xtra Dataset Results
The results obtained on the mix dataset were promising but, as mentioned in Sect. 3.2,
the test images were extracted from the same immersions used to train the net-
work, containing similar environmental conditions. To assess the performance of
each model on unseen conditions, we evaluated them over the extra dataset, the
results are presented in Fig. 11.
The AUC value of experiments that used a learning rate of 5e–04 were lower
that the ones achieved in the mix test set evaluation results, reaching values around
92%. On the other hand, experiments that used a learning rate of 1e–05 were able to
maintain the good results obtained on the mix dataset, achieving AUC values around
97.7% when the network was trained for 16,000 iterations and 97.0% when 8000.
The Application of Deep Learning in Marine Sciences 209
(a) 1,0
0,8
0,95
AUC
0,6
0_1_8 98,62% 0,9
Recall
0_1_16 98,47%
0_5_8 95,49%
0,4 0,05 0,1
0_5_16 96,40%
1_1_8 98,56%
1_1_16 98,66%
0,0
0,0 0,2 0,4 0,6 0,8 1,0
Fall-out
1,0
(b)
Precision
Accuracy
0,9
0,8
0_1_8 0_1_16 0_5_8 0_5_16 1_1_8 1_1_16 1_5_8 1_5_16
Experiment
Fig. 9 Mix test set results. a ROC curves and corresponding AUC. b Precision and accuracy metrics
obtained at the optimal binarization threshold
210 M. Martin-Abadal et al.
Fig. 10 Qualitative results obtained for images from the mix test set. On the first row, two original
images are shown. The second row of images illustrate the original images with their corresponding
ground truth superimposed in red. Finally, the last set of images show the results of the segmentation
superimposed in green to the original images
These results show that the models do not overfit the training images, being able to
generalize its training to images taken with a different camera, containing different
unseen environmental and P. oceanica conditions.
The same tend can be seen for the precision and accuracy, where experiments
with higher learning rates achieved values for both metrics around 85%, while exper-
iments that used lower learning rates only achieved values around 96% and 95%,
The Application of Deep Learning in Marine Sciences 211
(a) 1,0
0,8
0,95
0,9
AUC
0,6
0_1_8 96,89%
0,85
Recall
0_1_16 97,81%
0_5_8 92,74%
0,4 0,05 0,1 0,15 0,2
0_5_16 92,25%
1_1_8 97,03%
1_1_16 97,71%
0,2 1_5_8 91,00%
1_5_16 91,60%
0,0
0,0 0,2 0,4 0,6 0,8 1,0
Fall-out
(b) 1,0
Precision
Accuracy
0,9
0,8
0_1_8 0_1_16 0_5_8 0_5_16 1_1_8 1_1_16 1_5_8 1_5_16
Experiment
Fig. 11 Extra test set results. a ROC curves and corresponding AUC. b Precision and accuracy
metrics obtained at the optimal binarization threshold
212 M. Martin-Abadal et al.
Fig. 12 Qualitative results obtained for images from the extra dataset. On the first row, two original
images are shown. The second row of images illustrate the original images with their corresponding
ground truth superimposed in red. Finally, the last set of images show the results of the segmentation
superimposed in green to the original images
respectively. It also can be seen, experiments where the number of iterations was set
to 16,000 presented slightly higher metrics.
Figure 12 shows qualitative results over test images of the extra dataset.
Hyperparameters Evaluation
We conducted an overall comparison on a hyperparameter basis from the evaluation
results of all experiments, finding the hyperparameters which offer a better perfor-
mance.
The Application of Deep Learning in Marine Sciences 213
The results clearly indicate that experiments that used lower learning rates
obtained better AUC, precision and accuracy results. The best learning rate was
identified at 1e–05. Also, it can be seen that experiments conducted using a large
number of steps tend to have a slightly better performance. The best number of
iterations was identified at 16,000. Finally, we decided to apply data augmentation,
helping to generalize the training to new unseen conditions for future immersions.
Over the past decades the social and scientific concern about increasing jellyfish
numbers has risen. This can be noticed on the number of reports on jellyfish, over the
past two decades the number of media and news reports have dramatically increased
by over 500% [52], often with alarmist headlines [53].
Parallel to this, there is an ongoing scientific debate on whether jellyfish numbers
are on the rise, on the one hand, some scientists argue that populations are increasing
due to a range of natural and man-made causes [54, 55], while on the other, some
scientists defend that jellyfish populations have remained constant over time [53].
The lack of base line data to endorse conclusions makes it difficult to support either
argument.
Regardless, of the outcome of the debate, coastal populations are increasing, with
40% of the global population living within 100 km of the coast [56] and many more
spending their holidays and free time in coastal areas. The increase in the use of the
coast and its associated resources and benefits is leading to a higher rate of encounters
between humans and jellyfish with all the associated socioeconomic consequences
[57]. Among others, jellyfish aggregations are known to negatively affect coastal
tourism with associated impacts on tourism revenues and the tourism industry [58].
Large aggregations of jellyfish can interfere with fishing operations by presenting
a health hazard to fishermen when pulling the fishing gear on board, splitting the
fishing nets due the weight of the jellyfish in the nets or ruining the catch [59]. In
aquaculture, large aggregations of jellyfish have reportedly killed fish in pens [60,
61]. Water desalination and power plants have also suffered the consequences of
the presence of high numbers of jellyfish, which can clog seawater intake screens
causing power reductions and shutdowns [62, 63].
There is, therefore, a need to develop new technologies that enable the automatic
detection of these organisms to facilitate the design of adaptive management strate-
gies in order to mitigate jellyfish associated impacts. Furthermore, the development
of such technology will greatly facilitate the collection of long term monitoring data
in a cost-effective way.
So far, most studies aimed at monitoring and assessing the presence of jellyfish
have relied on manual methods, such as visual countings from boats [64] or small
aircrafts [65], or on a combination of video recording with subsequent human-based
manual counting [13]. Manual methods, however, greatly limit the scope of the
studies both from a spatial and a temporal perspective.
214 M. Martin-Abadal et al.
In this application an object detection architecture is used for the detection and
classification of the different jellyfish species. In the following section, the network
architecture and the training details are presented.
The Application of Deep Learning in Marine Sciences 215
Filter concatenaƟon
1x1 convoluƟons
Previous layer
Fig. 13 Inception module, showing how the input is convoluted by three different kernel sizes:
1 × 1, 3 × 3, and 5 × 5. To limit the number of input channels, an extra 1 × 1 convolution is added
before the 3 × 3 and 5 × 5 convolutions
Network Architecture
The architecture used is the Inception-Resnet V2 [68], a very deep convolutional
neural network with over 450 layers that it can efficiently learn to identify objects
on images, outputting instance bounding boxes and classifying them into one of the
specified classes with a confidence percentage.
When detecting objects on an image, one of the main problems is to select the
kernel sizes for the convolutional layers, as the same object may appear with huge
size and shape variations from one instance to another. A larger kernel is preferred
for bigger, more global instances, and a smaller kernel is preferred for smaller ones.
To tackle this issue, the architecture performs multiple parallel convolutions using
different kernel sizes, making the network “wider” rather than “deeper”. The blocks
of layers containing these convolutions are called inception modules [69], represented
in Fig. 13.
Another characteristic of the network, is the use of Residual Connections [70],
used to add the output of the convolution operation of the inception module to the
input. This introduces shortcuts in the model and it translates into a more optimal
and accurate network. Figure 14 shows the structure of a Residual Connection.
This architecture combines the inception modules with Residual Connections,
obtaining the so called Inception-ResNet modules. Figure 15 shows an example of
these modules.
With these Inception-ResNet Modules, the main body of the architecture is built.
Figure 16 shows a compressed view of the architecture.
216 M. Martin-Abadal et al.
ConvoluƟons
Relu acƟvaƟon
Training Details
The Inception-ResNet V2 architecture is trained by means of readjusting the values
of the kernels in the convolutional layers, backpropagating the loss computed over
the predictions obtained on the softmax layers.
Due to the high number of layers, the loss becomes small and insufficient to
update the kernel values properly. To prevent the middle part of the network from
“dying out” during the backpropagation process, an auxiliary classifier is applied at
the output of the second block of Inception-ResNet modules. In this way, an auxiliary
loss is computed and added to the prior one as shown in Eq. 2.
In order to train the network and adjust the kernel weights, a backpropagation
function is needed. For this case, a smooth L1 location loss function is used, its loss
increases as the predicted bounding box location diverges from the one specified
on the ground truth. Also, the Momentum optimiser algorithm along with gradient
clipping strategies [71] are implemented in order to help the training process reach
the global minimum error.
The architecture used for this application, had already been trained over the COCO
dataset [72]. To retrain the network with the desired classes, a set of images containing
different jellyfish species and its corresponding ground truth are needed. The ground
truth in this case is a text file for each image, where the bounding box and class for
each jellyfish instance present in the image are indicated.
The trainings were performed on the same computer mentioned in Sect. 3.1.
The Application of Deep Learning in Marine Sciences 217
Relu acƟvaƟon
+
1x1 convoluƟons
3x3 convoluƟons
Relu acƟvaƟon
Fig. 15 Inception-ResNet-A module. The Max pooling branch from the Inception Module is sub-
stituted by the Residual Connection. The 5 × 5 convolution is split into two equivalent 3 × 3 con-
volutions, boosting computer and accuracy performance (neural networks perform better when
convolutions do not alter the dimensions of the input drastically). Finally, for the residual sum
to work, the input and output after convolution must have the same dimensions, hence, a 1 × 1
convolution is applied after the original convolutions, to match the depth sizes
This section describes the experimental framework followed. First, the image acqui-
sition, organisation and labelling processes are described. Subsequently, the different
case studies and hyperparameters used are presented. Finally, we describe the vali-
dation and evaluation details.
Datasets
Acquisition
Training and testing images were extracted from underwater video sequences of the
three species under consideration. The objective was to construct a dataset contain-
ing the three species under different conditions, such as water coloration, turbidity,
illumination and different jellyfish positions and sizes, assuring robustness in the
training process.
A dataset of 842 images was generated, 80% of the dataset was used to train the
network (674 images), while the remaining 20% was used for testing purposes (168
images).
Figure 17 shows sample images from the dataset showcasing different conditions.
Labelling
For every image of the dataset, an annotation file was generated using the LabelImg
tool [73], this generates an “.xml” file which contains the position and classification
of each instance present in the image. Figure 18 shows an original image along with
its ground truth “.xml” text file.
Case Studies
Following the same procedure used in the previous application in Sect. 3.2, the net-
work was trained using different sets of hyperparameters. The network was first
trained with and without implementing data augmentation, secondly, two different
learning rates were set, and finally, the network was trained using two values for the
number of iterations. Results showing the different combinations of hyperparameters
are shown in Table 3.
Experiments
Following the methodology described in Sect. 2 and implemented in the previ-
ous application in Sect. 3.2, twelve different experiments were conducted K =
1, 2, ..., 12, each one assessing the performance of a case study, using its corre-
sponding hyperparameters and applying a 5 k-fold cross-validation i = 1, 2, ..., 5,
as shown in Fig. 19.
In order to evaluate the performance of each model, the Intersection over Union
(IoU) method along with the average precision metric (AP) [74] were used, these are
the most common evaluation methods for object detection, used in object detection
competitions such as PASCAL VOC [75], ImageNet [76] or COCO [72].
The Application of Deep Learning in Marine Sciences 219
Fig. 17 Images from the dataset showing the three jellyfish species under different environmental
conditions. Top: P. noctiluca, centre: R. pulmo, bottom: C. tuberculata
220 M. Martin-Abadal et al.
(a) (b)
<annotation>
<folder>Tuberculata</folder>
<filename>IMG_00012.jpg</filename>
<path>D:\Jellyfish\Tuberculata\IMG_00012.jpg</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>1280</width>
<height>720</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>tuberculata</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>616</xmin>
<ymin>127</ymin>
<xmax>973</xmax>
<ymax>525</ymax>
</bndbox>
</object>
</annotation>
Fig. 18 a Original image. b Corresponding ground truth “.xml” file, specifying the jellyfish location
and class
Table 3 Case studies. When Case Data aug. Learning rate Iterations (k)
applying data augmentation,
random rotations and 1 0 5e–04 10
horizontal and vertical flips 2 20
are applied. The decay 3 40
learning rate consists in 4 Decay 10
applying a learning rate of
5 20
5e–04 until 50% of the
training and then dropping it 6 40
to 5e–05 7 1 5e–04 10
8 20
9 40
10 Decay 10
11 20
12 40
The IoU measure gives the similarity between the predicted and the ground-
truth bounding-boxes areas, and is defined as the area of the intersection between
bounding-boxes divided by the union of the bounding-boxes areas (see Eq. 3).
Figure 20 illustrates how the IoU is calculated for a prediction.
Aintersection
I oU = (3)
Aunion
MODELS PREDICTIONS RESULTS
DATA
TRAIN MEAN
TRAINING CLASSIFICATION EVALUATION
The Application of Deep Learning in Marine Sciences
GT TXT
IMAGES
FILES
TEST
Fig. 19 Experiment “K” validation process. For each one of the twelve study cases, the network was trained five times using the k-fold cross-validation method.
The output models were run and evaluated over the test data. The final experiment results were obtained as the mean performance of its five models
221
222 M. Martin-Abadal et al.
C
GROUND TRUTH
Once the IoU is calculated for a prediction, in order to determine if that predic-
tions is a TP or a FP, a threshold value over the IoU is established. Following the
criteria applied in the PASCAL VOC challenge, this threshold is set at thr iou = 0.5.
A prediction is classified as TP if the IoU value with any ground truth bounding-box
is greater than the thr iou and the predicted class matches the corresponding one of
the ground truth, otherwise, the detection is marked as a FP. The following equation
represents this criteria:
T P, if I oU >= thriou & C pr ed = sgt
Detection = (4)
F P, otherwise
Also, ground truth instances which do not have a IoU > thr iou with any prediction
are marked as FN.
From the TP, FP and FN values, the precision and recall metrics are calculated
for all classes. Finally, from these metrics, the AP of each class and mean AP (mAP)
between classes are obtained. The AP can be understood as the average of the max-
imum precision at different recall values, or the area under max(Precision)-Recall
curve. Figure 21 exemplifies calculus of the AP for a series of detections. More
information about this evaluation metric can be found in [74].
The followed workflow to determine the detection performance of each model is
showcased in Fig. 22.
The Application of Deep Learning in Marine Sciences 223
Precision
4 FP 0,5 0,4
0,6
5 FP 0,4 0,4
0,5
6 TP 0,5 0,6
0,4
7 TP 0,57 0,8
0,3
8 FP 0,5 0,8
9 FP 0,44 0,8 0,2
0,2 0,4 0,6 0,8 1
10 TP 0,5 1 Recall
Fig. 21 Example calculus of the AP for a series of detections. The blue line represents the precision-
Recall curve. The orange lie represents the max(Precision)-Recall curve. The AP value equals to
the area under the max(Precision)-Recall curve (orange area)
PREDICTIONS ( )
RESULTS ( )
NOCTILUCA 99%
Fig. 22 Model “i” of experiment “K” evaluation process. The detection output is compared with its
corresponding ground truth, obtaining the FP, FN and TP values. From these, the Precision, Recall
and mAP values are calculated
4.3 Results
This section shows the results for each of the experiments and the assessment of the
hyperparameter.
Experiment Performance
The mean results obtained when evaluating the five models of each experiment over
their corresponding test set are shown in Table 4, showing the AP obtained for each
class and the mAP value.
224 M. Martin-Abadal et al.
Table 4 Results obtained for all experiments from evaluating the test set, showing the AP for each
class and the mAP over all classes
Exp. Data aug. Learning Iterations AP AP pulmo AP tuber- mAP (%)
rate (k) noctiluca (%) culata
(%) (%)
1 0 0.0005 10 74.2 96.5 96.5 89.1
2 20 75.3 95.7 96.8 89.3
3 40 74.4 96.5 96.8 89.3
4 Decay 10 75.6 97.3 96.4 89.7
5 20 76.5 97.7 96.3 90.2
6 40 77.1 98.1 96.3 90.5
7 1 0.0005 10 72.9 98.1 96.6 89.2
8 20 76.4 97.7 96.7 90.3
9 40 77.3 98.0 96.6 90.6
10 Decay 10 73.7 98.5 96.9 89.7
11 20 74.8 98.6 97.1 90.1
12 40 76.2 98.5 96.8 90.5
All experiments show mAP values around 90%, reaching a maximum of 90.6%
for experiment 9 and a minimum value of 89.1% for experiment 1. Looking at the
AP values for the three species, it can be seen that both R. pulmo and C. tuberculata
have much higher mAP values than P. noctiluca. This might be due to fact that R.
pulmo and C. tuberculata are bigger specimens and the shape of their bodies remains
relatively unchanged while swimming and therefore they might be easier to identify.
On the contrary, in P. noctiluca the relative position of the tentacles in relation to the
main body (umbrella) changes to a greater extent with the movement of the animal,
adopting therefore a multitude of shapes and thus making it more difficult to identify.
Experiments where data augmentation is applied tend to have a slightly better
performance. The same occurs with the number of iterations, experiments that are
trained during 20 k or 40 k iterations show a small increase in performance. The
application of the decay technique over the learning rate does not seem to have a
significant impact over the performance. Qualitative results for the jellyfish detection
process over the test dataset are shown in Fig. 23.
The Application of Deep Learning in Marine Sciences 225
Fig. 23 Visualization of the jellyfish detection obtained from images of the test set. The green
bounding boxes represent P. noctiluca detections, the blue boxes correspond to R. pulmo and the
orange ones to C. tuberculata
226 M. Martin-Abadal et al.
5 Conclusions
The two applications presented in this chapter clearly demonstrate the potential
of deep learning in aiding the classification and processing of large image data sets
collected in ecological studies. In the first study example, we showcase the application
of a deep semantic segmentation neural network architecture to automatically detect
the habitat forming seagrass species P. oceanica.
Diverse hyperparameter configurations were tested in order to find those that
provided the best metrics. The evaluation results of the models showcased that the
best metrics were achieved when data augmentation was applied and the network
was trained for 16,000 iterations with a learning rate of 1e–05. A video presenting
the network semantic segmentation can be seen at [77].
The results of this study are encouraging and show that deep learning techniques
can be a useful tool for the automatic classification of underwater habitats. Future
research should extend on this capability and build networks that can detect and
classify multiple habitat types of coastal areas.
In the second example study, an object detection deep network has been used to
automatically identify three commonly occurring species of jellyfish in the Mediter-
ranean.
Once again, diverse hyperparameter configurations were tested in order to find
those that provided the best metrics. In this case, the evaluation results of the models
showcased that the best metrics were achieved when data augmentation was applied
and the network was trained for 40,000 iterations with a decaying learning rate. A
video presenting the network semantic segmentation can be seen at [78].
These results show the potential of object detection in the identification of marine
species in image data, not only for jellyfish but for many other species that can be
filmed in underwater environments. With respect to the detection of jellyfish, these
results will be used to train the network to recognise more species. The demon-
strated automatic detection methods will have direct applications for the monitoring
of jellyfish in the proximity of beaches.
To conclude, deep learning techniques have a huge potential in supporting ecolog-
ical studies. Once neural networks are functioning to a high precision in the detection
of habitats or species, they can be applied to other datasets originating from other
locations, thus providing help in image data processing for a much wider, potentially
global, audience of scientists. Equally, with the help of experts in providing classified
images, existing neural networks can be extended to include more habitats or species
in the future. However, to further this development, information technologists and
natural scientists alike need to more actively engage with each other fields and search
for collaborations. Deep learning techniques have an essential part to play in moving
ecological studies to a new level, providing more cost-effective data collection solu-
tions at a time when large amounts of data are needed to detect and adapt to global
environmental change.
supported by the Ramón y Cajal Fellowship (grant by the Ministerio de Economía y Competitivi-
dad de España and the Conselleria d’Educació, Cultura i Universitats Comunidad Autonoma de las
Islas Baleares). We would like to thank Charlotte Jennings for her help in identifying and labelling
jellyfish in underwater images.
References
1. Caughlan, L.: Cost considerations for long-term ecological monitoring. Ecol. Indic. 1(2), 123–
134 (2001)
2. Barrio Froján, C.R.S., Cooper, K.M., Bolam, S.G.: Towards an integrated approach to marine
benthic monitoring. Mar. Pollut. Bull. 104(1–2), 20–28 (2016)
3. Del Vecchio, S., Fantinato, E., Silan, G., Buffa, G.: Trade-offs between sampling effort and
data quality in habitat monitoring. Biodivers. Conserv. 28(1), 55–73 (2018)
4. Bennett, M., Acott, C., Richardson, K., Bowen, S., Smart, D., Smith, P., Sharp, F., Bryson, P.,
Goble, S.: Recreational technical diving part 1: an introduction to technical diving methods and
activities. J. S. Pac. Underw. Med. Soc. Eur. Underw. Baromedical Soc. 43(4), 86–93 (2013)
5. Hissmann, K., Schauer, J.: Manned submersible JAGO. J. Large-Scale Res. Facil. JLSRF 3,
A110 (2017)
6. Sagalevich, A.M.: 30 years experience of Mir submersibles for the ocean operations. Deep Sea
Res. Part II Top. Stud. Ocean. 155(2017), 83–95 (2017)
7. Rosenkranz, G.E., Byersdorfer, S.C.: Video scallop survey in the eastern Gulf of Alaska, USA.
Fish. Res. 69(1), 131–140 (2004)
8. Lambert, G.I., Jennings, S., Hinz, H., Murray, L.G., Parrott, L., Kaiser, M.J., Hiddink, J.G.: A
comparison of two techniques for the rapid assessment of marine habitat complexity. Methods
Ecol. Evol. 4(3), 226–235 (2013)
9. Santana-Garcon, J., Newman, S.J., Harvey, E.S.: Development and validation of a mid-water
baited stereo-video technique for investigating pelagic fish assemblages. J. Exp. Mar. Biol.
Ecol. 452, 82–90 (2014)
10. Gallo, N.D., Cameron, J., Hardy, K., Fryer, P., Bartlett Douglas, H., Levin Lisa, A.: Submersible-
and lander-observed community patterns in the Mariana and New Britain trenches: influence
of productivity and depth on epibenthic and scavenging communities. Deep Sea Res. Part I
Ocean. Res. Pap. 99, 119–133 (2015)
11. Sheehan, E., Vaz, S., Pettifer, E., Foster, N., Nancollas, S., Cousens, S., Holmes, L., Facq,
J.-V., Germain, G., Attrill, M.: An experimental comparison of three towed underwater video
systems using species metrics, benthic impact and performance. Methods Ecol. Evol. 7(07)
(2016)
12. Langlois, T.J., Harvey, E.S., Fitzpatrick, B., Meeuwig, J.J., Shedrawi, G., Watson, D.L.: Cost-
efficient sampling of fish assemblages: comparison of baited video stations and diver video
transects. Aquat. Biol. 9(2), 155–168 (2010)
13. Holmes, T.H., Wilson, S.K., Travers, M.J., Langlois, T.J., Evans, R.D., Moore, G.I., Dou-
glas, R.A., Shedrawi, G., Harvey, E.S., Hickey, K.: A comparison of visual- and stereo-video
based fish community assessment methods in tropical and temperate marine waters of Western
Australia. Limnol. Ocean. Methods 11(7), 337–350 (2013)
14. Martin-Abadal, M., Guerrero-Font, E., Bonin-Font, F., Gonzalez-Cid, Y.: Deep semantic seg-
mentation in an AUV for online posidonia oceanica meadows identification. IEEE Access 6,
60956–60967 (2018)
15. Gray, P.C., Fleishman, A.B., Klein, D.J., McKown, M.W., Bézy, V.S., Lohmann, K.J., Johnston,
D.W.: A convolutional neural network for detecting sea turtles in drone imagery. Methods Ecol.
Evol., 1–11 (2018). In Review
16. Kotta, J., Valdivia, N., Kutser, T., Toming, K., Rätsep, M., Orav-Kotta, H.: Predicting the cover
and richness of intertidal macroalgae in remote areas: a case study in the Antarctic Peninsula.
Ecol. Evol. 8(17), 9086–9094 (2018)
228 M. Martin-Abadal et al.
17. Wäldchen, J., Mäder, P.: Machine learning for image based species identification. Methods
Ecol. Evol. 9(11), 2216–2225 (2018)
18. Weinstein, B.G.: Scene-specific convolutional neural networks for video-based biodiversity
detection. Methods Ecol. Evol. 9(6), 1435–1441 (2018)
19. Willi, M., Pitman, R.T., Cardoso, A.W., Locke, C., Swanson, A., Boyer, A., Veldthuis, M.,
Fortson, L.: Identifying animal species in camera trap images using deep learning and citizen
science. Methods Ecol. Evol., 1–12 (2018)
20. Mieszkowska, N., Sugden, H., Firth, L.B., Hawkins, S.J.: The role of sustained observations
in tracking impacts of environmental change on marine biodiversity and ecosystems. Philos.
Trans. R. Soc. A Math. Phys. Eng. Sci. 372(2025) (2014)
21. Gouraguine, A., Moranta, J., Ruiz-Frau, A., Hinz, H, Reñones, O., Ferse, S.C.A., Jompa, J.,
Smith, D.J.: Citizen science in data and resource-limited areas: a tool to detect long-term
ecosystem changes. Plos One 14(1), e0210007 (2019)
22. Cun, L., Cun, L., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel,
L.D.: Handwritten digit recognition with a back-propagation network. Adv. Neural Inf. Process.
Syst. 2, 396–404 (1990)
23. Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition
challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
24. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-
nition. ArXiv:1409.1556, Sept 2014
25. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
ArXiv:1512.03385, Dec 2015
26. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional
Networks, Aug 2016
27. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V.,
Rabinovich, A.: Going deeper with convolutions. ArXiv:1409.4842, Sept 2014
28. Chollet, F.: Xception: deep learning with depthwise separable convolutions.
ArXiv:1610.02357, Oct 2016
29. Geisser, S.: The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70(350),
320–328 (1975)
30. Diaz-Almela, E., Duarte, C.: Management of Natura 2000 Habitats 1120, (Posidonia Oceani-
cae). Technical report, European Commission (2008)
31. Ruiz-Frau, A., Gelcich, S., Hendriks, I.E., Duarte, C.M., Marbà, N.: Current state of seagrass
ecosystem services: research and policy integration. Ocean. Coast. Manag. 149, 107–115 (2017)
32. Marba, N., Duarte, C.: Mediterranean warming triggers seagrass (posidonia oceanica) shoot
mortality. Glob. Chang. Biol. 16(8), 2366–2375 (2010)
33. Telesca, L., Belluscio, A., Criscoli, A., Ardizzone, G., Apostolaki, E.T., Fraschetti, S., Gristina,
M., Knittweis, L., Martin, C.S., Pergent, G., Alagna, A., Badalamenti, F., Garofalo, G., Ger-
akaris, V., Pace, M.L., Pergent-Martini, C., Salomidi, M.: Seagrass meadows (posidonia ocean-
ica) distribution and trajectories of change. Scientific reports (2015)
34. y Royo, C.L., Pergent, G., Pergent-Martini, C., Casazza, G.: Seagrass (posidonia oceanica)
monitoring in western mediterranean: implications for management and conservation. Environ.
Monit. Assess. 171, 365–380 (2010)
35. Sagawa, T., Komatsu, T.: Simulation of seagrass bed mapping by satellite images based on the
radiative transfer model. Ocean. Sci. J. 50(2), 335–342 (2015)
36. Montefalcone, M., Rovere, A., Parravicini, V., Albertelli, G., Morri, C., Bianchi, C.N.: Evalu-
ating change in seagrass meadows: a time-framed comparison of side scan sonar maps. Aquat.
Bot. 104, 204–212 (2013)
37. Vasilijevic, A., Miskovic, N., Vukic, Z., Mandic, F.: Monitoring of seagrass by lightweight
AUV: a posidonia oceanica case study surrounding Murter island of Croatia. In: Mediterranean
Conference on Control and Automation, pp. 758–763, June 2014
38. Rende, F.S., Irving, A.D., Lagudi, A., Bruno, F., Scalise, S., Cappa, P., Montefalcone, M., Bacci,
T., Penna, M., Trabucco, B., Di Mento, R., Cicero, A.M.: Pilot application of 3D underwater
The Application of Deep Learning in Marine Sciences 229
imaging techniques for mapping Posidonia oceanica (L.) delile meadows. ISPRS - Int. Arch.
Photogramm. Remote. Sens. Spat. Inf. Sci., 177–181 (2015)
39. Bonin-Font, F., Burguera, A., Lisani, J.-L.: Visual discrimination and large area mapping of
posidonia oceanica using a lightweight AUV. IEEE Access 5, 24479–24494 (2017)
40. Gonzalez-Cid, Y., Burguera, A., Bonin-Font, F., Matamoros, A.: Machine learning and deep
learning strategies to identify posidonia meadows in underwater images. IEEE Oceans, 1–5
(2017)
41. Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv:1603.07285,
Mar 2016
42. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440,
June 2015
43. Teichmann, M., Weber, M., Zoellner, M., Cipolla, R., Urtasun, R.: MultiNet: real-time joint
semantic reasoning for autonomous driving. arXiv:1612.07695, Dec 2016
44. Buja, A., Stuetzle, W., Shen, Y.: Loss functions for binary class probability estimation and clas-
sification: structure and applications. Technical report, Department of Statistics of University
of Pennsylvania, Jan 2005
45. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980, Dec 2014
46. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple
way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
47. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Li, F.-F.: Imagenet: a large-scale hierarchical
image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–
255, June 2009
48. Taylor, L., Nitschke, G.: Improving deep learning using generic data augmentation.
arXiv:1708.06020, Aug 2017
49. Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In:
Neural Networks: Tricks of the Trade (2012)
50. Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induc-
tion algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learn-
ing, Apr 2001
51. Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness,
markedness and correlation. Int. J. Mach. Learn. Technol. 2, 37–63 (2011)
52. European Commission: Assessment of jellyfish socioeconomic impacts in the mediterranean:
implications for management, horizon 2020, May 2017
53. Condon, R.H., Graham, W.M., Duarte, C.M., Pitt, K.A., Lucas, C.H., Haddock, S.H.D., Suther-
land, K.R., Robinson, K.L., Dawson, M.N., Decker, M.B., Mills, C.E., Purcell, J.E., Malej, A.,
Mianzan, H., Uye, S.-I., Gelcich, S., Madin, L.P.: Questioning the rise of gelatinous zooplank-
ton in the world’s oceans. Bioscience 62(2), 160–169 (2012)
54. Richardson, A.J., Bakun, A., Hays, G.C., Gibbons, M.J.: The jellyfish joyride: causes, con-
sequences and management responses to a more gelatinous future. Trends Ecol. Evol. 24(6),
312–322 (2009)
55. Condon, R.H., Duarte, C.M., Pitt, K.A., Robinson, K.L., Lucas, C.H., Sutherland, K.R., Mian-
zan, H.W., Bogeberg, M., Purcell, J.E., Decker, M.B., Uye, S.-I., Madin, L.P., Brodeur, R.D.,
Haddock, S.H.D., Malej, A., Parry, G.D., Eriksen, E., Quinones, J., Acha, M., Harvey, M.,
Arthur, J.M., Graham, W.M.: Recurrent jellyfish blooms are a consequence of global oscilla-
tions. Proc. Natl. Acad. Sci. USA 110(3), 1000–1005 (2013)
56. United Nations: Factsheet: people and oceans, pp. 1–2. https://www.un.org/
sustainabledevelopment/wp-content/uploads/2017/05/Ocean-fact-sheet-package.pdf (2017)
57. Macrokanis, C., Hall, N., Mein, J.: Irukandji syndrome in Northern Western Australia: an
emerging health problem. Med. J. Aust. 181, 699–702 (2004)
58. Fenner, P.J., Lippmann, J., Gershwin, L.A.: Fatal and nonfatal severe jellyfish stings in Thai
waters. J. Travel. Med. 17(2), 133–138 (2010)
59. Purcell, J.E., Ichi Uye, S.-I., Tseng Lo, W.-T.: Anthropogenic causes of jellyfish blooms and
their direct consequences for humans: a review. Mar. Ecol. Prog. Ser. 350, 153–174 (2007)
230 M. Martin-Abadal et al.
60. Purcell, J.E., Baxter, E.J., Fuentes, V.L.: Jellyfish as products and problems of aquaculture. In:
Advances in Aquaculture Hatchery Technology, pp. 404–430. Elsevier (2013)
61. Merceron, M., Le Fevre-Lehoerff, G., Bizouarn, Y., Kempf, M.: Fish and jellyfish in Brittany
(France). Equinoxe 56, 6–8 (1995)
62. Lee, J.H., Choi, H.W., Chae, J., Kim, D.S., Lee, S.B.: Performance analysis of intake screens
in power plants on mass impingement of marine organisms. Ocean. Polar Res. 28, 385–393
(2006)
63. Matsumura, K., Kamiya, K., Yamashita, K., Hayashi, F., Watanabe, I., Murao, Y., Miyasaka, H.,
Kamimura, N., Nogami, M.: Genetic polymorphism of the adult medusae invading an electric
power station and wild polyps of Aurelia aurita in Wakasa Bay, Japan. J. Mar. Biol. Assoc. UK
85(3), 563–568 (2005)
64. Ferraris, M., Berline, L., Lombard, F., Guidi, L., Elineau, A., Mendoza-Vera, J.M., Lilley,
M.K.S., Taillandier, V., Gorsky, G.: Distribution of Pelagia noctiluca (Cnidaria, Scyphozoa) in
the Ligurian Sea (NW Mediterranean Sea). J. Plankton Res. 34(10), 874–885 (2012)
65. Barrado, C., Fuentes, J.A., Salamí, E., Royo, P., Olariaga, A.D., López, J., Fuentes, V.L.,
Gili, J.M., Pastor, E.: Jellyfish monitoring on coastlines using remote piloted aircraft. In: IOP
Conference Series: Earth and Environmental Science, vol. 17, p. 12195 (2014)
66. Kim, Donghoon, Shin, J.U., Kim, H., Kim, H., Lee, D., Lee, S.M., Myung, H.: Development
and experimental testing of an autonomous jellyfish detection and removal robot system. Int.
J. Control Autom. Syst. 14(1), 312–322 (2016)
67. Matsuura, F., Fujisawa, N., Ishikawa: Detection and removal of jellyfish using underwater
image analysis. J. Vis. 10(3), 259–260 (2007)
68. Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual
connections on learning. In: AAAI Conference on Artificial Intelligence, Feb 2016
69. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 1–9, June 2015
70. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
71. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks.
In: Proceedings of the 30th International Conference on International Conference on Machine
Learning, vol. 28, pp. III–1310–III–1318 (2013)
72. Lin, T.-Y., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P.,
Ramanan, D., Dollár, P., Lawrence Zitnick, C.: Microsoft coco: common objects in context.
In: ECCV (2014)
73. Tzutalin, D.: Labelimg. https://github.com/tzutalin/labelImg (2018)
74. Zhu, M.: Recall, precision and average precision. Technical report, Department of Statistics
and Actuarial Science, University of Waterloo, Waterloo, Canada, Sept 2004
75. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual
object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
76. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical
image database. In: CVPR09 (2009)
77. Martin-Abadal, M.: Video: online posidonia oceanica segmentation. http://srv.uib.es/po-
identification/, May 2018
78. Martin-Abadal, M.: Video: jellyfish object detection. http://srv.uib.es/jellyfish-object-
detection/, Dec 2018
Deep Learning Case Study
on Imbalanced Training Data
for Automatic Bird Identification
Abstract Collisions between birds and wind turbines can be significant problem in
wind farms. Practical deterrent methods are required to prevent these collisions. How-
ever, it is improbable that a single deterrent method would work for all bird species
in a given area. An automatic bird identification system is needed in order to develop
bird species level deterrent methods. This system is the first and necessary part of the
entirety that is eventually able to, monitor bird movements, identify bird species, and
launch deterrent measures. The system consists of a radar system for detection of the
birds, a digital single-lens reflex camera with telephoto lens for capturing images,
a motorized video head for steering the camera, and convolutional neural networks
trained on the images with a deep learning algorithm for image classification. We
utilized imbalanced data because the distribution of the captured images is naturally
imbalanced. We applied distribution of the training data set to estimate the actual
distribution of the bird species in the test area. Species identification is based on the
image classifier that is a hybrid of hierarchical and cascade models. The main idea is
to train classifiers on bird species groups, in which the species resembles more each
other than any other species outside the group in terms of morphology (coloration
and shape). The results of this study show that the developed image classifier model
has sufficient performance to identify bird species in a test area. The proposed system
produced very good results, when the hybrid hierarchical model was applied to the
imbalanced data sets.
1 Introduction
Demand for automatic bird identification systems for wind farms has increased
recently. This kind of system is especially required for offshore wind farms. The
objective of this application is twofold: it has first to detect two key bird species,
which are particularly required for monitoring in the environmental license, and sec-
ondly to classify maximum number of other bird species while the first part still
stands. The two key species are the white-tailed eagle (Haliaeetus albicilla) and
the lesser black-backed gull (Larus fuscatus fuscatus). An automatic identification
system is in development that consists of a separate commercial radar system to
detect the birds, a digital single-lens reflex camera with telephoto lens for capturing
images, a motorized video head for steering the camera, and a convolutional neu-
ral network (CNN) trained on the images with a deep learning algorithm for image
classification. The conventional approach to this image classification problem is to
presume that equally distributed data are fed into the classifier. However, this is a
real-world application, in which it is difficult and time-consuming to collect large
number of images for each class. Due to the nature of this application, it is conceiv-
able that imbalanced data are utilized because the distribution of the captured images
is naturally imbalanced, i.e., there are common and scarce bird species in the test
area. It is also possible to include scarcer classes into the classification process with
this approach. Researchers have proposed a class-imbalance aware loss function for
the problem of class imbalance. This loss function adds an extra class-imbalance
aware regularization term to the normal softmax loss [1]. However, we have applied
the distribution of the training data set to estimate the actual distribution of the bird
species in the test area. Training data set and test data set both share this distribution.
Species identification is based on the image classifier that is a hybrid of hierarchical
and cascaded models. The main idea is to train classifiers on bird species groups, in
which the species resembles more each other than any other species outside the group
in terms of morphology. The first classifier is hierarchical determining the group of
the test image and the subsequent classifiers within the groups are in cascades. We
have also applied our data augmentation method, which rotate and convert the images
in accordance with the desired color temperatures. The hybrid hierarchical and cas-
cade model is compared to two single classifiers. One of the classifiers is trained on
balanced data set and the other is trained on imbalanced data set without grouping.
CNN has been successfully applied to image classification problems [2]. The
number of training examples in image classification is typically large. This may
cause problems when dealing with real-world applications, as collection of large
number of images is not always possible. As a result, some data augmentation is
usually needed [3, 4]. Cascade CNN has been successfully applied to face detection
and road-sign classification system [5, 6].
The remainder of this chapter is organized as follows. In Sect. 2 we present the sys-
tem and its components for collecting images automatically. In Sect. 3 related work
is discussed. We describe our data, its grouping, and data augmentation algorithm
in Sect. 4 Classification algorithms, applied CNN models, and feature extraction
Deep Learning Case Study on Imbalanced Training Data … 233
are described in Sect. 5 Results for hybrid of hierarchical and cascade CNN model,
trained on imbalanced data set and compared to conventional CNN model, are pre-
sented in Sect. 6. We then offer conclusions in Sect. 7.
2 The System
The proposed system consists of several hardware as well as software modules. See
Fig. 1 for an illustration. At first, there is the radar system, which is connected to a
local area network (LAN), and thus it is able to communicate with the servers, in
which the various programs are running.
Fig. 1 The hardware of the system and the principle of catching flying bird into the frame area of
the camera
234 J. Niemi and J. T. Tanttu
We use a radar system supplied by Robin Radar Systems B.V. because they provide
an avian radar system that is able to detect birds. They also have algorithms for
tracking a detected object over time (between the blips). The model we use is the
ROBIN 3D FLEX v1.6.3 and it is actually a combination of two radars and a software
package for implementation of various algorithms such as the tracker algorithms [7].
The role of the radar is to detect flying birds and pass the WGS84 coordinates
of the target bird to the video head control software. The system includes the PT-
1020 Medium Duty motorized video head of the 2B Security Systems. The video
head is operated by Pelco-D control protocol [8], and the control software for it is
developed by us. The System uses Canon EOS 7D II camera with 20.2-megapixel
sensor and the Canon EF 500/f4 IS lens. Correct focusing of the images relies on the
autofocus system of the lens and the camera. Automatic exposure is also applied. The
camera is controlled by the application programmable interface (API) of the camera
manufacturer, and the software for controlling the camera has been developed by us.
In addition, the radar system provides parameters, which can be applied to increase
the performance of classifiers. These parameters are the distance in 3D of a target
(m), velocity of a target (m/s), and trajectory of a target (WGS84 coordinates). For
the details of the system hardware, see [9].
3 Related Work
targets as birds, and thereby separating them from other non-biological targets. The
ability of these algorithms for correct identification between bird species functional
groups was much weaker [11].
Time-lapse photography is a method in which the frame rate of taking a sequence
of images is higher than the frame rate used to view the sequence. Time-lapse images
can make subtle time-related processes distinct, and the process that is analyzed, can
be too fast or too slow to the human eye. Time-lapse images have been used to detect
birds around a wind farm. An Image-based detection using cameras have been applied
to build a bird monitoring system. This system utilized an open-access time-lapse
image data set that is collected around the wind farm. The system applied algorithms:
AdaBoost, Haar-like, histogram of oriented gradients (HOG), and CNN. AdaBoost
is a two-class classifier, which is based feature selection and weighted majority
voting. A strong classifier is made as a weighted sum of many weak classifiers, and
the resulting classifier is shallow but robust [12]. Haar-like is an image feature that
utilizes contrasts in images. It extracts the light and the shade of objects by using
black-and-white patterns [13]. HOG is a feature used for grasping the approximated
shape of objects. At first, it computes the spatial gradient of the image and makes
a histogram of the quantized direction of the gradient in each local region, called
a cell in the image. Subsequently, it concatenates the histograms of the cells in the
neighboring groups of the cells (the blocks) and normalizes them by dividing by their
Euclidean norms in each block [14]. The best method for detection was Haar-like,
and the best method for classification was CNN. The system was tested on only two
bird functional groups, hawks and crows, and it achieved only moderate performance
[15].
4 Data
Input data of this application consist of digital images. All images for training the
CNN have been taken manually at the test location in various weather conditions. The
location is the same where the camera will be installed for taking images automati-
cally. The collected image set was divided into two data sets: an original data set for
training classifiers, and a test set for measuring generalization of the classifiers, and
thus the classifiers will not see these test images during training. Both data sets are
divided into 14 classes. It became clear during image collection that there would be
low number of images of the scarcest bird species, resulting in classes with very low
number of data examples. Therefore, in order to be able to classify the scarcest bird
species, all the collected images are included with an acceptance that the resulting
data set will be imbalanced. The distribution of the number of images for each class
is used as an estimate for the actual distribution of bird species in the test area. This is
justified by the fact that images are collected in all four seasons and in all hours dur-
ing day light. The estimate is not necessary reliable in terms of bird species census,
because only the species that usually fly at approximately same height with the wind
turbines are taken into account, but it is sufficient in the context of this application.
236 J. Niemi and J. T. Tanttu
The total number of images of the original data set is 24631, and number of images
of the test data set is 439. The test data set was created by randomly choosing images
from all classes. The number of images in the test data set follows the distribution
of the original data set, thus reflecting the actual distribution in the test site. Class
labels and number of images of the original data set for each class are presented in
Table 1. In this Table, three classes are not defined in species level: LNSP, SWSP,
and CATE. The first two cases are because there is no need to distinguish between
loon species or swan species any further in this context, regardless the fact that two
common, and two rare species of loons occur in the test area, and analogously there
occur two common and one rare species of swans. The same applies to the third
case too, the common/arctic tern. In addition, it is generally very difficult to tell the
difference between these two tern species [16], and thus the number of required data
examples (images) might be too large, considering the time needed to collect them.
The number of images for each class in the test data set are also given in Table 1.
No preprocessing, other than cropping, is applied to the images before feeding
them into the classifiers. The cropping is based on a segmentation, and it is motivated
by being able to dispose the most of the pixels representing only sky. The resolution
of the camera sensor measured by the total number of pixels and the focal length of
the lens are important qualities because of the long range, of which images are to
be taken. The effective number of pixels (ENP) is defined by the number of pixels
representing a bird. The remaining number of pixels are considered noise, thus ENP
has a significant effect on the performance of image classification model as birds
will be very small (they consist of only a small number of pixels) in the images.
Fig. 2 Data example of the common goldeneye, the black-headed gull and the lesser black-backed
gull, respectively
ENP depends on the sensor resolution of the camera and the focal length of the lens,
and if the sensor resolution is fixed, ENP can be increased by choosing a long (in
terms of focal length) telephoto lens. An additional advantage is that there is no need
to feed classifiers with large (in terms of the number of pixels) images. For more
details about the segmentation, see [9]. Examples of the original data set images are
presented in Fig. 2. The first image in this figure illustrates that there can be more
than one bird in the image. There are species in the test area that have a custom to
fly in tight flocks, and in these cases, the result (in terms of data examples) is an
image of several birds. Moreover, there might be more than just one bird species in
these flocks. The custom of flying in tight flocks is an important feature in terms of
identification for certain bird species [17]. As result of the segmentation, an image
has only one bird left when there is a sparse flock of birds in the image. In the sparse
flock case, the bird closest to the center of the image is retained, thus when a sparse
flock has more than one bird species, the retained bird species is chosen randomly.
In the tight flock case, the identification is based on the whole flock, and thus it is
biased toward the most numerous bird species in the flock.
Data augmentation is applied to the original data set. We have used our own method,
in which the images are converted into various color temperatures according to step
size, s. The lower and upper limit to the color temperature is 2000 K and 15000 K,
respectively. For example, if s = 50 (in K), the number of data examples of the
augmented data set is (15000–2000)/50 + 1 * 24631 = 6453322. For more details,
see [9]. In addition to the color conversion, the images are also rotated by a random
angle between –20° and 20° drawn from the uniform distribution. Motivation for this
is that CNN is invariant to small translations, but not image rotation [18]. Number of
images for each species (classes) in the augmented data set when s = 50 and s = 200,
respectively, are presented in Table 2. Figure 3 presents data examples of the output
of the augmentation algorithm. The original image, fed into the data augmentation
238 J. Niemi and J. T. Tanttu
Fig. 3 Data example of the common eider. The image on the left is an augmented image with color
temperature of 3800 K. The original image is in the middle with color temperature of 5800 K. The
image on the right is an augmented image with color temperature of 7800 K
algorithm, has a color temperature of 5800 K, and the two augmented images have
a color temperature of 3800 K and 7800 K, respectively.
We trained our first classifier models on the entire data set, which was divided into
the same number of classes as the data had species. However, there are more and
less easily separable classes (assessed by human eye), and that led to an idea of
grouping those species together that seem similar to human eye, thus our proposal
in this respect is hierarchical [19]. In this approach, the number of classes decreases
Deep Learning Case Study on Imbalanced Training Data … 239
on the top level of the classifier hierarchy, and thus resulting better separability of
the data set. Classification inside the groups is dealt with the subsequent levels in
cascades [20]. Figure 4 illustrates examples of clearly and weakly separable classes.
The white-tailed eagle and the mute swan are examples of clearly separable classes,
and the herring gull and the common gull are examples of weakly separable classes.
There are four groups on the top-level of the classification hierarchy. Two of these
groups are actually single clearly separable species; swans (treated as single species
here) and white-tailed eagle, respectively. Gulls-and-terns and waterfowl (including
loons and common cormorant) form the other two groups, respectively. More groups
are defined below the top level in order to get species level classification. Division
of the classes into the groups are given in Table 3. The number of images for each
group formed from the original data set and from two augmented data sets with s =
50 and s = 200, respectively, are given in Table 4.
Fig. 4 From left to right: the white-tailed eagle and the mute swan are examples of clearly separable
classes. The herring gull and the common gull are examples of weakly separable classes
5 Classification
All classifiers in this application share the same CNN model, which is shown in
Fig. 5. Only the number of the neurons in the output changes according to the
number of classes. This model has three convolution layers, each of which is fol-
lowed by a rectified linear unit (ReLU) layer and the first two are followed by a
cross-channel normalization layer (Local Response Normalization, LRN). The use
of LRN is motivated by its ability to aid the generalization as its function may be
seen as brightness normalization [2]. There are two max-pooling layers, the first is
before the third convolution layer and the second is before the first fully connected
layer. There is no max-pooling layer before the second convolution layer. The reason
for this is the small ENP, and thus by omitting a max-pooling layer, all of the finest
edges detected by the first convolution layer are transferred to the second convolution
layer. The architecture is completed by three fully connected layers. The first two of
them are followed by dropout layers, and each dropout layer is followed by ReLUs.
The dropout was implemented by randomly setting the output neurons of the layer
to zero with a probability of 0.5. The architecture is finally terminated by softmax
Table 5 Parameters for the convolution layers and the max-pooling layers of the CNN model
Layer Filter # Feature maps Feature map size Stride Padding
Convolution 1 12 × 12 12 96 × 96 [2 2] [1 1]
Convolution 2 3×3 16 96 × 96 [1 1] [1 1]
Max-pooling 1 2×2 16 48 × 48 [2 2] [0 0]
Convolution 3 3×3 64 48 × 48 [1 1] [1 1]
Max-pooling 2 2×2 64 24 × 24 [2 2] [0 0]
activation, which produces a distribution over the class labels with cross entropy
loss function [21]. The input image is normalized and zero-centered before feeding
it to the network. CNN with Mini-batch training and supervised mode as well as
stochastic gradient descent with momentum is applied [21–24]. The L2 Regulariza-
tion (weight decay) method for reducing over-fitting is also applied [21, 24, 25]. We
kept the network size, in terms of free parameters, small due to limited capacity of
computer resources. Thus, resulting in total of 92 feature maps which are extracted
by convolution layers with kernel sizes [12 × 12 × 3] × 12, [3 × 3 × 12] × 16 and
[3 × 3 × 16] × 64, respectively. Total number of weights is about 9.47 × 106 .
Images of a size of 200 × 200 pixels are fed to each classifier. In the first convo-
lution layer, this image size produces (200 – 12 + 2 * 1)/2 + 1 = 96 square feature
maps, i.e., there are 96 × 96 = 9216 neurons in each feature map. Filter size, number
of feature maps, feature map size in neurons, stride, and padding for each convolution
layers and max-pool layers are given in Table 5. For each filter, Fig. 5 displays the
number of feature maps as the triplet [a, b, c].
We split the data set into a training set and a validation set as 70% and 30%, respec-
tively. We used manual tuning for choosing the number of epochs. Initial weights for
all layers are drawn from the Gaussian distribution with mean 0 and standard devia-
tion 0.01. Initial biases are set to zero. The L2 value is set to 0.0005 and mini-batch
size is set to 128.
The three convolution layers are designed to detect spatially distributed features from
the training images. Usual disjunctive features are shape and general coloration of the
bird. The ReLU (to introduce non-linearity) layer and the max-pooling (to increase
spatial invariance) layer after the second and third convolution layers, respectively,
may be seen as a refinement for the detected features due to the rectifying and down
242 J. Niemi and J. T. Tanttu
sampling properties of these layers. Figures 6 and 7 illustrate the features, extracted
by the CNN, for the classes LBBG and GBBG, respectively. These feature maps are
from the second convolution layer. There is one frame for each 16 feature maps in
the figure. These images are normalized, so that the minimum weight is 0 and the
maximum is 1, i.e., the most negative weight has turned into zero (black). The mid-
gray color (0.5) shows those areas in the image that have the minimum contribution
to the features, and the most blackish or the most whitish areas denote maximum
contribution to the features. The plain gray, or almost so, feature maps indicate that
no significant features have been found in these maps. These feature maps show that
the CNN is capable to give large weights on those areas of the bird plumage that are
relevant for species identification. These areas are mainly: wing tips, feet, and a bill
with these two pairs of gull species. As flying gulls usually have their feet concealed
Fig. 6 Visualization of the feature extraction by the CNN model for the class LBBG. There are 16
feature maps in the figure extracted by the second convolution layer
Deep Learning Case Study on Imbalanced Training Data … 243
Fig. 7 Visualization of the feature extraction by the CNN model for the class GBBG. There are 16
feature maps in the figure extracted by the second convolution layer
by feathers, and their underside is not always visible in the images, the usage of this
feature is minor. This leaves us the bill and the wing tip, and because the differences
in the bill color and structure are only subtle, the most significant identification point
is the wing tip. The great black-backed gull and the lesser black-backed gull also
have a slight difference in the hue of their upper wing color, but this does not always
seem to result in larger weights produced by the CNN for those areas, at least not
large enough, because images of the great black-backed gull are even misclassified
as the herring gull. Yet, the upper wing color is the key feature to distinguish between
the gray-backed gulls and black-backed gulls [26].
244 J. Niemi and J. T. Tanttu
Table 6 Filter sizes for the second modified model of the original CNN model
Layer Filter # Feature maps Feature map size Stride Padding
Convolution 1 12 × 12 12 96 × 96 [2 2] [1 1]
Convolution 2 7×7 16 90 × 90 [1 1] [0 0]
Convolution 3 5×5 32 86 × 86 [1 1] [0 0]
Convolution 4 4×4 64 40 × 40 [1 1] [0 0]
Convolution 5 3×3 128 18 × 18 [1 1] [0 0]
It became clear during the development of this algorithm that the challenge, in terms
of classification, lies in the group of gulls-and-terns-1, especially in the groups of
gray-backed gull and black-backed gull. Considering the CNN model, the first option
for a better performance should be a deeper model, i.e., more convolution layers. We
modified the original model by adding the fourth convolution layer, followed by
ReLU and max-pooling layers. This model had 128 filters with filter size of [5 ×
5] in its fourth convolution layer. The first modified model was tested on the group
of black-backed gull, but it failed to increase the performance of the original CNN
model. Then we tested even deeper model by adding the fifth convolution layer,
again followed by ReLU and max-pooling layers. In this case, the max-pooling layer
before the third convolution layer in the original model was removed in order to have
sufficient number of neurons left at the output of the architecture. We also modified
the filter sizes of the second modified model. The modified filter sizes are given in
Table 6. The two new max-pooling layers at the end of the second modified model
have filter size of [2 2], respectively. When this model was tested on the group of
black-backed gull, the result was the same as the first modified model, i.e., it did not
achieve a better performance than the original CNN model in terms of true positive
rate (TRP). Both test classifiers were trained on the augmented set, with s = 50, of
only the images from the group of black-backed gull.
If we want to identify (classify) all the species that occur in the test area, we must
accept that the training data set will be imbalanced, because there will be low num-
bers of training examples of the scarcest species. However, there are methods that
can be used for imbalanced data set. Naturally, the first option would be to collect
more data into the training data set, but this is not a very realistic option in our case.
Resampling is a method that is easy to implement, and fast to run. This means that
copies of data examples are added into the under-represented class, i.e., over-
sampling, or data examples are deleted from the over-represented class, i.e., under-
sampling [11]. However, we have augmented the original data set (resampling is
Deep Learning Case Study on Imbalanced Training Data … 245
not used) with s = 50, and trained a reference classifier on the augmented data set.
The results, in terms of performance, of the hybrid model (hierarchical and cascade
model combined) trained on the grouped data set are compared with this reference
classifier. The grouped data set is also augmented with s = 50, and both data sets are
imbalanced. Class imbalance ratios (i.e., ratio of the number of images in a class to
the class with the largest number of images) of the original data set for 13 classes,
rounded to the nearest integer, are given in Table 7. The class with the largest number
of images is GRCO, and it is omitted from the table. It can be seen from the table
that there is severe imbalance between several classes and the class GRCO.
Another reference classifier is trained on a balanced data set. This data set is
created by under-sampling method, so that the original data set is augmented with
s = 50, and then 236 × 262 = 61832 images are randomly chosen from each class,
except for the class VESC, from which all of the images are chosen, because this
class has the lowest number of data examples.
It is important to choose a suitable performance metric for classifiers trained
on imbalanced data set. We have used confusion matrix as a tool to compare the
classifiers. Precision (a measure for classifier exactness) and recall (a measure for
classifier completeness, a.k.a. TPR) are metrics that have been calculated from con-
fusion matrices. Receiver operating characteristic (ROC) curves and histograms of
predictions are the tools that have been applied to determine thresholds for various
classifiers trained on the grouped data set. Histograms present the predictions of a
classifier fed by a test data set, which the classifier has never seen before. Thus, his-
togram shows the distribution of prediction of a classifier over a class by presenting
the number of the predicted probabilities that falls into each bin. There are always
only two classes in the histograms: the positive class (in red), and the negative class
(in blue). If it is necessary to use histograms for more than two classes, then one
of the classes is treated as the positive class, and the other classes are combined to
246 J. Niemi and J. T. Tanttu
form the negative class. For all histograms, the x-axis is probability, and y-axis is the
number of hits for each bin. Y-axis ranges from zero to the largest probability that
hits a single bin in the histogram. The number of bins is always set to 10, and thus
the bin width is 0.1.
P = [ p1 , p2 , . . . , pn ], (1)
where pi is a probability for a classi as a result of the classification, and n is the number
of classes. Classes are alphabetically ordered by their class labels. Thresholds are
applied as follows:
1, if pi > thr esholdi ,
ci = (2)
0, otherwise,
C = [c1 , c2 , . . . , cn ], (3)
where pi is as in the P-vector (1), and threshold i is the threshold for classi . As result
of Eqs. (2) and (3) there will be exactly one element, ci , turned to one in C-vector,
and the rest of the elements are turned to zeros. The class label is found according
to the index of the element that is turned to one:
Deep Learning Case Study on Imbalanced Training Data … 247
Image
Primary
Prediction for the White- Classifier
1. White-tailed Eagle 2. Swans Prediction for Swans
tailed Eagle Output: 4
classes
3. Waterfowl1 4. Gulls-and-terns
waterfowl2
Great Cormorant Common/Herring Gull
Classifier 4.1.1
Prediction for the Great Prediction for the other 5
Prediction for Common/
Cormorant waterfowl species
Herring Gull
Gulls-and-terns-2
Classifier 4.2.1
Prediction for Greater/
Prediction for Black-
Lesser Black-backed Gull
headed Gull/Tern
The top-level classifier is the most important in terms of TPR, because a possible
misclassification will recur in subsequent hierarchy. This classifier deals with the
groups: swans, waterfowl-1, white-tailed eagle, and gulls-and-terns-1. Class imbal-
ance ratios of the top-level groups, rounded to the nearest integer, are given in Table 9.
Considering the environmental license requirements, it is crucial to keep the number
of false negative (FN, a data example from the positive class that is misclassified
as the negative class) of the group of white-tailed eagle and the group of gulls-and-
terns-1 as low as possible, preferably at zero. Figures 9 and 10 illustrate the choice
of possible threshold for the group of white-tailed eagle, and for the group of gulls-
and-terns-1, respectively. These histograms are formed from the predictions of the
top-level classifier (a.k.a. primary classifier), so that a histogram for the positive class
and the negative class are plotted in the same graph, respectively. Equivalent ROC
curves are computed based on the histograms, from which the TPRs and false posi-
tive rates (FPR) are calculated. ROC curves for the group white-tailed eagle and for
the group gulls-and-terns-1 are shown in Figs. 11 and 12, respectively. Both figures,
the histogram and the ROC curve, for the group of white-tailed eagle show that this
group is clearly separable, and thus it is easy to choose a suitable threshold for perfect
classification of the group. Generally, two values of probability can be read from the
histogram, and use as a threshold: the lowest probability value of the positive class
(LPPC), and the highest probability value of the negative class (HPNC).
In the case of the group of white-tailed eagle, these two probability values are not
overlapped, and thus this class is clearly separable. The threshold can be set anywhere
between 0.8 and 0.9 in order to classify this class perfectly. All true positives (TP,
250 J. Niemi and J. T. Tanttu
a data example from the positive class that is correctly classified) will be classified
correctly and there will be no false positives (FP, a data example from the negative
class that is misclassified as the positive class), nor FNs. As the group of white-
tailed eagle actually consists of only that single bird species, this also means that
this classifier is capable to classify the white-tailed eagle in accordance with the
environmental license.
Deep Learning Case Study on Imbalanced Training Data … 251
In the case of the group of gulls-and-terns-1, the LPPC and the HPNC are over-
lapped. There are two data examples from the negative class that have the probability
between 0.9 and 1, and all of the probabilities from the positive class fall into the same
bin. Probabilities inside the bins cannot be read from the histograms, but the plotting
software (MatLab) also prints the exact values for the probabilities. The LPPC and
the HPNC for the group of gulls-and-terns-1 is 0.9000 and 0.9643, respectively. If
we choose 0.9 for the threshold, there will be two FPs, but if we choose 0.9643 for
the threshold, there will be no FPs, nor FNs. In this case the number of FNs is the
most important, because the lesser black-backed gull belongs to the group of gulls-
and-terns-1, and it is particularly taken into account in the environmental license,
so we cannot take the risk of misclassifying a gull at the top level of the hierarchy.
Therefore, we must choose 0.9643 for the threshold. Table 10 shows the applied
threshold for the top-level group. The threshold for the group of white-tailed eagle
is set to 0.7415, because this is the exact value printed by the plotting software. One
image of the great cormorant in the test data set is misclassified as the white-tailed
eagle, and this causes the number of FPs to be one for the white-tailed eagle and
the number of FNs to be one for the group waterfowl-1. This is acceptable error
rate, because no white-tailed eagles are missed. Algorithm 1 describes the top-level
classification process. This algorithm also defines a new pseudo-class, which means
that this class does not exist in the data set, but it is used when the primary classifier
fails to classify a test image correctly. Thus, it enables a definition of an unidentified
bird (UNBI) class without explicitly including it in the real-world classes.
Waterfowl are classified in the second level of the hierarchy, so that two classifiers
are cascaded. The first one filters out the commonest class, GRCO, and all the other
waterfowl are classified in the second classifier. Thresholds for the first classifier are
given in Table 11. There is one FN, and accordingly, one FP as these thresholds are
applied. The misclassified class is GRCO, which is the only class in the group of
cormorants. Thresholds for the group of waterfowl-2 is given in Table 12. All other
classes have one FN, respectively, except for the class LOSP, which is clearly sepa-
rable. Algorithm 2 shows the classification process for both waterfowl groups. Two
new pseudo-classes are defined in the algorithm: unidentified waterfowl (UNWF),
and unidentified small waterfowl (UNSW).
Gulls and terns are classified in the second and third level of the hierarchy in cascade
classifiers. We have used a larger test data set with more images for the groups of
gulls-and-terns, which is enabled by the fact that the scarcest classes in the original
data set are not included in these groups. In this way, we gain more robust threshold,
though the original distribution is retained, and the test data set still has only images
that the classifiers have never seen before. The number of images in these data set sets
are given in Table 13. In this Table, the pair of groups or classes is in the first column
from the left, so that the positive class is mentioned first. The following two columns
are the number of images of the positive class and the negative class, respectively.
We can calculate from the table that the class imbalance ratio for the most of the
pairs of the fourth level of the hierarchy (the species level) is almost balanced. The
pair {LBBG, GBBG} is the only significant exception having the class imbalanced
ratio of 1:3, and because this pair also has the weakest separability, poor result for
classification is expected in terms of TPR.
At the second level the commonest group, gray-backed-gulls is filtered out first
from the group of gulls-and-terns-2, and then subsequently the group blackheaded-
tern. Finally, the group black-backed-gulls is the only one left. Figure 13 shows the
histogram for the group of gray-backed-gulls. It becomes clear from the histogram
that the distributions of the positive class (gray-backed-gulls) and the negative class
(gulls-and-tern-2) are overlapped, and that given we must make a choice for a suitable
threshold while keeping in mind the terms of the environmental license. There are
two choices for the threshold: 0.6000 with one FP, and 0.7590 with two FNs. We must
choose 0.7590 even though it means weaker general performance for the hierarchical
254 J. Niemi and J. T. Tanttu
classifier. This is because we do not want any member of the class LBBG misclassified
on the second level. As result of applying this threshold, there will be two images of
the group gray-backed-gulls misclassified as gulls-and-terns-2.
Species level classification is reached on the fourth (third for gray-backed gulls)
level of the hierarchical classifiers. This includes pairs of classes with the weakest
separability: {HEGU, COGU}, and {LBBG, GBBG}. The overlap of the distribu-
tions of the positive and the negative classes are illustrated in Figs. 14 and 15. The
Deep Learning Case Study on Imbalanced Training Data … 255
class HEGU is the positive class in Fig. 14, and the class COGU is the negative
class. The class LBBG is the positive class in Fig. 15, and the class GBBG is the
negative class. The best value for a threshold for separating the HEGU from COGU,
in terms of classifier performance, is 0.2134. This means zero FP, but eleven FNs,
i.e., eleven images of herring gulls will be misclassified as common gulls. Classi-
fication of the pair {BHGU, CATE} is straightforward owing to the fact that with
the chosen threshold it has the number of FNs and FPs equal to zero. See Table 14
for thresholds for the groups of gulls and terns. The best option for a threshold of
the class LBBG is 0.9993 when the number of FNs is 12. Algorithm 3 shows the
classification process for the group of gulls-and-terns-1. Five new pseudo-classes
are defined in this algorithm: gray-backed gull (GBGU, either the herring gull or the
common gull), black-headed (BHTE, either the black-headed gull or tern species),
black-backed gull (BBGU, either the lesser black-backed of the great black-backed
gull), non-gray-backed-gull (NGGU, either BHTE or BBGU), and unidentified gull
or tern (UNGU).
Table 14 Thresholds applied to the pair of groups of gull and tern species
Class label Threshold # FNs # FPs
{gray-backed-gulls, gulls-and-terns-2} 0.7590 2 0
{blackheaded-tern, black-backed-gulls} 0.2524 0 0
{HEGU, COGO} 0.2134 11 0
{BHGU, CATE} 0.8124 0 0
{LBBG, GBBG} 0.9993 12 0
256 J. Niemi and J. T. Tanttu
6 Results
Results for comparing the classifiers are viewed through generalization. The hybrid
of hierarchical and cascaded model achieved average performance of 0.9460 (TPR).
The reference classifiers have average TPRs as follows: for the imbalanced reference
classifier (IMBRC), 0.8195 and for the balanced reference classifier (BRC), 0.8307,
respectively. The total number of misclassification for the hybrid model was 16.
This number for the reference classifiers was 45 for the IMBRC, and 71 for the
BRC. Average precision for the hybrid model was 0.9619. Average precision for
the IMBRC was 0.8809, and for the BRC 0.7919. TPRs for the top-level groups
and the class LBBG are given in Table 15. The reference classifiers were trained on
ungrouped classes, therefore the numbers for TPRs of the groups have been averaged
of the numbers of those classes that the groups consist of.
Confusion matrix for the top-level groups is given in Table 16. This confusion
matrix also includes the pseudo-class UNBI. Naturally, the number of TPs are zero
for pseudo-classes, because these classes are only defined for failure of the classifiers.
Confusion matrix for the classes are given in two parts, because it is too big to fit in
the page. Table 17 presents the first part of the confusion matrix including the group
of waterfowl-1, i.e., the classes: LOSP, GRCO, COEI, COGO, VESC, and RBME.
This confusion matrix also includes the pseudo-classes: UNSW, and UNWF. One
test image of GRCO is presented in the top-level confusion matrix, therefore the
number of test images for the class GRCO is 99 in the waterfowl confusion matrix.
Table 18 presents the second part of the confusion matrix including the group gulls-
and-terns-1 (the classes: GBBG, HEGU, LBBG, COGU, BHGU, and CATE). The
five pseudo-classes defined in Algorithm 3 for gulls and terns are omitted in order to
save space, and because no image of any of the subgroups of the gulls-and-terns-1
was misclassified as any of these pseudo-classes.
Table 15 TPRs for the hybrid model and the reference classifiers
Classifier WTEA SWSP Waterfowl-1 Gulls-and-terns-1 LBBG
Hybrid 1 1 0.9935 1 0.9231
Imbalanced reference 0.9773 0.4000 0.7629 0.7691 0.6923
Balanced reference 1 0.8000 0.8621 0.7762 0.8846
As the number of images in the test data set is 439, we must split this number
between the three confusion matrices. The first confusion matrix is for all 439 images,
but because the classes WTEG and SWSP are only presented in this confusion matrix,
the sum of the number of test images in the other two confusion matrices is 439 –
50 = 389. The second confusion matrix presents results for 153 test images and the
third for 236 test images, so that the sum is 153 + 236 = 389 test images.
Confusion matrix for the IMBRC is given in two tables: Tables 19, and 20, respec-
tively. The class WTEA is included in both tables, because there are FPs and/or FNs
for it in the two tables. However, the number of TPs for the class WTEA is only
258 J. Niemi and J. T. Tanttu
Table 19 Confusion matrix for the imbalanced reference classifier, part one
WTEA LOSP GRCO COEI COGO VESC RBME
WTEA 43 0 0 0 0 0 0
LOSP 0 6 0 0 1 0 0
GRCO 2 0 97 0 1 0 0
COEI 0 0 0 17 0 0 0
COGO 1 0 0 1 19 0 0
VESC 0 0 0 1 0 3 0
RBME 0 0 0 0 0 0 5
given in the first table. The same test data set, as with the hybrid model, has been
used when the reference classifiers were tested. The total number of images, 439, is
again divided into two tables. The first table covers 197 test images and the second
table covers 242 test images.
Confusion matrix for the BRC is also given in two tables: Tables 21 and 22,
respectively. There are no FPs or FNs for the class of WTEA in the second confusion
matrix, so the class can be omitted from this table.
The results for the modified CNN models compared to the original CNN model
are given in Table 23. All three models were trained on the same augmented data
Deep Learning Case Study on Imbalanced Training Data … 259
Table 20 Confusion matrix for the imbalanced reference classifier, part two
WTEA SWSP GBBG HEGU LBBG COGU BHGU CATE
WTEA – 0 0 0 0 1 0 0
SWSP 0 2 0 2 0 1 0 0
GBBG 1 0 3 2 2 1 0 0
HEGU 1 0 0 67 0 4 0 0
LBBG 0 0 4 2 18 1 1 0
COGU 0 1 0 8 0 56 0 0
BHGU 0 0 0 0 1 0 30 1
CATE 0 0 0 1 0 1 3 27
Table 21 Confusion matrix for the balanced reference classifier, part one
WTEA LOSP GRCO COEI COGO VESC RBME
WTEA 44 0 0 0 0 0 0
LOSP 0 6 0 0 1 0 0
GRCO 4 1 91 1 1 2 0
COEI 0 0 0 16 0 1 0
COGO 0 0 2 1 15 2 1
VESC 0 0 0 1 0 3 0
RBME 0 0 0 0 0 0 5
Table 22 Confusion matrix for the balanced reference classifier, part two
SWSP GBBG HEGU LBBG COGU BHGU CATE
SWSP 4 0 0 0 1 0 0
GBBG 0 5 0 4 0 0 0
HEGU 1 0 48 4 18 0 1
LBBG 0 2 0 23 0 1 0
COGU 0 1 10 1 52 1 0
BHGU 0 0 0 1 2 28 1
CATE 0 0 1 0 3 0 28
set, which only consisted of the images from the group of black-backed gull. These
models were tested as single classifiers. There are TPRs for both training and gener-
alization (tested on images that the classifier has never seen before) in the Table. The
first modified model had four convolution layers, and the second had five convolution
layers. The models were tested only on the group of black-backed gull in these tests.
7 Discussion
The tests showed that the hybrid model is significantly better, in terms of perfor-
mance, than the reference classifiers. The only problematic class, in terms of the
environmental license, is the LBBG. Even though it had the number of FNs zero
in the test for the hybrid model, the number of FNs was 12 in the test for the gulls
and terns only. The number of test images in the latter test was larger, and this gives
insight into real-world implementation. The number of possible FPs is not signifi-
cant in this context, because it would just mean that other gull species, more likely
great black-backed gulls, are misclassified as LBBG. Therefore, it is advisable to
combine the classes LBBG and GBBG into a single class, i.e., not classify the group
black-backed-gulls any further.
The BRC performed better than the IMBRC, in terms of TPR. However, the
number of misclassification is 71 for the BRC and 45 for the IMBRC. The difference
is explained by the better average precision of the IMBRC. Precision increases as the
number of FPs decreases, and TPR increases as the number of FNs decreases. This
means that TPR is more significant metric than precision in our context, and thus
the BRC would be the second choice after the hybrid model. The IMBRC showed
poor performance even though it was trained on larger data set (6.45 * 106 versus
8.66 * 105 ) than the BRC. This implies that straightforward use of a single classifier
on an imbalanced data set gives poor performance in terms of TPR. This result is
based on relatively low number of data examples, which is often the case in real-world
application, but this method could perform better when trained on significantly larger
training data set. However, if precision is an important criterion, then this method
may be considered for real-world usage.
The top-level group has the number of FPs equal to one in its confusion matrix
(Table 16). This FP is a misclassified GRCO as WTEA. This is, of course, a FN for
the class GRCO. However, this is acceptable as no WTEA is misclassified, and thus
the number of FNs for the class WTEA is zero. The group waterfowl-1 also shows
good results, and there are only five misclassification. It seems that grouping the
original classes is useful approach to this kind of real-world classification problem.
By grouping, you can confine the most difficult classification problem to the one
group or even just to one subgroup. This approach indicates where the challenge
lies. In this context the challenge are the groups of gray-backed-gulls and black-
backed-gulls, respectively. The bird species that these groups consist of are very
similar in terms of morphology. This leads to a conclusion (assessed by human eye)
that the overlapped area of the classification boundary is clearly wide for both groups.
Deep Learning Case Study on Imbalanced Training Data … 261
References
1. Li, F., Li, S., Zhu, C., Lan, X., Chang, H.: Class-imbalance aware CNN extension for high
resolution aerial image based vehicle localization and categorization. In: 2017 2nd International
Conference on Image, Vision and Computing (ICIVC), Chengdu (2017)
2. Krizhevsky, A., Sutskever, I., Hinton, G. E.: Imagenet classification with deep convolutional
neural networks. Commun. ACM, 84–90 (2017)
3. Mao, R., Lin, Q., Allebach, J.: Robust convolutional neural network cascade for facial landmark
localization exploiting training data augmentation. In: Imaging and Multimedia Analytics in a
Web and Mobile World 2018, pp. 374-1-374-5(5) (2018)
4. Jia, S., Wang, P., Jia, P., Hu, S.: Research on data augmentation for image classification based
on convolution neural networks. In: 2017 Chinese Automation Congress (CAC), Jinan (2017)
5. Li, H., Lin, Z., Shen, X., Brandt, J., Hua, G.: A convolutional neural network cascade for face
detection. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Boston (2015)
6. Rachmadi, R., Uchimura, K., Koutagi, G., Komokata, Y.: Japan road sign classification using
cascade convolutional neural network. In: ITS (Intelligent Transport System) World Gongress,
Tokyo, pp. 1–12 (2016)
7. Robin radar models. In: Robin Radar Systems B.V. (Accessed 2019). https://www.robinradar.
com/
8. pelco-D protocol. In: Bruxy REGNET. (Accessed 2019). http://bruxy.regnet.cz/programming/
rs485/pelco-d.pdf
9. Niemi, J., Tanttu, J.: Automatic bird identification for offshore wind farms. In: Bispo, R.,
Bernardino, J.C., Costa, J.L.,( Eds.), Wind Energy and Wildlife Impacts, Cham, pp. 135–151
(2019)
10. Mirzaei, G., Jamali, M., Ross, J., Gorsevski, P., Bingman, V.: Data fusion of acoustics, infrared,
and marine radar for avian study. IEEE Sensors J. 15(11) (2016)
11. Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing
machine learning training data. ACM SIGKDD Explor. Newsl., 20–29 (2004)
12. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an appli-
cation to boosting. Comput. Learn. Theory 904, 23–37 (1995)
13. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In:
Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. CVPR 2001, Kauai, pp. 511–518 (2001)
262 J. Niemi and J. T. Tanttu
14. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San
Diego, pp. 886–893 (2005)
15. Yoshihashi, R., Kawakami, R., Iida, M., Naemura, T.: Evaluation of bird detection using time-
lapse images around a wind farm. Wind Energy 20(12), 1983–1995 (2017)
16. Malling Olsen, K., Larsson, H.: Terns of Europe and North America. Helm, London (1995)
17. Madge, S., Burn, H.: Wildfowl, an Identification Guide to the Ducks, Geese and Swans of the
World. Helm, London (1988)
18. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture
for object recognition. In: International Conference on Computer Vision, Kyoto, pp. 2146–2153
(2009)
19. Silla, C., Freitas, A.: A survey of hierarchical classification across different application domains.
Data Min Knowl Disc 22(31) (2011)
20. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detec-
tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 3476–3483 (2013)
21. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
22. Y.L., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recog-
nition. In: Proceedings of the IEEE 86, 11, New York, pp. 2278–2324 (1998)
23. Li, M., Zhang, T., Chen, Y., Smola, A. J.: Efficient mini-batch training for stochastic optimiza-
tion. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge, New
York, pp. 661–670 (2014)
24. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge,
USA (2012)
25. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall/Pearson,
New York (1994)
26. Malling Olsen, K., Larsson, H.: Gulls of Europe. Asia and North America. Helm, London
(2003)
Deep Learning for Person
Re-identification in Surveillance Videos
Abstract In the recent years, Closed Circuit Television (CCTV) is viewed as the
basis for providing security. One of the most important aspects of CCTV surveillance
systems security mechanism is to re-identify a person captured in one of the camera
across different surveillance cameras. Re-identification has a major role in several
applications like automated surveillance of universities, offices, malls, home and
restricted environments like embassies or laboratories with strong security restric-
tions. Traditionally, identifying a person in a video was practiced under the set of
same external conditions (like same illumination, viewpoint, back ground conditions
etc.). But when it comes to automated re-identification in a CCTV surveillance sys-
tem, several challenges emerge as the environment is uncontrolled and keeps varying,
further the poses of the person and the angles of the cameras capturing the videos also
incur additional challenge for the task considered. When a person disappears from
one camera view for a period of time, he should be recognized in another view of cam-
era at a different location when there are environmental disturbances like variation
in illumination, crowded scene, partial occlusions, physical appearance variations,
full occlusions, view point variations, background clutter, shadows and reflections,
etc. In this chapter, the major focus is on the techniques of deep learning used to
develop an end-to-end re-identification system highlighting the methods to handle
the uncontrolled environment challenges mentioned. An end-to-end re-identification
task consists of sequence of steps namely pedestrian detection, person tracking fol-
lowed by person re-identification. Given a video sequence or an image as an input,
firstly the humans are detected from the video sequence as a process of pedestrian
detection. The person tracking within the camera is conducted, to find the different
poses of the probe if needed. Then the re-identification process is conducted where
the deep learning models are used to re-identify the person with the help of gallery
set of videos and evaluates the similarities of gallery set and the person of interest
by using deep learning metrics. The re-identification results end as a retrieval pro-
cess where all similar images of the person of interest are retrieved. Several bench
mark datasets considered in literature for re-identification system are VIPeR, ETHZ,
PRID, CAVIAR, CUHK01, CUHK02, CUHK03, i-LIDS, RAiD, MARS, etc.
1 Introduction
The most important aspect of any intelligent closed-circuit television (CCTV) surveil-
lance system is to accomplish the task of re-identification of humans which is popu-
larly termed as Person Re-identification [1, 2]. The objective of such system is to find
out whether a person showing up in one camera is coming again in another camera
i.e. to determine whether a pair of humans appearing in various cameras with non-
overlapping views [3] has the same identity or different identity. Engaging or hiring
human operators to track the person of interest would be highly time-consuming
process as they need to spend more time and most of the time it ends as an exhaus-
tive task for the operators. To overcome this situation, automated computer vision
system with less human intervention is more suitable to assist the human operators
in identifying a person over a set of non-overlapping or disjoint cameras. The advent
of conducting research work in this area is to increase demand of public safety with
the help of widespread large camera networks placed in and around the public places
like theme parks, shopping malls, universities, airports, etc. To achieve the above-
mentioned goals, it is very costly to depend entirely on human workers in order to
accurately recognize or track a person of interest across several cameras.
In early days, the person re-identification was considered as a multi-camera track-
ing problem where appearance based models were used with the geometry calibration
with disjoint cameras in the surrounding environment. In the year 2005, the term re-
identification was coined by Zajdel et al. [4] from the university of Amsterdam where
they tried to re-identify a person when he departs from the camera view and appears
once again [4]. In the year 2006, Gheisasri et al. [5] applied spatio-temporal seg-
mentation algorithm and used visual signs of the persons as input a for foreground
detection. The problem was solved as image-based Re-id rather than as video based
one. This was the first work representing the isolating the person Re-id from multi-
camera tracking. Henceforth, the problem of Re-is is considered as an independent
computer vision task. Further in the year 2010, there were two major works which
proved that using multiple frames per person would effectively improve over the
single frame version [6]. In the year 2014, Yi et al. [7] and Li et al. [8] employed
Siamese neural network successfully for person re-identification problem where a
Deep Learning for Person Re-identification in Surveillance Videos 265
pair of input images where correctly determined with same identity using the net-
work. This network helped to overcome the major issue of re-id problem which was
having a lack in number of training samples. In the same year in 2014, Xu et al. pro-
posed an end-to-end image-based re-identification model [9] where they combined
the concept of detection and re-identification scores. The detection was used to find
the commonness and re-identification was performed to find the uniqueness of the
persons. The problem of person re-identification can be solved using any of the two
systems namely handcrafted systems and deep learning systems. Re-identification
system includes two components namely pedestrian detection and distance metrics.
In handcrafted systems the features are extracted and passed on to re-identification
system where as in deep learning systems, learning the features is inherent in the deep
learning architecture and provides improved results compared to hand crafted sys-
tems. This chapter focuses on the preliminaries of deep learning algorithms followed
by re-identification datasets and the different architectures, activation functions, loss
functions and evaluation protocols used in re-identification application.
This section discusses in brief on the basic deep learning models used in com-
puter vision task. The models discussed are Convolutional Neural Network, Le-
Net5, AlexNet, ZFNet, VGGNet, GoogleLeNet, ResNet, Recurrent Neural Network,
Siamese Neural Network. All these networks are based on CNN as a basic model
and they vary in their architecture with respect to number of hidden layers, activation
functions, loss functions and training mechanism.
Convolutional Neural Network
In the domain of deep learning, most of the works on Convolutional Neural Networks
(CNN) were performed to analyze visual images [10]. The network limits the process
of including preprocessing step since the network learns the features automatically
using filters and hence this avoids the feature design process. The convolution neural
network is composed of three main layers having input, output and multiple hidden
layers. The hidden layer further consists of convolution layers, activation functions,
pooling layers, fully connected layers and normalization layers. The convolution
layer employs the convolution operation on the input and forwards the result to the
subsequent layer, where each neuron handles the data only for its receptive field. This
avoids using a greater number of weights and allows the network to be deeper with
less parameter. The activations functions commonly used in CNN are ReLU, Tanh
and Softmax activation function. The pooling layers aim to continuously decrease
the spatial size of the features in order to help in reducing the number of parameters
and computations in the network. The pooling layer operates on each feature map
independently. The commonly used pooling operations are max pooling and average
pooling. The fully connected layer is to connect each neuron in one layer to every
neuron in another layer. The receptive field (input area of the neuron) of each layer
266 S. J. Narayanan et al.
varies. As in fully connected layer, the convolution layer doesn’t take input from
every element of the prior layer, besides it takes input from a controlled subarea of
the previous layer. Having weights and bias in each neuron, these weight vectors and
bias vectors form a filter representing some feature of the input. The main strength
of the CNN is that lots of neurons share the same filter thus eliminating the memory
track of each receptive field taking up their corresponding bias and weight vectors.
The other distinguishing feature of CNN is that it has 3D volume of data in terms
of width, height and depth. Second feature is that it has layers of different types
connected locally and completely and further stacked to build the CNN architecture.
The architecture guarantees that the trained filters generate results to a spatially local
input pattern and as the layers increase and get stacked up it would lead to nonlinear
filters that become gradually global. The third distinct feature is on shared weights,
i.e., each filter getting duplicated across the layers. The basic CNN architecture is
given in Fig. 1.
LeNet-5
LeNet-5 is a kind of convolutional network which was mainly intended for perform-
ing handwritten and machine-printed character recognition [11]. The network has a
total of 7 layers which includes two convolutional layers, two pooling layers, two
fully connected layers and one output layer. LeNet-5 uses 5 × 5 kernels of stride 1
and 2 × 2 subsampling of stride 2. It is considered as the base model for various
other successful deep CNN architectures. Figure 2 represents the LeNet-5 architec-
ture where Fig. 2a demonstrates the architecture with subsampling or max-pooling
layers whereas it is not a major focus of representation in other architectures like
AlexNet. The same is represented in Fig. 2b. In the current architectural representa-
tions, max pooling is replaced in place of subsampling layer and they also occur less
frequently than convolution layers.
LeNet-5 is vastly narrow in accommodating the recent standards. The architecture
retains the basic principles and the most commonly used activation function is the
sigmoid activation function. It accommodates RBF units in the final layer having the
Deep Learning for Person Re-identification in Surveillance Videos 267
Fig. 2 a Detailed architectural representation of LeNet-5 [12], b LeNet-5 brief representation [12]
prototype of every unit relating to the input vector and the output produced is the
Squared Euclidean distance between them. In recent standards, the practice of using
RBF is avoided and instead softmax units with log-likelihood loss on multinomial
label outputs are used. The major applications of LeNet-5 is on character recognitions
and is widely used to read the characters in bank cheques.
AlexNet
AlexNet is a 8 layered CNN architecture that won ImageNet challenge 2012 and
produced the widespread popularity for CNN architectures in the area of computer
vision [13] Fig. 3 demonstrates the AlexNet architecture.
In Fig. 3, each convolution layer follows ReLu activation functions which is not
explicitly shown and the max pooling layers denoted as MP follow only subset of
convolution-ReLU combination layers .The architecture is composed of 5 convo-
lutional layers and 3 fully connected layers. The first convolution layer comprises
of 96 filters of size 11 × 11 at stride 4 and second convolution layer consists of
256 filters of size 5 × 5. Third, fourth, and fifth convolution layers consists of 384,
384, and 256 filters respectively of size 3 × 3 at stride 1. First, second, and final FC
layers consists of 4096, 4096, and 1000 neurons. The most significant characteristic
of the AlexNet is the use of non-linear activation function (ReLU) and it also uses
heavy data augmentation. The role of ReLu activation function towards increasing
the training speed of CNN is firstly exhibited in AlexNet architecture. This proved
ReLu is far better than steeping activation functions like sigmoid or tanh. The hyper
268 S. J. Narayanan et al.
parameters(batch size, SGD momentum and learning rate) of AlexNet were set to
128, 0.9 and 0.01.
ZFNet
ZFNet architecture the winner of ISLVRC 2013 is almost similar to AlexNet shown
in Fig. 4.
The key difference exists in a few hyper-parameters set and further the architec-
tural changes made were in the first layers filter size and the stride. The filter size of
11 × 11 was reduced to 7 × 7 and the stride of convolution 2 was used instead of
stride 4. In convolutional layer 3, 4 and 5 the number of filters used were changed
from 384, 384, 256 to 512, 1024, 512 respectively.
VGGNet
VGGNet [15] presented at ILSVRC 2014 is the runner up in the competition which
composed of 16 convolutional layers. The architecture seems to be interesting due to
its uniform architectural style followed. The same is shown in Fig. 5. The architecture
reduced the complexity of using huge filter sizes with huge strides as used in AlexNet,
rather throughout the entire network, VGGNet uses a small 3 × 3 filter sizes with
stride 1. The model is trained for two to three weeks using 4 GPUs. This architecture
is considered as the best model for extracting features from images. It consistently
uses 3 × 3 as filter size and 2 × 2 as pooling size. The convolution was performed
with stride 1 and padding 1 and the pooling was with stride 2. It was observed that the
spatial outline of the output volume is always preserved when 3 ×3 filter is applied
with a padding of 1, whereas the pooling process compresses the spatial footprint at
all times. Hence, the pooling is performed on the non-overlapping spatial regions, and
this always reduced the spatial footprint by a factor of 2. This architecture is widely
used as a source feature extractor in various applications. The hyper parameter setting
of this architecture is publicly available but still it is considered as a challenging
architecture due to its usage of 138 million parameters.
270 S. J. Narayanan et al.
GoogLeNet
The winner of the ILSVRC 2014 competition is the GoogLeNet architecture [16]
which achieved a topmost-5 error rate of 6.67%. The architecture is inspired by the
LeNet architecture. The novel element included in this GoogLeNet is the inception
module. The concepts like image distortions, batch normalization, and RMSprop are
used in this architecture [17]. The inception module is developed over a number of
very small convolutions to extremely reduce the number of parameters used. A total
of 22 deep layers were used in CNN and the parameters are reduced to 4 million from
60 million. The inception model is considered as the central part of the architecture.
Figure 6. depicts an example of inception module and also depicts the design of good
local network topology where the inception modules are stacked on top of each other.
ResNet
ResNet the winner of ILSVRC 2015 [18] trained the network with 152 layers and
proved to be having less complexity than VGGNet. The architecture is unique on
its own by means of utilizing “skip connections” and also features substantial batch
normalization. It achieved a top-5 error rate of 3.57% which was considered as a
superior performance than human level predictions on ImageNet data set [19]. The
basic unit of this architecture is the residual model which plays a major role in
developing whole network by assembling many such residual models (Fig. 7).
Recurrent Neural Network
Recurrent Neural Network (RNN) is commonly used along with CNN to employ
the concept of recurrence, which is basically using the information from a previous
forward pass of the neural network. Figure 8 depicts a simple RNN having a single,
self-connected hidden layer. RNNs are more applicable for the applications having
input as a sequence. Corresponding to the input sequence, RNN produce either a
sequence of outputs or just one output for the entire input sequence. The key concept
of RNN [20] is held by the recurrent connections which allow the memory of the
previous inputs to carry further in the networks internal state and thus influencing
the networks output. There are several variants in using the recurrence relationships.
In the first variant, the hidden state for an entity is computed using its corresponding
input entity and the previous hidden state. The output of the network is computed
using the previous hidden states. The activations functions like tanh are used for the
computation of hidden state and softmax functions are used to compute the output
of the network. In the second variant of RNN, the hidden state for an entity in
the sequence is computed using its corresponding input entity and previous output
whereas in the first case it was using the previous hidden state. In the case of RNN
producing single output, the computation of hidden state is done for each entity in
the input sequence and the output is computed using the last hidden state.
In another variant named Bidirectional RNN [22] in the computation of hidden
state, the previous entities information along with the entities that lie further in the
sequence are also considered which is not the procedure followed in unidirectional
RNNs. Hence, Bidirectional RNNs [Schuster] have both forward hidden state and
back ward hidden states. The training of RNN is generally performed by applying a
simple unroll operation on the RNN for a given size of input and then training the
Deep Learning for Person Re-identification in Surveillance Videos 271
Fig. 6 Full GoogLeNet architecture (Stack inception modules with dimension reduction placed on
top of each other to form GoogLeNet) [16]
272 S. J. Narayanan et al.
Fig. 6 (continued)
Deep Learning for Person Re-identification in Surveillance Videos 273
RNN by computing the gradients and using stochastic gradient descent like technique.
When the network is unrolled, each of the input state, hidden state (previous and next)
and the output state correspond to a shallow transformation where the transformation
is represented as a single layer with a deep multilayer perceptron network.
To overcome the problem of vanishing gradients, a variant of RNN termed as
Long Short-Term Memory (LSTM) is proposed. This architecture excels in learning
long range sequences and avoids the long term dependency problem [23, 24]. The
primary inspiring part of LSTM model is the use of a novel structure called memory
cell which consists of four key components namely, an input gate, a neuron with a
self-recurrent connection, a forget gate and an output gate. The input gate permits
274 S. J. Narayanan et al.
the incoming signal to change the state of the memory cell or to block it. The self-
recurrent connections are assigned with a weight of 1.0 and it guarantees that the
position of a memory cell remains stable from one time step to another. The gates in
this model are used to control the interactions between the memory cell itself and the
environment. The forget gate regulates the self-recurrent connections of the memory
cell by allowing the cells to recollect or forget its previous state. Finally, the output
gate allows the state of the memory cell to create an impact on other neurons or
interrupt it (Fig. 9).
Figure 10 depicts the LSTM memory block with a single cell. Most commonly
used gate activation function ‘f’ is the logistic sigmoid and hence the activations are
bound to lie between 0 and 1. 0 denotes the gate is closed and 1 denotes the gate
is open. ‘tanh’ or logistic sigmoid functions are generally used for cell input and
output activation functions. However in some cases identity function is also used
as activation function. The dashed lines in the figure denote the weighted peephole
connections and the remaining connections in the block are unweighted meaning
they have fixed weight of 1.0 [23]. The LSTM network is similar to standard RNN,
however the summation units in the hidden layer are replaced by the memory blcoks.
Siamese Neural Networks
Siamese Neural Network [25] comprises of two or more alike or identical sub net-
works. The work identical sub network means that they share the same architecture,
same parameters and weights. Figure 11 shows the siamese network having the
same weights between the networks. Based on the number of sub networks used,
the architecture can be termed as pairwise or triplets and accordingly corresponding
loss functions are employed. This network is appropriate for person re-identification
problem as the output of the network is a similarity score at the top of the network.
The network also addresses the data scarcity problem and achieves good recognition
rate.
Activation Functions for Deep Learning Models
The activation function are the crucial part of training deep neural networks. Acti-
vation function makes the network more powerful so as to learn complex data and
represent the non-linear functional mappings between inputs and outputs. Another
Deep Learning for Person Re-identification in Surveillance Videos 275
1
Y = (1)
1 + e−x
Vanishing gradient is a popular issue faced by sigmoid activation functions and this
issue is more severe in deep architectures. Moreover, sigmoid activation function is
not zero centered. Despite these issues, sigmoid functions are most widely in many
classification tasks.
Tanh
Hyperbolic Tangent (Tanh) activation function resolves the issue of zero centered in
sigmoid function. It ranges between −1 to +1. The activation function is defined in
Eq. 2.
e x − e−x
Y = (2)
e x + e−x
Y = max(0, x) (3)
ReLU [13] is very simple, efficient, and avoids vanishing gradient problem, it is
widely used in very deep architectures. However, ReLu2 activation function suffers
due to dying ReLU problem where the excessive gradient flowing over a ReLu neuron
might affect the weight update in such a way that the neuron never gets activated on
any data point. It is limited to use only in hidden layers of deep architectures.
Leaky ReLU
Leaky ReLu is a kind of solution to overcome the problem of “dying ReLU problem”.
The function is designed in such a way that rather than the function being assigned
with zero when x < 0, a leaky ReLU will assign a slight negative slope. The same is
defined in Eq. 4.
αx, x < 0
Y = (4)
x, x ≥ 0
Deep Learning for Person Re-identification in Surveillance Videos 277
α value in Leaky ReLU is 0.01. Though leaky ReLU provides good results in few
cases it doesn’t exhibit consistency at all times.
Parametric ReLU
Parametric ReLU (PReLU) adaptively learns the parameters of the rectifiers [18], and
improves accuracy at negligible extra computational cost. The difference between
parametric and leaky ReLu is that leaky ReLu uses a predetermined whereas the
parametric ReLu adaptively learns the parameter value from the neural network
itself. PReLU is defined as given in Eq. 5.
αx, x < 0
Y = (5)
x, x ≥ 0
Maxout
ReLU and its leaky version are together generalized in Maxout neuron [26] activation
function. It has twice the number of parameters. The activation function is defined
in Eq. 6.
1
Y = (9)
1 + e−x
Cosine Similarity loss function improves or maximizes the cosine value for
matching pairs and minimizes the cosine value for the negative pairs when the value
is less than margin. The loss function is defined in Eq. (10)
max(0, cos(x1 , x2 ) − m) i f y = 1
image(x1 , x2 , y) = (10)
1 − cos(x1 , x2 ) i f y = −1
Contrastive loss function [29] minimizes the mapping function to low dimen-
sional space maps by mapping the similarity of input vectors as output and dissimi-
larity as distant points. The loss is computed as given in Eq. (11)
1 1
image(x1 , x2 , y) = (1 − y) (Dist)2 + (y) {max(0, m − Dist )} 2 (11)
2 2
In Eq. (11) m is a margin parameter which is greater than zero and acting as a
boundary. The distance between two feature vectors is computed as D(x1 , x2 ) =
x1 − x2 2 . The average of total loss for each of the pairwise loss functions given
above is computed as per Eq. (12)
1
n
loss(X 1 , X 2 , Y ) = − image(xi1 , xi2 , yi ) (12)
n i=1
Hinge loss function aims to reduce the squared hinge loss of the linear SVM,
which is same as determining the maximum margin based on the true person match
and false person match over training step. This hinge loss function performs a convex
approximation in the range of 0–1 ranking error loss which basically approximates
the models violation of ranking order specified in the triplet. The loss equation
given in (14) has the margin parameter g. It is a regularization parameter which
regularizes the margin between the distance of two image pairs (imagei , imagei+ )
and (imagei , imagei− ). Dist is based on Euclidean distance.
loss(imagei , imagei+ , imagei− ) = max(0, g + Dist (imagei , imagei+ ) − Dist (imagei , imagei− ))
(14)
Equation (15) is an improved triplet loss function where N denotes number of triplet
training examples, β is a weight factor to balance the inter-class and intra-class
constraints. The function d (.,.) defines the L2-norm distance
1
loss(imagei , imagei+ , imagei− , w) = (max dist n (imagei , imagei+ , imagei− , w), δ1 },
N
,
+ β max dist p (imagei , imagei+ , imagei− ), δ_2}
(15)
Cross entropy loss or Softmax loss: This loss function is proposed by McLaugh-
lin et al. [30] and the loss equation is as defined in Eq. (16).
exp(Wc v)
image(v) = P(y = c|v) = (16)
n exp(Wn v)
In Eq. 16, m denotes the Siamese margin and f¯ic , f¯jc are the temporally pooled
feature vectors for person i and j, respectively.
280 S. J. Narayanan et al.
Binomial deviance loss function Wu et al. [32] in 2018, proposed to use cosine
similarity and the binomial deviance loss function for training the neural network
model. The loss function used is given in Eq. (18).
loss = W ln(exp−α(S−β)M +1) (18)
i, j
In Eq. (18), is the element wise multiplication operator, i and j denotes the
count of training samples and S = [Si, j ]n×n represents the similarity matrix for
the image pairs having n representing the total number of training images. Si, j =
cos ine(xi , x j )).α, β are the hyper parameters. The matrix M is to encode the training
supervision and is defined as
1, matching pair
M=
−1, mismatching pair
The person re-identification datasets based on image and video that are used in
literature are given in Table 1.
Different deep learning models used for person Re-Identification are given in Table 2.
The details provided are in terms of the architectural style used, activation functions,
loss functions and the corresponding re-identification datasets on which the archi-
tecture was employed.
Evaluation Metrics
The evaluation of the person re-identification models is generally performed using
the Cumulative Matching Characteristic (CMC) curve, Synthetic reacquisition rate,
and normalized Area under Curve (nAUC). CMC curves are used to evaluate the
person re-identification task as a ranking problem [102]. The curve generated is
based on the probability of identifying the correct match over the first k ranks. This
Table 1 Person Re-identification Datasets
Dataset Year # People # Cameras Crop image size # Images Image/Video Produced by/Detector
VIPeR [33] 2007 632 2 128 × 128 1264 Image Hand
ETHZ [34] 2007 146 Vary 4857 Video Hand
GRID [35] 2009 1025 8 Vary 1275 Image Hand
QMUL iLIDS [36] 2009 119 2 Vary 476 Image Hand
3DPeS [37] 2011 200 6 Vary 200,000 Video Hand
CAVIAR4REID [38] 2011 72 2 17 × 39, 72 × 144 1220 Video Hand
PRID [39] 2011 385 2 128 × 64 Image Hand
SAIVT-Softbio [40] 2012 152 8 Vary 64472 Video Hand
CUHK01 [41] 2012 971 2 160 × 60 3884 Image Hand
WARD [42] 2012 70 3 128 × 48 4786 Image Hand
CUHK02 [43] 2013 1816 10 160 × 60 7264 Image Hand
i-LIDS MCTS [44] 2014 Multiple Vary 479 Video Hand
CUHK03 [8] 2014 1360 6 Vary 13,164 Image DPM [46]/Hand
iLIDS-VID [45] 2014 300 2 Vary 42495 Video Hand
RAiD [47] 2014 43 4 128 × 64 6920 Image Hand
Deep Learning for Person Re-identification in Surveillance Videos
Dataset Year # People # Cameras Crop image size # Images Image/Video Produced by/Detector
Airport [54] 2017 9651 6 128 × 64 39902 Video ACF
MSMT [55] 2018 4,101 15 Vary 126441 Video Faster RCNN
RPIfield [56] 2018 112 12 Vary 601581 Video ACF
S. J. Narayanan et al.
Table 2 Deep learning architectures for Person Re-identification using pairwise models
References Architecture Activation function Loss function Datasets
Zhang et al. [57] 8 layered Deep Convolution Linear SVM (L2-SVM) Margin based- square hinge VIPeR, Caviar
Neural Network loss
Yi [7] 5 layered Siamese deep neural ReLU Fisher criterion and binomial VIPeR, PRID
network deviance cost function
Li et al. [8] 6 layered Filter pairing neural Softmax Negative log-likelihood cost CUHK03, CUHK01, VIPeR,
network (FPNN) function CUHK02
Ahmed et al. [58] 8 layered Deep Convolution Softmax A stochastic approximation for CUHK03, CUHK01, VIPeR,
Neural Network average loss CUHK02
Ding et al. [59] 5 layered Deep Convolution ReLU The triplet-based loss function iLIDS, VIPeR
Neural Network
Zhang et al. [60] Bit-Scalable Deep Hashing tanh The triplet-based loss function MNIST, CIFAR-10, CIFAR-20,
Framework (10 layers) NUS-WIDE.
Shi et al. [61] Convolutional neural networks ReLU Mahalanobis distance CUHK03, CUHK01, VIPeR
with Mahalanobis metric layers
Iodice et al. [62] Strict Pyramidal Deep CNN tanh Cross entropy loss function VIPeR
architecture
Cheng et al. [63] Multi-channel Convolution ReLU The triplet-based loss function i-LIDS, VIPeR, PRID2011,
Deep Learning for Person Re-identification in Surveillance Videos
measure can also be termed as recall at k. The curve plot is the probability of correct
match that is ranked equal to or less than a particular threshold against the size of
the gallery set. The two aspects of the CMC curve are the first rank re-identification
rate and the steep of the curve. The steeper the curve is the better the performance
is. Secondly, Synthetic Reacquisition Rate (SRR) curve is derived from the CMC
curve and it measures the probability that any of the k best matches is correct. The
Normalized area under the CMC curve (nAUC) provides the overall performance
by having the model yielding a positive match over a negative match. Higher the
value of nAUC is, the better the performance would be. The main objective of all the
re-identification models is to improve on Rank-1 recognition rates.
5 Experimental Setup
The Market 1501 dataset consists of 32,668 annotated bounding boxes and 1501
identities captured by 6 cameras, 5 of which are high resolution and 1 is low resolu-
tion. Each identity or person appears in at least 2 cameras. It is the largest and most
robust open source re identification dataset available online.
The dataset employs the Deformable part model in order to detect pedestrians in
the images. For each detected bounding box, a hand drawn ground truth bounding
box is created and the intersection over union is calculated. If the IoU value is greater
than 50%, the bounding box generated is marked as good, if it is over 20% then it is
marked as distractor, else otherwise it is marked as junk.
The setup contains a base, pre trained CNN as an embedding network to produce
vector embeddings of the Images in n-Dimensional space. In our experiments, Resnet
and Xception networks are used as the CNN to extract said feature vectors which are
pretrained on the ImageNet dataset (Fig. 12).
The current model focuses on embedding the images into an n-Dimensional vector
space, a process essential to achieve re identification. Once this embedding network
is trained it can be fed validation images which will be mapped into the vector space
such that the vectors representation of the same person starts forming clusters. Then
these clusters can be extracted using clustering algorithms like K-Means clustering
in order to achieve a complete end to end re-identification system. This experimental
setup considers only the first half of the Re-id process involving the embedding of
the images into the vector space.
The first experimental set up consists of a residual network with 50 layers, used
as the embedding network. The images are fed to the ResNet to obtain embedding’s
of the images. The obtained results are then passed to a global average pooling layer
to reduce it into a one-dimensional vector. Then, online hard mining is carried out
to mine the hardest triplet in each batch. These triplet vectors are used to calculate
the triplet loss as
where d (x, y) is the distance between the embedding’s of x and y, a is the anchor
image, p is the positive image and n is the negative image and margin is a tun-
able hyper parameter. The network has a total of 23,587,712 parameters of which
23,534,592 are trainable and 53,120 are non-trainable.
The second experiment used Xception networks with modified depth wise sepa-
rable convolutions as the embedding network. The architecture has 36 convolutional
layers which form the base for feature extraction. The embedding’s produced are
passed on to a global average pooling layer and the vectors so produced are mined
for hard triplets which are then used to calculate triplet loss. The network has a
total of 20,861,480 parameters of which 20,806,952 are trainable and 54,528 are
non-trainable.
Resnet 50 and Xception networks are currently the highest performing networks
on the ImageNet dataset and hence are used as feature extractors to embed the image
dataset into an n-Dimensional space. We use pretrained models, with ImageNet
weights for the embedding network.
Adadelta with the learning rate set to 1.0 is used as the optimizer with the param-
eters like rho (decay factor) and the decay set to 0. A network trained using this
Deep Learning for Person Re-identification in Surveillance Videos 291
Fig. 13 Loss graph for architecture with Resnet50 for embedding network
method can be used to produce image vectors that can then be passed through a
clustering algorithm to achieve re identification.
The experiment was carried out on a Linux system with 16 GB RAM, Core i7-
8700 k processor and an NVIDIA Titan Xp graphics card with 12 GBs of VRAM.
The obtained results are summarized in the form of graphs which contain the
number of epochs on the x-axis and loss value on training and validation given in
y-axis.
In the first experiment, the features extracted from ResNet50, when used to train
the model were unable to converge to a satisfactory degree after running for 300
epochs. The minimum validation loss obtained was in the initial phases of the training
and was of the magnitude 147.2. The loss then proceeds to diverge despite using
smaller lower learning rates and also while using other optimizers (Fig. 13).
Subsequently in the second experiment, the features extracted using Xception
networks, when used in the architecture described before were able to converge in
100 epochs to about 80.6 without overfitting the training data (Fig. 14).
Comparing the two, we see that a network trained with an Xception Network as
the embedding network performs better than a network that uses a Resnet50 for the
embedding network.
Advantages of Deep Learning Models Towards Person Re-identification
1. Deep learning models attempts to learn the high level features in incremental
manner.
2. Automatic feature learning eliminates the need of domain expert and the need
for hard crafted features in person Re-identification
3. During both training and testing time, generally deep learning algorithms works
faster.
292 S. J. Narayanan et al.
Fig. 14 Loss graph for architecture with Xception net for embedding network
Acknowledgements The authors thank VIT for providing ‘VIT SEED GRANT’ for carrying out
this research work. We gratefully acknowledge the support of NVIDIA Corporation with the dona-
tion of the Titan Xp GPU used for this research on person Re-identification.
References
1. Bedagkar-Gala, A., Shah, S.K.: A survey of approaches and trends in person re-identification.
Image Vis. Comput. 32(4), 270–286 (2014)
2. Zheng, L., Yang, Y., Hauptmann, A. G.: Person re-identification: Past, present and future
(2016). arXiv preprint arXiv:1610.02984
3. Saghafi, M.A., Hussain, A., Zaman, H.B., Saad, M.H.M.: Review of person re-entification
techniques. IET Comput. Vision 8(6), 455–474 (2014)
4. Zajdel, W., Zivkovic, Z., Krose, B.J.A.: Keeping track of humans: have I seen this per-
son before? In: Proceedings of the 2005 IEEE International Conference on Robotics and
Automation, 2005. ICRA 2005, pp. 2081–2086. IEEE (2005)
5. Gheissari, N., Sebastian, T.B., Hartley, R.: Person reidentification using spatiotemporal
appearance. In: Null, pp. 1528–1535. IEEE (2006)
6. Bazzani, L., Cristani, M., Perina, A., Farenzena, M., Murino, V.: Multiple-shot person re-
identification by hpe signature. In: 20th International Conference on Pattern Recognition
(ICPR), 2010, pp. 1413–1416. IEEE (2010)
7. Yi, D., Lei, Z., Liao, S., Li, S.Z.: Deep metric learning for person re-identification. In: Pattern
Recognition (ICPR), 2014 22nd International Conference on, pp. 34–39. IEEE (2014)
8. Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: deep filter pairing neural network for person
re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 152–159 (2014)
9. Xu, Y., Ma, B., Huang, R., Lin, L.: Person search in a scene by jointly modeling people com-
monness and person uniqueness. In: Proceedings of the 22nd ACM International Conference
on Multimedia, pp. 937–940. ACM (2014)
10. LeCun, Y., Kavukcuoglu, K., Farabet, C.: Convolutional networks and applications in vision.
In: ISCAS Vol. 2010, pp. 253–256 (2010)
11. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
12. Charu, C.A.: Neural Networks and Deep Learning: A Textbook. Springer (2019)
13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105
(2012)
14. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European
Conference on Computer Vision, pp. 818–833. Springer, Cham (2014)
15. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-
nition (2014). arXiv preprint arXiv:1409.1556
16. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D.,… Rabinovich, A.: Going
deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1–9 (2015)
17. Dauphin, Y.N., de Vries, H., Chung, J., Bengio, Y.: RMSProp and equilibrated adaptive
learning rates for non-convex optimization. CoRR arXiv:1502.04390 (2015)
18. He, K., Zhang, X., RenSchustera, S., Sun, J.: Deep residual learning for image recogni-
tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
19. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … Berg, A.C.: Imagenet
large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
294 S. J. Narayanan et al.
20. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
21. Graves, A.: Supervised sequence labelling. In: Supervised Sequence Labelling with Recurrent
Neural Networks, pp. 5–13. Springer, Berlin, Heidelberg (2012)
22. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal
Process. 45(11), 2673–2681 (1997)
23. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent
is difficult. IEEE Trans. Neural Netw. (1994)
24. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
25. Shanmugamani, R.: Deep Learning for Computer Vision: Expert techniques to train advanced
neural networks using TensorFlow and Keras. Packt Publishing Ltd (2018)
26. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks
(2013). arXiv preprint arXiv:1302.4389
27. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by
exponential linear units (elus) (2015). arXiv preprint arXiv:1511.07289
28. Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-identification by
symmetry-driven accumulation of local features. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2010, pp. 2360–2367. IEEE (2010)
29. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant map-
ping. In: Null, pp. 1735–1742. IEEE (2006)
30. McLaughlin, N., Martinez del Rincon, J., Miller, P.: Recurrent convolutional network for
video-based person re-identification. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1325–1334 (2016)
31. Chung, D., Tahboub, K., Delp, E.J.: A two stream siamese convolutional neural network for
person re-identification. In: The IEEE International Conference on Computer Vision (ICCV)
(2017)
32. Wu, L., Wang, Y., Li, X., Gao, J.: What-and-where to match: deep spatially multiplicative
integration networks for person re-identification. Pattern Recogn. 76, 727–738 (2018)
33. Gray, D., Tao, H.: Viewpoint invariant pedestrian recognition with an ensemble of local-
ized features. In: European Conference on Computer Vision, pp. 262–275. Springer, Berlin,
Heidelberg (2008)
34. Ess, A., Leibe, B., Van Gool, L.: Depth and appearance for mobile scene analysis. In: 11th
International Conference on Computer Vision, 2007. ICCV 2007. IEEE, pp. 1–8. IEEE (2007)
35. Loy, C.C., Xiang, T., Gong, S.: Time-delayed correlation analysis for multi-camera activity
understanding. Int. J. Comput. Vision 90(1), 106–129 (2010)
36. Zheng, W., Gong, S., Xiang., T.: Associating groups of people. In: BMVC (2009)
37. Baltieri, D., Vezzani, R., Cucchiara, R.: 3dpes: 3d people dataset for surveillance and foren-
sics. In: Proceedings of the 2011 Joint ACM Workshop on Human Gesture and Behavior
Understanding, pp. 59–64. ACM (2011)
38. Cheng, D.S., Cristani, M., Stoppa, M., Bazzani, L., Murino, V.: Custom pictorial structures
for re-identification. In: Bmvc, Vol. 1, No. 2, p. 6 (2011)
39. Hirzer, M., Beleznai, C., Roth, P. M., Bischof, H.: Person re-identification by descriptive and
discriminative classification. In: Scandinavian Conference on Image Analysis, pp. 91–102.
Springer, Berlin, Heidelberg (2011)
40. Bialkowski, A., Denman, S., Sridharan, S., Fookes, C., Lucey, P.: A database for person re-
identification in multi-camera surveillance networks. In: 2012 International Conference on
Digital Image Computing Techniques and Applications (DICTA), pp. 1–8. IEEE (2012)
41. Li, W., Zhao, R., Wang, X.: Human reidentification with transferred metric learning. In: Asian
Conference on Computer Vision, pp. 31–44. Springer, Berlin, Heidelberg (2012)
42. Martinel, N., Micheloni, C.: Re-identify people in wide area camera network. In: IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),
2012, pp. 31–36. IEEE (2012)
43. Li, W., Wang, X.: Locally aligned feature transforms across views. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 3594–3601 (2013)
Deep Learning for Person Re-identification in Surveillance Videos 295
44. Wang, T., Gong, S., Zhu, X., Wang, S.: Person re-identification by video ranking. In: European
Conference on Computer Vision, pp. 688–703. Springer, Cham (2014)
45. Branch, H.O.S.D.: Imagery library for intelligent detection systems (i-lids). In: The Institution
of Engineering and Technology Conference on Crime and Security, pp. 445–448 (2006)
46. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with
discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9),
1627–1645 (2010)
47. Das, A., Chakraborty, A., Roy-Chowdhury, A.K.: Consistent re-identification in a camera
network. In: European Conference on Computer Vision, pp. 330–345. Springer, Cham (2014)
48. Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification:
a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision,
pp. 1116–1124 (2015))
49. Ma, L., Liu, H., Hu, L., Wang, C., Sun, Q.: Orientation driven bag of appearances for person
re-identification (2016). arXiv preprint arXiv:1605.02464
50. Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., Tian, Q.: Mars: a video bench-
mark for large-scale person re-identification. In: European Conference on Computer Vision,
pp. 868–884. Springer, Cham (2016)
51. Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., Tian, Q.: Person re-identification in
the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pp. 1367–1376(2017)
52. Xiao, T., Li, S., Wang, B., Lin, L., Wang, X.: End-to-end deep learning for person search
(2016). arXiv preprint arXiv:1604.01850, 1(2)
53. Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data
set for multi-target, multi-camera tracking. In: European Conference on Computer Vision,
pp. 17–35. Springer, Cham (2016)
54. Camps, O., Gou, M., Hebble, T., Karanam, S., Lehmann, O., Li, Y.,… Xiong, F.: From the lab
to the real world: Re-identification in an airport camera network. IEEE Trans. Circuits Syst.
Video Technol. 27(3), 540–553 (2017)
55. Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer gan to bridge domain gap for person
re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 79–88 (2018)
56. Zheng, M., Karanam, S., Radke, R.J.: RPIfield: a new dataset for temporally evaluating person
re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, pp. 1893–1895 (2018)
57. Zhang, G., Kato, J., Wang, Y., Mase, K.: People re-identification using deep convolutional
neural network. In: Computer Vision Theory and Applications (VISAPP), 2014 International
Conference on Vol. 3, pp. 216–223. IEEE (2014)
58. Ahmed, E., Jones, M., Marks, T.K.: An improved deep learning architecture for person re-
identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pp. 3908–3916 (2015)
59. Ding, S., Lin, L., Wang, G., Chao, H.: Deep feature learning with relative distance comparison
for person re-identification. Pattern Recogn. 48(10), 2993–3003 (2015)
60. Zhang, R., Lin, L., Zhang, R., Zuo, W., Zhang, L.: Bit-scalable deep hashing with regularized
similarity learning for image retrieval and person re-identification. IEEE Trans. Image Process.
24(12), 4766–4779 (2015)
61. Shi, H., Zhu, X., Liao, S., Lei, Z., Yang, Y., Li, S.Z.: Constrained deep metric learning for
person re-identification (2015). arXiv preprint arXiv:1511.07545
62. Iodice, S., Petrosino, A., Ullah, I.: Strict pyramidal deep architectures for person re-
identification. In: International Workshop on Neural Networks, pp. 179–186. Springer, Cham
(2015)
63. Cheng, D., Gong, Y., Zhou, S., Wang, J., Zheng, N.: Person re-identification by multi-channel
parts-based cnn with improved triplet loss function. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1335–1344 (2016)
296 S. J. Narayanan et al.
64. Chen, S.Z., Guo, C.C., Lai, J.H.: Deep ranking for person re-identification via joint represen-
tation learning. IEEE Trans. Image Process. 25(5), 2353–2367 (2016)
65. Wu, S., Chen, Y.C., Li, X., Wu, A.C., You, J.J., Zheng, W.S.: An enhanced deep feature
representation for person re-identification. In: Applications of Computer Vision (WACV),
2016 IEEE Winter Conference on, pp. 1–8. IEEE (2016)
66. Xiao, T., Li, H., Ouyang, W., Wang, X.: Learning deep feature representations with domain
guided dropout for person re-identification. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 1249–1258 (2016)
67. Wu, L., Shen, C., Hengel, A.V.D.: Personnet: Person re-identification with deep convolutional
neural networks (2016). arXiv preprint arXiv:1601.07255
68. Li, S., Liu, X., Liu, W., Ma, H., Zhang, H.: A discriminative null space based deep learning
approach for person re-identification. In: 4th International Conference on Cloud Computing
and Intelligence Systems (CCIS), 2016, pp. 480–484. IEEE (2016)
69. Shi, H., Yang, Y., Zhu, X., Liao, S., Lei, Z., Zheng, W., Li, S.Z.: Embedding deep metric
for person re-identification: A study against large variations. In: European Conference on
Computer Vision, pp. 732–748. Springer, Cham (2016)
70. Varior, R.R., Shuai, B., Lu, J., Xu, D., Wang, G.: A siamese long short-term memory architec-
ture for human re-identification. In: European Conference on Computer Vision, pp. 135–153.
Springer, Cham (2016)
71. Wang, F., Zuo, W., Lin, L., Zhang, D., Zhang, L.: Joint learning of single-image and cross-
image representations for person re-identification. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1288–1296 (2016)
72. Franco, A., Oliveira, L.: A coarse-to-fine deep learning for person re-identification. In: IEEE
Winter Conference on Applications of Computer Vision (WACV), 2016, pp. 1–7. IEEE (2016)
73. McLaughlin, N., del Rincon, J.M., Miller, P.C.: Person reidentification using deep convnets
with multitask learning. IEEE Trans. Circuits Syst. Video Techn. 27(3), 525–539 (2017)
74. Liu, J., Zha, Z.J., Tian, Q.I., Liu, D., Yao, T., Ling, Q., Mei, T.: Multi-scale triplet cnn for person
re-identification. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 192–196.
ACM (2016)
75. Wang, J., Wang, Z., Gao, C., Sang, N., Huang, R.: DeepList: learning deep features with
adaptive Listwise constraint for person reidentification. IEEE Trans. Circuits Syst. Video
Techn. 27(3), 513–524 (2017)
76. Liu, H., Feng, J., Qi, M., Jiang, J., Yan, S.: End-to-end comparative attention networks for
person re-identification. IEEE Trans. Image Process. 26(7), 3492–3506 (2017)
77. Wu, L., Shen, C., van den Hengel, A.: Deep linear discriminant analysis on fisher networks:
A hybrid architecture for person re-identification. Pattern Recogn. 65, 238–250 (2017)
78. Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutional model
for person re-identification. In: 2017 IEEE International Conference on Computer Vision
(ICCV), pp. 3980–3989. IEEE (2017)
79. Franco, A., Oliveira, L.: Convolutional covariance features: Conception, integration and per-
formance in person re-identification. Pattern Recogn. 61, 593–609 (2017)
80. Qian, X., Fu, Y., Jiang, Y.G., Xiang, T., Xue, X.: Multi-scale deep learning architectures for
person re-identification. In: Proceedings of the IEEE International Conference on Computer
Vision, pp. 5399–5408 (2017)
81. Zhu, J., Zeng, H., Liao, S., Lei, Z., Cai, C., Zheng, L.: Deep hybrid similarity learning for
person re-identification. IEEE Trans. Circuits Syst. Video Technol. 28(11), 3183–3193 (2018)
82. Cheng, D., Gong, Y., Chang, X., Shi, W., Hauptmann, A., Zheng, N.: Deep Feature Learning
via Structured Graph Laplacian Embedding for Person Re-Identification. Pattern Recogn.
(2018)
83. Mao, C., Li, Y., Zhang, Z., Zhang, Y., Li, X.: Pyramid Person Matching Network for Person
Re-identification(2018). arXiv preprint arXiv:1803.02547
84. Li, D., Chen, X., Zhang, Z., Huang, K.: Learning deep context-aware features over body and
latent parts for person re-identification. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 384–393 (2017)
Deep Learning for Person Re-identification in Surveillance Videos 297
85. Lin, J., Ren, L., Lu, J., Feng, J., Zhou, J.: Consistent-aware deep learning for person re-
identification in a camera network. In: The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Vol. 6 (2017)
86. Bai, X., Yang, M., Huang, T., Dou, Z., Yu, R., Xu, Y.: Deep-Person: Learning Discriminative
Deep Features for Person Re-Identification (2017). arXiv preprint arXiv:1711.10658
87. Chang, Y. S., Wang, M.Y., He, L., Lu, W., Su, H., Gao, N., Yang, X.A.: Joint deep semantic
embedding and metric learning for person re-identification. Pattern Recogn. Lett. (2018)
88. Chen, Y., Duffner, S., Stoian, A., Dufour, J.Y., Baskurt, A.: Deep and low-level feature-based
attribute learning for person re-identification. Image Vis. Comput. 79, 25–34 (2018)
89. Tao, D., Guo, Y., Yu, B., Pang, J., Yu, Z.: Deep multi-view feature learning for person re-
identification. IEEE Trans. Circuits Syst. Video Technol. 28(10), 2657–2666 (2018)
90. Wu, L., Wang, Y., Ge, Z., Hu, Q., Li, X.: Structured deep hashing with convolutional neural
networks for fast person re-identification. Comput. Vis. Image Underst. 167, 63–73 (2018)
91. Su, C., Zhang, S., Xing, J., Gao, W., Tian, Q.: Multi-type attributes driven multi-camera person
re-identification. Pattern Recogn. 75, 77–89 (2018)
92. Wang, J., Zhou, S., Wang, J., Hou, Q.: Deep ranking model by large adaptive margin learning
for person re-identification. Pattern Recogn. 74, 241–252 (2018)
93. Wu, D., Zheng, S.J., Yuan, C.A., Huang, D.S.: A deep model with combined losses for person
re-identification. Cogn. Syst. Res. 54, 74–82 (2019)
94. Zhang, Z., Si, T., Liu, S.: Integration convolutional neural network for person re-identification
in camera networks. IEEE Access 6, 36887–36896 (2018)
95. Liu, Y., Song, N., Han, Y.: Multi-cue fusion: Discriminative enhancing for person re-
identification. J. Vis. Commun. Image Represent. 58, 46–52 (2019)
96. Wang, F., Zhang, C., Chen, S., Ying, G., Lv, J.: Engineering Hand-designed and Deeply-
learned features for person Re-identification. Pattern Recogn. Lett. (2018)
97. Yuan, C., Guo, J., Feng, P., Zhao, Z., Xu, C., Wang, T., … & Duan, K.: A jointly learned deep
embedding for person re-identification. Neurocomputing 330, 127–137 (2019)
98. Xin, X., Wang, J., Xie, R., Zhou, S., Huang, W., Zheng, N.: Semi-supervised person Re-
Identification using multi-view clustering. Pattern Recogn. 88, 285–297 (2019)
99. Wu, D., Zheng, S.J., Bao, W.Z., Zhang, X.P., Yuan, C.A., Huang, D.S.: A novel deep model
with multi-loss and efficient training for person re-identification. Neurocomputing 324, 69–75
(2019)
100. Zhou, S., Ke, M., Luo, P.: Multi-camera transfer GAN for person re-identification. J. Vis.
Commun. Image Represent. (2019)
101. Zhong, W., Jiang, L., Zhang, T., Ji, J., Xiong, H.: Combining multilevel feature extraction and
multi-loss learning for person re-identification. Neurocomputing (2019)
102. Fumera, B.L.G., Roli, F.: Multi-stage ranking approach for fast person re-identification. IET
Comput. Vis. 12(4), 513–519 (2018)
Deep Learning in Gait Analysis
for Security and Healthcare
O. Costilla-Reyes (B)
Brain and Cognitive Sciences, Massachusetts Institute of Technology,
77 Massachusetts Ave, Cambridge, MA 02139, USA
e-mail: [email protected]; [email protected]
R. Vera-Rodriguez
Biometrics and Data Pattern Analytics (BiDA) Lab - ATVS,
Universidad Autonoma de Madrid, Avda. Francisco Toms y Valiente, 11,
28049 Madrid, Spain
e-mail: [email protected]
A. S. Alharthi · S. U. Yunas · K. B. Ozanyan
School of Electrical and Electronic Engineering, The University of Manchester,
Manchester M13 9PL, UK
e-mail: [email protected]
S. U. Yunas
e-mail: [email protected]
K. B. Ozanyan
e-mail: [email protected]
1 Introduction
ranging from airport entry checkpoints and entry to buildings to home-based security
systems. Feature engineering has been central in automatic gait recognition research
[11]. The procedure involves the careful selection and design of complex and time-
consuming hand-crafted features from footstep data, employing geometric, holistic,
spectral and wavelet feature engineering approaches to name some [12].
For both themes, the research effort of this book chapter focused on designing
machine learning models based on Convolutional Neural Networks (CNN), a form
of deep learning [13], to allow the automatic extraction of features from the raw
spatio-temporal gait and footstep data.
The ImageNet Large Scale Visual Recognition Challenge [14] is one of the largest
computer vision competitions in the world. The challenge objective is to classify
images from a 1000 set of possible labels such as “car”, “plane”, etc. The dataset
contains around 1 million images. The breakthrough of deep learning in modern time
came from using this massive dataset for image classification by using convolutional
neural networks. The best accuracy results of the challenge in recent years (from
2014 onwards) have used convolutional neural network techniques at its core [14].
Gait analysis has been widely studied for a variety of applications including health-
care, biometrics, sports, and many more [1]. Classification of a person’s given its
emotional state has also been explored. A person’s pride, happiness, neutral emotion,
fear, and anger has been classified with high statistical confidence given only its gait
pattern [15]. Generally, three types of gait monitoring systems exist, namely: cameras
using image processing, floor sensors and wearable sensors [11]. The use of cameras
for gait is vulnerable to details in the environment such as levels of lighting. Besides
that, the use of cameras is considered an invasion of privacy in living environments,
e.g. for healthcare [16]. Because of disadvantageous parallels to video surveillance.
The disadvantage of wearable sensors is that the sensors need to be attached to the
body, maybe uncomfortable to wear, as well as require assistance to attach correctly.
On the other hand, floor sensor systems have the advantage of being non-invasive and
even unobtrusive, less prone to environmental noise and undemanding the subject’s
attention, which affects the data quality positively.
Gait occurs due to a cooperation of several parts of the human body including the
brain, spinal cord, nerves, muscles, bones, and joints [1]. Within a walking sequence,
gait can be understood as a translation of human brain activity to the patterns of muscle
contractions. The command is generated by the human brain which is transmitted
to initiate the neural centers through the spinal cord which eventually results in
patterns of muscle contractions supported by the feedback from muscles, joints, and
the receptors. This will results in the movement of the trunk and lower limbs in a
connected way whilst the feet recursively touching ground surface and the change
center-of-mass of the human body. Gait can be defined as repetitive cycles for each
302 O. Costilla-Reyes et al.
foot resulting in a sequence of periodic events. Each cycle can be divided into stance
and swing phases as shown in Fig. 1.
The classification of a person’s given its emotional state has been explored in
the literature. A person’s pride, happiness, neutral emotion, fear, and anger has been
classified with high statistical confidence given only its gait pattern [15]. Generally,
three types of gait monitoring systems exist, namely: cameras using image process-
ing, floor sensors and wearable sensors [11]. The use of cameras for gait is vulnerable
to the environment such as lighting. Besides that, the use of cameras invades the pri-
vacy in living environments, e.g. for healthcare [16] because of disadvantageous
parallels to video surveillance. The disadvantage of wearable sensors is that the sen-
sors need to be attached to the body, maybe uncomfortable to wear, as well as require
assistance to attach them correctly. On the other hand, floor sensor systems have the
advantage of being non-invasive and even unobtrusive, less prone to environmental
noise and undemanding the subject’s attention, which has a positive effect on data
quality.
Within a walking sequence, gait can be understood as a translation of human
brain activity, projected into the spinal cord and then able to activate the patterns
of muscle contractions. The command is generated by the human brain which is
transmitted to initiate the neural centers through the spinal cord which eventually
results in patterns of muscle contractions supported by the feedback from muscles,
joints, and receptors. This results in movement of the trunk and lower limbs in a
connected way whilst the feet recursively touch the ground surface and the change
center-of-mass of the human body. Gait can be defined as repetitive cycles for each
foot resulting in a sequence of periodic events. Each cycle can be divided in phases
shown in Fig. 1, defined as follows:
• Stance Phase (approximately 60% of the gait cycle, with the foot in contact with
the ground). This phase is subdivided into four intervals (A, B, C, D).
• Swing Phase (approximately 40% of the gait cycle with the foot swinging and not
in contact with the ground). This phase is subdivided into three intervals (E, F, G).
The three main modalities to study gait is by image processing, floor sensors and
with wearable devices. The modalities are shown in Fig. 2 [18]. Gait patterns can
be obtained from video streams, floor sensor systems footstep pressure or with
accelerometer signals (temporal signals).
Gait analysis in the context of this work deals with two main components. One
component is time: this refers to the temporal gait cycle pattern. The other component
is space: this refers to the spatial footstep shape characteristics of the gait pattern.
Here, we introduce a methodology to learn spatio-temporal features directly from
raw sensor data with deep learning models, this is without the use of human feature
engineering. The deep learning models are based on ANN architectures of several
layers that are able to learn features automatically from raw sensor data.
Camera-based sensors. Here images and videos obtained from cameras record
human gait. Then, image processing techniques are used, such as segmentation and
others to identify gait from the images. This is the most widely used approach in the
literature. This approach involves both model-free and model-based analysis for gait
recognition [12].
In Table 1 are shown the state-of-the-art research approaches for vision systems
in biometrics applications. The performance is indicated as classification rate (CR).
Histogram-based systems have been found to be the most successful in this area.
Floor sensors. The main research themes developed in this work are based on floor
sensor systems. There are mainly two types of floor sensors studied in the literature
based on the ground reacting force and the other based on switch sensors. The first
obtains continuous signals while the latter only delivers binary pressure signals. In
304 O. Costilla-Reyes et al.
Table 2 are shown the state-of-the-art footstep recognition systems. The table shows
the number of signals, the number of users, samples type per model and performance
results in EER.
The iMagiMat [24, 25] is an affordable floor sensor that allows spatio-temporal
sampling of the ground reaction force (GRF) resulting from footsteps. The gait data
can be recorded, stored and analyzed over large periods of time. The technology
is embodied in a 1 m by 2 m prototype [24]. Gait is measured by detecting light
attenuation caused by the bending of plastic optical fibers (POFs) while walking
on the surface. GRF in the active area of the sensor can then be reconstructed for
further data analysis [24, 25]. The adequate spatio-temporal sampling is ensured by
applying tomography principles to the floor sensor design and a suitable frequency
of spatial frames acquisition set at 256 Hz.
2.1.1 Advantages
• Non-intrusive system
• There are no external factors affecting the analysis due to a controlled environment.
2.1.2 Disadvantages
In this approach, sensors are placed in different positions of the human body to
measure gait. The type of sensors that can be used are force sensors, accelerometers,
gyroscopes, extensometers, and inclinometers.
Inertial sensors. These sensors use the earth’s gravitational field to obtain measure-
ments of a subject velocity, acceleration, orientation or gravitational forces for gait
analysis. Three-axis accelerometers and gyroscopes angular velocity is usually used
for this type of application.
In Table 3 are shown the state-of-the-art approaches for inertial systems. There is
currently no consensus from the research community in approach or location of the
sensor for optimal analysis.
Ultrasonic sensors. Sound waves are used as the sensing mechanism. By measuring
the distance between the ultrasonic sensor and the progression of the gait pattern,
gait can be measured and consequently studied.
Electro goniometer. This sensor, often installed in the hip or knee allows the obten-
tion of continuous measurements of the current states of a joint angle of a human
subject.
Exoskeletons. They are devices that cover the entire human body, usually made
of solid materials. These devices are combined with goniometers or potentiometers
sensors to allow measurement of human kinematics.
2.2.1 Advantages
2.2.2 Disadvantages
In summary, the floor sensor system offers unique advantages over other sensing
modalities to analyze gait. The system is non-intrusive and resilient to noise in the
environment as the main advantages over other sensing systems. As opposed to
forcing the user to wear the device for the experiment as in some inertial systems or
to be susceptible to noise in environmental conditions such as different levels of light
or cross-view angles difficulties to acquire the data in a form suitable for analysis
[35]. For the aforementioned reasons, this book chapter focuses on studying the
ground reacting force from floor sensor systems for gait analysis in two applications
healthcare and security.
Cameras, inertial sensors or floor sensor systems have been used for gait analysis [11,
36]. Floor sensor systems have the advantage of being unobtrusive and resistant to
surrounding noise; in contrast, camera systems require adequate illumination while
wearable inertial sensors require daily placement and maintenance. A floor sensor
system can be hidden in a home environment allowing the acquisition of natural gait
signals over large periods of time. While floor sensor systems have been built for
automatic gait analysis applications [11], they have relied heavily on physiologically
defined, man-made and features such as the body’s center of pressure, stride length,
and cadence, rather than using raw sensor signals, to construct gait classifier models.
Deep Learning in Gait Analysis for Security and Healthcare 307
An example of gait recognition system using a switch sensor system is the Ubi-
FLoorII system [37]. The switches in the UbiFloorII system are made from photo
interrupters sensors. The switch sensor generates 0 V or 5 V (on-off) according to
the weight exerted on the floor sensor system.
Force plates [38] have been used as the sensors used for gait analysis to obtain
the ground reaction force, perpendicular to the floor sensor system. Piezoelectric
sensors are used as the sensing mechanism. The piezoelectric effect measures the
accumulated charge in solid materials as a response to force stress. In this case,
the measured pressure is the response to the pressure exerted by the weight of the
subject walking on the floor. The change in pressure modifies the voltage level in the
piezoelectric sensor output, to enable the measurement of gait signals.
For the goal of classification of human postural and gestural movements using
floor sensor systems, Saripalle et al. [39] applied force platforms to infer the center of
pressure of individuals. Eleven body movements by volunteers were analyzed with
an accuracy ranging from 79 to 92% using linear and non-linear supervised machine
learning models. Feature selection is highlighted as a critical step for obtaining reli-
able accuracy scores, but this approach is limited by the lack of a single classification
model suitable for all types of mobility.
Floor sensors systems have been used to distinguish human movements as pre-
sented in [40]. The recognition is achieved by analyzing the Ground Reaction Force
(GRF) on a weight-sensitive floor. The changes in the GRF arise from activities per-
formed at the same position, including jumping, sitting and rising. A hidden Markov
model was used for human movement classification. The classification performance
was close to 100%. One of the disadvantages of such a study is that the postural
activities were performed statically at the same position.
examples are just a small subset of the wide applications of supervised learning in
industry.
Deep structured learning or hierarchical learning is inspired by the biological
neural networks structure and function. It is based initially on the concept of multi-
layer Artificial Neural Network (ANN) with the aim to learn data representations
automatically; thus, Deep Learning becomes the method of choice where the classi-
fication features, if known at all, are complex, with no straightforward quantitative
relation to the raw data. Typically, the term ‘deep’ refers to the number of layers
in the variety of possible networks structures: Deep Belief Networks (DBN), Feed-
forward Deep Networks (FDN), Boltzmann Machine (BM), Generative Adversarial
Networks (GAN), Convolutional Neural Networks (CNN), Recurrent Neural Net-
works (RNN), and Long-Short Term Memory (LSTM) a special kind of RNN. A
comprehensive presentation of the theory of ANNs and Deep Learning is not within
the scope of this review and the reader is referred to established sources [13]. Further,
we focus on models with practical significance for gait applications such as CNN
and LSTM.
The CNN model is suitable for processing 1D, 2D or 3D data that has a known
grid-like topology [13]. The network has the ability to learn a high level of abstrac-
tion and features from large datasets by applying convolution operation to the input
data. Commonly, the network consists of convolution layers, pooling layers, and
normalization layers, with a set of filters and weights shared among these layers.
The convolutional layers output a feature map harvested automatically from the
raw input data. The pooling layers are utilized to reduce the size of representation and
make the convolution layer spatially invariant. The CNN model uses commonly two
types of pooling layers: max pooling and average pooling. All convolution layers and
pooling layers have activation functions (e.g. Sigmoid, Tanh, ReLU, Leaky ReLU),
to calculate the weight of neuron and add a bias, deciding whether to fire the neuron
or not [45]. LSTM networks are favorable for processing time-series data, where
the order is of importance, such as gait data sequences. In essence, they exploit
recurrence, by using information from a previous forward pass over the network.
The goal of using ANNs in gait analysis is to develop a model to extract gait
features and perform well on unseen real-world gait data. Commonly, for appropriate
training and testing, the model is trained and validated on 70% of the data and
tested on the remaining 30%. In supervised training, the procedure is launched by
initializing the weights randomly, processing the inputs and comparing the resultant
output against the desired output. During training, the weights and biases are adjusted
in every iteration, until the error is minimized, and validation is used to estimate the
model performance during training. Lastly, the model is tested with unseen data,
allowing to identify over-training.
The widely used accuracy measure for ANN gait analysis is the confusion matrix.
It is a table to visualize the number of predictions classified correctly and wrongly
for each class. The table consists of true positive, true negative, false positive, and
false-negative classification occurrences. One of the advantages of the confusion
matrix display is that it is straightforward to identify the decision confusions, thus
possibly concluding on the quality of the data involved.
Deep Learning in Gait Analysis for Security and Healthcare 309
A core building block in CNN models is the convolutional Layer where the com-
putational heavy lifting of data processing is taking place. This layer is based on
Convolution, a specialized kind of linear operation [30]. The convolution operation
is performed on two functions to produce a feature map, where the first function
is the input data and the later is the filter or kernel. In this process, the filter slides
over an input data and perform convolution, the sum of the convolution operations
transformed to feature maps (see Fig. 3). Feature map output consists of different
feature maps produced by different kernels as convolution layer output. An activa-
tion function is utilized to produce nonlinear feature maps to make the training faster
and more accurate. The widely used activation function in a convolutional layer
is Rectified Linear Units (ReLU) to convert all negative numbers to 0 or positive.
A mathematical representation of convolution operation given an input I (t) and a
kernel K (a) is given as
s(t) = α N I (a) ∗ K (t − a) (1)
There are usually two types of layers in convolutional networks, pooling layers, max
pooling, and average pooling. The objective of using this layer is to recombine the
convolutional layer output to produce meaningful information. In pooling layer, a
filter slides over the convolutional layer output and the maximum or average value
in the filter window are transformed as an element in an output matrix as pooling
layer output.
In gait analysis, the goal of using neural networks is to develop a model to extract gait
features and perform well on unseen real-world gait data. For appropriate training
and testing, commonly the model is trained and validated on 70–80% of the data and
tested on the remaining 30–20%. In supervised training, the procedure initializes the
weights randomly, processing the inputs and comparing the resultant output against
the desired output. During training, the weights and biases are adjusted in every
iteration until the error is minimized, and validation is used to give an estimate of
the model performance during training. Lastly, the model is tested with unseen data,
allowing to identify over-fitting.
4.2 Background
4.3 Methodology
Healthy men and women between the ages of 20 and 65 years were invited to par-
ticipate in the study. Those with any condition that might affect a normal walking
pattern, typically a history of falls within 6 months prior to enrolment, were excluded
from the study. Statistical information such as gender and age were also captured
to allow further analyses. All methods were performed in accordance with guide-
lines and regulations by the University of Manchester Research Ethics’ Committee.
Informed consent was also obtained from the participants to take part in this study.
Statistical information such as gender and age were also captured to allow further
analyses. All methods were performed in accordance with guidelines and regula-
tions by the University Research Ethics Committee (UREC) at the University of
Manchester. The experimental protocol was approved by the Ethics’ Committee
with reference: ethics/15536 on January 25, 2016. Informed consent was obtained
from the participants to take part in this study.
4.3.2 Procedure
Four walking experiments were executed by each participant on the floor sensor
system. The participants initially undertook normal and fast walk experiments. Fol-
lowed by two dual-task experiments. The first dual-task experiment was to spell five
common words in reverse [3]. In the second, participants performed serial seven
Deep Learning in Gait Analysis for Security and Healthcare 313
subtractions starting from a random 3-digit number [5]. The experiments were per-
formed in a silent environment, with external distraction kept to a minimum. Par-
ticipants were allowed to wear any type of footwear during the experiments. Each
experiment lasted 5 min for a total of 20 min per participant. The number of cap-
tured experimental gait samples depended on the participant’s speed and manner
of walking, which varied among participants. The participants walked continuously
from one end of the walkway to the other during the experiment. An extra one-meter
length at the start and end of the floor sensor system was allowed to enable the par-
ticipants to accelerate and decelerate their walk. No cameras or video recording was
used since they can significantly compromise the privacy of participants [16] and
affect adversely the quality of the data.
The dual-task database collected, entitled UoM-Gait-69, was comprised of data from
69 cognitively and physically healthy adults who participated in our study. The
participant’s ages ranged from 20 to 63 years. Thirty-seven (53%) were female. The
participants were given a unique identification number (ID) for anonymization and
experiment identification.
We designed a set of seven experimental cases (rows of Table 4). The cases are the
database volunteers arranged in different groups to allow experimental results. For
example, Experimental task 1 has 3 groups: group 1 of 27 participants between
20–28 years, group 2 of 22 participants between 31–42 years and group 3 of 20
participants between 46–63 years. Experimental cases Table 4.
314 O. Costilla-Reyes et al.
The experimental cases have a wide range of age-cohort sets (group columns of
Table 4) of age-related differences in dual-tasks executed by the participants with
the aim of testing the ability of a machine learning model to differentiate age-range
sets with a variable number of participants per set. This includes experiment type,
number of participants and age range. In some experiments, a large age cohort of
participants was contained in each group. For example, experiment one had three-
decade-long age sets for classification of approximately 20 participants in each set.
Also, a single age cohort was included in other experiments such as experiment
seven, for participants between 20 and 26 years old of age.
Spatio-Temporal Raw Sensor Matrices
Spatio-temporal raw sensor matrices (RSMs) as described in [25] were constructed
from the raw sensor data in this study. This approach did not require tomography
reconstructed images, instead, it was possible to derive the RSMs directly from the
raw data. Therefore, RSMs were calculated for all the experiments performed for
this study.
Convolutions Convolutions
(inception inspired). 1x1x140 5x5x140
Feature
concatenation Output classification
returned an F-score of 56.12% whilst the linear SVM returned the lowest classifica-
tion performance overall with an F-score of 23.67%. The deep learning methodol-
ogy (F-score: 97%) improved the F-score of the Random Forest classifier by 40.88%,
while the best improvement was obtained against the linear SVM classifier by 63.5%.
These results justify a conclusion of robust classification performance of the deep
machine learning methodology compared to a shallow, ensemble and linear machine
learning models (Table 6).
Deep Learning in Gait Analysis for Security and Healthcare 317
Table 6 Comparison of the deep learning models against shallow models of experiment seven.
Classification of normal and dual-task two is shown. Classes are defined in Table 5
Optimization Model Precision (%) Recall (%) F-score (%) Support
Early stop Two-stream 97 96.99 96.99 1197
inception
Genetic Gradient boosting 62.79 62.82 62.50 1197
programming classifer
None Random forest 57.28 57.31 56.12 1197
None Linear SVM 26.36 27.82 23.67 1197
classifier
Here, experiment three and seven, described in Table 4 are further explored, since
the former had a large cohort of participants in two age-ranges, whilst the latter
delivered the best F-score overall experiments, for classification of two single-age
groups. Tables 7 and 8 show the detailed performance results per class for experiment
3 and 7 respectively. The analysis included metrics such as the Matthews correlation
coefficient, informedness, markedness, and prevalence [56] to further inform the
classification performance results.
Figures 6a and 7a show the precision and recall curve [56] for experiments three
and seven respectively that plots precision and recall correspondence for some thresh-
old values. In Figs. 6b and 7b it is shown the receiver operating characteristic curve
(ROC) [56] for the same two experiments. This curve demonstrates true positive
and false positive threshold rates of a machine learning model. As in the case of
the precision and recall curve, the experiment seven model outperforms experiment
three in the ROC curve.
4.8 Discussion
Fig. 6 Classification
performance characteristics
of experiment three
(a)Precisionandrecallcurve
(b)ROCcurve
Fig. 7 Classification
performance characteristics
of experiment seven
load in participants compared to task one, resulting in a more pronounced gait pat-
tern, which impacted on the ability of the machine learning model to classify gait
patterns successfully. Moreover, for participants to perform the arithmetic operations
of dual-task, coordination among several processes such as articulatory, phonatory
and respiratory, functions was required, which, might have led to a greater demand
on the executive function processing [5]. High classification performance was also
observed with short age-range groups and with a large age gap between groups.
These characteristics tended to isolate the gait pattern even further.
The high classification performance obtained in the age-related experiments
demonstrated that the deep learning methodology presented here may be appro-
priate for gait data analysis from participants with MCI in large cohort studies [57].
People with impaired executive function in the context of a diagnosis of AD have
320 O. Costilla-Reyes et al.
This study aims to establish a benchmark for the relationship between a managed cog-
nitive load and gait in cognitively healthy participants from a novel data analysis per-
spective. We will apply a novel analytic approach based on advanced computational
322 O. Costilla-Reyes et al.
models known as deep machine learning [13]. Current methods in dual-task analysis
rely on specific statistical features such as gait speed and variability [3–5, 50]. This
focus has been influenced by human observation, which is intrinsically subjective
and has limited reach. Usually, only a few experimental gait samples per participant
have been included in the analysis.
In contrast, in this study, the dual-task effects are studied using deep machine
learning principles [13] to automatically define and extract optimal gait features
harvested from raw spatio-temporal gait data. The data were obtained from an original
tomographic floor sensor system [51] sampled directly from the raw sensor data
rather than from reconstructed data [52], which requires further processing. A large
cohort of 69 participants was recruited, resulting in a sizeable set of gait samples
per participant experiment that allowed statistical reliability [53]. These aspects are
features not commonly found in gait analysis research. Furthermore, a large dataset
is beneficial for the optimal application of deep learning models." as provided in the
latex source code.
Footstep feature extraction and feature engineering have played a central role in
automatic footstep recognition research [12]. This procedure involves the careful
selection and design of very complex and time-consuming hand-crafted features for
footstep recognition. The features include Geometric, Holistic, Spectral and Wavelet
approaches to name a few [12]. Automatic feature learning models [13] have not
been well studied for biometric footstep recognition using floor sensors systems.
Research studying footstep data as a biometric collected footstep signals from:
(i) switch sensors [59, 60] which analyzes the spatial distribution of the footstep
signals, and (ii) pressure sensors [26–28], focusing on dynamic pressure information
in the signals, but with low spatial resolution. Qian et al. [61] use a commercial
pressure mat with high resolution is used by in order to extract the center of pressure
information, therefore using time and spatial pressure information only for some
selected key points (geometric approach).
Recently, footstep signals in temporal and spatial domains were analyzed [29],
reporting experiments on the SFootBD. The spatial information is extracted from
accumulated pressure images. Temporal information was extracted from the average
GRF and from other hand-crafted features. Principal Component Analysis (PCA) was
used for dimensionality reduction of the footstep data and a non-linear SVM is used
for biometric verification. Results were obtained in the range of 2.5–10% Equal Error
Rate (EER) were achieved depending on the application setting. In [36] we reported
a pilot study of a convolutional neural network model to learn processed spatial
footstep features of the SFootBD database, suggesting significant improvements of
footstep recognition performance compared to existing work [29].
Table 2 shows the recognition performance of the approach compared to other
known biometric verification systems based on floor sensor data only. The other
Deep Learning in Gait Analysis for Security and Healthcare 323
Fig. 8 Two-stream
spatio-temporal resnet
architecture for raw footstep
representation
studies do not use the SFootBD database, thus cannot be directly compared to this
work in terms of performance since the experiments differ in the number of clients
and footstep signals. However, we are using a much larger database and therefore
the performance results are more statistically significant.
In this report, we analyze the effect of evaluating a set of diverse footstep data
representations in machine learning models. Two representations worked best overall
for the spatio-temporal biometric verification problem presented here: raw footstep
data and processed footstep data.
The deep machine learning models used in this work are based on the state-of-the-art
resnet architecture [62].
The resnet architecture is illustrated in Fig. 8 consisting of spatial and temporal
streams for the raw representation. From input to output, each stream consists of the
following layers: First, there is a resnet configuration 1 block (2ay) (Fig. 9 right),
followed by resnet configuration 2 block (x2) (2by and 2cy) (Fig. 9 left), then an
average pooling layer, fully connected layer (FC) and finally a softmax layer. The
blocks consist of convolutional layers, batch normalization [63] and ReLU activation
functions [64]. The residual units in the network can be expressed in general form
as:
yl = h(xl ) + G(xl , Wl ), (2)
where xl is the input to the l-th residual block, and xl+1 is its corresponding output
and G is a non-linear residual function. h(xl ) = xl is an identity mapping, f is a
RELU activation [64] function. Wl = {Wl,k {|1 ≤ k ≤ K } is the set of weights and
biases of the l-th residual block. K is the number of layers in a residual unit. If f is
an identity mapping, then xl+1 ≡ yl , therefore Eq. 3 can be expressed as:
conv 2a conv 2a
Batch Norm 2a Batch Norm 2a
ReLU ReLU
conv 2b conv 2b
Batch Norm 2b Batch Norm 2b
ReLU ReLU
ReLU ReLU
For any unit of L and shallow unit l, the forward propagation of the feature x L can
be expressed as an additive output:
L−1
x L = xl + G(xl , Wl ). (5)
i=l
∂
L−1
∂γ ∂γ ∂x L ∂γ
= = (1 + G(xl , Wl )). (6)
∂xl ∂x L ∂xl ∂x L ∂x L i=l
By using the Resnet as feature extractor, it eases the evaluation of the verification
biometric system. This by evaluation the learn feature set with a discriminative linear
classifier. This allowed saving computational resources and time. The linear classifier
selected for the evaluation of the experiments was a linear Support Vector Machine
(SVM), due to its high biometric performance when compared with other linear
classifiers such as logistic regression or perception. If u is considered as the total
number of clients for a given experiment, then is required to train u linear SVM
classifier models using the Resnet models as a feature extractor, instead of training
u Resnet models which are computationally expensive to train.
The RMSprop [66] optimizer was selected to update the model’s weights due to
its stability at training time. All models were trained with a Batch size of 32 samples.
Initialization of the models with ImagNet Resnet-50 [67] weights for transfer learn-
ing was tested without major improvements, therefore the weights were initialized
instead by sampling values from a Gaussian random distribution to ease the initializa-
tion process. The RMSprop learning rate was set initially at 0.001 and decreased by
Deep Learning in Gait Analysis for Security and Healthcare 325
a factor of 10 once the learning error plateaus. An early learning stopping procedure
was implemented: we stop training once the validation error stopped decreasing.
As footstep GRF patterns tend to contain a large degree of fine-grained GRF vari-
ability they are difficult to visualise for evaluation by humans Figs. 10 and 11 shows
a side by side comparison of stride raw (top) and processed (bottom) spatial footstep
representations from 2 clients of the SFootBD, considering 2 samples per user. The
comparison implies that effective footstep recognition based only on visual percep-
tion is a very challenging problem as there can be a high user intra-variability and
low inter-user variability in some cases. Moreover, humans are not accustomed to
recognizing this type of images as opposed to other biometric traits such as facial
recognition. Machine learning has been used in an attempt to solve differentiating
the fine-grained GRF variability between clients and impostors.
The spatial and temporal footstep data share the same resnet architecture shown
in Fig. 8. The input footstep representations affect the dimensions of the first
convolutional (conv.) layer of the resnet model, it takes as input a stride footstep
tensor of shape (n, m, c) where n × m is the 2D footstep sensor matrix and c the
frames. c = 1 for the spatial case and c = 100 for the temporal component. The filter
size of the resnet blocks (Fig. 9) and channels change according to the input footstep
tensor dimensions.
The widely-used deep network design introduced by the VGG net [68] is adopted
for the resnet models. The methodology decreases the spatial component at the conv.
layers as a function of increasing the number of filter maps, from the left (input) to
the right (output) layers of the network.
The verification system performance was evaluated by using the Detection error
trade-off (DET) curve [69], which displays a trade-off of missed detection and false
alarm errors. We also used the Equal Error Rate (EER) to summarise the biometric
verification performance of the system. The EER is the intersection in the DET curve
where the False Rejection Rate (FRR) and the False Acceptance Rate (FAR) are equal.
Therefore, we are giving equal importance to FRR and FAR for the evaluation of our
experiments.
5.6 Results
For this benchmark, the fusion of the spatial and temporal domains performs best
overall for the 3 representations considered. Separately, the raw representation deliv-
ered the best performance (11.80%, 11.50%) EER followed by the processed SVM
representation with (8%, 12.50%) EER and lastly, the processed representation
obtained (10.10%, 14.50%) EER.
For the fusion of representations, the raw and processed representations deliver
(8.10%, 10.70%) EER. This is also a better performance than considering the two rep-
resentations separately. While the combination of the raw, processed and processed
SVM representations delivers the optimal performance overall (7.10%, 10.50%)
EER. This improves the previous reported optimal performance [29] by 2% EER
in evaluation and 0.9% EER in validation datasets. This benchmark considers the
least amount of footstep data for training from the 3 benchmarks. The benchmark
exemplifies a real-world security application, where data is scarce.
Deep Learning in Gait Analysis for Security and Healthcare 327
Spatio-temporal fusion performs best overall for the 3 representations in this dataset.
The processed SVM delivers the best performance from the 3 representations with
(3.80%, 6.70%) EER. The raw and processed SVM representation delivers the same
evaluation performance of 8% EER. In validation, the raw representation obtains
6.10% EER while the processed SVM delivers better performance of 3.80% EER.
For the fusion of representations, the raw and processed representations deliver
(3.20%, 5.30%) EER this performs better than any of the representations consid-
ered. The combination of the raw, processed and processed SVM representations
deliver the optimal performance overall of (2.80%, 4.90%) EER for this dataset.
This improves the previous reported optimal performance [29] by 1.8% EER in
evaluation and 1% EER in validation datasets
This benchmark considers a medium amount of footstep data for training from
the 3 benchmarks. An office security environment exemplifies a real-world scenario.
Spatio-temporal fusion performs best overall for the 3 representations. The pro-
cessed representation delivers the best performance with (1.80%, 2.60%) EER. The
processed SVM follows with (2.10%, 3.20%) EER and lastly, the raw representation
obtained (1.70%, 5.60%) EER.
At the fusion of representations level, the raw and processed representations
deliver (0.80%, 2.10%) EER performing better than considering the representations
separately as in previous benchmarks. The combination of the raw, processed and
processed SVM representations deliver the optimal performance overall (0.70%,
1.70%) EER for this dataset and overall in all experiments. This improves the pre-
vious reported optimal performance [29] by 2.3% in the evaluation and 1.4% in
validation datasets These results are the best overall considering all experiments and
benchmarks.
This benchmark considers the largest amount of footstep data for training from
the 3 benchmarks, thus the best performance observed overall experiments.
We argue that the best performance observed here overall experiments is since
the largest amount of footstep data is considered for training the Resnet models. A
home environment exemplifies a real-world security application of this dataset, and
where the proposed methodology and models would optimally work.
5.8 Discussion
The partition of the test datasets into validation and evaluation subsets allows eval-
uation of the model’s generalisation performance with high confidence since the
328 O. Costilla-Reyes et al.
Table 9 Biometric verification results in terms of EER (in %) for benchmarks B1, B2 and B3
Domain Model Benchmark B1 Benchmark B2 Benchmark B3
(40 clients) (15 clients) (5 clients)
Val. (%) Eval. (%) Val. (%) Eval. (%) Val. (%) Eval. (%)
Raw representations
Temporal Resnet 14.70 18.00 8.20 6.70 4.60 8.00
Spatial Resnet 16.30 13.40 11.20 10.70 3.40 12.00
Spatio-temporal Resnet 11.80 11.50 6.10 8.00 1.70 5.60
Spatio-temporal DNN 27.65 27.93 14.33 17.33 5.76 6.57
Spatio-temporal CNN 31.28 31.21 14.26 14.67 3.62 4
Processed representations
Temporal Resnet 12.20 18.00 6.60 9.30 3.90 2.00
Spatial Resnet 13.60 15.50 5.50 9.30 3.00 6.60
Spatio-temporal Resnet 10.10 14.50 3.80 8.00 1.80 2.60
Spatio-temporal DNN 17.25 21 6.10 6.66 2.80 3
Spatio-temporal CNN 18.1 23 6.07 9.95 1.61 3.38
Processed SVM representations
Spatial-integrated SVM 12.10 16.50 9.30 12.00 6.10 8.20
temporal
Spatial SVM 11.70 17.50 5.90 9.20 3.80 2.60
Spatio-temporal SVM 8.00 12.50 3.80 6.70 2.10 3.20
Fusion of representations
Raw and processed Resnet 8.10 10.70 3.20 5.30 0.80 2.10
Raw and processed Resnet 7.10 10.50 2.80 4.90 0.70 1.70
and processed SVM and SVM
evaluation dataset never influence the training process directly (training set) or indi-
rectly (validation set). Overall, the validation dataset EER is better than the evaluation
dataset due to the generalisation of the model in held-out footstep data. We are able to
provide better performance results in all benchmarks when compared with previously
reported work [29].
The validation dataset performance influences the early stopping procedure at the
training time of the resnet models, thus indirectly influencing the generalization per-
formance of the system. However, this is a widely used procedure, and by providing
an EER performance in a held-out dataset (evaluation) a closer and more realistic
estimate of the generalization performance is provided.
Deep residual networks are known to show state-of-the-art performance for prob-
lems that use large amounts of footstep data for model training, such as ImageNet
[13, 67] which contains millions of samples for training. This effect can be shown
for both the validation and evaluation dataset performance results shown in Table 9,
as data available per model increases.
Deep Learning in Gait Analysis for Security and Healthcare 329
The raw and processed resnet representations obtained very similar performance
EER in the 3 datasets as observed in Table 9. Therefore, the raw models are able to
provide competitive performance from raw unprocessed footstep data evaluated in a
learning model when compared with processed footstep data.
This section has explored the important effects of testing spatio-temporal input
footstep data representations in machine learning models based on deep residual
networks. The representations are based on footstep raw and processed data. We
compare its performance with a processed representations approach using a SVM
The two methods delivered similar performance. The critical factors that affect foot-
step biometric verification performance are the spatio-temporal data representations
considered and the amount of data considered for training.
Three datasets from the largest footstep database were considered for the spatio-
temporal analysis. The dataset resembles data-driven real-world scenarios, including
a small footstep dataset for security applications (Benchmark B1), a medium size
dataset for office-oriented applications (Benchmark B2), and a large dataset for home-
based scenarios (Benchmark B3). These scenarios intend to cover the most common
real-world scenarios.
The experiments performed here have proven that there is not a single optimal
representation for all datasets. Considering the representations separately, for Bench-
mark B1 the raw representation performs optimally, in Benchmark B2 the processed
SVM delivers optimal verification performance and for Benchmark B3 the processed
SVM representation performs best overall, this justifies this research in terms of
evaluation of several representations in machine learning models in order to obtain
a robust footstep recognition model.
This result highlights the need for raw data representation analysis for automatic
feature learning models. We have demonstrated that an ensemble of resnet and SVM
models using processed and unprocessed footstep data obtain a robust footstep recog-
nition model for biometric verification.
6 Conclusions
In this chapter spatio-temporal gait and footstep representations have been studied
with deep learning methodologies. In the healthcare theme, dual-task has been clas-
sified with robust classification performance by providing an F-score of 97.33% in
the optimal case, while in the security theme, state-of-the-art footstep recognition
performance has been obtained in a biometric verification scenario, obtaining an opti-
mal EER of 0.7%. Therefore, robust pattern recognition in gait and footstep analysis
have been provided with high statistical significance. The methodologies to obtain
the optimal results used deep machine learning principles based on convolutional
neural networks.
In the healthcare theme, the link between cognitive activities and their effects
on the changes in human gait patterns was investigated. The research analyzed
of the effect of cognitive activities in gait patterns from healthy individuals. The
330 O. Costilla-Reyes et al.
Acknowledgements We express our gratitude to the participants for taking the time to participate
in this research and to David H. Foster for useful discussions. This work was supported by the
U.K. Engineering and Physical Sciences Research Council EP/K005294/1 EP/K503447/1, in part
by CONACyT (Mexico), grant 467373 and in part by the University of Manchester Data Science
Institute. O. Costilla-Reyes would like to acknowledge CONACyT (Mexico) for a studentship. We
acknowledge NVIDIA for the donation of the GPU used to perform some of the experiments of
this research.
References
4. Beurskens, R., Bock, O. Age-related deficits of dual-task walking: a review. Neural Plast. 2012
(2012)
5. Hausdorff, J.M., Schweiger, A., Herman, T., Yogev-Seligmann, G., Giladi, N.: Dual-task decre-
ments in gait: contributing factors among healthy older adults. J. Gerontol. Ser. A: Biol. Sci.
Med. Sci.63(12), 1335–1343 (2008)
6. Barnes, D.E., Yaffe, K.: The projected effect of risk factor reduction on Alzheimer’s disease
prevalence. Lancet Neurol. 10(9), 819–828 (2011)
7. Lundin-Olsson, L., Nyberg, L., Gustafson, Y.: Stops walking when talking as a predictor of
falls in elderly people. Lancet 349(9052), 617 (1997)
8. Costilla-Reyes, O., Vera-Rodriguez, R., Scully, P., Ozanyan, K.B.: Analysis of Spatio-temporal
representations for robust footstep recognition with deep residual neural networks. IEEE Trans.
Pattern Anal. Mach. Intell.41(2), 285–296 (2018)
9. Vacca, J.R.. Biometric Technologies and Verification Systems. Butterworth, Heinemann (2007)
10. P. Daphne Tsatsoulis, Jaech, A., Batie, R., Savvides, M.. Continuous authentication using
biometrics. IGI Global, 68–88 (2012)
11. Muro-de-la Herran, A., Garcia-Zapirain, B., Mendez-Zorrilla, A.: Gait analysis methods: an
overview of wearable and non-wearable systems, highlighting clinical applications. Sensors
14(2), 3362–3394 (2014)
12. Eric Mason, J., Traoré, I., Woungang, I.: Machine Learning Techniques for Gait Biometric
Recognition. Springer (2016)
13. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
14. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M. et al.: Imagenet large scale visual recognition challenge. arXiv
preprint arXiv:1409.0575 (2014)
15. J.M. Montepare, Goldstein, S.B.: The identification of emotions from gait information. J.
Nonverbal Behav. 11(1), 33–42 (2016)
16. Martina, Z., Simon,H., Wilkowska, W.: When Your Living Space Knows What You Do: Accep-
tance of Medical Home Monitoring by Different Technologies. Springer (2011)
17. Alharthi, A.S., Yunas, S.U., Ozanyan, K.B.: Deep learning for monitoring of human gait: a
review. IEEE Sens. J. (submitted) (2019)
18. Connor, P., Ross, A.: Biometric recognition by gait: a survey of modalities and features. Comput.
Vis. Image Underst. 167, 1–27 (2018)
19. El-Alfy, H., Mitsugami, I., Yagi, Y.: A new gait-based identification method using local Gauss
maps. In: Asian Conference on Computer Vision, pp. 3–18. Springer (2014)
20. Ioannidis, D., Tzovaras, D., Damousis, I.G., Argyropoulos, S., Moustakas, K.: Gait recognition
using compact feature extraction transforms and depth information. IEEE Trans. Inf. Forensics
Secur. 2.3, 623–630 (2007)
21. Arora, P., Srivastava, S., Arora, K., Bareja, S.: Improved gait recognition using gradient his-
togram Gaussian image. Procedia Comput. Sci. 58, 408–413 (2015)
22. Liu, Y., Zhang, J., Wang, C. ,Wang, L.: Multiple HOG templates for gait recognition.2012 21st
International Conference on Pattern Recognition (ICPR) , pp. 2930–2933. IEEE (2012)
23. Sivapalan, S., Chen, D., Denman, S., Sridharan, S., Fookes, C.: Gait energy volumes and frontal
gait recognition using depth images. In: 2011 International Joint Conference on Biometrics
(IJCB), pp. 1–6. IEEE (2011)
24. Costilla-Reyes, O., Scully, P., Ozanyan, K.B.: Temporal pattern recognition in gait activities
recorded with a footprint imaging sensor system. IIEEE Sens. J. 16(24), 8815–8822 (2016)
25. Costilla-Reyes, O., Scully, P., Ozanyan, K.B.: Deep neural networks for learning spatio-
temporal features from tomography sensors. IEEE Trans. Ind. Electron. 65(1), 645–653 (2018)
26. Cattin, P.C.: Biometric authentication system using human gait. PhD thesis. Diss., ETH Zurich,
Nr. 14603, pp. 1–140 (2002)
27. Stevenson, J.P., Firebaugh, S.L., Charles, H.K.. Biometric identification from a floor based
PVDF sensor array using hidden Markov models. Proc. SAS 7 (2007)
28. Mostayed, A., Kim, S., Mazumder, M.M.G., Park, S.J.: Foot step based person identification
using histogram similarity and wavelet decomposition. In: Proceedings of the 2nd International
Conference on Information Security and Assurance, , pp. 307–311. IEEE (2008)
332 O. Costilla-Reyes et al.
29. Vera-Rodriguez, R., Mason, J.S.D., Fierrez, J., Ortega-Garcia, J.: Comparative analysis and
fusion of spatiotemporal information for footstep recognition. IEEE Trans. Pattern Anal. Mach.
Intell. 35(4), 823–834 (2013)
30. Zhong, Y., Deng, Y.: Sensor orientation invariant mobile gait biometrics. In: 2014 IEEE Inter-
national Joint Conference on Biometrics (IJCB), pp. 1–8. IEEE (2014)
31. Zhang, Y., Pan, G., Jia, K., Lu, M., Wang, Y., Wu, Z.: Accelerometer based gait recognition by
sparse representation of signature points with clusters’. IEEE Trans. Cybern. 45(9), 1864–1875
(2015)
32. Bours, P., Shrestha, R.: Eigensteps: a giant leap for gait recognition. In: 2010 2nd International
Workshop on Security and Communication Networks (IWSCN), pp. 1–6. IEEE (2010)
33. Gafurov, D., Snekkenes, E., Bours, P.: Improved gait recognition performance using cycle
matching. In: 2010 IEEE 24th International Conference on Advanced Information Networking
and Applications Workshops (WAINA), pp. 836–841. IEEE (2010)
34. Rong, L., Jianzhong, Z., Ming, L., Xiangfeng, H.: A wearable acceleration sensor system for
gait recognition. In: 2007 2nd IEEE Conference on Industrial Electronics and Applications,
ICIEA 2007, pp. 2654–2659. IEEE (2007)
35. Zifeng, W., Huang, Y., Wang, L., Wang, X., Tan, T.: A comprehensive study on cross-view gait
based human identification with deep CNNs. IEEE Trans. Pattern Anal. Mach. Intell. 39(2),
209–226 (2017)
36. Costilla-Reyes, O., Vera-Rodriguez, R., Scully, P., Ozanyan, K.B.: Spatial footstep recogni-
tion by convolutional neural networks for biometric applications. In: Proceedings of IEEE
SENSORS 2016. IEEE (2016)
37. Yun, J.: User identification using gait patterns on UbiFloorII. Sensors 11(3), 2611–2639 (2011)
38. Claude Cattin, P.: Biometric authentication system using human gait. PhD thesis. PhD disser-
tation Technische Wissenschaften ETH Zurich Nr. 14603 (2002)
39. Kanth Saripalle, S.: Classification of human postural and gestural movements using center of
pressure parameters derived from force platforms. PhD thesis. University of Missouri- Kansas
City (2010)
40. Headon, R., Curwen, R.: Recognizing movements from the ground reaction force. In: Proceed-
ings of the 2001 Workshop on Perceptive User Interfaces, pp. 1–8. ACM (2001)
41. Simeone, O.: A very brief introduction to machine learning with applications to communication
systems. In: IEEE Transactions on Cognitive Communications and Networking (2018)
42. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order
to the web (1999)
43. Zhao, W., Krishnaswamy, A., Chellappa, R., Swets, D.L., Weng, J.: Discriminant analysis of
principal components for face recognition. In: Face Recognition, pp. 73–85. Springer (1998)
44. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey
of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734–749
(2005)
45. Yi, N., Li, C., Feng, X., Shi, M.: Research and improvement of convolutional neural network.
In: 2018 IEEE/ACIS 17th International Conference on Computer and Information Science
(ICIS), pp. 637–640. IEEE (2018)
46. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in
videos. Adv. Neural Inf. Process. Syst., 568–576 (2014)
47. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K.,
Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
2625–2634 (2015)
48. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features
with 3d convolutional networks. In: Proceedings of the IEEE International Conference on
Computer Vision, pp. 4489–4497 (2015)
49. Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action
recognition. Adv. Neural Inf. Process. Syst., 3468–3476 (2016)
Deep Learning in Gait Analysis for Security and Healthcare 333
50. Ji, S., Wei, X., Yang, M., Kai, Y.: 3D convolutional neural networks for human action recog-
nition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
51. Wang, X., Farhadi, A., Gupta, A.: Actions—transformations. In: 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 2658–2667 (2016)
52. Gafurov, D.: A survey of biometric gait recognition: approaches, security and challenges. In:
Annual Norwegian Computer Science Conference, pp. 19–21. Citeseer (2007)
53. Montero-Odasso, M., Muir, S.W., Speechley, M.: Dual-task complexity affects gait in people
with mild cognitive impairment: the interplay between gait variability, dual tasking, and risk
of falls. Arch. Phys. Med. Rehabil. 93(2), 293–299 (2012)
54. Owings, T.M., Grabiner, M.D.: Measuring step kinematic variability on an instrumented tread-
mill: how many steps are enough? J. Biomech. 36(8), 1215–1218 (2003)
55. Atkinson, H.H., Rosano, C., Simonsick, E.M., Williamson, J.D., Davis, C., Ambrosius, W.T.,
S.R. Rapp, Cesari, M., Newman, A.B., Harris, T.B.: Cognitive function, gait speed decline,
and comorbidities: the health, aging and body composition study. J. Gerontol. Ser. A: Biol. Sci.
Med. Sci. 62(8), 844–850 (2007)
56. Powers, D.M.W.: Evaluation: from precision, recall and F-measure to Roc, informedness,
markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)
57. Gunn-Moore, D., Kaidanovich-Beilin, O., Iradi, M.C.G., Gunn-Moore, F., Lovestone, S.:
Alzheimer’s disease in humans and other animals: a conseREFERENCES 33 quence of postre-
productive life span and longevity rather than aging. Alzheimer’s Dement.: J. Alzheimer’s
Assoc. 14(2), 195–204 (2018)
58. Allan, L.M., Ballard, C.G., Burn, D.J., Anne Kenny, R.: Prevalence and severity of gait disorders
in Alzheimer’s and non-Alzheimer’s dementias. J. Am. Geriatr. Soc. 53(10), 1681–1687 (2005)
59. Middleton, L., Buss, A., Bazin, A., Nixon, M.: A floor sensor system for gait recognition. In:
Proceedings of Fourth IEEE Workshop on Automatic Identification Advanced Technologies,
pp. 171–176 (2005)
60. Vera-Rodriguez, R., Fierrez, J., Mason, J.S.D., Ortega-Garcia, J.: A novel approach of gait
recognition through fusion with footstep information. In: Proceedings IAPR International
Conference on Biometrics, ICB (2013)
61. Qian, G., Zhang, J., Kidane, A.: People identification using floor pressure sensing and analysis.
IEEE Sens. J. 10(9), 1447–1460 (2010)
62. He, K., Zhang, X., Ren, S., Sun,J.: Deep residual learning for image recognition. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
63. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing
internal covariate shift. In: Proceedings of the 32nd International Conference on Machine
Learning, pp. 448–456 (2015)
64. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of
the 14th International Conference on Artificial Intelligence and Statistics, vol. 15, pp. 315–323
(2011)
65. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Proceedings
of European Conference on Computer Vision, pp. 630–645. Springer (2016)
66. Dauphin, Y., de Vries, H., Bengio, Y.: Equilibrated adaptive learning rates for non-convex
optimization. Adv. Neural Inf. Process. Syst. , 1504–1512 (2015)
67. Russakovsky, O., Deng, J., Hao, S.: Imagenet large scale visual recognition challenge. Int. J.
Comput. Vis. 115(3), 211–252 (2015)
68. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-
nition. In: International Conference on Learning Representations (ICRL), pp. 1–14 (2015)
69. Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The DET curve in
assessment of detection task performance. Tech. Rep, DTIC (1997)
70. Lawson, J., Murray, M., Zamboni, G., Koychev, I.G., Ritchie, C.W., Ridha, B.H., Rowe, J.B.,
Thomas, A., Ffytche, D.H., Howard, R.J.: Deep and frequent phenotyping: a feasibility study
for experimental medicine in dementia. Alzheimer’s Dement.: J. Alzheimer’s Assoc. 13(7), pp.
1268–1269 (2017)
334 O. Costilla-Reyes et al.
71. Gledson, A., Asfiandy, D., Mellor, J., Ba-Dhfari, T.O.F., Stringer, G., Couth, S., Burns, A., Leroi,
I., Zeng, X., Keane, J.: Combining mouse and keyboard events with higher level desktop actions
to detect mild cognitive impairment. In: 2016 IEEE International Conference on Healthcare
Informatics (ICHI), pp. 139–145. IEEE (2016)
72. Aharon Satt, A., Sorin, A., Hoory, R., Toledo-Ronen, O., Derreumaux, A., Manera, V., Verhey,
F., Aalten, P., Robert, P.H.: Automatic speech analysis for the assessment of patients with
predementia and Alzheimer’s disease. Alzheimer’s Dement.: Diagn., Assess. Dis. Monit. 1(1),
112–124 (2015)
Deep Learning for Building Occupancy
Estimation Using Environmental Sensors
Abstract Building Energy efficiency has gained more and more attention in last few
years. Occupancy level is a key factor for achieving building energy efficiency, which
directly affects energy-related control systems in buildings. Among varieties of sen-
sors for occupancy estimation, environmental sensors have unique properties of non-
intrusion and low-cost. In general, occupancy estimation using environmental sensors
contains feature engineering and learning. The traditional feature extraction requires
to manually extract significant features without any guidelines. This handcrafted
feature extraction process requires strong domain knowledge and will inevitably
miss useful and implicit features. To solve these problems, this chapter presents a
Convolutional Deep Bi-directional Long Short-Term Memory (CDBLSTM) method
that consists of a convolutional neural network with stacked architecture to auto-
matically learn local sequential features from raw environmental sensor data from
scratch. Then, the LSTM network is used to encode temporal dependencies of these
local features, and the Bi-directional structure is employed to consider the past and
future contexts simultaneously during feature learning. We conduct real experiments
Z. Chen (B) · M. Wu · X. Li
Institute for Infocomm Research (I2R), A*STAR, Singapore, Singapore
e-mail: [email protected]
M. Wu
e-mail: [email protected]
X. Li
e-mail: [email protected]
C. Jiang
School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China
e-mail: [email protected]
M. K. Masood
Department of Civil Engineering, Aalto University, Espoo, Finland
e-mail: [email protected]
Y. C. Soh
School of Electrical and Electronic Engineering, Nanyang Technological University,
Singapore, Singapore
e-mail: [email protected]
to compare the CDBLSTM and some state-of-the-art approaches for building occu-
pancy estimation. The results indicate that the CDBLSTM approach outperforms all
the state-of-the-arts.
1 Introduction
To maintain the thermal comfort of indoor environments, around 40% of the energy
has been consumed in building sectors [28]. Thus, a lot of attention has been paid on
building energy efficiency and sustainable development. To achieve that, a crucial
factor is the building occupancy information, also known as occupant number or
range in buildings. It can be used for building climate and adaptive light control
[28, 36]. Balaji et al. saved 17.8% of energy for HVAC systems relied on actual
occupancy levels in a designed experiment [1]. A light control system developed in
[24] has reported a reduction of 35–75% of energy consumption for building light
control systems. However, to obtain an accurate and robust occupancy estimation
system is a challenging mission and remain unsolved.
Occupancy estimation can be done by the use of different sensors. For instance, Liu
et al. present a detection of the absence and presence of occupants via PIR sensors
[27]. It will be more meaningful to obtain the actual occupant number or range
indoors. In order to fulfill that, the methods relied on RFID and wearable devices
were presented in [1, 25]. However, these approaches require users to wear specific
devices, which is intrusive and inconvenient. Accurate occupancy estimation can be
achieved by using cameras [42]. However, camera based solutions often suffer from
the problems of insufficient illumination and high computational load. Besides, they
also have the issue of privacy concerns. Some other methodologies rely on occupants’
involvement, such as using chair sensors [23] and applicants power usage data [22].
However, occupants that do not involved will not be able to be detected.
Recently, environmental sensors are widely adopted for occupancy estimation,
because they are low-cost and non-intrusive for users [21, 29, 40, 41]. Due to the
complex relationship between environmental sensor measurements and occupancy
levels, physical modeling is with limited performance. An alternative way is to model
the complex relationship by using machine learning techniques which work well
on function approximation. Since, environmental sensor data are with large noise
and not representative for different occupancy levels, the machine learning mod-
els trained with raw sensory data may have limited performance. The common
operation is to perform feature engineering which intends to extract more infor-
mative representations for different occupancy levels [26]. However, the traditional
manual feature engineering does not have a guideline on which features should be
extracted for occupancy inference. In addition, it requires strong domain knowledge
and will inevitably miss implicit and useful features. To solve this problem, this
Deep Learning for Building Occupancy Estimation … 337
2 Literature Review
Many advanced algorithms have been presented for occupancy inferences in build-
ings using environmental sensor data. The authors in [13] presented an occupancy
estimation system for an open office room by using sensor networks that are able to
collect data of CO2 , CO, acoustics, PM2.5, motion, illumination, temperature and
humidity. Some statistical features, e.g., moving average of 20-min and 1st order
difference, were manually extracted. Next, the most important features were chosen
via the popular information gain theory. Finally, data-driven methods including Sup-
port Vector Machine (SVM), Artificial Neural Network (ANN) and Hidden Markov
Model (HMM) were utilized for occupancy estimation. They made a conclusion that
the most significant sensors are CO2 and acoustic, and the HMM achieves the best
performance for occupancy estimation.
The authors in [30] employed environmental sensors of temperature, CO2 , humid-
ity, and pressure, to estimate occupancy for a tutorial room. They extracted some
similar features used in [13]. An ELM-based wrapper algorithm was developed for
feature selection and occupancy inference.
In [38], the authors investigated various sensors including sound, motion, tem-
perature, door state, CO2 , humidity, passive infrared and light to infer occupancy
in both multi-occupant and single-occupant offices via some widely used machine
learning algorithms. Instead of extracting more useful features, they used raw sensor
data as features. Here, the authors applied many informative sensors to guarantee
a satisfactory performance of their proposed method. The contribution of different
sensors (features) were tested by using the theory of information gain. Eventually,
light level, door state and CO2 are shown to be the most important parameters. For
different algorithms, the decision tree (DT) approach has the best performance.
Candanedo et al. developed an occupancy detection system with sensors of humid-
ity, CO2 , temperature and light levels [3]. They also used the raw sensor data as
features in this work, and utilized some statistical models identify the two states of
absence and presence of occupants. Different combinations of features with distinct
statistical approaches were tried, and then the best sensors and models can be selected.
At last, they made a conclusion which claims that a satisfactory performance is able
to be fulfilled when properly selecting sensors and learning methods.
338 Z. Chen et al.
Since occupancy dynamics has the Markov property [4, 7, 8], the HMM model
has achieved great success for building occupancy detection and estimation [13].
But, the traditional HMM often suffers from some limitations, such as the use of
mixture of Gaussian model to estimate emission probabilities and the fixed transition
probability matrix. To solve these issues, the authors in [12] presented an IHMM-
MLR for environmental sensor based occupancy inference. Firstly, inhomogeneous
transition probability matrices for capturing occupancy dynamics at distinct time
steps were developed. Then, multinomial logistic regression to produce the emission
probabilities with environmental sensor data was designed. Two schemes, i.e., online
and offline, were formulated to infer occupancy in distinct situations.
Chen et al. presented another system to enhance the performance for occupancy
estimation by considering occupancy properties [6]. They performed a fusion of tra-
ditional machine learning algorithms with a well-developed occupancy model which
is able to show occupancy properties. The sensors they utilized include CO2 , humid-
ity, pressure and temperature, which is widely available. The algorithms include
ELM, SVM, ANN, KNN, CART and LDA. They formulated a Bayes filter to fuse
the occupancy model and six data-driven algorithms for the estimation of occupancy.
A detailed survey for occupancy estimation can be found in [5].
Here, we leverage on the environmental sensors including temperature, CO2 , pres-
sure and humidity that are popular in normal HVAC systems [14] instead of applying
specific sensors, such as acoustic level [13, 38], motion [19, 38] and light level [3].
Without applying the noisy sensor data as features or using some handcrafted statisti-
cal features, we attempt to automatically extract some useful local sequential features
by using the convolutional neural network with stacked structure. Then, the BLSTM
network is able to encode temporal dependencies for sequential local features during
high-level feature learning. We have made a comprehensive comparison with some
state-of-the-arts by using actual experiments.
3 Methodology
3.1 Overview
For environmental sensor based occupancy estimation, the key part is to learn discrim-
inative representations (features) from raw data for distinct occupancy levels. Figure 1
presents the CDBLSTM framework for environmental sensor based occupancy
Deep Learning for Building Occupancy Estimation … 339
where m denotes the windows size and ⊕ is the concatenation operation. Next, an
activation function is performed over the multiplied results, shown as
ci = g v xi:i+m−1 + b (2)
where g(·) is the activation function, b is the bias term and is the transpose opera-
tion. The widely used ReLU activation function [31] is adopted. By sliding the filter
from the beginning of the input sequence to its end, we can produce a feature map,
shown as follows:
z j = [z 1 , z 2 , ..., z r −m
s +1
] (4)
where z i = max (cis−s , cis−s+1 , ..., cis−1 ). Hence, the pooling operation will generate
compressed feature map z j , j ∈ 1, 2, . . . , k. Eventually, the output
of the convolu-
tional neural network will have a feature dimension of r −s m + 1 × k.
In general, assume the number of samples n, the input data has a dimension
r −nm× r ×d. The output of the convolutional neural network has a size of n ×
of
s
+ 1 × k. It can be found that the length of the input data is compressed from
r −m
r to s + 1 . In addition, the data dimension changes from d (number of sensors)
to k (number of filters), where k is much larger than d. This means that the data
becomes more informative. In other word, the convolutional neural network can be
treated as a local feature learned which is able to get more informative representations
and preserve the temporal information from raw environmental sensor data.
Recurrent Neural Network (RNN) is widely used for the modeling of time series data
thanks to its strong sequential modeling capacity. However, the conventional RNN
342 Z. Chen et al.
often has the problem of gradient vanishing or exploding during training. This dra-
matically influence the performance of RNN on modeling long-term dependencies
in time-series data [2]. To solve this issue, the authors in [17] proposed a new archi-
tecture, named LSTM, which attempts to use some gates to control the information
for preserving or discarding, such that it is able to capture long-term dependencies
of the sequence. The LSTM network has been successfully employed in a num-
ber of important and challenging tasks, e.g., activity recognition [9, 10] and natural
language processing [34]. The conventional LSTM only considers the sequential
information in one direction, that is the forward direction. This is not adequate for
sequential modeling of environmental sensor data. The future information may also
be useful. To consider both the future and past contexts for occupancy inference, we
adopt the BLSTM which contains a forward layer and a backward layer to process
sequential data in the forward and backward directions.
Recently, deep structures have achieved great success in representation learning
[16]. The Deep Bi-directional LSTM (DBLSTM) which stacked multiple BLSTM
layers is adopted in this study to encode the temporal dependencies and learn high-
level features from the sequential local features extracted by the convolutional neural
network. In addition to that, the DBLSTM is able to make the inputs to propagate
through time and space (layers), simultaneously, such that, the model parameters are
able to distribute over layers instead of enlarging memory size of the network. This
will result a more efficient non-linear operation of the data and is also the ultimate
purpose for stacking multiple layers in deep learning [16]. Figure 3 illustrates a
hidden layer l at time step t − 1, t and t + 1 of the DBLSTM network, where the
arrows pointing to the left and right denote the backward and forward operations
respectively. Here, the forward operation from time step t − 1 to t is to capture the
past information, and the backward operation from time step t + 1 to t is to model
the future information. We use one hidden layer l at time step t as an example to
t
show the detailed operation of the DBLSTM network. Assume that h l−1 is the hidden
f f
state, Cl is the memory cell state, wl , wl , wl and wl are the weights, bl , bli , blC
t−1 i C o
and bl are the biases, and σ (·) denotes the sigmoid activation function. The forward
o
←−t −f
f l =σ ← − f [←
w
−t+1 ← −t ←
l h l , h l−1 ] + b l
←−t −i
i l =σ ← −i [←
w
−t+1 ←
h ,
−t
h ] +
←
b
l l l−1 l
←− −C
C̃ lt = tanh ← w−C [ h t+1 , h t ] + ←
←
− ←−
b
l l l−1 l
(6)
←−t ← −t ← −t+1 ← −t ← −
C l = f l ∗ C l + i l ∗ C̃ lt
−
←
o−t = σ ← −o [←
w
− ←
−
h t+1 , h t ] + b o
←
l l l l−1 l
←
− −
←
h lt = ←
o−lt ∗ tanh C lt
The final output of the l-th hidden layer at time t of the DBLSTM network is a
concatenation of the forward and backward layers, which can be expressed as
−→ ←−
h lt = h lt ⊕ h lt (7)
−
→
where h lt can update the current hidden state by using the past information, that is
← −
the time from 1 to t − 1, and h lt can update the current hidden state by using the
future information, that is the time from t + 1 to r .
344 Z. Chen et al.
The outputs of the DBLSTM network are high-level features which will be fed into
some fully connected layers to get more abstract representations. The expression of
the fully connected layers can be shown as:
oi = g αi μi + βi (8)
where μi and oi are the input and output of the i-th fully connected layer respectively,
αi and βi are the weights and bias respectively, and g(·) is the activation function. We
choose the activation function of ReLU in this study. Suppose that we have stacked
c fully connected layers, the output of the last fully connected layer, known as oc−1 ,
is the final representation of the input data. The final feature representations are fed
into a softmax classification layer to obtain the occupancy.
With the outputs of the CDBLSTM and the true labels (occupancy ranges), the errors
can be calculated over all the training data, and then error gradients will be derived
and back-propagated to adjust model parameters for the training of CDBLSTM
[37]. More precisely, given training data with the true occupancy levels, the network
outputs can be calculated. Then, the cross-entropy losses can be derived based on
the network outputs and true occupancy levels. Next, we can get the error gradients
to back-propagate for the adjustment of model parameters via some gradient based
optimization algorithms. In this study, we adopt the popular optimization method of
RMSprop [35]. Precisely, given θt the parameter for optimization, and L(θt ) the loss
function, the parameter update of θt+1 by using the optimization method of RMSprop
can be calculated as:
where gt is a moving average of the squared gradient at time step t, and the learning
rate η, the parameter γ and the decaying rate are chosen to be 0.001, 0.9 and 0,
respectively.
In order to alleviate the overfitting problem, we use the technique of dropout. By
using dropout, we will randomly mask parts of the hidden nodes with probability p
during training. Figure 4 illustrate the operation of dropout. During model training,
a thinned architecture will be preserved and trained each time. Given a network
containing n nodes with a dropout probability of p equaling to 0.5, the network
could be treated as an ensemble of 2n thinned networks. Due to the shared structure
Deep Learning for Building Occupancy Estimation … 345
Fig. 4 The operation of dropout. Left: the network without dropout; Right: the network after
dropout. Crossed nodes have been dropped during model training [33]
of these thinned networks, the number of parameters will remain the same. During
testing, the dropout will be switched off and all the network nodes will take effect for
model outputs, which is similar to an ensemble of some distinct thinned networks. In
other words, the dropout is used to enlarge training data size. In each training iteration,
random masking will also create some variants into data, which will make the trained
network more robust. The dropout technique has been shown to be effective for
preventing Overfitting [33]. Therefore, in this study, we leverage on one dropout
layer between the DBLSTM and the first fully-connected layer and another dropout
layer between the two fully connected layers, where the masking probabilities are
chosen to be 0.5 and 0.3 respectively.
4 Evaluation Results
In this section, we firstly introduce the data acquisition process. Then, evaluation
setup and experimental results are presented. After that, the generalization perfor-
mance of the CDBLSTM is analyzed by randomly selecting the data for training and
testing. Finally, to further demonstrate the performance of CDBLSTM for building
occupancy inference using environmental sensors, we demonstrate additional results
of the CDBLSTM using data collected from another environment, i.e., a tutorial room.
346 Z. Chen et al.
The sensor data of CO2 , temperature, air pressure and humidity have been collected
from a research lab at a university campus. The lab has an office area which contains
24 cubicles and 11 open seats. Generally, nine postgraduate students and eleven
research staffs will work at the office area. Besides, the lab also has six PCs for
undergraduate students on their final year projects and five PCs for other students.
It is well known that identifying the exact occupancy (number) is very challenging
and may require to use some high-cost sensors in a crowded space. Here, instead of
estimating the exact occupancy, we divide the exact occupancy into ranges of zero,
low, medium and high. These occupancy ranges are enough for common building
control and scheduling systems [18]. To make the four ranges balanced, which will
maximize the impact of state changes, we define the low occupancy as 1–6 subjects,
the medium occupancy as 7–14 subjects, and the high occupancy as larger than 14
subjects.
We measure pressure level by leveraging on Lutron MHB-382SD sensor, and
CO2 , temperature, and relative humidity by using the CL11 sensor from Rotronic. The
sampling frequency is one sample per minute for both sensors. During data collection,
we firstly stored the data in the sensor internal memory and then transmitted to a PC
by using a USB cable. Note that, the area is air-conditioned by the conventional
Variable Air Volume and Active Chilled Beam systems, and is ventilated by Air
Handling Unit (AHU) that will constantly provide fresh air.
Table 1 shows the accuracy and resolution of the sensors. During experiments,
we attach the sensors on supporters with a height of 1.1 m from the ground. Figure 5
illustrates the layout of the apace which has a size of 20 m × 9.3 m × 2.6 m. We
apply two pairs of sensors in this space. Here, the placements of sensors are intu-
itively selected considering occupant density. To get ground truth occupancy, we
deploy three IP cameras at each door to record occupant movements. Then, the true
occupancy is counted manually with the help of motion detection software which is
able to take pictures when occupants move. The entire space contains three doors.
The main door (placement of camera 1) connects the space with the office area for
administrative staffs. Another door which locates at camera 2 in Fig. 5 opens to a lab
space. And the third door is always closed. Note that, all windows are closed, due to
the operation of air-conditioning and ventilation systems.
Totally, we collected 31 days of data in workdays, where the first 26 days of data
are utilized for model training and the rest 5 days of data is utilized for model testing.
Since building control systems are with slow response, a resolution of 15-min is
enough for occupancy estimation [39]. But the original sensor data and occupancy
have a resolution of 1 min, we firstly transfer them into a 15-min resolution by using
the simple averaging. Note that, the number of occupants are an integer value, so that
a rounding operation is conducted after the use of averaging on original occupancy.
Deep Learning for Building Occupancy Estimation … 347
sequence r is 15. With 2 pairs of sensors shown in Fig. 5, the total number of sensors
d is 8. Hence, the input is with a dimension of 15 × 8 for environmental sensor based
occupancy estimation. We use cross-validation with the training data to choose proper
hyperparameters for all the approaches. Specifically, the DBLSTM consists of three
BLSTM layers with hidden nodes of 24, 75 and 100. Then, two fully connected
layers with hidden nodes of 150 and 100 are adopted. For the CDBLSTM approach,
the window size, the pooling size and the number of filters are chosen to be 3, 2,
100, respectively. The CDBLSTM contains three BLSTM layers with hidden size
to be 100, 150 and 200. The two fully-connected layers have 200 and 300 hidden
nodes. The implementation of the deep algorithms, i.e., CDBLSTM and DBLSTM,
is under Keras. The other shallow algorithms are performed using Matlab.
Here, occupancy estimation is regarded as a typical classification problem. Hence,
the criterion of classification accuracy can be adopted for model performance evalu-
ation. Besides, we use another widely used evaluation criterion of Normalized Root
Mean Square Error (NRMSE) which will show the range of classification errors
[38]. As we all know, the absence and presence are of great significance for building
control systems, especially the light control system [32], the detection accuracy of
the two states is also analyzed.
The evaluation results for different methodologies under the defined three evaluation
criteria are shown in Table 2. Candanedo’s and Yang’s approaches which applied the
raw data as features performs the worst. Note that Candanedo et al. [3] and Yang
et al. [38] used many sensors in their works to guarantee the satisfactory performance,
which is not practical due to the high cost and the inconvenience caused by constant
maintenance. Masood’s and Dong’s approaches performs better than Candanedo’s
and Yang’s approaches, due to the use of statistical features instead of raw data
for features. These results clearly show that feature extraction is compulsory and
useful, especially with limited sensors. Since Masood’s and Dong’s methods used
Table 2 The Evaluation results of different methods under the three evaluation criteria. P/A rep-
resents Presence/Absence
Criterion Dong’s [13] Yang’s [38] Masood’s Candanedo’s DBLSTM CDBLSTM
[30] [3]
Classification 71.46 66.67 72.31 70.21 74.38 76.04
accuracy (%)
NRMSE 0.1912 0.2509 0.2322 0.2297 0.1574 0.1169
Detection 93.13 90.21 92.38 88.54 95.21 95.42
accuracy of
P/A (%)
Deep Learning for Building Occupancy Estimation … 349
manually extracted features which will inevitably miss useful and implicit features,
the performances of these methods are also limited for environmental sensor based
human activity recognition.
Owing to the deep structures for feature learning and temporal encoding of the
DBLSTM approach, it is able to perform better than all the state-of-the-arts under
these three evaluation criteria. With the powerful local feature extractor fulfilled
by the convolutional network, the CDBLSTM further enhance the performance of
DBLSTM. It outperforms all the approaches where the occupancy estimation accu-
racy, the NRMSE and the detection accuracy are 76.04%, 0.1169 and 95.42%, respec-
tively.
We also illustrate the occupancy estimation results of all the testing days in Fig. 6,
where useful insights can be concluded:
– Candanedo’s and Yang’s approaches perform worse than other approaches, due to
the use of raw data as features. With sensor noise and limited number of sensors,
the raw sensor data is not representative for different occupancy levels. The more
efficient way is to extract some representative features.
– Since Masood’s exhaustively searches the best integration of features with the
proposed wrapper method, it overfits on the testing data. Similarly, Dong’s method
also cannot track occupancy profiles well with the handcrafted features. It can be
concluded that handcrafted features lack a clear guideline and will inevitably miss
useful and implicit features, which limited the system performance.
– One interesting phenomenon is that the estimated occupancy suddenly increases at
midnight for Candanedo’s, Masood’s and Yang’s approaches. By checking the data
carefully, it should be caused by a sudden increase of CO2 data. Then, the recorded
video was checked, and we find that one subject siting near a pair of sensors usually
walks around to prepare for leaving at that time. The optimal locations sensors
will be considered as one of our future works [20]. Due to the sequential modeling
capacity of HMM and the BLSTM structure, Dong’s approach, DBLSTM and
CDBLSTM can almost immune to this issue caused by the increase of CO2 data.
– With the deep structure for feature learning and the BLSTM network for temporal
encoding, the DBLSTM and CDBLSTM approaches outperforms all the state-of-
the-arts.
– Owing to the convolutional network for local feature extraction, the CDBLSTM
further enhances the performance of DBLSTM, and its better performance over
all methodologies indicates the effectiveness of using CDBLSTM for building
occupancy inference based on environmental sensors.
Time complexity is a big concern about deep learning based methods. To show
the time complexity of the CDBLSTM, we tested its training and testing time during
experiments. Here, the state-of-the-art algorithms all based on manual feature extrac-
tion and conventional machine learning algorithms have much smaller training and
testing time when compared with CDBLSTM. The CDBLSTM is implemented with
a computer which has dual core CPUs of Intel Xeon(R) E5-2697 v2 2.70 GHz and
a GPU of NVIDIA Tesla K40c. Its training time is about 16 min and 40 s. Although
350 Z. Chen et al.
high
Occupancy Range
Ground Truth
Dong’s
medium
low
zero
0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102 108 114 120
Time (hour)
(a) Dong’s approach
high
Occupancy Range
Ground Truth
Yang’s
medium
low
zero
0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102 108 114 120
Time (hour)
(b) Yang’s approach
high
Occupancy Range
Ground Truth
Masood’s
medium
low
zero
0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102 108 114 120
Time (hour)
(c) Masood’s approach
high
Occupancy Range
Ground Truth
Candanedo’s
medium
low
zero
0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102 108 114 120
Time (hour)
(d) Candanedo’s approach
high
Occupancy Range
Ground Truth
DBLSTM
medium
low
zero
0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102 108 114 120
Time (hour)
(e) DBLSTM
high
Occupancy Range
Ground Truth
CDBLSTM
medium
low
zero
0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102 108 114 120
Time (hour)
(f) CDBLSTM
Fig. 6 The evaluation results of the testing data for all the methodologies [11]
Deep Learning for Building Occupancy Estimation … 351
this amount of time for training is large, it is still acceptable because the training only
requires to be done once in offline. The testing time of the CDBLSTM for all the
samples (480 samples) is 0.35 s. This can be neglected for building control systems
with a resolution of 15 min. Hence, we can conclude that the CDBLSTM method
can be used for real-time occupancy estimation with environmental sensors.
4.4 HyperParameters
Some hyperparameters are crucial for the CDBLSTM approach. Here, the parameters
of the masking probabilities of the two dropout layers and the number of hidden
layers are investigated. We explored three masking probability levels, including high
(0.7), medium (0.5) and low (0.3). Figure 7 demonstrates the occupancy estimation
accuracy of the CDBLSTM with different combinations of masking probability. We
can find that the CDBLSTM may underfit with a degraded performance when high
masking probabilities, such as the combinations of [0.7 0.7], [0.7 0.5], [0.5 0.7]
and [0.5 0.5] are used. It is clear that a good selection of this hyperparameter will
enhance the performance of CDBLSTM. The number of hidden layers is another
key hyperparameter for the model. The estimation performance of the model with
distinct number of hidden layers is shown in Fig. 8. When the number of hidden
layers increases from 1 to 3, the model performance improves. But, if the number of
hidden layers is larger than 4 in this study, the model may overfit, resulting a limited
performance.
Fig. 7 Occupancy
Estimation performance of
the CDBLSTM with
80
Estimation Accuracy (%)
different combinations of
masking probability
75
70
0.3
0.5 0.7
Layer One 0.7 0.5
0.3 Layer Two
352 Z. Chen et al.
Fig. 8 Estimation 78
performance of CDBLSTM
with varying number of 77
hidden layers
75
74
73
72
71
70
1 2 3 4 5
Number of Layers
The CDBLSTM approach is able to almost immune to some abnormal and noisy
data as analyzed in Sect. 4.3, due to its ability to consider temporal dependencies in
data. In order to explore the robustness of CDBLSTM on noise data, we manually
include some noise into the raw sensor data. Figure 9 presents the performance of all
the approaches with different noise levels. Note that the signal to noise ratio (SNR)
is ∞ when no noise is added. When the SNR decreases (noisier), the performance of
all the approaches degrade accordingly. Due to the capability of modeling temporal
dependencies in data, the noise impact on the HMM model (Dong’s), DBLSTM
Fig. 9 Estimation 80
performance with varying
SNR 75
Estimation Accuracy (%)
70
65
60
55
Dong's
Yang's
50
Masood's
Candanedo's
45 DBLSTM
CDBLSTM
40
20dB 10dB 6dB 3dB 0dB
SNR
Deep Learning for Building Occupancy Estimation … 353
and CDBLSTM is smaller, which is consistent with the previous conclusion. The
evaluation manifests that the CDBLSTM approach is robust against the noise in data.
(a)
80 Dong's
Estimation Accuracy (%)
Yang's
Masood's
Candanedo's
75 DBLSTM
CDBLSTM
70
65
60
1 2 3
Times
(b)
0.35
Dong's
Yang's
Masood's
Candanedo's
0.3 DBLSTM
CDBLSTM
NRMSE
0.25
0.2
0.15
1 2 3
Times
(c)
Detection Accuracy of P/A (%)
100
Dong's
Yang's
Masood's
Candanedo's
DBLSTM
95 CDBLSTM
90
85
1 2 3
Times
Fig. 10 The evaluation results for the analysis of generalization performance a estimation accuracy,
b NRMSE and c detection accuracy of P/A
354 Z. Chen et al.
for model testing and the rest for training. Note that, each day of data have equal
probability to be chosen as training or testing, that guarantees the indication of the
generalization capability of the CDBLSTM approach. We performed three times for
the experiments. Figure 10 shows the final results. It can be found that the DBLSTM
approach has a better performance than the state-of-the-arts, and CDBLSTM per-
forms the best under the three evaluation criteria. The conclusions are the same as
the previous analysis. This clearly manifests the good generalization performance
of the CDBLSTM method for environmental sensor based occupancy detection and
estimation.
5 Conclusion
This chapter introduces a deep learning algorithm, termed Convolutional Deep Bi-
directional Long Short-Term Memory (CDBLSTM), for environmental sensor based
occupancy inference in buildings. The CDBLSTM consists of a convolutional net-
work for sequential local feature extraction from the raw environmental sensor data
and a DBLSTM for temporal coding and feature learning. To verify the performance
of CDBLSTM, we perform experiments in a research lab environment and compare
with some existing approaches and the DBLSTM method without the convolutional
operation. The results indicate that DBLSTM outperforms the state-of-the-arts and
CDBLSTM has the best performance, which indicates the merits of the convolutional
network and the DBLSTM structure for temporal encoding and feature learning. We
also test some hyperparameters of the CDBLSTM with a conclusion that a proper
selection of model hyperparameters will boost the performance of CDBLSTM. Then,
the impact of noise on model performance is evaluated. The results manifests that the
CDBLSTM is able to alleviate the noise effect due to its unique structure. After that,
we test the generalization performance of the CDBLSTM by randomly selecting data
for training and testing. We can obtain the same conclusion in this scenario. Finally,
we perform an additional test in a tutorial room. Similarly, the CDBLSTM achieves
a superior performance over all the other methodologies.
References
1. Balaji, B., Xu, J., Nwokafor, A., Gupta, R., Agarwal, Y.: Sentinel: occupancy based HVAC
actuation using existing WiFi infrastructure within commercial buildings. In: Proceedings of
the 11th ACM Conference on Embedded Networked Sensor Systems, p. 17. ACM (2013)
2. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is
difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
3. Candanedo, L.M., Feldheim, V.: Accurate occupancy detection of an office room from light,
temperature, humidity and CO2 measurements using statistical learning models. Energy Build.
112, 28–39 (2016)
4. Chen, Z., Jiang, C.: Building occupancy modeling using generative adversarial network. Energy
Build. 174, 372–379 (2018)
5. Chen, Z., Jiang, C., Xie, L.: Building occupancy estimation and detection: a review. Energy
Build. (2018)
6. Chen, Z., Masood, M.K., Soh, Y.C.: A fusion framework for occupancy estimation in office
buildings based on environmental sensor data. Energy Build. 133, 790–798 (2016)
7. Chen, Z., Soh, Y.C.: Modeling building occupancy using a novel inhomogeneous Markov chain
approach. In: 2014 IEEE International Conference on Automation Science and Engineering
(CASE), pp. 1079–1084. IEEE (2014)
8. Chen, Z., Xu, J., Soh, Y.C.: Modeling regular occupancy in commercial buildings using stochas-
tic models. Energy Build. 103, 216–223 (2015)
9. Chen, Z., Zhang, L., Cao, Z., Guo, J.: Distilling the knowledge from handcrafted features for
human activity recognition. IEEE Trans. Ind. Inform. (2018)
10. Chen, Z., Zhang, L., Jiang, C., Cao, Z., Cui, W.: WiFi CSI based passive human activity
recognition using attention based BLSTM. IEEE Trans. Mob. Comput. (2018)
356 Z. Chen et al.
11. Chen, Z., Zhao, R., Zhu, Q., Masood, M.K., Soh, Y.C., Mao, K.: Building occupancy estimation
with environmental sensors via CDBLSTM. IEEE Trans. Ind. Electron. 64(12), 9549–9559
(2017)
12. Chen, Z., Zhu, Q., Masood, M.K., Soh, Y.C.: Environmental sensors-based occupancy estima-
tion in buildings via IHMM-MLR. IEEE Trans. Ind. Electron. 13(5), 2184–2193 (2017)
13. Dong, B., Andrews, B., Lam, K.P., Höynck, M., Zhang, R., Chiou, Y.S., Benitez, D.: An infor-
mation technology enabled sustainability test-bed (ITEST) for occupancy detection through
an environmental sensing network. Energy Build. 42(7), 1038–1046 (2010)
14. Frodl, R., Tille, T.: A high-precision NDIR gas sensor for automotive applications. IEEE Sens.
J. 6(6), 1697–1705 (2006)
15. Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolu-
tional activation features. In: European Conference on Computer Vision, pp. 392–407. Springer
(2014)
16. Hinton, G.E.: Learning multiple layers of representation. Trends Cogn. Sci. 11(10), 428–434
(2007)
17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
18. Iglesias, F., Palensky, P.: Profile-based control for central domestic hot water distribution. IEEE
Trans. Ind. Inform. 10(1), 697–705 (2014)
19. Jiang, C., Chen, Z., Png, L.C., Bekiroglu, K., Srinivasan, S., Su, R.: Building occupancy
detection from carbon-dioxide and motion sensors. In: International Conference on Control,
Automation, Robotics and Vision (ICARCV), pp. 931–936 (2018)
20. Jiang, C., Chen, Z., Su, R., Soh, Y.C.: Group greedy method for sensor placement. IEEE Trans.
Signal Process. 67(9), 2249–2262 (2019)
21. Jiang, C., Masood, M.K., Soh, Y.C., Li, H.: Indoor occupancy estimation from carbon dioxide
concentration. Energy Build. 131, 132–141 (2016)
22. Jin, M., Jia, R., Spanos, C.J.: Virtual occupancy sensing: using smart meters to indicate your
presence. IEEE Trans. Mob. Comput. 16(11), 3264–3277 (2017)
23. Labeodan, T., Zeiler, W., Boxem, G., Zhao, Y.: Occupancy measurement in commercial office
buildings for demand-driven control applications a survey and detection system evaluation.
Energy Build. 93, 303–314 (2015)
24. Leephakpreeda, T.: Adaptive occupancy-based lighting control via grey prediction. Build. Env-
iron. 40(7), 881–886 (2005)
25. Li, N., Calis, G., Becerik-Gerber, B.: Measuring and monitoring occupancy with an RFID
based system for demand-driven HVAC operations. Autom. Constr. 24, 89–99 (2012)
26. Liu, H., Motoda, H.: Feature Extraction, Construction and Selection: A Data Mining Perspec-
tive. Springer Science & Business Media (1998)
27. Liu, P., Nguang, S.K., Partridge, A.: Occupancy inference using pyroelectric infrared sensors
through hidden Markov models. IEEE Sens. J. 16(4), 1062–1068 (2016)
28. Mantovani, G., Ferrarini, L.: Temperature control of a commercial building with model pre-
dictive control techniques. IEEE Trans. Ind. Electron. 62(4), 2651–2660 (2015)
29. Masood, M.K., Jiang, C., Soh, Y.C.: A novel feature selection framework with hybrid feature-
scaled extreme learning machine (HFS-ELM) for indoor occupancy estimation. Energy Build.
158, 1139–1151 (2018)
30. Masood, M.K., Soh, Y.C., Chang, V.W.C.: Real-time occupancy estimation using environmental
parameters. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8.
IEEE (2015)
31. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Pro-
ceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814
(2010)
32. Parise, G., Martirano, L., Cecchini, G.: Design and energetic analysis of an advanced control
upgrading existing lighting systems. IEEE Trans. Ind. Appl. 50(2), 1338–1347 (2014)
33. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple
way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Deep Learning for Building Occupancy Estimation … 357
34. Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In:
Thirteenth Annual Conference of the International Speech Communication Association (2012)
35. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its
recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2) (2012)
36. Tran, D., Tan, Y.K.: Sensorless illumination control of a networked led-lighting system using
feedforward neural network. IEEE Trans. Ind. Electron. 61(4), 2113–2121 (2014)
37. Williams, D., Hinton, G.: Learning representations by back-propagating errors. Nature
323(6088), 533–538 (1986)
38. Yang, Z., Li, N., Becerik-Gerber, B., Orosz, M.: A systematic approach to occupancy modeling
in ambient sensor-rich buildings. Simulation 90(8), 960–977 (2014)
39. Yu, Z., Jia, L., Murphy-Hoye, M.C., Pratt, A., Tong, L.: Modeling and stochastic control for
home energy management. IEEE Trans. Smart Grid 4(4), 2244–2255 (2013)
40. Zhu, Q., Chen, Z., Masood, M.K., Soh, Y.C.: Occupancy estimation with environmental sensing
via non-iterative LRF feature learning in time and frequency domains. Energy Build. 141, 125–
133 (2017)
41. Zimmermann, L., Weigel, R., Fischer, G.: Fusion of nonintrusive environmental sensors for
occupancy detection in smart homes. IEEE Internet Things J. 5(4), 2343–2352 (2018)
42. Zou, J., Zhao, Q., Yang, W., Wang, F.: Occupancy detection in the office by analyzing surveil-
lance videos and its application to building energy conservation. Energy Build. 152, 385–398
(2017)
Index
C E
Classification, 1–3, 5, 6, 25, 32–34, 36, 37, Energy consumption prediction, 106, 336
39, 42, 48–55, 57, 61, 69, 73, 77, 78, Environmental sensors, 335–338, 345, 349,
114, 161, 168, 193–197, 199, 214, 351
218, 226, 231, 232, 234–236, 239, Exploration geophysics, 130, 144, 154
G P
Gait analysis, 299–301, 303–308, 310–312, Perturbation analysis, 31, 35
321 Posidonia oceanica, 193, 195, 197, 204
H
High dynamic Range imaging, 189 R
Regression, 2, 3, 5, 6, 42, 43, 70, 71, 77, 78,
81–84, 87, 88, 99, 111–113, 136, 145,
I 185, 338
Intelligent surveillance systems, 264 Re-identification, 263–265, 274, 277, 280,
Inverse problems, 145 281, 283, 284, 289, 291, 292
ReLU, 10–12, 93, 123, 196, 240, 241, 244,
265, 267, 276, 277, 283–287, 340,
J 344
Jellyfish, 193, 195, 197, 213, 214, 216, Renewable energy, 67, 69, 85, 104, 106, 107,
218–220, 224–226 125
Representation learning, 67, 71–73, 78, 81,
85, 98, 99, 342
L
Learning deep neural networks, 1
Load forecasting, 67, 68, 70, 103–121, 125,
S
126
Seismic imaging, 129, 132, 154
Self Learnable Activation Function (SLAF),
M 1, 20, 21, 23–28
Machine Learning (ML), 5, 14, 24, 32, 33, Semantic segmentation, 195, 196, 199, 200,
49, 56, 71, 88, 98, 105, 106, 110, 112, 226
113, 121, 124, 125, 130, 159, 167, Smart grids, 71, 103–108, 118, 119, 125, 126
195, 199, 234, 292, 300, 301, 307, Statistical learning, 33, 56, 61, 124, 136
311, 314–317, 319, 321, 323, 325,
329, 330, 336–338, 349
Marine, 193–195, 226, 234 T
Time series, 2, 14, 68–71, 79–85, 87, 88, 95,
96, 98, 99, 106, 111–118, 121, 124,
N 337, 341
Neural Networks (NNs), 1–4, 6, 9, 10, 15, Tomography, 129–133, 135–137, 151, 154,
16, 21, 27, 28, 43, 45, 49, 50, 56, 304, 311, 314
57, 70, 71, 113–115, 157, 167, 168, Traffic light recognition, 157–160, 163, 166,
196, 217, 226, 231, 232, 234, 264, 167, 170
265, 270, 273, 274, 277, 278, 280,
283–287, 308, 310
V
O Vehicle signal recognition, 157–161, 184,
Object detection, 43, 167, 168, 173, 185, 185
195–197, 214, 218, 226 Video surveillance, 263, 264, 292, 301, 302