Machine Learning For Engineers: Ryan G. Mcclarren
Machine Learning For Engineers: Ryan G. Mcclarren
McClarren
Machine
Learning for
Engineers
Using Data to Solve Problems for
Physical Systems
Machine Learning for Engineers
Ryan G. McClarren
Machine Learning
for Engineers
Using Data to Solve Problems for Physical
Systems
Ryan G. McClarren
Department of Aerospace & Mechanical
Engineering
University of Notre Dame
Notre Dame, IN, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my parents, Russell and Mary McClarren,
for their love and support, as well as the
supervised and unsupervised learning they
provided me.
Preface
This book arose from the realization that students in traditional science and
engineering1 disciplines can both apply machine learning technology to problems
that arise in their fields of study, and that the knowledge they have of science and
engineering makes them well suited to understanding how machine learning works.
The problem is that these students typically do not see these connections. Part of this
is that most curricula do not include machine learning as a tool to solve problems
alongside the mathematical methods and statistical techniques that are routinely
used. The other part is that most courses and course materials that introduce machine
learning motivate topics using examples that are foreign to students used to studying
systems in the physical world. Many students would not find it immediately obvious
that a model that can count the cats in an image might be useful for processing data
from an experimental measurement.
This work attempts to meet science and engineering students in the middle by
bringing machine learning techniques closer to problems in their disciplines and
showing how they might be applied. The target audience is any student who has 1–2
years of instruction in a science or engineering discipline and is, therefore, familiar
with calculus, a bit of differential equations, and a smattering of matrix algebra.
There are points where more advanced topics are used, but these can be skipped
by more junior readers. This book is also designed for more advanced students
in other disciplines who want to see how machine learning might apply to their
work. For example, graduate students (such as those in my research group) that are
familiar with advanced computational science techniques for the solution of partial
differential equations can use this book to learn how machine learning can augment
their work.
1 Here,by science and engineering I mean the science and engineering outside computer science
and engineering.
vii
viii Preface
Deciding on a scope for a work like this is a challenge. It is even more challenging
in a field like machine learning where new results, models, and even new subfields
appear with startling rapidity. The topics considered herein were chosen with the
following two considerations in mind: (1) that the topics should give the reader
enough background to confidently branch out into more complex versions of the
topics discussed, and (2) that the topics have broad enough applicability to appeal
to a large swath of science and engineering students and practitioners. This work
does not present residual neural networks, but the discussion of convolutional neural
networks that is covered is a good springboard for readers that have special interest
in these models. There are several other examples of emerging trends (e.g., attention
networks and physics-informed models using automatic differentiation) that are
omitted from this work for the purpose of having a broad overview and appealing to
the large set of potential readers.
The case studies and examples herein have the requisite data and code to
reproduce them available on Github at https://github.com/DrRyanMc. The codes are
in Python and make heavy use of Scikit-learn and Tensorflow, along with Numpy
and Scipy. Any reader who is interested in plugging in their own problem into the
models discussed in this work are encouraged to tinker with these examples.
The exercises at the end of each chapter come in two flavors. In the earlier
chapters, there are problems that are used to demonstrate principles and facts
regarding machine learning that most readers can show for themselves. There
are also problems that are more like mini-projects suitable for a report-length
deliverable. These mini projects often involve the production of new data and fitting
a model. These problems are deliberately open ended and should be treated as
reader-led case studies.
I want to acknowledge those that helped me in the production of this work. My
students in the Notre Dame study abroad program in Rome during the spring of 2019
helped review a couple of early chapters, Todd Urbatsch gave me feedback as well,
and Kelli Humbird has been gracious in answering several Tensorflow questions
that I had. I also want to thank the Steinbuch Centre for Computing at Karlsruhe
Institute of Technology, the Notre Dame Rome Global Gateway, and Universidad
Adolfo Ibàñez for hosting me at different points during the creation of this work.
Finally, I am writing this preface while I am sequestered in my home during the
events surrounding the COVID-19 pandemic during the early spring of 2020. I want
to express my deep gratitude for those healthcare workers that are on the frontlines,
as well as others up and down the global supply chain that make it possible for
someone like me to work in relative comfort during such an extraordinary time.
Part I Fundamentals
1 The Landscape of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Supervised Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.3 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Unsupervised Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Finding Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Association Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Optimization and Machine Learning vs. Simulation . . . . . . . . . . . . . . . . . 16
1.4 Bayesian Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Linear Models for Regression and Classification. . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Motivating Example: An Object in Free Fall . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 General Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Nonlinear Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Logistic Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Comparing to the Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.2 Interpreting the Logistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.3 Multivariate Logistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.4 Multinomial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4 Regularized Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.2 Lasso and Elastic Net Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5 Case Study: Determining Governing Equations from Data . . . . . . . . . . 47
2.5.1 Determining the Form of the Coefficients . . . . . . . . . . . . . . . . . . . . 51
2.5.2 Discussion of the Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ix
x Contents
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3 Decision Trees and Random Forests for Regression and Classification 55
3.1 Decision Trees for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.1 Building Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 Comparison of Random Forests and Tree Models . . . . . . . . . . . 67
3.4 Case Study: Predicting the Result of a Simulation Using
Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.1 Using the Random Forest Model to Calibrate the Simulation 77
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4 Finding Structure Within a Data Set: Data Reduction and Clustering 83
4.1 Singular Value Decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Case Study: SVD to Understand Time Series . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 K-means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.1 K-means Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4 t-SNE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.1 Computing t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.2 Example of t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5 Case Study: The Reflectance and Transmittance Spectra of
Different Foliages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5.1 Reducing the Spectra to Colors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Acronyms
xiii
Part I
Fundamentals
Chapter 1
The Landscape of Machine Learning:
Supervised and Unsupervised Learning,
Optimization, and Other Topics
We begin with defining the types of problems that machine learning can solve. If
we think of our data as a large box of toys and the machine learning algorithms as
children, the problems we pose are the ways we present the toys to the children. In
supervised learning, when the child grabs a toy, we ask what type of toy is it. When
the child responds correctly, we say yes that is correct (and probably some other
praise); if the child is incorrect, we say the correct answer. Eventually, the child will
be able to develop internal rules for giving names to the toys. This rules will likely
be more accurate on toys seen before but could also be used for extrapolation. If the
child has learned the difference between “red blocks” and “blue blocks,” it would
be possible to properly identify a “red truck” when the child has only seen a “blue
truck” before.
Analogous to the children’s toy example, in supervised learning we have a set
of data. Each data point, or case, has a number of features about it, called the
independent variables that can be used to describe it. For toys it might be the shape,
appearance, how you play with it, etc. For scientific data it may be properties of the
system such as the material it is made out of, the mass, temperature, etc., as well as
environmental factors, experimental conditions, simulation parameters, or anything
else that varies from case to case.
In supervised learning the data must also have labels. These labels, also called
dependent variables, can be a number, such as the measured values in an experiment
or the result of a calculation or simulation, or they can be a categorical variable such
as success versus failure, or membership in a group, i.e., is the leaf from a particular
species of tree. These labels are the answers that we want machine learning to
predict from the independent variables.
The goal of supervised learning is to have a machine learning model taking the
independent variables as input and producing the correct labels for that case. We
can measure the ability of the model to correctly predict the label using measures
of accuracy, and we desire to have a model that has a high degree of accuracy, a
point we will make more explicit later. A machine learning model has a number of
parameters that need to be set to make a prediction. We use a set of data to set these
parameters to maximize the accuracy of the model. The process of specifying the
model parameters to fit the data is known as training the model. The data used to
train the model is known as the training data.
Simply optimizing accuracy can lead to a problem known as overfitting. If we
have too complicated a model, with a large number of parameters, the model can
have a very high accuracy on the training data, but fail miserably when presented
with new data. This can occur for a variety of reasons, but two primary causes are:
(1) using a small training data set and having enough model parameters so that
each point in the training data can be uniquely identified and (2) having training
data that is not representative of the data that the model will see when we use it
to make predictions. Consider the case where all of our training data has perfect
measurements without noise, but when we want to use the model the independent
variables will have measurement error and uncertainty. Given that the model has
no “experience” with noisy inputs, the outputs from an overfit model in this case
will be disappointing. The problem of overfitting is something we address in detail
throughout our study.
In supervised learning we would also like to gain insight into our data. We are
interested in questions such as:
1.1 Supervised Learning 5
1.1.1 Regression
We now specify the supervised learning problem for dependent variables (labels)
that are numerical values in some continuous range. Consider that we have collected
the values of J independent variables, x1 , x2 , . . . , xJ and K dependent variables
y1 , y2 , . . . , yk at I cases. In that sense we have
y = f(x). (1.2)
The form of f(x) depends on the method used. Some methods prescribe a form for
this function and as a result can only approximate certain classes of functions. In
general the mapping will not be exact so we add an error term to Eq. (1.2):
The function (x, z) denotes the difference, i.e., error, between the function f
produced by the supervised learning model and the true value of y. This error could
depend on the input variables, x, but could also depend on variables that our model
does not have access to that we denoted as z. These unknown variables could have
a large impact on the dependent variable, but they are not in our collected data set.
As an example, consider a scenario where one wants to predict the temperature of
6 1 The Landscape of Machine Learning
an object that was outside. If we only had access to the ambient temperature, but
did not record the amount of sunlight striking the object (making this an unknown
variable to our model), we would have an error term that depended on the amount
of direct sunlight.
The goal of the training procedure is to set parameters in the supervised learning
model so that f(x) approximates the known values of y by minimizing a loss
function. Throughout our study we will use L to denote a loss function. The loss
function is typically a measure of the error, , but can include information about the
complexity of the model, its sensitivity to changes, or its behavior in some limit. An
example loss function is the squared-error loss function:
I
K
Lse = (yik − fk (xi ))2 . (1.4)
i=1 k=1
The squared-error loss function computes the square of the error for each dependent
variable in each case and then sums up those errors over the cases. We use the square
for two reasons: (1) we want to minimize the magnitude of the errors and (2) using
a square gives smoothness to the loss function in terms of derivatives that some
methods can take advantage of.
With a prescribed loss function, the training procedure then attempts to set the
parameters in the supervised learning model to minimize the loss function over
the training data. Consider that the model has P parameters, w1 , w2 , . . . , wP , that
need to be set in the model. To minimize the loss function we seek to find sets of
parameters, w, where
∂L
= 0, p = 1, 2, . . . , P .
∂wp
If the loss function is a convex function of the parameters, then we can be assured
that the value of w is a minimum of the loss function. For non-convex optimization
problems, there may be many local minima that are not the global minimum of the
loss function. We will return to the topic of loss functions for each of the supervised
learning models we study.
A potential loss function that includes information about the sensitivity of the
prediction could take the form
⎡ ⎤
P
I K
∂f(x )
L= ⎣ (yik − fk (xi ))2 + λ i ⎦
(1.5)
∂w .
p
i=1 k=1 p=1
This loss function balances squared error with the derivative of the model with
respect to the parameters. The relative importance of the two terms is given using
λ > 0. This loss function states a preference of the model builder: minimizing the
error in the model is not the only consideration, and the sensitivity of the model
1.1 Supervised Learning 7
to the parameters should also be minimized. This form of loss function attempts
to create a model where the predictions will not change wildly when new data is
introduced and the model is retrained. This can be an important consideration when
a model is going to be used over time and retrained when new data is obtained. If
the model produces very different predictions from each training period, it will be
harder to believe the model’s predictions.
Additionally, we can make the loss function penalize models that have too many
non-zero parameters. The principle behind such a loss function is parsimony: the
simplest model that can explain the data should be preferred to a more complicated
model where possible. As in Occam’s razor, the number of parameters should not
be multiplied beyond necessity.1 To have our loss function encourage parsimony we
could add a term to the squared-error loss function:
I
K
P
L= (yik − fk (xi ))2 + λ |wp |. (1.6)
i=1 k=1 p=1
This loss function balances, through λ > 0, the error in the model with the
magnitude of the parameters. Such a loss function can lead to trained models that
have many parameters set to zero, and, in effect, a simpler model. In the next chapter
we will see an example of this loss function in a type of regularized regression called
lasso.
The principle of parsimony can also be applied to the number of independent
variables that influence the model’s prediction. If the independent variables that
are available to the model are not all useful in predicting the dependent variables,
the model should not include these variables when making a prediction. That is,
changing these independent variables should not affect the model’s value of f(x). In
many cases, such situation corresponds with setting some of the model parameters
to zero.
The squared error is not the only possible measure of the error. The measure of
the error should always be positive, so that the error in one case does not cancel the
error in another case, but that is one of the few hard constraints on the loss function.
For example, we could take the absolute value of the error for each case, as opposed
to the square. We can also include multiple considerations in the loss function by
adding information about the sensitivity as well as encouraging parsimony.
There is one measure of the error that is often quoted for a regression model: R 2
or the coefficient of determination. This quantity is the fraction of the variance in
the data explained by the model. The formula for R 2 is
1 Occam’s razor, commonly stated as entia non sunt multiplicanda praeter necessitatem (entities
must not be multiplied beyond necessity), is named for William of Ockham, an English Franciscan
Friar of the fourteenth century. However, the idea can be traced to much earlier statements,
including one by Aristotle.
8 1 The Landscape of Machine Learning
I K
k=1 (yik − fk (xi ))
2
R2 = 1 − i=1
I K , (1.7)
k=1 (yik − ȳk )
2
i=1
where ȳk is the mean of the data for the kth output. A “perfect” model will have
R 2 = 1 because the numerator in the second term will be zero. Also, the squared-
error loss function attempts to make R 2 as close to one as possible because it is
minimizing the numerator in Eq. (1.7). Nevertheless, a high R 2 does not necessarily
mean that model is useful. The best advice is that a high value of R 2 indicates if the
model might be useful, but further investigation is needed using the techniques we
discuss below, such as cross-validation.
1.1.1.1 Overfitting
To see the need for parsimony we consider a simple relationship y = 3x1 + δ, where
δ is random noise term given by a Gaussian distribution with mean zero. We have
50 realizations or cases, i.e., I = 50. For the independent variables we also have
50 variables so that x = (x1 , x2 , . . . , x50 ), i.e., J = 50 where for j > 1 we have
random values for xj . If we fit a model on all 50 variables, we can get an “exact” fit
so that y = f (x) with no error, when in reality y only depends on x1 . In fact, we
can fit a linear regression model (discussed in the next chapter) of the form
y = b + w1 x1 + w2 x2 + · · · + w50 x50 ,
which will exactly reproduce the training data, even though there is no exact
relationship between y and x because of the added noise. The reason we can get
an exact model is that with 50 independent variables we have enough freedom in
the model parameters to set them so each data point is reproduced.2 This is shown
in the left panel of Fig. 1.1. However, when we try to predict the value of y for new
data, in the right panel of Fig. 1.1, the model gives completely unreasonable values.
However, if we build a model with a loss function that seeks parsimony as well as
accuracy, we get a model that is not “exact” for the training data, but much more
accurate when presented with new data, x, to predict y. This parsimonious model is
a type of regularized regression called lasso regression that we will discuss in the
next chapter. The end result of the parsimonious model is we have traded a bit of
accuracy on the training data for robustness when applying the model to new data.
In this case the “exact” model would have a value of R 2 = 1 when looking at
the training data. However, it is clear that this model is not perfect and that there are
aspects to the model that R 2 does not capture.
2 Technically,
we only need 49 independent variables in this case because of the intercept, or bias,
term in the model.
1.1 Supervised Learning 9
y
y
4
1.0 2
0.5 0
0.0 −2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
Fig. 1.1 Result of fitting a linear model with additional, unimportant, independent variables gives
an “exact” match of the training data (left panel) but fails miserably when it tries to predict the
relationship for 10 new cases not used to train the model (right panel). A parsimonious model has
error in the training data but performs much better on the new, testing data
Regression
Regression is a supervised learning problem where we want to predict
numeric value dependent variables given the values of independent variables.
In these problems we define a loss function that specifies the magnitude of the
error in the prediction as well as other considerations we find important (e.g.,
parsimony, low sensitivity to small changes in the variables).
1.1.2 Classification
variable taking each possible value. Then, if one wants the class which a case
belongs, the highest value of the probability can be selected as the prediction
for the class. Having the model produce probabilities is a feature that allows us
to use the machinery of regression models discussed above. This is because we
have transformed the problem of predicting an integer into one where we predict
a probability that is a number between 0 and 1. Often, one more transformation is
used to map the interval [0, 1] to the whole real line. Therefore, the model predicts
a real number that is then transformed to a probability that is then used to assign the
class.
To formulate a classification problem we need to specify a loss function. Because
we are dealing with categorical variables, we need different loss functions than
we used for regression. The most commonly used loss function for classification
is the cross-entropy loss function. We first define the loss function for a binary
classification problem with K = 1 and where the cases either have y = 0 or y = 1.
We write predicted probability for case i that y = 1 as h(xi ); this implies that
1 − h(xi ) is the probability that y = 0. The true value for case i is either yi = 0 or
1. Therefore, the product
I
LCE = − yi log h(xi ) + (1 − yi ) log(1 − h(xi )) . (1.8)
i=1
In this case, the loss will be zero when the model is perfect and greater than zero
otherwise.
For a multiclass classification problem where y can take on many values, we
instead use the model to predict a real number z (x) for y = for = 0, . . . , L.
This real number is defined such that as z → ∞, the probability that y = goes to
1, as z → −∞, the probability that y = goes to 0, and z > z implies y = is
more likely than y = . We then pass these values z through the softmax function
to map them each to the interval [0, 1]:
exp(z )
softmax(, z0 , z2 , . . . , zL ) = L ∈ [0, 1], (1.9)
=0 exp(z )
L
softmax(, z0 , z1 , . . . , zL ) = 1,
=0
so that we can interpret the softmax of each z as the probability that the case has
yi = . Now for a given case, the loss from the prediction is
1 yi =
I(, yi ) = δ,yi = .
0 otherwise
The product in Eq. (1.10) will be 1 once if the model gives a probability of 1 for
the correct value and probability of 0 for every other value. As before we take the
logarithm and multiply by −1 for each case and then sum over all cases to define
the general cross-entropy loss function:
I
L
LCE = − I(, yi ) log softmax(, z0 , . . . , zL ). (1.11)
i=1 =0
The cross-entropy loss function provides a way to measure the accuracy of the
model, and a function to minimize to train the model. However, as with regression
we may want to add terms to the loss function to promote parsimonious models or
reduce sensitivity to parameters in the model. To accomplish this we can add the
same terms to the loss function as we discussed above for regression.
Classification
Classification looks to build supervised learning models to predict member-
ship in a certain class based on independent variables. These models typically
produce probabilities of a case belonging to a class.
• The loss functions for classification are different than those for regression.
• Binary (i.e., problems with two classes) and multiclass classification
problems have connected, but slightly different approaches to solution.
12 1 The Landscape of Machine Learning
Times series are sequences of data points collected in a particular order. They have
the property that for each variable evaluated at time t, we may know the values
for all times t < t but do not know the values in the future. Examples of time
series could be the reading of a gauge during the operation of a machine, the images
in a movie, and even the words that form a document. In a sense, modeling time
series using machine learning is the same as supervised learning with regression or
classification. The difference is that in a time series the current state and previous
states of the series are necessary components in the prediction. For example, in
predicting the reading of a temperature gauge, the current temperature in the system,
as well as the recent changes in the temperature and other independent variables,
would be necessary to include in the model.
Using the current state of the dependent variable in the model as an independent
variable introduces some changes in the approach to machine learning. In practice,
as we will see, machine learning for time series requires the model to determine how
long of a memory to have during the training process. Some processes only depend
on the current system state to predict the next state, but others require knowledge of
several states in the past (i.e., it must have memory). This requires specific types of
machine learning systems, and we will cover this in some detail in a later chapter.
models begin with making random decisions, so it may take many trials before
the model gets a success. Indeed, many of the famous successes of reinforcement
learning, including the playing of video games [1] and Go [2], are in application
domains where many trials can be generated via computer. The machine learning
can play video games continuously and with many machines playing in parallel to
generate the training data, and Go machine learning models can play against each
other. Therefore, it is possible to have millions of cases for the purposes of training.
There are many more games that a human could play in a lifetime, and as a result the
machines played Go differently than expert humans did. Our study of reinforcement
learning will apply these principles to science and engineering problems.
I
J
L= (gi (f(xi )) − xi )2 + λJ , λ > 0. (1.12)
i=1 j =1
This loss function attempts to minimize the difference between x̂ and the original
data x while also considering that the size J should be minimized. If we did not
include the λJ term, the loss function would be zero if the functions g and f were
just multiplication by an identity matrix. Picking the value of λ gives a means to
balance the reconstruction error with the size of the transformed variables z.
If we do find functions f and g where the reconstruction error is small, we can
consider z to be a low-dimensional representation of our original data. That is we
can just store z for each case and use g(z) when we need the original values x.
This can require much less storage of data when J J and can also improve
performance on supervised learning problems using z as the independent variables
rather than x because of the smaller dimensionality.
Another example of finding structure in data is the concept of embedding.
Embedding refers to a type of model where large vectors of discrete numbers
(e.g., integers) are mapped to a smaller vector of real numbers. For illustration,
we could take a data set consisting of vectors of length 1000 containing only zeros
or ones and map it into a two-dimensional vector of real numbers. The embedding
is also specified so that the vectors of real numbers have large dot products when
the original vectors are “similar.” The definition of similar depends on the type of
vectors being embedded. Similarity may be the dot product of vectors in the high-
dimensional space, but it may also include other information such as relationships
between the vectors measured in another way. One approach to embedding, t-SNE
(t-distributed stochastic neighbor embedding), embeds high-dimensional vectors
into a 2-D or 3-D space so that data can be visualized. This embedding has the
effect that vectors in the high-dimensional space are close to each other in the 2-D
or 3-D space; the net result is that natural clusters in the data can be visualized.
Another type of unsupervised learning task is finding association rules. This type of
problem asks if there are particular combinations of variables that appear together,
or are associated with each other. A simple illustration of this is the market
1.2 Unsupervised Learning 15
basket problem. Here we consider the data generated by taking the inventory of
the purchases of each person at a grocery store over the course of the day. We
construct a data matrix of size I × J where I is the number of unique shoppers
and J is the number of items in the store. Each row corresponds to a shopper, and
each column in the data set represents an item (e.g., milk, eggs, bread, broccoli,
etc.) with the number in row i, column j corresponding to the number of items j
shopper i purchased. In this case we want to find items that are frequently purchased
together and estimate the percentage of time these items go together, so that we
could optimize the store layout, suggest purchases to customers, and perform other
tasks. An example association rule could find that 40% of the time when someone
purchased milk and eggs, they also purchased bread. We could use this information
to place these three items together in the store, for example. Online stores and video
streaming services use similar technologies to suggest items to you based on the
purchases of others or the watching habits of others.
Association rules are also at work beyond many natural language processing
tasks. Much of language construction is an association rule. Sentences have a finite
number of structures and some words naturally follow others. Using unsupervised
learning we can feed in many examples of text and it can learn how language is
structured. This can be illustrated if we consider a data set that has a row for each
sentence in a collection of texts (often called a corpus). The columns have an integer
that corresponds to the words used in each position of the sentence in that the first
column has an ID for the first word in the sentence, the second column has an ID for
the second word in the sentence, and so on. There is also usually a special integer to
identify the end of the sentence.
The association rule we are after, in such a case, will give probabilities for the
next word in a sentence, given the current words in a sentence. These association
rules can be used to write sentences. For example, we have an “association rule” for
the first word of the sentence (basically a probability for each word in our vocabulary
being the first word in a sentence). We randomly sample a starting word based on
these probabilities. Then we have an association rule that tells us given that the first
word what are the probabilities of the second word being a particular word. We can
then sample from these probabilities a second word and continue using association
rules until we sample the end of sentence word. This description is not meant to be
a complete description of how natural language systems work in practice; there has
been too much research to do those systems justice in our brief coverage here, but
rather to use language as an example association rule.
In science and engineering we may want to learn association rules to understand
a complex system’s behavior. One case may be a manufacturing process where we
are interested in understanding the differences among operators of machinery in the
process. We can learn the ways that the operators behave using association rules.
Perhaps there are conditions that lead to some operators controlling the system in a
certain manner, while other operators behave differently. We could build a system
of association rules that uncovers a series of rules such as “when conditions A
and B occur, take action C” and so on. The analogy could extend into the ways
that users operate an engineering design code. Certain users with long experience
16 1 The Landscape of Machine Learning
probably have a set of rules they follow when setting up a simulation in terms of
parameters, meshing, tolerances, etc. that they may not be able to express in words.
Using association rules, it may be possible to understand how these experts practice
their craft.
With association rules the problem of scale can quickly become apparent. If I
have J different variables that one wants to look for association rules between, there
are J !/(2(J − 2)!) two way association rules to check, and J !/(3(J − 3)!) three way
association rules, and so on.3 For a modest-sized data set of J = 100, this leads
to 4950 two way rules and 323,400 three way rules. Clearly, good algorithms are
needed to find the plausible rules by quickly eliminating those rules where there are
no examples in the data set.
In the context of association rules vector embeddings can also play a large role.
Rather than representing a word as a vector with length equal to the number of
words in the vocabulary, we can find embeddings that have large dot products when
words appear nearby in the text. The representation as a vector of real numbers with
this property gives the representations meaning that a vector of zeros or ones does
not have. The embedding can also work for the market basket problem to reduce the
data from potentially huge vectors of integers to smaller dimensional real vectors.
Unsupervised Learning
Unsupervised learning looks for structure in the data. These machine learning
problems do not require an independent/dependent variable relationship (i.e.,
no labeling of cases is required). Unsupervised learning can be used to find:
• Low-dimensional representations of the data (i.e., transformations to the
variables that capture the variability in the data)
• Vectors of real numbers called embeddings to represent large, discrete
valued variables
• Association rules to know when certain conditions imply another condition
to be more likely
3 Thesecounts assume that A implies B is the same as B implies A because we have no basis to
assume a causal relationship, and we are really just looking for correlations.
1.3 Optimization and Machine Learning vs. Simulation 17
find a time-dependent geometry that gives nearly the same energy production, but is
robust to perturbations [3].
Optimization
Optimization plays a key role in machine learning:
• The training process is optimization: we try to minimize the loss function.
• Machine learning models can be used to find optimum values of the
functions they are trained to learn.
One concept that is important and widely used in machine learning methods is
Bayesian probability. The idea of Bayesian probability is that we begin with an
estimate of the behavior of the probability for a random variable and update our
estimates once we have collected more data. We can consider the parameters in a
machine learning model to be random variables because we have no expectation
that there is a “correct” value for these parameters and that these parameters exist
as distributions. Consider a parameter w that we consider to be between −1 and 1,
but we have no other information about this parameter. We specify our knowledge
of this parameter through a prior distribution. In this case we may specify the prior
distribution as the probability density function
1
−1 ≤ w ≤ 1
π(w) = 2 .
0 otherwise
p(D|w) = exp(−L(w)2 ),
where D represents our training data set and L(w) is the loss on training set D for
a given value of w. The function p(D|w) is called a likelihood function and the
notation indicates that it is the likelihood that we would have the training data D if
w were a particular value. In a sense this seems backward: we have our training data
and we want to find w. However, we can think about it as what is the likelihood that
our model would produce the data in D if w were the value of the parameter. The
form of the likelihood above makes it so that the function is maximized when the
loss function is small; we could have chosen a function other than the exponential
1.4 Bayesian Probability 19
that has this property. The likelihood then says that when the loss is small, the
likelihood that our model would produce D is high.
Bayes’ rule (or Bayes’ theorem) says that given a likelihood function and a prior
distribution, we should update the prior distribution to get a posterior distribution
as the following:
p(D|w)π(w)
π(w|D) = . (1.13)
p(D|w)π(w) dw
This equation tells us, given the data we have collected, what our new probability
distribution for w is. It does not give us a single value for w; rather, it provides
a distribution of possible w values. The denominator in Eq. (1.13) is there to
make the posterior, π(w|D), a properly normalized probability. The integral in the
denominator is over the range of possible w values.
In the sense that Bayes’ rule gives us a distribution for the parameter w, it
is giving us a distribution of machine learning models. This gives us the ability
to produce a distribution of predictions in a supervised learning problem. This
distribution can be characterized as an uncertainty in the prediction and used to
understand where in input space a model is more confident in its prediction.
In more general form we can write Bayes’ rule for a vector of parameters, w, as
p(D|w)π(w)
π(w|D) = . (1.14)
p(D|w)π(w) w
Using Bayes’ rule to find parameters is different than the approach discussed
previously of minimizing the loss function. Finding the values of the parameters that
minimize the loss function is equivalent to finding the maximum of the likelihood
function used in Bayes’ rule. For this reason sometimes minimizing the loss function
is called a maximum likelihood method.
It turns out that as difficult as maximizing the likelihood (or minimizing the
loss function) is, determining a distribution of parameters using Bayes rule is
harder. This is because of the integral in the denominator. In most cases this
integral cannot be analytically computed. If the parameter vector is large (as it
usually is) performing numerical integration would require an enormous number
of evaluations of the loss function at different parameter values, making Bayes’ rule
impractical for determining distributions of parameters. Thankfully, there have been
algorithms developed to sample w from the posterior distribution without evaluating
the denominator. These algorithms are examples of Markov Chain Monte Carlo
methods with the Metropolis–Hastings algorithm being the most famous version
[4]. These algorithms will provide a set of sampled w from the posterior distribution
that specify a set of machine learning models that can then be used to make a
range of predictions. Although knowledge of the uncertainty in a machine learning
model is useful, the cost of these sampling methods is typically much greater than
maximizing the likelihood.
20 1 The Landscape of Machine Learning
Bayesian Probability
Bayesian probability gives us a way to express our current expectation for a
random variable in terms of a prior distribution and then to use data to update
the expectation. Bayesian methods give us a distribution of machine learning
models, at the cost of making models harder to train.
1.5 Cross-Validation
Often we are most interested in how a machine learning model will perform on new
data to operate that has not been used to train the model. We would like to have a
sense of how the model will perform before we use it to make decisions, to operate
on its own, or in other critical situations. To approximately quantify the expected
performance of the model faute de mieux we turn to cross-validation.
In cross-validation for supervised learning we split the available data into a
training and test set. The training set is used to set the model parameters (i.e., train
the model), and the test set is not used to train the model, rather the model attempts
to predict the dependent variable for the cases in the test set. This allows us to see
how the model will perform on data it has not seen. We can inspect for overfitting,
biases in the model, and other problems. Using the test set we can see what the
expected accuracy of the model will be on new data and determine if there is a
fundamental difference in the model performance on the training and test data. We
can also use cross-validation for unsupervised learning problems to test the results
(i.e., association rules, low-dimensional representations, etc.) on data not used to
formulate the model and test whether the results of the analysis make sense or are
otherwise not extendable.
With splitting the data into a test and training set we are potentially limiting
the performance of the machine learning model because more data almost always
makes for a better model. To address this, k-fold cross-validation is used. In this
case the data is split via random assignment into k equally sized groups of cases;
k − 1 of these groups are used as training data, and the group left out is the test data.
This procedure is repeated k times (implying the k models must be trained) giving
k different sets of results on test data. This mitigates the impact of not including all
the data to train the model and gives us a larger set of test runs to base our judgment
of the models performance on.
The extreme case of k-fold cross-validation is leave-one-out cross-validation
where k is equal to the number of cases, that is, k = I . In this case, we build all
models with all of the cases except one and then try to predict that single case, and
this is repeated I times. Leave-one-out cross-validation is useful when the number
of cases is small and we cannot afford to lose any data, but still want to know how
1.5 Cross-Validation 21
the model would perform on the next case it would encounter. For large data sets
leave-one-out cross-validation may be too costly to implement because it would
involve building thousands or millions of models in many instances.
The random assignment to the k groups is important if there was any structure
in the data collection that would not be expected to be replicated when the model
is deployed. For example, if the data were collected using a technique where only
a single independent variable is changed at a time, before then changing several
variables, one would not want all of the cases in a group to only have a single
independent variable changing. However, random assignment is not always the
appropriate method for assigning cases. If we are interested in predicting a time
series, the test and training sets must respect the arrow of time: we cannot use data
from t > t in a model to predict the outcome at t. Therefore, the cross-validation
procedure should break data into potential groups based on the time it was collected
to see how the model would perform in predicting the dependent variables at the
next time.
Cross-validation is not just for making inferences about model performance on
new data; we can use it to set parameters in the model in some cases. In models
that have a penalty term, such as λ in Eqs. (1.5) and (1.6), we can use a training
and test data set to get an idea of which values of the penalty appropriately balance
accuracy and the penalty. Nevertheless, these models must then undergo further
cross-validation to observe their performance on data not used to train the model or
set the penalty.
Finally, we mention that cross-validation is not the only step required to make a
machine learning model acceptable for deployment. It is sound practice to have a
warm start or phased deployment of any model. For instance, one could deploy a
machine learning model to give predictions alongside a human operator for a period
of time. During that time the human operator may not even know what the machine
is predicting, but we can go back and check the accuracy. The next step may be to
have the machine make predictions with the human approving or agreeing to the
decisions for a period of time. As automatic use of the model grows, the amount of
human involvement can then be decreased.
Another important aspect of cross-validation and model deployment is knowing
when the machine learning model is uncertain. There are types of models that can
automatically indicate how uncertain a given prediction is. This information can be
used to highlight decisions for human intervention. Additionally, it can be possible
to determine if a case is very different than the cases used to train the model. For
example, if the case has a particular value of an independent variable that is an order
of magnitude larger than any in the training set, we would be wary of the confidence
to place in the model’s prediction. It is these cases where human intervention (or
post-prediction review) may be necessary.
22 1 The Landscape of Machine Learning
Cross-Validation
Cross-validation splits the available data into training and test sets in an
attempt to determine how the model will perform on new data. The machine
learning model is trained using the training data, and then the model’s
performance is assessed on the test data. This process can be repeated by
splitting the data into different sets repeatedly. A common form of cross-
validation is k-fold cross-validation where:
• The available data is split randomly into k groups.
• k − 1 of the groups are used to build the model, and the remaining group
is the test data.
• This process is then repeated k times.
• When the number of groups is equal to the number of cases, we call this
leave-one-out cross-validation.
Problems
1.1 Use Bayesian statistics to give the posterior probability of the toss of a
particular coin landing on “heads,” θ . Use a prior distribution of
1 θ ∈ [0, 1]
π(θ ) = ,
0 otherwise
and for the data likelihood use the probability from a binomial random variable
n h
p(h, n|θ ) = θ (1 − θ )n−h ,
h
where h is the number of heads in n tosses of the coin. Assuming you have flipped
the coin 10 times and obtained 6 heads, plot the posterior distribution for θ . Now if
you toss the coin 100 times and get 55 heads, how does the distribution change?
References 23
1.2 Consider the minimization problem to minimize the function f (x, y) = (2x +
y −c)2 +(x −y −d)2 subject to the constraint x 2 +y 2 ≤ 1. Use Lagrange multipliers
to show that there is an equivalent loss function of the form
References
1. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G
Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
2. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van
Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc-
tot, et al. Mastering the game of go with deep neural networks and tree search. Nature,
529(7587):484, 2016.
3. J L Peterson, K D Humbird, J E Field, S T Brandon, S H Langer, R C Nora, B K Spears, and P T
Springer. Zonal flow generation in inertial confinement fusion implosions. Physics of Plasmas,
24(3):032702, 2017.
4. W.R. Gilks and DJ Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and
Hall/CRC, 1996.
5. Ryan G McClarren. Uncertainty Quantification and Predictive Computational Science.
Springer, 2018.
Chapter 2
Linear Models for Regression and
Classification
Abstract This chapter discusses linear regression and classification, the founda-
tions for many more complex machine learning models. We begin with a motivating
example considering an object in free fall to then use regression to find the
acceleration due to gravity. This example then leads to a discussion of least squares
regression and various generalizations using logarithmic transforms. The topic of
logistic regression is presented as a classification method, with an example given
to predict failure of a part based on the temperature it was exposed to. Then the
topic of regularization based on the ridge, lasso, and elastic net regression methods
is presented as a means to select variables and prevent overfitting. These methods
are then used in the case study to determine the laws of motion of a pendulum using
data.
Consider the experiment where an object at rest is dropped from a known height.
We take an object up to a given height above the ground, call this height h, and drop
it. We then measure the time t it takes to reach the ground. Newton’s laws will tell
us that in the absence of drag, h and t are related by the equation
g 2
h = v0 t + t , (2.1)
2
where v0 is the initial vertical velocity of the object and g is the acceleration due
to gravity. Suppose we perform 8 experiments at different values of h and t, and
we want to use this data to inform a prediction of the height an object was dropped
from given the time it took to reach the ground. That is we wish to find h(t) for
our particular experimental setup. We know that in our experiment there will be
measurement uncertainty so we do not expect (2.1) to hold exactly; moreover our
experiments may have a bias in that the apparatus might systematically drop objects
from slightly higher or lower by an amount h0 that is subtracted from the left-hand
side of (2.1). Nevertheless, we would like to use our data to estimate the terms in
the model and find any systematic biases in the measurements.
Using (2.1) we have a model form for the function with two unknown constants
v0 and g, and an unknown bias term. If we do not assume that v0 = 0, because
we are not sure our experimental apparatus can assure zero initial velocity, we can
think of this as a regression problem with input, t, and a single output h. However,
the model we are trying to fit has a t 2 term in it, and it offers no complication to call
this another input given that, if one knows t, t 2 can be directly computed. We can
cast this model in generic form by defining x1 = t, x2 = t 2 , b = −h0 , and y = h to
write
y = w1 x1 + w2 x2 + b. (2.2)
This model is shown graphically in Fig. 2.1. The inputs are denoted by the circles
on the left. These inputs are connected to the output via the weights w1 and w2 ; the
intercept (or bias) also connects to the output. The output node in the figure has Σ
inside it to indicate that it computes the sum of the inputs times the weight on the
connecting arrow, i.e., this node computes w1 x1 + w2 x2 + b.
Now that our model is specified we can use the data we collected to determine
the three free parameters in the model w1 , w2 , and b. To do this we specify a loss
function for the model. Given that we have performed 8 measurements and recorded
yi and (x1 )i , (x2 )i for i = 1, . . . , 8, we would like our model to minimize the
difference between w1 (x1 )i + w2 (x2 )i + b and yi . One way to do this would be to
minimize the sum
8
L= (ŷi − yi )2 , (2.3)
i=1
where ŷi ≡ w1 (x1 )i +w2 (x2 )i +b is the predicted value for the model for case i. We
choose the sum of the squares for two reasons: each term in the sum is positive so
there will not be cancellation of errors (if one term is positive and another negative,
we could get a small sum even if each error was large), and it is a differentiable
function. The function L is called the loss function for our model. Other loss
functions could be used (e.g., the sum of the absolute values of error), but the sum
of the squared errors is easy to work with.
To find the minimum of L we will need to take the derivative with respect to each
of the parameters w1 , w2 , and b and set the result of each to zero. This gives us the
three equations
∂L 8
= 2(ŷi − yi )(x1 )i = 0, (2.4a)
∂w1
i=1
∂L 8
= 2(ŷi − yi )(x2 )i = 0, (2.4b)
∂w2
i=1
∂L
8
= 2(ŷi − yi ) = 0. (2.4c)
∂b
i=1
XT Xw = XT y.
Notice that XT X will give a square matrix so in principle this equation can have a
unique solution. In practice as long as the number of measurements is larger than
the data and the rows are not all linearly dependent, there will be a solution.
To finish the example we specify the data from the experiments. For the 8
measurements we have
28 2 Linear Models for Regression and Classification
⎛ ⎞ ⎛ ⎞
10.1 1.46
⎜25.7⎟ ⎜ 2.3 ⎟
⎜ ⎟ ⎜ ⎟
⎜17.5⎟ ⎜ 1.9 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜12.3⎟ ⎜ 1.6 ⎟
h[m] = ⎜ ⎟, t[s] = ⎜ ⎟.
⎜ 9.8 ⎟ ⎜1.43⎟
⎜ ⎟ ⎜ ⎟
⎜26.7⎟ ⎜2.35⎟
⎜ ⎟ ⎜ ⎟
⎝28.4⎠ ⎝2.42⎠
21.0 2.08
These correspond to
⎛ ⎞
1 1.46 2.12
⎜1 2.3 5.3 ⎟
⎜ ⎟
⎜1 1.9 3.62⎟ ⎛ ⎞ ⎛ ⎞
⎜ ⎟
⎜ ⎟ 8 15.54 31.35 151.5
⎜1 1.6 2.56⎟
X=⎜ ⎟, XT X = ⎝15.54 31.35 65.3 ⎠ , XT y = ⎝315.94⎠ .
⎜1 1.43 2.06⎟
⎜ ⎟ 31.35 65.3 139.74 676.64
⎜1 2.35 5.5 ⎟
⎜ ⎟
⎝1 2.42 5.85⎠
1 2.08 4.34
J
y= wj xj + b, (2.5)
j =1
where, as before, b is the bias in the model. In this model the matrix X is of size
I × (J + 1), where I is the number of observations and
⎛ ⎞
1 x11 x12 . . . x1J
⎜1 x21 x22 . . . x2J ⎟
⎜ ⎟
X = ⎜. .. .. ⎟ . (2.6)
⎝ .. . . ⎠
1 xI 1 xI 2 . . . xI J
2.2 General Linear Model 29
XT Xw = XT y, (2.7)
with w = (b, w1 , . . . , wJ )T .
The resulting values of b and wj are the values that minimize the sum of the
squared-error loss function given by
I
L= (ŷi − yi )2 , (2.8)
i=1
where ŷi is the prediction of the model for case i. For this reason the resulting linear
model is known as the least squares linear regression model or just least squares (LS)
model. This is the simplest linear model that can be defined. There has been much
research into when such models work and how to diagnose them. It has been shown
that these models work best in practice when the error in the model is independent
of the value of the prediction, i.e., the difference ŷi − yi does not depend on the
magnitude of yi . This can be checked by creating a scatter plot of ŷ versus ŷ − y
and looking for an obvious relationship between the two.
We can visualize the linear model as a mapping from inputs to an output in
Fig. 2.2. This figure represents the inputs as nodes at the left that are connected
with weights wj to a node that sums its inputs to produce the output y. The bias is
connected to the summation node with a weight of b.
One feature of linear models is that they lend themselves to interpretation. The
wj tells us how much we expect y to change for a unit change of xj , if every other
variable stays the same. Thinking about it another way, wj is the derivative of y
with respect to xj . This can be very useful to explain what is driving the dependent
xJ
1
30 2 Linear Models for Regression and Classification
variable. Care must be taken, however, if the model includes independent variables
that are transformations of each other. In the example we had a x1 = t and x2 =
t 2 : in such a scenario it is not possible to keep x2 fixed while adjusting x1 . This
scenario would require transforming back to the original t variable and then taking
the derivative.
We can also interpret the bias, b, as the value y would take if all of the
independent variables were zero. Sometimes this can cause consternation among
non-technical stakeholders because it may not make sense (or even be possible) for
all of the independent variables to be zero in the model.1 It is possible to remove
the bias from the model (by removing the column of ones in the data matrix X).
Alternatively, one can center the inputs by defining new variables x̃j ≡ xj − x̄j ,
where x̄j is the mean of the j th independent variable. In this case, the bias would
be the predicted response when all the inputs were at their average, or mean, value.
This model can be cast as a linear model by defining a new variable using the natural
logarithm2
z = log y. (2.10)
1 A personal anecdote regarding this. I was explaining to an executive why a linear model predicted
that the number of people who would visit a retail store would be negative if no one lived within
5 miles of the store. There was no training data with fewer than 10,000 people living within that
radius; the model has no idea what is going to happen in that scenario.
2 We will use the convention that log x denotes the natural logarithm of x. When we need the
Therefore, we can take the natural logarithm of both sides of Eq. (2.9) to get
J
z= wj xj + b, (2.11)
j =1
which is a general linear model of the form Eq. (2.5) with a dependent variable given
by z = log y. Therefore, to fit this model we only need to interpret the dependent
variable as the logarithm of the original dependent variable. When fitting the model,
we will be minimizing the sum of (ẑi − zi )2 , that is, minimizing the sum of the
squared difference in the logarithm of the dependent variable. Using the properties
of the logarithm, we find
ŷi
ẑi − zi = log i = 1, . . . , I.
yi
∂z
= wj .
∂xj
∂z ∂ 1 ∂y
= log y = .
∂xj ∂xj y ∂xj
∂y
= wj y. (2.12)
∂xj
This implies that as the value of the dependent variable increases, the multiplicative
effect of increasing xj also increases.
A useful rule of thumb can be found by supposing a small change to xj and using
Taylor series to find the change in y
y(xj + δ) − y(xj ) 1 1
= δwj + δ 2 wj2 + δ 3 wj3 + · · · ≈ δwj . (2.13)
y(xj ) 2 6
32 2 Linear Models for Regression and Classification
where a = eb .
Power-Law Models
It is also possible to fit power-law models of the form
J
w
y = at1w1 t2w2 · · · tJwJ = a tj j . (2.14)
j =1
Using these definitions we take the logarithm of both sides of Eq. (2.14)) to get
J
z = b + w1 log t1 + w2 log t2 + · · · + wJ log tJ = wj xj + b. (2.15)
j =1
Therefore, we can transform a power law into a linear model by taking the natural
logarithm of both the independent and dependent variables.
(continued)
2.3 Logistic Regression Models 33
J
j =1 wj xj + b using least squares. The weights from the linear model are
the coefficients in the power-law model, and a = eb .
The log-odds map the probability from the interval [0, 1] to the real line, (−∞, ∞).
The log-odds are also invertible to find the probability. If z is the log-odds of y = 1
given x, then
1
Pr(y = 1|x) = ≡ logit(z). (2.17)
1 + e−z
The function logit(z) is known as the logistic function, and it is this function that
gives logistic regression its name. The probability of y = 1 as a function of the
log-odds is shown in Fig. 2.3. From the figure we can see that if the log-odds are
zero (or the odds are 1), then there is an equal probability that the observation is in
class 1 or 0. As z approaches infinity, the probability approaches 1; similarly, the
probability approaches zero as z goes to negative infinity.
We write a linear model for the log-odds of the form
34 2 Linear Models for Regression and Classification
Pr(y = 1|x)
log = wx + b. (2.18)
1 − Pr(y = 1|x)
To fit this model we need to find the weight, w, and bias, b, that minimize a loss
function. The loss function we use should go to zero as the number of correct class
predictions goes up. To this end we assume that at a given value of the independent
variable x there is a unique probability of an observation having y = 1 using
Eq. (2.18). Therefore, for a given observation, the model predicts
where h(x) = 1/(1 + exp(wx + b)). By inspection this formula indicates that to
maximize the probability of the model being correct when y = 1, we want h(x) →
1; conversely, if y = 0, we want h(x) to go to zero to be correct.
Now consider we have I observations yi with independent variable xi . We want
to maximize the probability that the model is correct on all of the observations. In
other words, we want to maximize the product
I
h(xi )yi (1 − h(xi ))1−yi .
i=1
This product would be 1 if the model were perfect. We define the loss function as
the negative logarithm of this product (the negative is chosen to make the optimal
solution at the minimum of the loss function):
I
L=− log h(xi )yi (1 − h(xi ))1−yi . (2.19)
i=1
This loss function does not lead to a neat, closed form solution like the least squares
model did. Nevertheless, the minimum of the loss function can be found using
standard techniques such as gradient descent.
Returning to our example, suppose we have the following 10 observations
(I = 10):
T(◦ C) 281 126 111 244 288 428 124 470 301 323
Result (Fail=1) 1 0 0 0 0 1 0 1 0 1
2.3 Logistic Regression Models 35
Result
regression model
0
100 200 300 400 500
T(°C)
z = 0.04831T − 14.46073.
One question we might naturally ask is how much better is our model than random
chance. In our example we have 4 failures and 6 successes. Therefore, we could say
that any case has a 40% chance of being a failure, regardless of the temperature.
Using this chance we could predict that no case is expected to fail and be right 60%
of the time. This is the so-called null model where we say temperature has no effect
on failure. In this case the loss function is
I
y
Lnull = − log θi (1 − θ )1−yi ,
i=1
Another relevant question is how to interpret the weight and bias in the model. The
bias tells us what the probability of y = 1 is when x = 0 via the formula 1/(1+e−b ).
In the example, b ≈ −14.46, making Pr(y = 1|T = 0) ≈ 5 × 10−7 indicating that
there is little chance that at zero temperature we will observe failure.
The weight is a bit trickier to interpret. Clearly, every unit increase in the
independent variable increases z by w. The effect on the probability is trickier due
to the nonlinear relationship between z and the probability. A useful reference is the
“divide by four rule” to get an estimate of the maximum impact of the variable on
the probability. We compute the derivative of the probability of y = 1 with respect
to x as
∂ 1 we−(wx+b)
−(wx+b)
= 2 . (2.20)
∂x 1 + e 1 + e−(wx+b)
To maximize this derivative, and therefore maximize the sensitivity of the proba-
bility to x, we take the derivative of Eq. (2.20) and set it to zero and find that the
maximum sensitivity for Pr(y = 1|x) occurs where z = 0. This can be seen visually
in Fig. 2.3. At z = 0 Eq. (2.20) becomes
∂ 1 w w
= 2 = . (2.21)
∂x 1+e −(wx+b) 4
z=0 1 + e0
Thus far we have discussed logistic classification models with a single independent
variable. There is no major complication in extending them to multiple variables.
Given J independent variables we write the model as
J
1
z= wj xj + b, Pr(y = 1|x) = . (2.22)
1 + e−z
j =1
2.3 Logistic Regression Models 37
This model is visualized in Fig. 2.5. From the figure we can see that the only
difference between the linear regression model and the logistic model is the
transformation in the output layer.
In the multivariate case, the loss function is the same as Eq. (2.19). To fit a
multivariate model the loss function is minimized with respect to the bias and the J
weights. The coefficients can be interpreted using the divide by four rule, as before,
and the interpretation of the bias is identical to the single variable case.
Although logistic models are simple models, they can be powerful tools in
making predictions. Due to their ease of fitting and evaluation, these models can be
deployed on large data sets in a variety of settings. For example, in a manufacturing
process logistic regression could be used to understand what conditions lead to
products failing quality assurance tests, and we could use this insight to improve
operations.
We also note that the logistic model is our first neural network in the sense that
it uses a nonlinear thresholding function (in this case the logistic function) to give a
nonlinear response to an affine transformation of the inputs.3 We will return to this
idea later.
3 Byaffine transformation of the inputs, we mean a linear combination of the inputs plus a constant
term. The bias supplies the constant term and the weights define the linear combination.
38 2 Linear Models for Regression and Classification
J
(1)
z(1) = wj xj , (2.23)
j =1
J
(2)
z(2) = wj xj ,
j =1
..
.
J
z(K) = wj(K) xj .
j =1
Using the fact that the sum of the probabilities must be 1 (Pr(y = 1) + · · · + Pr(y =
K) = 1), we define the probability that Pr(y = c) as
(c)
ez
Pr(y = c|x) = K (k)
≡ softmax(c, z(1) , . . . , z(K) ), c = 1, 2, . . . , K.
z
k=1 e
(2.24)
Equation (2.24) defines the softmax function. This function is named because of
the K variables z(c) , the one that is the largest will cause the softmax function to be
close to one and the others will return a number close to zero. Suppose that K = 3
and the z(c) are 0, 2, and 4 for c = 1, 2, 3. The values of the softmax function will be
1
softmax(1, 0, 2, 4) = ≈ 0.016,
1 + e2 + e4
e2
softmax(2, 0, 2, 4) = ≈ 0.117
1 + e2 + e4
e4
softmax(3, 0, 2, 4) = ≈ 0.867.
1 + e2 + e4
These values sum to one, and the largest number (4 in this case) gave a result closest
to one.
2.3 Logistic Regression Models 39
The loss function used in fitting must also be changed to deal with the fact that
there are multiple categories:
I
K
L=− I(k, yi ) log softmax(k, z(1) , . . . , z(K) ), (2.25)
i=1 k=1
where
1 yi = k
I(k, yi ) = δk,yi =
0 otherwise
is known as the indicator function. The loss function defined in Eq. (2.25) is called
the cross-entropy function. This function is a measure of the distance between the
model’s predictive distribution and the distribution of the data (i.e., the empirical
distribution).
The multinomial model is shown in Fig. 2.6. The fact that there are more weights
is apparent in the figure, but the overall architecture of the model is similar to the
logistic regression model with the logit node replaced by a softmax node.
Inputs Outputs
Σ Pr(y = 1|x)
w1(1)
x1
Σ Pr(y = 2|x)
x2 softmax
.
.
.
Σ Pr(y = 3|x)
xJ
w4(J ) Σ Pr(y = 4|x)
Fig. 2.6 Schematic of the multinomial classification model in the case of K = 4 classes. There
are J inputs to the model, independent variables. Only two of the 4J weights are labeled but each
line between the input layer and the “Σ” layer has a weight. As before, the Σ denotes that the sum
of the inputs to the node is computed. The softmax function computes the probability of each class
that is then returned in the output layer
40 2 Linear Models for Regression and Classification
In the previous sections we detailed several different models for making predictions
that were variations on the theme of linear, least squares regression. All of the
models had a loss function related to the model error, and fitting the models required
minimizing that error. However, minimizing the error is not our only objective as
we demonstrate now. Consider a variable y that is uniformly, randomly distributed
between 0 and 1. We have two samples of y given by y1 = 0.235 and y2 = 0.642.
We then produce samples from another random number x that are x1 = 0.999, and
x2 = 0.257.
Using this data we construct a linear regression model using least squares of the
form
y = wx + b.
From the data we find w = −0.5485 and the bias to be b = 0.7830. This model
is exact in that yi = wxi + b for i = 1, 2. However, recall that y and x were
independent random numbers so there should be no relation between them. We see
the drawback of this model when we consider a third set of randomly-sampled
values of y and x: y3 = 0.610, x3 = 0.962. In this case the model predicts
ŷ3 = 0.255343, which is off by over 300%—so much for the perfect model. As
we continue sampling y and x points, the model’s predictive ability goes to zero.
This phenomenon is known as overfitting.
Overfitting can also occur when we have many data points. We can “improve”
the model by adding variables to the model to make the model’s error on the known
data go to zero. For instance, if we have N realizations of the dependent variables,
then a model with N independent variables, even if some of those independent
variables are meaningless noise, will be perfect on the known data. Nevertheless,
on new data the model will have very little predictive power (see Sect. 1.1.1.1 for a
demonstration). This is part of the motivation for cross-validation discussed in the
previous chapter.
2.4 Regularized Regression 41
There are also occasions where we have the opposite problem: we have more
independent variables than realizations. If we consider an experiment where the
number of controllable parameters in the experiment is J but the number of
realizations of the experiment we can perform is I with I < J , how can we decide
which of the independent variables to include in a model when we assume that only
a few of the independent variables affect the dependent variables?
The idea of “regularized regression” attempts to address both the problem of
overfitting to the known data and variable selection. The idea is to change the
loss function to balance the objectives of minimizing the error with reducing the
magnitude of the weights and biases in the model. If the model prefers smaller in
magnitude weights and biases, the model can be robust to overfitting and help to
select variables as we will see.
Ridge regression [2], also known as Tikhonov regularization, adjusts the loss
function by adding the Euclidean norm, also called the L2 norm, of the weights
and biases. For linear regression the loss function becomes
I
Lridge = (ŷi − yi )2 + λ w 2 , (2.26)
i=1
In the ridge loss function the parameter λ ≥ 0 is the penalty strength: it gives the
relative weight of the penalty to the error when minimizing the loss function. When
λ = 0, the loss function becomes the same as the least squares model.
For λ > 0 ridge regression will give a result even in the case where J > I . As a
consequence ridge regression will allow us to fit a line given a single point! To see
that ridge regression will give a unique solution when J > I we take the derivative
of Eq. (2.26) with respect to the J weights and the bias and set the result to 0 to get
a system of J + 1 equations for the weights and bias:
where I is an identity matrix of size (J + 1) × (J + 1), and the other matrices and
vectors are the same as in Eq. (2.7). Notice that this equation will have a solution
when λ > 0 because the penalty adds a constant to the diagonal (i.e., the ridge) of
the matrix XT X. Therefore, if the rank of XT X is less than J + 1, the penalty fixes
the rank.
To further understand what the ridge penalty does, we change the problem from
minimizing the loss function to a constrained optimization problem. If we think of
λ as a Lagrange multiplier we can form the equivalent minimization problem to
minimizing Lridge as
I
wridge = min (ŷi − yi )2 subject to w 2 ≤ s.
w
i=1
There is a one-to-one relationship between s and λ, but we will not derive this here
(this was given as an exercise in the previous chapter). We can say that as λ → 0, s
will go to infinity, implying there is no constraint on the weights.
To illustrate the effect of the ridge penalty, consider a system with J = 2 as
shown in Fig. 2.7. The loss function in Eq. (2.26) is a quadratic function in the two
parameters b and w1 . The contours of this quadratic function will appear as ellipses
in the (b, w1 ) plane. In the center of these ellipses is the least squares estimate.
The circle in Fig. 2.7 has a radius s, and the solution must lie inside or on this
circle. Because the LS estimate is outside the circle, the solution will be where
the circle intersects a contour line of the sum of the squared errors at the minimal
possible value of the error. Notice that the magnitude of both b and w1 has decreased
in the ridge estimate compared with the LS estimate, and that both are non-zero.
A feature of ridge regression is that the larger the value of λ, the smaller the
values in ŵridge will be in magnitude relative to the values of ŵLS , when the LS
values exist. This makes λ a free parameter that must be chosen based on another
consideration. We will discuss using cross-validation to choose this parameter later.
As a simple example of ridge regression, consider the problem of estimating a
function of the form y = w1 x + b given the data y(2) = 1. That is, we are interested
in fitting a line to single data point. This problem is formulated as
2 1
w1 = , b= ,
λ+5 λ+5
for λ > 0. From this we can see that the limit of this solution as λ approaches zero
from the above is
2 1
lim w1 = , lim b = .
λ→0+ 5 λ→0+ 5
Notice that for λ > 0, the fitted solution does not pass through the data, that is,
2w1 + b = 1. Also, we note that it is possible to take the ridge solution in the limit
of λ → 0, but setting λ = 0 in Eq. (2.28) does not yield a solution.
In this example we can see that we can fit a line to single data point, but the result
is not necessarily faithful to the original data, though it does give us a means to fit a
solution when I < J . This property will be useful for estimating local sensitivities
when we have fewer cases than parameters.
If we change the loss function that we used in the ridge regularization to be the
L1 norm of the weights and biases, that is if we include the sum of the absolute
values of the wj ’s and b, we get the method known as least absolute shrinkage and
selection operator, or the “lasso” for short [3]. The loss function for lasso is
44 2 Linear Models for Regression and Classification
I
Llasso = (ŷi − yi )2 + λ w 1 (2.29)
i=1
I
J
= (ŷi − yi ) + λ|b| +
2
λ|wj |.
i=1 j =1
This seemingly small change has an important impact on the resulting method.
For instance, the equations are now more difficult to solve because the absolute value
function is not differentiable when its argument is zero. Additionally, the resulting
w found by lasso tends to set a subset of the elements of this vector to zero. It
“lassos” the important independent variables. In this sense, it is a “sparse” method
in that lasso can find a solution where only a subset of the independent variables
contributes to the model prediction. One feature of the lasso result is that for I data
points used to train the model, at most I coefficients of w will be non-zero.
We can see how lasso sets some coefficients to zero by investigating the solution
for J = 2. In this case, level curves in the (b, w1 ) plane with a constant value of
w will be diamond-shaped rather than circular; this is shown in Fig. 2.8. Given
this diamond shape the ellipses that represent the curves of equal squared error in
the loss function will likely touch the diamond at one of its vertices. These vertices
correspond to only one of the coefficients having a non-zero value. In the figure
we also notice that the prediction error from lasso will be larger than that for ridge
ŵLS
ŵlasso
ŵridge
L1
b
L2
2.4 Regularized Regression 45
because the ellipse that touches the diamond is further away from the least squares
result.
We can also combine lasso and ridge loss functions to get the elastic net
method [4]. In an extension of the metaphor of the method capturing the important
coefficients, this time we use a net rather than a rope. The elastic net loss function is
I
Lel = (ŷi − yi )2 + λ α w 1 + (1 − α) w 2
2 , (2.30)
i=1
with α ∈ [0, 1]. This loss function can transition from lasso to ridge regression:
by setting α = 1 to get lasso, α = 0 to get ridge, and a mixture of the two for α
in between 0 and 1. The elastic net allows more than I coefficients to be non-zero
and demonstrates the grouping effect: independent variables that are correlated have
weights that are nearly identical. The relaxation of the strictness of the lasso allows
elastic net to outperform lasso in many problems in terms of minimizing the error
in the predicted values of the dependent variables.
In our visualization of the case of J = 2, the effect of elastic net is to blunt
points of the diamond-shaped level curve of the lasso method. As shown in Fig. 2.9
the elastic net solution is between the ridge and lasso solutions.
One key feature of regularized regression techniques is that the scale of the
variable matters. In standard regression if we multiply an independent variable by a
constant, it does not change the predictions of the model: the coefficients will adjust
ŵel
L1
b
L2
46 2 Linear Models for Regression and Classification
xi − μxi
x˜i = . (2.31)
σxi
The mean could be replaced by another measure of the centrality of the variable,
such as the median or mode, and the standard deviation could be replaced by the
average deviation or some other measure of the range of the variable.
We have not talked about how to determine the penalization strength λ. This is
typically handled using cross-validation. A subset of the data is used to produce
models with different values of λ. Then the model is tested on the portion of the
data not used to train the model. Repeating this several times at each value of λ
gives an indication of the model accuracy at each λ value. By varying λ one can see
where the ideal trade-off between model accuracy and sparseness/robustness lies. A
standard way to choose λ is to pick the largest value of λ that yields a model with
mean error within one standard deviation of the minimum error over all λ tested.
Regularization Techniques
The regularization techniques we discussed all modify the loss function so
that the model minimizes the sum of the squared error and a penalty term.
• Ridge regression adds a penalty term in the form of the L2 norm of the
coefficients to the loss function to shrink the magnitude of the coefficients.
• Lasso adds a penalty term in the form of the L1 norm of the coefficients
to the loss function. The result, due to the properties of the L1 norm, will
typically set many coefficients to zero.
• Elastic net is a convex combination of the lasso and ridge penalties. It sets
groups of parameters to zero and can give models with more predictive
power, but the model will not necessarily be sparse.
In all cases the penalty term has a regularization strength parameter that must
be chosen using cross-validation. Furthermore, the independent variables
should be normalized and non-dimensional so that each affects the penalty
in a comparable manner.
2.5 Case Study: Determining Governing Equations from Data 47
In this case study we will use regularized regression to determine the governing
differential equations that generated a data set. In our case we will be dealing with
the equations of motion for a pendulum. The pendulum we consider is drawn in
Fig. 2.10. The angle between the vertical line crossing the point where the pendulum
is mounted and the pendulum is θ . The differential equation governing θ (t) is
b g
θ̈ + θ̇ + sin θ = 0, (2.32)
m L
where a dot denotes a time derivative.
A first-order differential equation can be written by defining ω as the angular
speed so that
b g
ω̇ = − ω − sin θ, (2.33a)
m L
θ̇ = ω. (2.33b)
ω(ti ) − ω(ti−1 )
θ̈ = ω̇ ≈ .
ti − ti−1
This dictionary of functions is meant to capture all the possible functions of θ and
ω that the acceleration θ̈ could depend on. This dictionary is not unique and could
be much larger or much smaller depending on how much knowledge we have about
the potential law of motion.
For a first test, we solve using Runge–Kutta time integration to solve 100
different initial value problems for the pendulum with initial positions θ0 randomly
chosen between ±π , with an initial angular speed of θ̇0 randomly chosen in the
interval ±10 radians per second. From each of the solutions we select 10 random
time points to collect the pendulum position, θ , and θ̇ to compute θ̈ and the
dictionary x. In our simulations we set b = 0.1 N·s, m = 0.25 kg, g = 9.81 m/s2 ,
and L = 2.5 m. The collected data are normalized by subtracting the mean and
dividing by the norm for each of the elements of x. The penalty parameters for the
regularized regression methods are estimated using the cross-validation procedure
in the Python library scikit-learn [5].
The collected values for θ̈ versus several of the dictionary elements are shown in
Fig. 2.11. The figure also includes the proper linear combination of θ̇ and sin θ as
well as linear combinations containing the Taylor series of sin θ . From the figure
we can see that there is a strong relationship between several of the dictionary
elements that are not the “correct” elements of sin θ or θ̇ , such as θ , θ 3 , and
θ 2 θ̇ . It is also clear that the correct relationship of (b/m)θ̇ + (g/L) sin θ is born
out by the data. Using this data we can investigate whether the true relationship
can be discovered by regression methods. We use least squares regression and the
regularized regression methods lasso, ridge, and elastic net. Using these methods
we get different results for the coefficients of the dictionary elements. In Fig. 2.12
the coefficients are compared to the true values of −b/m = −0.4 s−1 for the
θ̇ coefficient, −g/L = −3.924 s−2 for the sin θ coefficient, and 0 for the other
coefficients. From the figure we can see that for the θ̇ coefficient all of the methods
give values close to the correct value. However, for the coefficient on sin θ , lasso’s
result agrees with the true value with the other methods giving a low estimate.
Moreover, regularizations using elastic net and ridge give spurious values for the
coefficient of θ . This is due to the fact that θ is the leading term in the Taylor series
of sin θ . Therefore, these methods are sharing the contribution to θ̈ between these
inputs. Least squares regression has a spurious non-zero coefficient for θ as well as
several others including θ 2 and cos θ .
While lasso was able to find the proper coefficients in the differential equation
from the data, this exercise was a bit too easy in some senses. The only discrepancy
we should have expected between the model and the data is numerical error.
2.5 Case Study: Determining Governing Equations from Data 49
−2 0 2 −20 0 20 −1 0 1
θ θ3 sinθ
−5 0 5 −10 −5 0 5 −5 0 5
b ̇ g b ̇ g g b ̇ gθ gθ3 gθ5
mθ + L sinθ mθ + L θ − 6Lθ3 mθ − + − 120L
L 6L
Fig. 2.11 Scatter plots of the numerical results relating the acceleration of the pendulum with
several members of the dictionary x. The relationship determined by the differential equation is
shown in the lower left panel
4
true
lasso
3 elnet α = 0.9
elnet α = 0.5
|coefficient|
2 ridge
least squares
0
θ4
θ5
θ2
θ3
θθ ̇
cosθ
sinθ
cosθ ̇
sinθ ̇
θθ ̇2
θ
θ̇
sin 2θ ̇
cos 2 ̇
θ 2θ ̇
θ ̇2
θ ̇3
θ ̇4
θ ̇5
sin 2θ
θ
cos 2
Fig. 2.12 Fit coefficients for a subset of the dictionary elements of x using different regression
techniques compared with the true differential equation coefficients where only θ̇ and sin θ have
non-zero coefficients
50 2 Linear Models for Regression and Classification
Therefore, the acceleration of the pendulum was very nearly exactly what the
differential equation predicted. To better test the methods, and to emulate the case
where there is measurement error, we repeat the exercise by adding simulated
measurement error.
We repeat the exercise above using the same parameters and change the recorded
values of θ and θ̇ to each having a random error given by a Gaussian distribution
with mean 0 and a standard deviation of 0.01. This in effect gives an unbiased
measurement uncertainty to the angle and angular speed measurements of the
pendulum. The resulting data, shown in Fig. 2.13, no longer has the precise
relationship between θ̈ , θ̇ , and sin θ that the noiseless case demonstrated. It is clear
there is a relationship here, but it is less obvious. Also, it is harder to tell the
5 5 5
θ̈
0 0 0
−5 −5 −5
−2 0 2 −20 0 20 −1 0 1
θ θ3 sinθ
5 5 5
θ̈
0 0 0
−5 −5 −5
−10 0 10 0 25 50 75 −50 0 50
θ̇ θ2̇ θ2θ̇
5 5 5
θ̈
0 0 0
−5 −5 −5
−5 0 5 0 10 −5 0 5
b ̇ g b ̇ g g b ̇ gθ gθ3 gθ5
mθ + L sinθ mθ + L θ − 6Lθ3 mθ − + − 120L
L 6L
Fig. 2.13 Scatter plots of the numerical results relating the acceleration of the pendulum with
several members of the dictionary x. The relationship determined by the differential equation is
shown in the lower left panel
2.5 Case Study: Determining Governing Equations from Data 51
4.0
true
3.5
lasso
3.0 elnet α = 0.9
elnet α = 0.5
|coefficient|
2.5
2.0 ridge
least squares
1.5
1.0
0.5
0.0
θ 2θ ̇
θ2
θ3
θ4
θ5
cosθ
cos 2 ̇
sin 2θ ̇
sinθ ̇
sinθ
θθ ̇
θθ ̇2
cosθ ̇
θ
θ ̇3
θ ̇4
θ ̇5
θ̇
θ ̇2
θ
sin 2θ
θ
cos 2
Fig. 2.14 Fit coefficients using the noisy pendulum data for a subset of the dictionary elements
of x using different regression techniques compared with the true differential equation coefficients
where only θ̇ and sin θ have non-zero coefficients. Note the least squares coefficients for θ and
sin θ are not shown because their magnitude is greater than 10
difference between the influence of sin θ and its Taylor series approximation due
to the presence of noise.
We would expect that the presence of noise would affect the regression estimates
for the coefficients. This is most obvious in the results for least squares, as we can
see in Fig. 2.14. The least squares results give coefficients of significant magnitude
for several dictionary elements, and the coefficients for sin θ and θ were greater
than 10. The other regression methods also get worse when noise is added, with
the exception of lasso. For cos θ̇ and sin θ̇ elastic net with α = 0.5 and ridge give
coefficients significantly different than zero, in addition to the non-zero estimate for
θ . Once again lasso gives accurate estimates for sin θ and θ̇ .
In the exercise above we used the same pendulum parameters for each of the
simulations. We were able to infer the proper coefficients in the differential
equation using lasso, but we cannot say how those coefficients will vary as the
pendulum mass, length, and damping parameter vary. Given the success of the
lasso in determining the coefficients, we will vary the pendulum parameters and
repeat the estimation of θ̈ as a linear combination of θ̇ and sin θ using lasso. For
each of the different pendula we estimate the coefficients for θ̇ and sin θ . Using
these coefficients we then estimate how they vary as a function of the pendulum
parameters.
To posit the variables that the coefficients of θ̇ and sin θ depend on we look
at units. Knowing that the units of θ̈ are s−1 , we look for combinations of the
parameters g (m s−2 ), L (m), m (kg), and b (kg s−1 ) that give units of s−1 for the
52 2 Linear Models for Regression and Classification
coefficient of θ̇ , Cθ̇ , and give s−2 for the sin θ coefficient, Csin θ . From this argument
we consider the following possible linear combinations:
g gm 2
b2
Csin θ = w1 + w2 + w3 , (2.35a)
L m2 bL
" gm
b g
Cθ̇ = w1 + w2 + w3 . (2.35b)
m L bL
This case study in determining nature’s laws from data required more than just
machine learning blindly determining the laws based on raw data. For starters,
we had to formulate the dictionary of functions that the acceleration could depend
on. This step was important because if we did not include the correct functions,
our regression procedure may not have found the correct relationships. One of the
Table 2.1 Data collected to estimate the coefficients as a function of the pendulum parameters
gm 2
g b2 b g gm
Csin θ Cθ̇ L m2 bL m L bL
−6.795 −0.355 6.722 0.299 150.848 0.547 2.592 12.282
−4.667 −1.142 5.171 1.424 18.771 1.193 2.274 4.332
−3.529 −0.660 3.648 0.544 24.464 0.737 1.910 4.946
−3.762 −2.037 4.340 4.255 4.427 2.063 2.083 2.104
−12.346 −0.283 12.500 0.204 762.995 0.452 3.535 27.622
−19.873 −0.468 19.735 0.329 1183.872 0.573 4.442 34.407
−9.954 −0.492 10.019 0.293 342.102 0.541 3.165 18.496
−3.650 −0.893 3.496 1.010 12.099 1.005 1.869 3.478
−3.653 −0.262 3.611 0.091 142.594 0.302 1.900 11.941
−2.214 −0.413 2.507 0.265 23.709 0.514 1.583 4.869
Problems 53
benefits of the regularized regression techniques is that we can add as many possible
dictionary functions as we like.
Additionally, in the step where we determined the form of the coefficients in
the differential equation as a function of the pendulum parameters we had to
use knowledge of dimensional analysis to propose the possible combinations of
parameters. Had we used arbitrary combinations of those parameters it may have
been more difficult to capture the correct behavior. Indeed we may have come up
with a “law of motion” that made little physical sense.
All of this is to say that scientists and engineers cannot forget their expertise
when applying machine learning. Throwing a data set at regularized regression
would not have been nearly as fruitful had we not used some scientific knowledge.
For example, it would be hard to explain a model with very esoteric units on the
coefficients. As it stands, even if we did not know the correct answer for this
exercise, we could at least show that the form of the differential equation made
sense from a dimensional perspective.
Notes
Problems
2.1 Show that scaling an independent variable does not affect the prediction of least
squares regression model, but that it does affect the ridge model.
2.2 Consider a data set consisting of 10 samples of three independent variables
x1 , x2 , x3 and one dependent variable y where all three variables are randomly
drawn from a standard normal distribution (i.e., a Gaussian with mean 0 and
standard deviation 1). Fit three linear least squares models of the form (1) y =
ax1 + b, (2) y = ax1 + bx2 + c, (3) y = ax1 + bx2 + cx3 + d. All of these models
are nonsensical because they are fitting random noise. Which has a larger value for
R 2 ? Repeat this several times to and comment on your findings.
2.3 Construct a logistic regression model using the following data:
54 2 Linear Models for Regression and Classification
x 0.950 0.572 0.915 0.920 0.520 0.781 0.479 0.461 0.876 0.700
y 1 0 1 1 1 1 0 1 1 0
References
1. Michael Gold, Ryan McClarren, and Conor Gaughan. The lessons Oscar taught us: data science
and media and entertainment. Big Data, 1(2):105–109, 2013.
2. Arthur E Hoerl and Robert W Kennard. Ridge regression: applications to nonorthogonal
problems. Technometrics, 12(1):69–82, 1970.
3. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288, 1996.
4. Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, April 2005.
5. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine
Learning Research, 12:2825–2830, 2011.
6. Samuel H Rudy, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Data-driven discovery
of partial differential equations. Science Advances, 3(4):e1602614, April 2017.
7. Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations
from data by sparse identification of nonlinear dynamical systems. Proceedings of the National
Academy of Sciences, 113(15):3932–3937, April 2016.
Chapter 3
Decision Trees and Random Forests for
Regression and Classification
Mr. Fox: It’s not exactly an evergreen, is it? Aren’t there any
pines on the market this side of the river?
Weasel: Pines are pretty hard to come by in your price range.
—from the film Fantastic Mr. Fox
Abstract This chapter covers the topics of decision tree models and random
forests. We begin with a discussion of how binary yes/no decisions can be used to
build a model for a regression problem by dividing, or partitioning, the independent
variables for a simple problem with 2 independent variables. This idea is then
generalized for regression problems of more variables before continuing on to
classification problems. We then introduce the idea of combining many trees
together in the random forest model. Random forests have each tree in the forest
built on different training data and use randomization when building the trees as
well. The prediction of each of these trees is averaged in a regression problem
or votes on a class for a classification problem to give the predicted value of the
dependent variable. We demonstrate the efficacy of random forests through a case
study of predicting the output of an expensive computer simulation.
In daily life we often encounter a series of nested binary (i.e., yes/no) decisions. If
the sign says “Don’t Walk,” I do not cross the street. If the weather forecast calls for
rain, I bring an umbrella when I leave the house, and if it looks clear later, I may
leave the umbrella in my office when I leave for lunch. Decision trees are a way of
formalizing these decisions so that they can be cast as machine learning problems.
0.6
x2
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
x1
y 1.01 0.97 0.97 1.03 0.92 1.97 1.91 2.09 1.93 1.94 0.22 0.26 0.35
x1 0.06 0.10 0.24 0.16 0.07 0.42 0.96 0.48 0.85 0.78 0.79 0.80 0.50 . (3.1)
x2 0.27 0.44 0.03 0.38 0.12 0.71 0.87 0.75 0.59 0.57 0.20 0.20 0.41
This data set is visualized in Fig. 3.1. We would like to determine a way to
split the data orthogonal to an axis to divide the data into two regions. In each of
these regions we then assign all of the points a value c that minimizes the squared
difference between the true solution and the value c.
In this data set we will split the data using x2 = 0.505 and set the value for
all of the points above the split to ŷ = 1.968; below the split we set all values to
ŷ = 0.7162. The values of 1.968 and 0.7162 are the mean values of the training data
that fall into the respective boxes. This splitting of the independent variable space is
shown in Fig. 3.2. In the figure we graphically split the independent variable space
using a horizontal line at x2 = 0.505; we also display the model as a decision tree.
At this point there is only one decision; if x2 is greater than 0.505 we follow the
“yes” path of the tree to arrive at ŷ = 1.968; otherwise we follow the “no” path to
get ŷ = 0.7162. If we use a squared-error loss function, the loss for this simple tree
model is given by
3.1 Decision Trees for Regression 57
1
ŷ = 1.968
yes 0.8
x2 > 0.505
0.6
x2
ŷ = 1.968 ŷ = 0.7162 0.4
0
0 0.2 0.4 0.6 0.8 1
x1
(b)
Fig. 3.2 Visualization of the first decision boundary as a tree and in the independent variable
space. (a) Two-leaf decision tree for predicting y = f (x1 , x2 ). (b) Visualization of the decision
boundary at x2 = 0.505
L= (yi − 1.968)2 + (yi − 0.7162)2 = 0.9641.
xi2 >0.505 xi2 <0.505
Looking at Fig. 3.2b, we see that it may be possible to improve the decision tree:
below the split it seems that points toward the right side have smaller values than
those to the left. We therefore draw a vertical line downward from the horizontal
line at x2 = 0.505 at x1 = 0.37.
Drawing this extra division in the independent variable space gives us two more
“leaves” on the tree (that is endpoints in the decision tree). The values of these leaves
are ŷ = 0.2767 when x1 > 0.37 and x2 < 0.505, and ŷ = 0.98 when x1 < 0.37
and x2 < 0.505. Once again the values of ŷ assigned are the average values of the
training data inside the boxes. The loss function for this tree model is given by
L= (yi − 1.968)2 + (yi − .2767)2 + (yi − 0.98)2 = 0.0365,
xi ∈b1 xi ∈b2 xi ∈b3
where xi ∈ b indicates the values of xi leading to the prediction at leaf in the tree,
i.e., xi is the box corresponding to leaf . For example, if x2 > 0.505, the point is in
box b1 .
By adding the additional split to the independent variable space, and, as a result,
adding another leaf to the tree, we have decreased the loss function by a factor of
about 26. This model is much better at explaining the data. We could continue to
add splits/leaves until the error was zero by isolating each data point in a box, but
this would lead to a very complicated tree model that had nearly as many decisions
58 3 Decision Trees and Random Forests for Regression and Classification
1
ŷ = 1.968
yes no 0.8
x2 > 0.505
0.6
x2
yes no
ŷ = 1.968 x1 > 0.505 0.4
ŷ = 0.2767
0.2 ŷ = 0.98
ŷ = 0.2767 ŷ = 0.98 0
0 0.2 0.4 0.6 0.8 1
x1
(a) (b)
Fig. 3.3 Visualization of the two decision boundaries as a tree and in the independent variable
space. (a) Three-leaf decision tree for predicting y = f (x1 , x2 ). (b) Visualization of the decision
boundaries
in the tree as there are data points. In this example, it appears that adding another
split beyond the two we have will not appreciably improve the model (Fig. 3.3).
With this decision tree we can make a few observations. Given the way that the
decision boundaries work, all of the boundaries are orthogonal to an axis, and they
create rectangular boxes. Using these boxes, if the independent variables fall into a
given box, we have a single value for the predicted dependent variable based on the
mean of the training data inside that box. Furthermore, we can determine which box
a given value of x falls into by answering a series of binary yes/no questions.
The decision tree we have created will handle extrapolation in a somewhat
graceful manner. If we imagine the decision boundaries extend to infinity when
they do not intersect another decision boundary, then any value of x will lead to
a leaf on the decision tree or a box in the independent variable space. Each of the
leaves on the decision tree is the mean of the dependent variables in some subset
of the training data. Therefore, the tree model will always give a prediction that
is within the range of the training data. This contrasts with the linear models we
discussed in the previous chapter where, if an independent variable has a positive
coefficient in the model, extrapolating with that variable can lead to a prediction
that is arbitrarily large. This is not to say that tree models give the correct answer
for extrapolation, but rather their predictions are limited by the training data, and
this can be a desirable feature for certain model uses.
We have not yet discussed how we chose where to the place the decision
boundaries. Notice that if we were to shift the decision boundaries slightly, the
predictions of the decision tree, and the loss function value, would not change. We
placed the decision boundaries so that they were equidistant between the nearest
training points. This reflects our belief that if multiple places to put the decision
3.1 Decision Trees for Regression 59
boundary give the same loss, we should place the decision boundaries as far from
the training data as we can so that a small perturbation to the training data does not
give a large change to the prediction.
If we were to generalize the idea of a decision tree to more than two independent
variables, the decision boundaries would be a plane in three dimensions, and a
hyperplane in higher dimensions. The decision boundaries would still be orthogonal
to an axis, and all of the decisions in the tree would still be binary. The tree itself
would still have the same appearance as the decision trees above, but visualizing the
partitioning of independent variable space becomes much more difficult.
Tree Models
Tree models split the space of independent variables into rectangular regions
that each correspond to a different predicted value for the dependent variable.
This splitting of the independent variable space is equivalent to a series of
yes/no decisions that can be visualized as a decision tree.
where
1 1
c− = yi , c+ = yi , (3.3)
N− x ≤s N+ x >s
ij ij
1A greedy algorithm searches for the best possible immediate outcome given the known data. An
example greedy algorithm would be the hill-climber algorithm in optimization where one always
moves in the direction of the gradient of the objective function (i.e., climbs upward on the hill). A
greedy algorithm can get stuck at local extreme values, rather than a global maximum/minimum
because it typically does not explore enough of the solution space. In the hill-climber algorithm,
the solution it finds will be the top of the nearest hill, not the top of the tallest hill. We can use a
greedy algorithm to build decision trees because of the pruning step we do later.
60 3 Decision Trees and Random Forests for Regression and Classification
and N+ and N− are the number of points above and below (or at) the split,
respectively. The values of c− and c+ are the average values of the dependent
variables on either side of the split. To find this split and variable we search through
the range of each independent variable to find the split that minimizes the loss.
Once we find this split, we now have two data sets, (X− −
1 , y1 ) for all of the cases
+ +
below the first split and (X1 , y1 ) for the cases above. For each of these we can find
a best split and variable to split on. That is, we repeat the procedure on the split data
sets; we continue in this manner until some stopping criteria are met. A few possible
stopping criteria are that we split until:
• There are a certain number of cases remaining in the partitioned data set.
• The cases in the set to be partitioned all have a value within some tolerance.
• The split does not improve the loss function sufficiently.
The first two of these procedures are used in practice to “grow” decision trees.
The third option, only splits if it gives enough improvement in the loss function,
is not used in practice because it is possible that a split that only provides marginal
improvement at one level could lead to a more important partition later on.
Pruning the Tree
The algorithm for growing the tree, using recursive partitions of the data set
looking for improvement in the loss function, can grow trees that are very large.
Additionally, because we do not account for how much improvement each partition
of the data set gives, we do not know what decisions in the tree are responsible for
the performance of the tree. Also, because the tree is large with many decision paths,
it is possible that the tree will be vulnerable to overfitting (i.e., because the decision
tree model is not parsimonious, the likelihood that it is overfitting the training data
is increased).
To deal with the size of the tree, we extend the arboreal metaphor and “prune”
the trees to remove unnecessary branches/decisions in the tree. We do this by
regularizing the loss function in a similar manner to regularized linear regression
by adding a penalty to the loss function.
Let T be a decision tree, with T0 being the tree fit from the procedure outlined
above. We then write the number of leaves in the tree, that is the number of places
where the tree ends and there are no more decisions to be made, as |T |. The
regularized loss function for a given decision tree, T , is written as
I
L(T ) = (cT (xi ) − yi )2 + λ|T |, (3.4)
i=1
where cT (xi ) is the predicted value of the dependent variable for tree T given the
independent variables for case i, xi . In this loss function, λ is again a parameter that
balances the accuracy of the tree in the first term with the complexity of the tree. As
in regularized regression, λ can be chosen using cross-validation.
3.2 Classification Trees 61
Using this loss function we search all of the subtrees in T0 to find the one that
minimizes the loss function. We can produce subtrees by removing a split that leads
to a leaf. This fuses two leaves together and we can then assess the loss function in
Eq. (3.4). To choose the leaves we fuse, we find the leaf that produces the smallest
increase in the mean-squared error, the first term in the loss function, when it is
removed. We repeat this procedure, fusing leaves and assessing the loss function
until we reach the original split in the tree. Comparing the loss function for all of
the trees produced will lead to the minimum value of L(T ).
Regression Trees
The procedure for building trees is recursive because each split creates new
regions in independent variable space. We then look for partitions in these
new regions and repeat until we reach the stopping criteria. However, the tree
should be pruned to avoid overfitting through a regularization process where
we look for a tree that balances the number of leaves with the error in the
predictions.
where N is the number of cases in region b and the indicator function I (yi = k)
is given by
1 yi = k
I (yi = k) = .
0 yi = k
K
G = pk pk . (3.5)
k=1 k =k
The Gini index will be higher when there are many different classes in region and
it will be zero if all of the cases are of the same class. If we consider a two-class
problem, K = 2, the Gini index simplifies to
1
E= I (yi = k̂).
N
xi ∈b
1
L(T ) = I (yi = k̂) + λ|T |, (3.6)
N
xi ∈b
where once again |T | is the number of leaves in the tree. As with regression, to
prune the tree we find the subtree of the original tree that minimizes this regularized
loss.
Once we have our final, pruned tree, we would like to use it to make predictions.
For a given leaf, we can simply give the predicted class as the class that is most
common at that leaf. Additionally, we could quote the confidence in the prediction
by reporting the misclassification rate or the Gini index for the leaf.
In categorical prediction problems the confusion matrix, C, can be a useful device
for reporting the behavior of the model. For a K-class problem, it is a K ×K matrix.
The entries Cmn are defined as the number of cases in a data set (training or test)
3.2 Classification Trees 63
that were actually of class m and predicted to be class n by the model, i.e.,
I
Cmn = I (yi = m)I (ŷi = n), (3.7)
i=1
and ŷi is the predicted class for case i. A perfect model will have a diagonal
confusion matrix because all of its predictions will match the true value. The off-
diagonals indicate in which way the model is inaccurate. In Table 3.1 a confusion
matrix is shown for a hypothetical model that predicts the weather conditions.
Looking at this table we see that there were 67 cases where the weather was clear
and the model predicted correctly. There were also 12 cases where the weather was
clear, but the model predicted rain; 6 times the weather was clear and the model
predicted snow. The confusion matrix also shows us that the model is more likely
to be wrong by predicting a clear day when the actual conditions were rain or snow
(15 + 18 = 33) than predicting rain or snow when the conditions were clear. We can
also infer from the confusion matrix that the model is not very good at predicting
snow: it predicted snow correctly 21 times, and 20 times true conditions of snow
were predicted to be rain or clear.
Confusion Matrix
The confusion matrix is a generally applicable technique to understand how
classification models perform. It is a matrix of size K × K for a K-class
classification problem. Each entry in the matrix Cmn is the number of times
that the model predicted class m when the true class was class n. It allows
one to assess which classes are accurately predicted or find blind spots in
the model by looking at the magnitude of the off-diagonal terms relative to
the diagonal values in the matrix. The confusion matrix is the classification
analog to a plot of the predicted values versus actual values from a regression
problem. The confusion matrix can be computed separately for training and
test data to look for overfitting.
K
=
GR Rkk pk pk . (3.8)
k=1 k =k
Modifying the Gini index this way in the construction of the tree will try to find
splits that lead to minimizing incorrect predictions with higher corresponding values
in the risk matrix.
The risk matrix will not work for a two-class problem, K = 2, because the Gini
index will just account for the sum of the risk matrix elements (this is demonstrated
in an exercise). To account for asymmetric loss in a two-class problem, we need to
use a different loss function than the Gini index in the tree construction. We can use
the misclassification rate in the training and weight that with a risk.
Classification Trees
Classification trees are built in a similar way to regression trees; we partition
the independent variable space to find regions where the training data has the
same class. The trees are then pruned via a regularization process.
It is possible to combine many trees together to get a better model of the input–
output relationship between independent and dependent variables in a technique
known as random forests. The overall idea is that if we can build many small trees
that have errors that are not correlated and we average those trees together, we can
get a better prediction. We can indulge in a bit of analogizing here: trees grow
in forests rather than having one giant tree. In this way we want smaller trees to
combine to get a better model than one very large tree.
In random forests we use an extension of the concept of bagging. Bagging builds
multiple models using different training data: the available data is sampled with
replacement (this is known as bootstrapped sampling) to produce a suite of training
2 This is also called a loss matrix, but to avoid confusion with the loss function we use the term risk
matrix.
3.3 Random Forests 65
sets that models are built on. Because the models are built using different training
data, their predictions will be different and can be averaged together.
Random forests [1] modify the concept of bagging to include random sampling
of the independent variables. To grow a tree for random forests we produce a
bootstrapped sample of the training data. Then using this data we choose where
to create splits as in a normal tree, except we only consider a random subset
of the independent variables to split on. Using only a subset of the independent
variables, coupled with building each tree on a different training set makes the trees
behave differently due to different training data and different inputs used to the
tree growth. Because the training data and independent variables for the splits are
chosen randomly, the trees will have a low correlation with each other so that when
we average them together we can have errors cancel, and we get a better prediction.
Bagging
Bagging is a way to produce different models by modifying the training data
that each model uses. We take our original training data and sample with
replacement. That is, we pick a random case to put in a training set, and repeat
this process until we have the desired number of sampled cases, allowing
the cases previously selected to be selected again. Each training set that we
produce from this sampling with replacement is then used to build a different
model. These models will give different predictions because they were built
on different training data.
Random forests modify the concept of bagging to have each tree to select
from a random subset of the independent variables when choosing where to
split the data.
The algorithm for producing a random forest model with B trees is given below
in Algorithm 1. This will produce a set of B trees, Tb . Each of these trees will give
a predicted value for the dependent variable Tb (x). We then produce an estimate for
the dependent variable ŷ as a simple average for regression,
1
B
ŷ(x) = Tb (x), (3.9)
B
b=1
or via voting for classification where the class predicted by the most trees is the
predicted class for the forest. The value of m is the number of independent variables
√
to consider for a split. The originators of the random forest model suggest J
rounded down to the nearest integer for classification problems and J /3, rounded
down, for regression problems.
66 3 Decision Trees and Random Forests for Regression and Classification
Random Forests
When we combine many decision trees together, we get a forest. The random
forest model combines trees but adds randomization two ways:
1. Each decision tree is built using a different sample of the training data.
2. When choosing how to partition the data, only a random subset of the
independent variables are considered for each split.
This added randomization makes the trees give different results that, when
averaged together in a regression problem or when they vote on a class, are
superior to a single large decision tree.
Additionally, due to the way that the random forest is built, we can perform a
type of cross-validation while the model is built. Because each tree in the forest sees
a different subset of the training data, we can estimate the test error for each tree
by using it to evaluate the cases that were not used to train it. Therefore, for each
tree we have a prediction at each case not used to train it. These predictions are the
out-of-bag estimates (because they were not in the “bag” of training samples). We
can combine the out-of-bag estimates from each tree in the forest to get a prediction
for each case using the average of the out-of-bag estimates in the regression case, or
the majority vote in the classification case. The error in these out-of-bag estimates
is an estimate of the test error from cross-validation and can be used to determine
how many trees should be in a forest, for example.
From the random forest we can estimate the importance of the independent
variables in making a prediction. In each tree in the forest we can record the
improvement in the loss function due to splits in each variable. Summing up these
improvements over the trees gives an estimate of how important that independent
variable is to the model. In addition to the variable importance, we could estimate
the uncertainty in a prediction from the random forest by looking at the variability
in the estimates from each tree. For a regression problem this could be the standard
deviation or variance in the estimate, and for classification we could use the Gini
index. Such measures can be useful when one wants to identify important regions
in input space where the model is less certain.
3.3 Random Forests 67
One of the benefits that random forests provide over a single tree is the ability
to naturally include partitions of independent variable space that are not necessarily
orthogonal to the axes. This feature arises from the fact that different trees in the
forest will likely put splits in different places on the same variable because of the
different cases used to train the trees. Therefore, when we average these slightly
different splits, the effective partition of independent variable space will not be
orthogonal, which will be seen in an example below.
Furthermore, due to the bagging of the training data, on a regression problem
the estimate for the value assigned to a region of independent variable space (i.e.,
the value at a leaf of the tree) will vary between trees because each tree will have
a different mean for the values at a node. This allows random forests to capture the
numerical change in the dependent variable without adding too many splits to the
tree.
To see how random forests behave, and to compare the method to a single tree, we
attempt to fit the function
⎧
⎪
⎪ (x1 − 3)2 + (x2 + 3)2 < 42
⎨10
y= −10 (x1 − 2)2 + (x2 − 2)2 < 1 (3.10)
⎪
⎪
⎩2x 2 + x 2 otherwise
1 2
68 3 Decision Trees and Random Forests for Regression and Classification
2.0
6. 2.0
0
2.0
0
1.5 1.5 1.5
8.
2.
2.0
0
1.0
1.0 4.0 1.0 1.0
6.0
1.0 1.0
8.0
6.0
0.5 0.5 0.5
1.0
1.0
6.0
1.0
x2
x2
x2
4.0
0.0 0.0 0.0
1.0
−0.5 −0.5 −0.5
2.
4.0
2.0
8.0
0
6.0
8.
8.0
0
8.0
−2.0 −2.0 −2.0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x1 x1 x1
8.0 6.0
6.0
1.0
1.0
0.9
8.0
2.0
1.0 2.0 1.0 2.0 1.0 1.0
1.0
6.0
4.0
8.0
2.0
0.9
1.0
0.5 0.5 0.5
4.0
6.0
0.9
6.0
x2
x2
x2
0.0 0.0 0.0
8.0
8.0
8.0
−1.0 −1.0 −1.0
4.0
−1.5 −1.5 −1.5
8.0
2.0
2.0 2.0
8.0
2.0 2.0
8.0
1.0
4.0
1.0
4.0
8.0
8.0
4.0
4.0
0.9
1.0
x2
x2
x2
6.0
8.0
Fig. 3.4 Comparison of contours of y(x1 , x2 ) for random forest models with different numbers
of trees and a single tree model on two training sets of 2000 points. The numbers in parentheses
denote the training set (1 or 2). The training points are also shown. (a) True function contours. (b)
10 tree forest (1). (c) 10 tree forest (2). (d) Tree model (1). (e) 100 tree forest (1). (f) 100 tree forest
(2). (g) Tree model (2). (h) 1000 tree forest (1). (i) 1000 tree forest (2)
using training data that is generated by randomly sampling values of x1 and x2 from
a normal distribution with mean 0 and standard deviation 1.25. Near the origin the
function has a smoothly varying region adjacent to two discontinuities leading to a
constant value in two circular regions.
In Fig. 3.4 we show the results for fitting a variety of models using two different
sets of 2000 training points. The true contour plot of the value of y as functions of
3.3 Random Forests 69
x1 and x2 is shown in the top left with contour lines at y =1, 2, 4, 6, and 8. In the
first column is the result from a tree model using each of the two training sets. With
this large number of training points the differences between the two contour plots
are minor. The most noticeable differences are in the regions where there are few
training points at the top right and lower left of the plot.
The results from random forest models with different numbers of trees are shown
in columns two and three of Fig. 3.4. We have allowed the trees to fit until there was
a single case at each leaf when growing the trees and at each split both independent
variables were used (rather than a random subset). The rows correspond to different
numbers of trees in the forest: 10 trees in row one, 100 trees in row two, and 1000
trees in row three. Between the 10 and 100 tree forests the primary difference is
in the estimation of the boundary between the circular region in the lower right
and the region where y varies linearly. This transition is smoother in the 100 tree
case, although both results are smoother than the single tree model. Additionally,
the change in the results for the two different training data sets is more pronounced
in the 10 tree model.
The differences between the 100 and 1000 tree forests seem minor when looking
at these contour plots. The exact location of the contours varies but it seems to
the eye that the contours are in general in agreement with the true contours in
the top left of the figure. In the 100 and 1000 tree forests there are some artifacts
of the production of the contour plot. There are points where different trees are
affecting the value predicted by the model slightly and these artifacts give very
narrow contours in certain regions of the problem. These tend to occur in places
where the amount of data is small, such as the left side of the plot. The exact location
of these regions does seem to depend on the training data.
If we retrain the models using only 200 training points, we can exaggerate the
differences between models. The results for this less accurate training are shown in
Fig. 3.5. In this figure we see that the differences in the tree model between the two
training sets are much more pronounced, and that the tree model produces contours
that appear to be rectangular, rather than the rounded shapes of the true function’s
contours. Additionally, we see that 10 tree random forest model now demonstrates
more differences in the results for the two training sets. The random forest models
do not estimate as smooth of a transition to the circular region in the lower right as
when they were trained with more data. We also note that for a particular data set
the difference between the 100 and 1000 tree forests remains minor, as we found in
the case with more training points.
To further investigate the behavior of the random forest model we repeat the
above exercise using different numbers of trees in the forest: we vary the number
of trees from 1 to 200. For each forest we compute the out-of-bag (OOB) error as
the variance in the out-of-bag errors for each tree divided by the variance in the
dependent variable in the out-of-bag samples. A perfect score for this metric is zero
because in this case the errors in the OOB predictions are all zero. This error metric
is sometimes called the fraction of variance unexplained because it deals with the
variance in the data that cannot be adequately captured in the model.
70 3 Decision Trees and Random Forests for Regression and Classification
2.0
6.
0
2.0
0
1.5 1.5 1.5
8.
2.
1.0
0
1.0
8.0
1.0
1.0 4.0 1.0 1.0
4.0
8.0
4.0
1.0
0.5 0.5 0.5
4.0
1.0 1.0
8.0
x2
x2
x2
0.0 0.0 0.0 6.0
8.0
1.0
2.0
6.0
1.0
−0.5 −0.5 −0.5
2.
4.0
6.0
2.0
0
8.0
8.0
8.0
−2.0 −2.0 −2.0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x1 x1 x1
1.0
8.0
8.0
2.0
0.9 2.0
6.0
1.0
4.0
2.0
4.0
1.0
6.0
x2
x2
x2
0.0 0.0 0.0
6.0
6.0
0.9
4.0
−0.5 −0.5 −0.5
6.0
8.0
8.0
6.0
6.0
4.0
2.0
0.9
0.9 2.0 4.0 6.0
6.0 2.0
1.0 2.0
1.5 1.5 1.5
8.0 2.0
2.0 2.0
1.0 0.90.9
1.0 1.0 1.0
6.0 6.0 1.0
1.0
1.0
4.0
6.0
x2
x2
x2
4.0
6.0
1.0
8.0
8.0 8.0
−2.0 −2.0 −2.0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x1 x1 x1
Fig. 3.5 Comparison of contours of y(x1 , x2 ) for random forest models with different numbers of
trees and a single tree model on two training sets of 200 points. The numbers in parentheses denote
the training set (1 or 2). The training points are also shown. (a) True function contours. (b) 10 tree
forest (1). (c) 10 tree forest (2). (d) Tree model (1). (e) 100 tree forest (1). (f) 100 tree forest (2).
(g) Tree model (2). (h) 1000 tree forest (1). (i) 1000 tree forest (2)
The OOB error for random forest models on this problem as a function of the
number of trees in the forest is shown in Fig. 3.6. From the figure we can see that
there is clearly a difference in the error in the models based on the number of training
points: the error level the models can reach is about an order of magnitude smaller
in the case with 2000 training points. Also, from the figure we can see why the
results in Figs. 3.4 and 3.5 demonstrated that the 100 tree forest was different than
3.4 Case Study: Predicting the Result of a Simulation Using Random Forests 71
OOB Error
10−1
Fig. 3.6 The OOB error given as the fraction of variance unexplained by the model
the 10 forest model, but not noticeably different than the 1000 tree forest. It seems
that the OOB error has converged within the variability of the training set variations
between 25 and 50 trees in the forest. Beyond this point it appears that adding more
trees has little effect, as we saw in the above. In hindsight, we could have used this
result to decide how many trees were enough for our data.
As a case study for random forests we will consider the problem of predicting the
output of a computational physics simulation code. Many computer simulations
are expensive to run in terms of both computational time and human time. The
computational time cost is understood by any person who has had to wait for a
simulation to complete as part of a larger analysis. The cost of human time, while
harder to measure, is also important. Some computer simulations require effort to set
up, whether in problem specification or in generating a computational mesh or other
setup costs. Other costs include dealing with failed runs due to incorrect setting of
parameters in the simulation such as iterative tolerances, time steps, or other model
parameters.
The particular problem we will discuss is the computational simulation of a laser-
driven shock launched into a beryllium (Be) disc as part of a high-energy density
physics experiment [2–4]. An illustration of the experiment is shown in Fig. 3.7 In
this problem we use the radiation-hydrodynamics code Hyades [5] to predict when
a shock wave created by a high-energy laser will exit the Be disc, that is, when the
shock “breaks out” of the Be. The shock is created when the laser ablates some
of the Be from the disc. The momentum of this ablated material creates a counter
72 3 Decision Trees and Random Forests for Regression and Classification
Fig. 3.7 Illustration of the experiment being modeled in the simulations: a laser strikes a beryllium
disc, launching a shock wave into the beryllium that eventually leaves the beryllium, i.e., “breaks
out” into the gas-filled, polymer tube. The disc thickness is exaggerated in the drawings because it
is only about 20 microns thick. (a) Side view. (b) End view
flowing shock into the disc. This shock eventually reaches the back surface of the
disc and breaks out into the gas-filled tube. At this breakout time the conditions
of the disc are used to feed another simulation of the shock propagating down a
gas-filled polymer tube.
Hyades computes the hydrodynamic motion using a computational mesh that
moves with the materials. This means that if there is significant mixing or turbulent
flow, the mesh can become twisted to the point of the simulation crashing due to
mesh zones being twisted in such a way that the effective volume of a zone is
negative. To deal with such mesh twisting sometimes requires hand adjusting of
the mesh to remove these tangled zones. This is a source of much human effort,
along with the computational cost of the simulations.
To this end one may want to build a machine learning model, often called a
surrogate model, to predict what the Hyades output would be, in this case the shock
breakout time. For these particular simulations there are 5 parameters that we could
consider varying. The Be disc thickness can be varying because the thickness of
the disc can vary within several microns (μm) due to machining tolerances. The
second parameter is the laser energy because there is shot-to-shot variation in the
actual amount of energy that the laser can deliver to the disc. In addition to these
two physical parameters, there are three model parameters to vary. These model
parameters are often set to match some experimental observation, and as such they
are sometimes called tuning parameters. These tuning parameters are used to deal
with the fact that the simulation uses some approximate models to compute the
system evolution. In the code’s mathematical models we use a gamma-law equation
of state for Be, and the value of γ in the equation of state is a parameter we can
vary. Additionally, the value of the flux limiter for the electron diffusion model
is another parameter that can vary, as well as a parameter called the wall opacity
that is a multiplier on the optical thickness of the tube wall. These five parameters
(2 physical and 3 model parameters) will be our independent variables and the
dependent variable will be the shock breakout time.
3.4 Case Study: Predicting the Result of a Simulation Using Random Forests 73
3950
3900
Laser Energy (J)
3850
3800
3750
3700
3650
22
Be Thickness (micron)
21
20
19
18
1.75
1.70
1.65
Be Gamma
1.60
1.55
1.50
1.45
1.40
1.3
1.2
Wall Opacity
1.1
1.0
0.9
0.8
0.7
0.075
0.070
Flux Limiter
0.065
0.060
0.055
0.050
3600
3700
3800
3900
4000
18
19
20
21
22
1.4
1.5
1.6
1.7
0.8
1.0
1.2
0.05
0.06
0.07
Laser Energy (J) Be Thickness (micron) Be Gamma Wall Opacity Flux Limiter
Fig. 3.8 The 104 sets input parameters for the Hyades simulation runs. The diagonal plots show
the histogram of the samples for each independent variable, and the off-diagonals show the scatter
plot of the independent variables projected onto the 2-D planes of each pair of variables. The scatter
plot points are shaded so that points with a shorter breakout time are darker
For these five parameters we have sampled 104 points3 and run Hyades calcula-
tions for each and recorded the shock breakout time. The sampled points projected
onto the two-dimensional planes of each pair of coordinates are shown in Fig. 3.8,
as well as the histograms for the samples of each independent variable. The samples
do a reasonable job of covering the five-dimensional input space. Additionally,
this figure shades the points in the scatter plots so that simulations with a shorter
breakout time are darker.
3 Theseare a subset of the original planned points, but several simulations could not be run to
completion.
74 3 Decision Trees and Random Forests for Regression and Classification
OOB Error
shock breakout time as a
function of the number of
trees in the forest 100
10−1
Our task now is to use these simulations to produce a random forest model to
predict the shock breakout time as a function of the five inputs. For this we use
the implementation of the random forest model in the library scikit-learn
for Python [6]. We break our data into training and test sets with 15 simulations
in the test set. We then compute the OOB error, which is the fraction of variance
unexplained, for random forest models using different numbers of trees. We see in
Fig. 3.9 that by 100 trees per forest, the OOB error seems to have converged. Also,
given our relatively small data size there is little downside in terms of model cost by
using 200 trees in the forest so we use that for our final model.
The next step is to build the model and compare its accuracy on the test and
training sets. In Fig. 3.10 we show the predicted values for the shock breakout
from the random forest model versus the true values in the data. A perfect model
would have all of the results fall on the 45◦ line x = y. From the results we see
that the training data falls closer to the correct predictions relative to the test data.
However, there does not appear to be any bias or systematic flaws in the predictions
for the training points. For the test data the mean-absolute error was 0.013 ns and the
mean-absolute percent error was 3.35%. These are both acceptable given the large
variation of breakout times in the simulations. The mean-absolute error (MAE) is
defined as
1
I
LMAE = |ŷi − yi |, (3.11)
I
i=1
100 |ŷi − yi |
I
LMAP E = . (3.12)
I |yi |
i=1
3.4 Case Study: Predicting the Result of a Simulation Using Random Forests 75
0.55 Train
Test
0.50
0.40
0.35
0.30
0.30 0.35 0.40 0.45 0.50 0.55
True Value (ns)
Fig. 3.10 Predicted versus actual shock breakout times (in nanoseconds) for the 200 tree random
forest model
Table 3.2 The variable importances provided by scikit-learn for the shock breakout
random forest model
Variable Laser energy Be thickness Be gamma Wall opacity Flux limiter
Importance 0.010 0.188 0.731 0.013 0.058
Now that we have a random forest model that can predict the shock breakout
time given the five inputs, what should we do with it? We can use this model to
understand what inputs are most important in the simulation. The random forest
model in scikit-learn does give a variable importance measure based on the
decrease in variance attributed to each input. The numbers are reported as a number
for each independent variable and the total importances sum to 1. These values
are reported in Table 3.2. From these outputs we see that the value used in the
gamma-law equation of state, Be Gamma in the table, has the largest importance,
followed by the thickness of the Be disc, and the flux limiter value. The laser energy
and the wall opacity have seemingly little importance. For the laser energy, this
can be explained by the fact that the range of laser energies was not that large and
the fact that differences in the laser energy can be counteracted by changing other
parameters. The wall opacity not being important is due to the fact that the optical
properties of the polymer tube do not influence the shock behavior prior to break
out.
To get a better sense of how these parameters can affect the predicted breakout
time, we will plot the main effects of the model. The main effect for an independent
variable is the average behavior of the output when we average the predictions over
all independent variables except for one. In Fig. 3.11 we plot the main effects for
each of the independent variables. To compute these main effects we compute the
average model prediction at a given value of a single independent variable and using
the available data for all of the other independent variables. For example the main
effect for variable x1 , the laser energy, would be computed as
76 3 Decision Trees and Random Forests for Regression and Classification
0.46
0.44
0.44
0.42
Breakout time (ns)
0.40 0.40
0.38
0.38
0.36
0.36 0.34
3600 3700 3800 3900 4000 18 19 20 21 22
Laser Energy (J) Be Thickness (micron)
0.46
0.48
0.46 0.44
0.44
Breakout time (ns)
0.42
0.42
0.40
0.40
0.38 0.38
0.36
0.36
0.34
1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 0.050 0.055 0.060 0.065 0.070 0.075
Be Gamma Flux Limiter
0.44 17.5
15.0
0.42
Breakout time (ns)
12.5
Frequency
0.40 10.0
7.5
0.38
5.0
2.5
0.36
0.0
0.7 0.8 0.9 1.0 1.1 1.2 1.3 0.30 0.35 0.40 0.45 0.50
Wall Opacity Breakout time (ns)
Fig. 3.11 The main effects, as calculated via Eq. (3.13), for the random forest model. The dashed
lines are plus/minus one standard deviation of the model outputs as a function of the independent
variable. For reference in the lower right the histogram of the breakout times in the simulation data
is shown
3.4 Case Study: Predicting the Result of a Simulation Using Random Forests 77
1
I
m(x1 ) = f (x1 , xi2 , xi3 , xi4 , xi5 ), (3.13)
I
i=1
where I is the number of available data points and xij is the ith value of variable
j in the data set. The value f (x1 , xi2 , xi3 , xi4 , xi5 ) is computed using the random
forest model. This calculation can be performed for each independent variable. The
variables that are not important in the model will have their main effect functions be
relatively flat, indicating that varying just this independent variable does not affect
the mean output of the model. Also, at each value of the independent variables that
we are computing the main effects we can compute a standard deviation of the I
model outputs at that point.
For the main effects plotted in Fig. 3.11 we can see that the value of Be
gamma and the disc thickness are two important variables. The mean prediction
changes noticeably as these variables change, and the standard deviation bands
are also narrower. We see that increasing the value of Be gamma decreases the
shock breakout time. The main effects also show that increasing the disc thickness
increases the breakout time, as one might expect.
The flux limiter parameter demonstrates some influence on the model predic-
tions: the main effects for this variable change between the lowest value and the
largest value of the flux limiter. However, the laser energy and the wall opacity have
main effects curves that appear flat. This indicates that changing these variables is
much less important that the other variables. Moreover, the ±1 standard deviation
band for these variables is nearly the entire range of the breakout data, as can
be seen from the histogram. In this case the importances we infer from the main
effects are the same as from the reported importances from scikit-learn, but
the extra work is worth it because we can get a better understanding of how the
model behaves. The cost of computing the main effects is also modest because we
are just evaluating the random forest model repeatedly.
The random forest model can be thought of as an inexpensive way to estimate what
a full simulation would calculate the shock breakout time to be. One possible use
of this tool is to determine what the values of the simulation parameters should
be to get a desired result. For a given experiment there will be a value of the disc
thickness and the laser energy, but the value of the Be gamma, the flux limiter, and
the wall opacity are free to be modified to match the experimental breakout time. We
consider a scenario where we have a disc of thickness 20.16 µm, a laser energy of
3820 J, and a measured breakout time of 0.39 ns. We wish to use our random forest
model to determine what the correct values of Be gamma, the flux limiter, and the
wall opacity are to match this experiment.
78 3 Decision Trees and Random Forests for Regression and Classification
To do this we first set the wall opacity to the median value from the previous
simulation data, 1.005. This is done because our analysis above showed that it had
minimal impact on the shock breakout time. We now need to find the values of Be
gamma and the flux limiter that we need to match the simulation. We know that our
random forest model is not perfect because there is some error in the predictions.
Using the test data we estimate the standard deviation of the error on the test data to
be σ = 0.01566 ns. We then use this standard deviation to score a particular value
of a random forest prediction as
(f (x) − 0.39)2
score(x) = exp − . (3.14)
σ2
We vary the values of Be gamma and the flux limiter over their respective ranges in
the original data while fixing the values of the laser energy, disc thickness, and wall
opacity at the values given above. The value of the score will be 1 if the random
forest model predicts the desired value of 0.39 ns and the score decreases as the
prediction moves away from the desired value for the shock breakout time. The
standard deviation scales how fast this decrease occurs. If the model predicts a value
that is less than a standard deviation away from the desired value, the score will be
higher than if the prediction is farther away. This also implies that our calibration
will be better if the standard deviation is small because in that scenario only values
quite close to the desired value will get a high score.
Note that our scoring mechanism is symmetric in that predictions above or below
the target of the same magnitude would get the same score. We have implicitly
assumed that the random forest’s errors are symmetric. This is reasonable because
in our test we did not observe a strong bias in the predictions. If we had seen this, we
might have to adjust the way we score a prediction. Furthermore, if we had evidence
that σ was also a function of the inputs, we might have to adjust our scoring as well.
In Fig. 3.12 we plot the score as a function of the two calibration parameters. To
create this figure we evaluated the random forest model at a grid of 200 × 200 points
for Be gamma and the flux limiter. In this figure the dark regions correspond to high
score regions of input space. The calibration reflects the fact that Be gamma has a
stronger influence on shock breakout than the flux limiter: for every value of the flux
limiter we can get a high score, but only certain values of Be gamma give a high
score.
We can use the calibration to decide, for example, where we should run the full
Hyades code to get the desired breakout time. This could represent a large saving
in human and computer time because we would not be required to run multiple
simulations to perform the calibration. One question that naturally arises is what
value of Be gamma and the flux limiter should one use in running the code. In
Fig. 3.12 we see that there is a region between Be gamma of about 1.60 and 1.65
and a flux limiter range of 0.0525 and 0.0575 where the score is near one. Given
that this plateau of high score is the largest such region, it is sensible to pick a value
in the centroid of this region. This selection is preferred to, for example, a point on
the ridge of high score just below Be gamma of 1.60. At this value of Be gamma it
seems any value of the flux limiter would allow a high score, but a slight change in
Be gamma has the score drastically decrease for certain values of the flux limiter.
If multiple runs of the code were possible, one could sample points for Be gamma
and the flux limiter using the scores and rejection sampling [7]. One picks at random
values for Be gamma and the flux limiter in their respective ranges. Then one picks
another random number, ξ , between 0 and 1. If the score at the selected values of Be
gamma and the flux limiter is larger than ξ , we then run those values through a full
simulation. Otherwise, the point is rejected and we sample the values and ξ again.
Doing this repeatedly will pick points that have their density distributed in the same
way as the scores in Fig. 3.12.
This calibration exercise is just one way we can use our random forest model. We
could perform a calibration for multiple values. Say we have a suite of experiments
with different disc thicknesses and laser energies, we could have the score reflect
what values of the inputs match multiple experiments by multiplying exponentials
together, for instance. Additionally, given that we found in our model that Be gamma
was an important parameter in determining the shock breakout time, that indicates
that an investment in improving the equation of state model in the simulation (where
the Be gamma was a parameter) would have a bigger impact on the ability of
the code to model shock breakout time than, for example, improving the electron
diffusion calculation because the flux limiter parameter in that model had a much
smaller impact.
Notes
Our discussion of tree models did not touch on some other machine learning models
that are based on trees. For a detailed overview of these models, and a more
mathematically rigorous treatment of random forest models, please see [8]. The
models include:
80 3 Decision Trees and Random Forests for Regression and Classification
Problems
3.1 Consider a two-class classification problem and a generic risk matrix R. Show
that the weighted Gini index given in Eq. (3.8) will not affect the building of a
decision tree.
The remaining problems can be completed using a library such as scikit-learn.
3.2 Construct a training data set for a regression problem using the formula y =
10x1 + x2 + 0.1x3 + 0.01x4 + 0.001x5 by sampling each xi from a standard normal
distribution (that is a normal distribution with mean 0 and standard deviation 1).
Use 100 cases, that is 100 samples for the xi and 100 corresponding values for y.
Compare the following models: (1) a single decision tree, (2) random forests, and
(3) linear regression using a test sample. Repeat the comparison using 10 and 1000
cases in the training and comment on the changes.
3.3 Consider a problem where there are two independent variables z and τ and the
dependent variable is U (z, τ ). In Table 3.3 blank entries are zero. Fit the data using
a decision tree and random forests.
3.4 Build decision trees and random forests on the Rosenbrock function
by sampling 100 points randomly in the interval [−1, 1] to create the training set.
Use the models to find the location of the minimum of the function and compare
your result with the true values of x = −1 and y = 1.
3.5 Generate a classification data set by sampling values for x1 and x2 randomly in
the interval [0, 1]. Then evaluate the function
x 2 − x22
f (x1 , x2 ) = exp − 1 ,
10
and compare the values to another random number, ξ , in the interval [0, 1]. If
ξ > f (x1 , x2 ), the case is in class 1; otherwise, the case is in class 0. Build a
random forest model and decision tree model for the data you generate. Visualize
the decision boundaries and evaluate the confusion matrix for models built with 10,
100, and 1000 cases.
References
The previous chapters on linear regression and tree-based models focused on super-
vised learning problems where we wanted to understand the functional relationship
between independent variables and dependent variables, that is, we wanted to
find approximations to the relationship y = f (x). In this chapter we approach
the problem of unsupervised learning where we have a set of data comprised of
many variables and we want to find patterns or relationships between the different
variables.
There are many scenarios where one may have several different measurements
for each case. These measurements could be the size of different body parts (head,
arms, feet, etc.) in a human being, the performance characteristics of an automobile,
or the pixel values in an image. These measurements are telling us about the
underlying cases, but not all the measurements are independent. For example, in
the case of the human body someone with large arms will tend to have large feet,
but not always. We would like to take the measurements that we have and find ways
to combine them into uncorrelated measurements that tell us what are ways that our
data tends to vary.
To do this we consider a data set that is comprised of I cases and that each
case has J measurements or variables associated with it. We write each case as
xi = (xi1 , xi2 , . . . , xij . . . , xiJ ) for i = 1, . . . , I . We can assemble these cases into
a matrix X of size I × J where each row is a case and each column corresponds to
a given variable. We are interested in how the columns of X are related: that is how
the measurements may vary together or be correlated. We accomplish this using
the singular value decomposition (SVD). For the discussion below we assume that
I > J.
The SVD of the matrix X factors X into three matrices:
X = USVT , (4.1)
X = Ur Sr VTr , (4.2)
1A matrix is orthogonal if the product of the matrix with its transpose is the identity matrix, I.
4.1 Singular Value Decomposition 85
The matrix U has the same number of rows as the original data matrix X. Each
column in U represents the value of a new variable that is uncorrelated with the other
columns. Each of these columns is a linear combination of the original columns in
the data matrix. This linear combination is given by the columns of the matrix V
(or, equivalently, the rows of VT ) times the corresponding singular value. Given
that the singular values are ordered in decreasing magnitude, the columns of U give
the uncorrelated variables in decreasing order of importance. Therefore, we can
understand the cases of our data by looking at the first few columns in U in many
instances.
The columns of V define a linear combination of the original measurement values
to form a new uncorrelated variable. We can try to interpret what these mean by
looking at the changes as we add terms to the SVD reconstruction of the original
matrix. This will indicate what effects the different uncorrelated variables contribute
to. In the case study below we use this approach to interpret the uncorrelated
variables.
SVD to Approximate a Matrix
Even if all p singular values are non-zero, it is possible that several of the singular
values are approximately zero or that the first few singular values are much larger
than the others. Consider the fraction m defined as
sn
n=1
m = . (4.3)
p
sn
n=1
X ≈ U S VT , (4.4)
From the matrix S we see that the first element in the diagonal is much larger than
the other entries. This indicates that the columns are not independent measurements
and that there is some dependence between them. We can further see this by looking
at the values of m : m1 = 0.8861, m2 = 0.9825, and m3 = 0.9976. This indicates
that we can approximate the matrix A using just two vectors (the first column of
U and the first row of VT ) and the first entry on the diagonal of S and expect to
produce a reasonable approximation because 88.6% of the variability in the data
will be captured by the r = 1 approximation:2
⎛ ⎞ ⎛ ⎞
0.95 0.37 0.58 0.18 0.554
⎜0.59 0.11⎟ ⎜ ⎟
⎜ 0.15 0.40 ⎟ ⎜0.341⎟
⎜ ⎟ ⎜ ⎟
⎜0.29 0.22 0.11 0.03⎟ ⎜0.169⎟
⎜ ⎟≈⎜ ⎟ 2.141 0.801 0.328 0.478 0.148
⎜0.12 0.12 0.04 0.05⎟ ⎜0.076⎟
⎜ ⎟ ⎜ ⎟ S1 VT1
⎝0.23 0.23 0.07 0.03⎠ ⎝0.139⎠
1.24 0.51 0.74 0.23 0.723
A U1
⎛ ⎞
0.95 0.39 0.57 0.18
⎜0.58 0.24 0.35 0.11⎟
⎜ ⎟
⎜ ⎟
⎜0.29 0.12 0.17 0.05⎟
=⎜ ⎟.
⎜0.13 0.05 0.08 0.02⎟
⎜ ⎟
⎝0.24 0.10 0.14 0.04⎠
1.24 0.51 0.74 0.23
2 TheSVD of the matrix is written only to three decimal places to conserve space. In practice we
would use more digits.
4.1 Singular Value Decomposition 87
In the above approximate version of A the entries that are not correct to 2 decimal
places are written in bold. Examining this approximation we see that one row is
exactly reproduced, and the first column has no errors larger than 0.01. Depending
on our application, this approximation may be sufficient. In such a case we would
only consider a single variable for each case as stored in the vector U1 , for our data
matrix, and then use S1 and V1 to reproduce the original variables in A.
If we use the two term approximation, r = 2, to approximate A, we get nearly a
perfect approximation:
⎛ ⎞ ⎛ ⎞
0.95 0.37 0.58
0.18 0.554 −0.101
⎜0.59 0.11⎟ ⎜ ⎟
⎜ 0.15 0.40 ⎟ ⎜0.341 −0.443⎟
⎜ ⎟ ⎜ ⎟
⎜0.29 0.03⎟ ⎜0.169 0.516 ⎟ 2.141 0
0.22 0.11
⎜ ⎟≈⎜ ⎟
⎜0.12 0.05⎟ ⎜0.076 0.323 ⎟
0.12 0.04 0 0.233
⎜ ⎟ ⎜ ⎟
⎝0.23 0.03⎠ ⎝0.139 0.650 ⎠
0.23 0.07 S2
1.24 0.51 0.74
0.23 0.723 0.007
A U2
⎛ ⎞
0.95 0.37 0.58 0.18
⎜0.59 0.15 0.40 0.11⎟
⎜ ⎜
⎟
⎟
0.801 0.328 0.478 0.148 ⎜0.28 0.22 0.11 0.05⎟
× =⎜ ⎟.
−0.047 0.865 −0.496 −0.061 ⎜0.13 0.12 0.04 0.02⎟
⎜ ⎟
VT2 ⎝0.23 0.23 0.07 0.03⎠
1.24 0.51 0.74 0.23
From this result we see that the largest error is 0.03, and nearly all of the entries
are exact to 2 decimal places. Therefore, we can replace the 4 measurements in
the original data with 2 linear combinations of the original columns, given by the
first two columns of V. To see how such a reduction could be used in practice, we
examine a case study next.
Data Reduction with SVD
When we replace a matrix by its term SVD, we reduce the storage required. If the
original matrix was of size I × J , it required I J numbers in storage/memory. For
an term SVD we need to store I numbers for U , numbers for S because only
the non-zero terms need to be stored, and J numbers for V . Therefore, the total
storage for the term SVD is (I + J + 1). Therefore, if is much less than I or J ,
the storage is greatly reduced.
Consider the case of having I = 106 with J I . This is a common case
in data applications where there might be millions of cases of tens to hundreds of
measurements. In this case the storage required in the term SVD is (I + J +
1) ≈ I . The fraction of data needed to be stored, relative to the original matrix,
is then I /I J = /J . As a result, the storage is reduced by a factor /J . It is
commonplace that could be an order magnitude smaller than J so that the data
storage requirement goes down by about an order of magnitude.
88 4 Finding Structure Within a Data Set: Data Reduction and Clustering
Zhoh=0.5 cm
Zbaf=
0.2 cm
RLEH=0.155 cm
Rapt=0.1155 cm
Rhoh=0.28875 cm
Z
Z=0
Fig. 4.1 Schematic of nominal hohlraum geometry for the simulated experiments. The sample to
be irradiated is shown for illustration purposes in the center of the geometry
0.15
cubic splines to connect the
data points
0.10
0.05
0.00
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
time (ns)
Rapt /Zbaf = Rhoh /Zhoh holds. The variable “rapt” multiplies Rapt . Each of these
three parameters, scale, scl, and rapt, can take values of 0.8, 0.85, 1, 1.05, 1.2, 1.25.
We have data from 15 simulations for the average temperature of the gold near
where the laser is focused at 30 different times from −0.399979 to 3.50014 ns; the
zero time is when the main part of the laser pulse begins. We assemble this data into
a matrix of 15 rows and 30 columns. We then compute the mean of each column
of the matrix to get the average temperature profile over all of the simulations. This
mean temperature response is shown in Fig. 4.2; the units of temperature used are
keV where 1 keV ≈ 11.6 × 106 K. Before computing the singular value of the data
matrix the mean of each column is subtracted from the data matrix so that each
column represents the difference from the mean for each simulation as a function of
time. We call this mean temperature profile T̄.
We now compute the SVD of the data matrix, after subtracting the mean. The
values for m from the SVD indicate that the first few singular values capture nearly
all the variations between the different simulations. The value for m1 is nearly 0.7,
90 4 Finding Structure Within a Data Set: Data Reduction and Clustering
0.0
Value
−0.1
−0.2
−0.3
v1
−0.4 v2
implying that the largest singular value explains about 70% of the variation between
cases. The next two values are m2 = 0.836 and m3 = 0.882. Given that we care
most about the overall behavior of the temperature as a function of time, the first
two singular components appear to be sufficient to capture the temperature profile.
The means that we can write the vectors of temperature for each case i at the 30
time points, T, as the sum of the mean and contributions from the first two columns
of V:
Trad(keV)
magnitudes, respectively, for
ui1 ; cases 5 and 9 had the 0.15
largest positive and negative
magnitudes, respectively, for 0.10
ui2 . The full data is shown
0.05
using a solid black line and 1 singular value
the approximations using 1 2 singular value
0.00
and 2 singular values use a
dashed and dotted line, −0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
respectively. (a) Cases 1 time (sh)
(upper curve) and 4 (lower (a)
curve). (b) Cases 5 and 9
Case 9
0.25 Case 5
0.20
T (keV)
0.15
0.10
0.05
1 singular value
0.00 2 singular value
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
time (ns)
(b)
geometry is larger there is a larger volume to contain the energy and the energy
density goes down. A smaller geometry gives a higher energy density and a higher
temperature.
The impact of the contribution from ui2 is shown in Fig. 4.4b. Here we show
the two cases with the largest positive and negative values for ui2 : cases 5 and 9,
respectively. For these cases an approximation based solely on the first singular
value cannot capture the behavior at early and late times near the plateau of the
temperature profile. When the second singular values are included, the approximate
profiles are indistinguishable from the true profiles. We can also see that the
influence of the second singular component is much smaller than the first singular
component, as expected from the difference between m1 and m2 . Cases 5 and 9 had
the extreme values for the sample chamber length: case 5 had the minimum value
for this quantity in the data, and case 9 had the maximum value. From the figure it
appears that having a large value of sample chamber length makes the temperature
92 4 Finding Structure Within a Data Set: Data Reduction and Clustering
3.77
scalei2
-0.04
0.66
scli
-2.26
-0.74
rapti
-0.57
Σ ui2
0.95
rapti2
1
plateau rise faster and fall off sooner; case 5 has a small value for this parameter and
the plateau is reached at a later time and remains there longer.
Using the results above we will build two linear regression models, as described
in Sect. 2.2, to predict the values of ui1 and ui2 as a function of the three parameters
via Eq. (4.5). This will allow us to predict the time profile of the temperature for
any given modification to the hohlraum that can be described via the parameters
scale, scl, and rapt. After some investigation (including cross-validation) we find
that adequate models are given by
The linear models are visualized in Fig. 4.5. From these models we can see that for
ui1 the most important parameter is the scale parameter. This is consistent with our
findings from the above where the simulation cases with extreme values of scale
had extreme values for ui1 . For ui2 it appears that scl is the parameter that has the
largest impact.
We can compare the predicted temperature profiles with the data used to train the
model. When doing so we find that on a graph the model output and the data are
4.2 Case Study: SVD to Understand Time Series 93
0.20
Trad (keV)
0.15
0.10
4.3 K-means
K
L= xi − μk 2
, (4.8)
k=1 xi ∈Sk
where μk is the vector of the mean each of the variables of the observations in set Sk :
1
μk = xi . (4.9)
Nk
xi ∈Sk
When the minimum of L is found, we have each case assigned to a group, Sk , and
each group is represented by the mean value of the group. Therefore, we can think
of each group, or cluster, as observations “nearby” the mean of the group.
This is a useful way to understand the different clusters that arise in the data, but
there are reasons that these clusters may not be ideal. For instance, just because we
have minimized L in Eq. (4.8) does not mean that all the cases are close to a group
mean. Moreover, in minimizing L we have to pick the number of clusters K. The
value of the loss function will likely change with K, and it may be difficult to know
ahead of time what K should be.
4.3 K-means 95
K-means Clustering
K-means finds K different clusters in a set of data by grouping each case
based on how close it is to the average of the cases in a cluster. This method
requires the user to specify the number of clusters to find, K. Also, the clusters
found may change when the algorithm is run multiple times.
x2
1
center of each cluster chosen
by K-means is indicated with 0
a large marker, and clustered
points identified by K-means
that disagree with the −1
distribution the original data
was drawn from are plotted 0 1 2 3 4 5 6
using an unfilled symbol. (a) x1
Original data. (b) Clusters (a)
from K-means
2
x2
−1
0 1 2 3 4 5 6
x1
(b)
The centers of the clusters as identified by K-means are x̄1 = (0.825, −0.0525),
x̄2 = (1.761, 2.003), x̄3 = (4.083, 0.491). These values are near the true centers
of the distributions used to produce the data. It is worthwhile to remember that
the K-means algorithm has no information about the original distribution, and we
only told it to look for 3 clusters. Given how it has performed, we could have
reasonable confidence that a new point generated from one of the distributions could
be correctly identified as which distribution it came from.
We can make the clustering problem more difficult by increasing the standard
deviations of the distributions so that there is more overlap in the data. Specifically,
we increase the standard deviation in the normal distributions to be 0.75. This causes
the samples from each distribution to significantly overlap, as shown in the top
panel of Fig. 4.8. We would not expect K-means to give identical answers as before
because there are many points that are closer to the center of a different distribution
than the center of the distribution it was sampled from. However, the centers of the
4.3 K-means 97
x2
1
center of each cluster chosen
by K-means is indicated with
0
a large marker, and points
clustered by K-means that
disagree with the original −1
data are drawn as an outline
−1 0 1 2 3 4 5
x1
(a) Original Data
2
x2
−1
−1 0 1 2 3 4 5
x1
(b) Clusters from K-means
distributions are not changed so it may be possible to identify these clusters using a
centroid.
In the lower panel of Fig. 4.8 we see that there are many more points identified
with a different cluster than the distribution it was generated from. This comes
from the fact that K-means only uses distances in its clustering. The cluster centers
identified by K-means are x̄1 = (0.752, −0.046), x̄2 = (1.679, 2.018), x̄3 =
(3.855, 0.605). These centers are less accurate than what we saw when the clusters
were better separated, but they still are reasonable approximations and qualitatively
they indicate the three clusters where we would expect them to be.
In a sense we cheated a little bit in this example. The primary user input to K-
means is the value of K. We used K = 3 so far in the example, and K-means
found our three clusters without much difficulty. The question naturally arises of
how would K-means behave if the number of clusters was run with a different value
of K. The results using K = 2 and K = 5 on the data using a standard deviation
98 4 Finding Structure Within a Data Set: Data Reduction and Clustering
x2
1
−1
0 1 2 3 4 5 6
x1
(a)
2
x2
−1
0 1 2 3 4 5 6
x1
(b)
of 0.6 in the distributions used to generate the samples are shown in the top and
bottom panels of Fig. 4.9, respectively. In the figure we see that when K = 2 the
two clusters identified seem to be divided solely on the value of x1 . This is likely
due to the fact that the range of x1 in the data is greater than the range in x2 . Just
viewing the data in the figure this clustering seems reasonable, but it is possible to
spot the problem. In the left cluster the center is not near many data points: it appears
that there are two clusters that this point is splitting. This is one way to identify that
there may be a problem. We can also look at the value of the loss function, Eq. (4.8).
The value of L for K = 2 is 226.5 and for K = 3 it is L = 101.9. The magnitude
of the loss function is smaller for K = 3, but we should look at the value of L per
cluster by multiplying by K. This will help deal with the fact that if K = I , that is,
the number of clusters is equal to the number of data points, the value of L = 0. For
this example LK = 453.0 for K = 2 and LK = 305.6 for K = 3. Based on this
measure we can see that K = 3 is superior to K = 2.
4.3 K-means 99
Std.=0.6
Std.=0.75
500
LK 450
400
350
300
1 3 5 7 9 11 13 15 17 19
K
Fig. 4.10 The value of the loss function, L, times the value of K used in the K-means clustering
for the two data sets considered
The case of K = 5 as shown in the lower panel of Fig. 4.9 has the resulting
clusters appearing to split two of the three “true” clusters into two different clusters.
As with the K = 2 case, the clusters identified seem reasonable to the eye at first
look. The value of LK for K = 5 is 353.6, significantly higher than the value of
K = 3. This would give us a reason to select K = 3 over K = 5.
We can further investigate the idea of using LK to choose K by examining these
values for both the data sets we have generated (one using 0.6 as the standard
deviation in the normal distributions and the other using 0.75 for the standard
deviation). The value of LK is plotted as a function of K in Fig. 4.10 for each of
these data sets. In both curves there is an obvious local minimum at K = 3. Though
it appears that large values of K will have a lower value of LK, if we use parsimony
as a guide, we can easily settle on K = 3 rather than some value K > 20. Looking
at a plot such as this one helps guide the selection of K.
4.4 t-SNE
where f (x|μ, σ ) is the probability density for x given a distribution mean of μ and
standard deviation σ . The Cauchy distribution (i.e., the t-distribution with a single
degree of freedom) has a probability density function given by
1 γ2
f (x|μ, γ ) = . (4.11)
πγ (x − μ)2 + γ 2
For the Cauchy distribution μ is also the mean of the distribution. The parameter γ
gives information about the tails of the distribution, but the Cauchy distribution has
an undefined standard deviation and variance because the distribution approaches
zero so slowly outside of the mean. In Fig. 4.11 normal and Cauchy distributions
are compared using μ = 0 and σ = γ = 1. The tails of the Cauchy distribution
are much heavier than that for the normal as shown by the distribution approaching
zero more slowly as x goes away from the mean. If we look at the value at x = 4,
we get that the value for the Cauchy distribution is about 140 times greater than that
for the normal.
t-SNE uses both of these distributions: a normal distribution is used to find the
clusters in the high-dimensional space, and these clusters are then mapped into a
Cauchy distribution in 2-D or 3-D to visualize the data. The normal distribution is
used in the high-dimensional space because it is easy to work with and has nice
properties when used on large data sets. The Cauchy distribution is used for the
4.4 t-SNE 101
probability density
σ = γ = 1. Notice that the 0.25
Cauchy distribution
approaches zero much more 0.20
slowly as x gets farther from 0.15
the mean
0.10
0.05
0.00
−4 −2 0 2 4
x
visualization because the long tails tend to spread out the clusters so that they are
easier to visualize.
t-SNE
High-dimensional data sets, such as images, are difficult to cluster and
visualize the clusters. t-SNE maps clusters found in a high-dimensional data
set to 2-D or 3-D in such a way that clusters can be visualized. The trick
it uses is to find groupings in the high-dimensional space using a normal
distribution and then map that distribution into t-distribution with one degree
of freedom (also called the Cauchy distribution) in 2-D or 3-D. Using the
Cauchy distribution in the low-dimensional space helps to spread out the
clusters so that they can be seen visually.
exp − xi − xi 2 /(2σ 2 )
i
pi |i = , (4.12)
k =i exp − xi − xi 2 /(2σi2 )
J
xi − xi
2
= (xij − xi j )2 . (4.13)
j =1
This conditional probability has a form that makes it so that the further the distance
between two points, the smaller the probability that they are neighbors. The distance
is measured in units of σi . The value of σi is chosen based on the data so that σi is
larger in regions where there are many points and smaller in regions where there
are few data points. In implementations of t-SNE one sets a parameter called the
perplexity to control how this density of points is determined.
Once we have computed the conditional probabilities pi |i we want to create
an equivalent low-dimensional distribution. We use qii to denote the probability
density in this low-dimensional distribution and yi to denote the points in the low-
dimensional space where yi = (yi1 , yi2 ) or yi = (yi1 , yi2 , yi3 ) depending on if the
low-dimensional space is 2- or 3-dimensional. We call the number of dimensions
d. The probability of yi and yi being neighbors is based on the multidimensional
Cauchy distribution
−1
1 + yik − yi k 2
qii = . (4.14)
I I 2 −1
=1 =1 1 + yk − y k
Using the Cauchy distribution for qii allows the numerator to go to zero more
slowly than using an exponential (as the normal distribution does). This means that
the difference between points far away is more likely to have a large value in the
low-dimensional space, and therefore the distribution can tell the difference between
points once they are far away. Or, to think about it another way, the long tails of the
Cauchy distribution allow it to push points farther away in the low-dimensional
space than they would be in the high-dimensional space. This helps to separate the
clusters.
The question arises how do we pick the points yi . To find these values we ask
that the distributions in the high-dimensional space, pi |i and qii , match in some
sense. To make these distributions match we minimize the loss function
I
I
pii
L= pii log , (4.15)
qii
i=1 i =1
where pii = 0.5(pi |i + pi|i ). This is called the Kullback–Leibler (KL) divergence.
Due to the logarithm in the formula, the minimum value of L occurs when all qii =
4.4 t-SNE 103
pii . The values of yi are chosen so that L is minimized. This then gives us the
values for yi that are distributed by a Cauchy distribution that matches the high-
dimensional normal distribution. The minimization of L is accomplished by using
standard optimization techniques. These optimization techniques often rely on a
parameter called the learning rate that we will discuss in some detail later.
To demonstrate how t-SNE works we consider an image recognition data set called
the Fashion MNIST image data set [2]. This is a set of 28 × 28 grayscale images
that contain one of ten classes of clothing and fashion accessories. The classes are
given in Table 4.1. Figure 4.12 contains example images from this data set for each
of the 10 classes.
Though in science and engineering we rarely deal with fashion items, this set of
images is related to the important problem of image classification. Whether we are
considering an autonomous vehicle, drone flight, or other automated tasks, object
recognition is an important problem. In our current study we ask if there are natural
groupings in the data that t-SNE can uncover. We treat each image as having 282 =
784 degrees of freedom, i.e., J = 784, where each value is the grayscale value of a
pixel. In this manner we convert a set of images into a standard data matrix of size
I × J where I is the number of images. In the application of t-SNE the algorithm
has no information about the class of images, just the raw pixel data.
We apply t-SNE using the scikit-learn implementation on a sample of 6000
images from this data set. Before performing t-SNE we use an SVD on the data
set with = 200 to reduce the 784 degrees of freedom in the original data. This
makes the t-SNE algorithm run faster because the step of computing the normal
distribution will be over a smaller dimensional data set. We then apply t-SNE with
d = 2 to produce a 2-D clustering; the additional parameters we set are a perplexity
of 40 and a learning rate of 1200 (see Sect. 4.4.1).
The resulting clustering by t-SNE is shown in Fig. 4.13 where the location of
the points is determined by t-SNE but we label the points by the class the point
belongs to. The resulting figure looks like the map of an archipelago with several
large islands with several small points in between. Using the map analogy we refer
to compass directions with north pointing to the top of the figure. At the top left
(north west) of the figure is an island of pants, and we see that there are few pant
points that are not on this island. From this we can see that trousers/pants are easy
for t-SNE to distinguish from the other images. We also notice that at the bottom of
the figure there is a region of bags.
104 4 Finding Structure Within a Data Set: Data Reduction and Clustering
Fig. 4.12 Examples of images from each class of the Fashion MNIST data set
There are several other large islands that we can identify that contain several
classes. There is a large island in the lower part of the figure that contains two larger
regions connected by a thin strip. This island contains primarily shirts, pullover
shirts, and coats in its north-west part and contains shirts, pullover shirts, and T-
shirts/tops in the south-east part. On the right side of the figure we see an island of
sneakers and sandals: to t-SNE these images are similar. In the center of the figure
there is an island containing nearly all of the dresses in the western part, and several
other classes in the eastern part.
4.4 t-SNE 105
Fig. 4.13 Resulting cluster from t-SNE on the Fashion MNIST data set
To get further insight into the clustering we can put the images at the locations
identified by t-SNE; this is shown in Fig. 4.14. In this figure we do not plot every
data point to avoid overlapping. Plotting the actual images can give us some insight
into how the islands were formed. For instance we can directly see why the island
with the two large ends connected by a thin strip was formed. This island contains
images of roughly the same shape but the two ends of the island separate darker
items from lighter images. We can see this effect in other islands as well (e.g., the
center island). We also notice that t-SNE found two different types of boots. It looks
like one has more prominent heels than the other.
By applying t-SNE to this data we can understand how the images themselves
have different structures beyond the classifications that we have imposed. For
instance, based solely on the data a light-colored sandal is more similar to a light-
colored low-top sneaker than that sneaker is to a dark high-top sneaker. These
insights can be used in supervised learning problems to understand how to pose
supervised learning problems and to gain better understanding of high-dimensional
data. For instance, if we could normalize the data so that only the shape of the object
was important and the differences in brightness/lightness were removed, it might be
easier to differentiate items. Or, it could indicate that the classification that we give
to data points may not make the most sense as a function of the raw data.
106 4 Finding Structure Within a Data Set: Data Reduction and Clustering
Fig. 4.14 Rendering of the t-SNE clustering where the points are replaced by the image. Not all
images are shown to avoid overlapping
4.5 Case Study: The Reflectance and Transmittance Spectra of Different Foliages 107
Interpreting t-SNE
The results from a t-SNE clustering can tell us information about the data that
may not be apparent beforehand. For instance, in the Fashion MNIST data if
we classify images based on pixels alone it appears that light-colored shoes of
different types are more similar than light shoes and dark shoes of the same
type. This type of insight could be useful when devising supervised learning
models that work with the data.
102
101
100
European smoketree
Geranium
European privet
English ivy
Leatherleaf arrowwood
Rhododendron calophytum
Sweetgum
Judas-tree
Beautyberry
Common lilac
Cider gum
English holly
Holly osmanthus
Large gray willow
Weigela florida
Silver linden
Tuliptree
Japanese snowball
European chestnut
English walnut
Portugal laurel
Otto Luyken English laurel
Staghorn sumac
Wine grape
Boston ivy
Cherry laurel
Filbert
Dwarf schefflera
Laurustinus
Red-barked dogwood
Black locust
Sycamore maple
White poplar
Variegated boxelder
European alder
Wintercreeper
Holly oak
Pin oak
Hortensia
Fig. 4.15 The number of leaves by species in the ANGERS leaf database. Note that there is a
large number of sycamore maple leaves relative to the other leaves
the long wavelength measurements above about 2500 nm. From the SVD we can
look at the values of m and determine what rank, r, we need to capture the leaf to
leaf variability in the data set. The plot of m is shown in Fig. 4.17. From this figure
we can see that using just a few terms of the SVD we can capture a majority of the
variability in the data. For instance, with = 5 we can capture 85% of the variability
in the data. Additionally, 90% of the variability can be captured with = 14.
We choose to use a rank 4, i.e., = 4, SVD approximation of the data matrix
to find clusters in the data using K-means. When we look at the values of LK for
this data, we find that 5 clusters is a reasonable number to search for. To visualize
the 5 clusters we create scatter plots of the values of the columns in the matrix
U, uij , for the first four columns j = 1, . . . , 4 in Fig. 4.18. In this figure we plot
all the pairs (uij , uij ) as the x- and y-axis, respectively. The marker and color for
each point indicate what cluster it belongs. From this figure we notice that cluster
3 contains points that have a large value of ui3 , and that these points form a cluster
that becomes obvious when visualizing the data in this form. Additionally, cluster 1
looks like it contains points that have large values of ui1 , ui2 , and ui4 .
Another way to understand the clusters that K-means reveals is to look at the
reflectance and transmittance spectra that correspond to the center of each cluster,
and the data point that is closest to that center. The center of each cluster is the four
values of ūj for j = 1, . . . , 4; to reconstruct the spectra we use Eq. (4.2) with = 4
and where U is replaced with the row vector (ui1 , ui2 , ui3 , ui4 ). The reflectance
and the transmittance spectra for the centers of the clusters and the nearest leaf in
the data are shown in Fig. 4.19.
4.5 Case Study: The Reflectance and Transmittance Spectra of Different Foliages 109
0.4
reflectance
0.2
0.0
500 1000 1500 2000 2500
wavelength (nm)
(a)
White poplar
Geranium
0.4 Portugal laurel
0.3
transmittance
0.2
0.1
0.0
500 1000 1500 2000 2500
wavelength (nm)
(b)
Fig. 4.16 Example reflectances and transmittances for leaf samples from three different species
of plant. (a) Reflectance. (b) Transmittance
The spectra shown in Fig. 4.19 give further insight to the observations we made
from Fig. 4.18. We notice that cluster 3 has a significantly elevated value for the
reflectance and transmittance in the visible part of the spectrum on the left. This
tells us that a large value for ui3 indicates that leaf i has high values in the visible
range for reflectance and transmittance. In Fig. 4.19 we also observe that cluster 4
110 4 Finding Structure Within a Data Set: Data Reduction and Clustering
mℓ
respectively
0.8
0.7
has a large value for the reflectance in the range of wavelengths above the visible,
but a low value for the transmittance in this range. Leaves in clusters 2 and 5 appear
to have the opposite behavior with a low reflectance and high transmittance in the
wavelengths above visible; the difference between these clusters is their behavior in
the visible range.
Though we have optical data for the leaves well beyond the visible range, we can
compute the color of the reflected and transmitted light from a leaf as a visual
representation of the spectra. To do this we will need a bit more optical theory.
On a computer we typically represent colors as a combination of red, green, and
blue called RGB. A color is a vector of three numbers that give the contribution of
each color mixed together to get the final color; typically the values are normalized
so that the vector has a maximum entry of 1. Each wavelength in the visible range
can be mapped to a color; this was codified in the CIE 1931 RGB color space3 [4].
This will allow us to convert the light reflected or transmitted at each wavelength
into a color, but we still must determine how the colors mix to get the ultimate color
that we would see.
To do this we assume that we shine light with a particular spectrum onto the leaf.
If the light has a known spectrum, we can integrate the product of the light intensity
at a particular wavelength, the color values at that wavelength, and the reflectance
or transmittance value at that wavelength to get a vector corresponding to the RGB
values for the reflectance or transmittance. For the light intensity we use a blackbody
1
2
3
−0.04 4
5
U1
−0.06
0.2
0.1
U2
0.0
−0.1
0.3
0.2
U3
0.1
0.0
0.2
0.0
U4
−0.2
Fig. 4.18 Scatter plots of the values of uij , j = 1, . . . , 4 for the reflectance and transmittance
spectra for the leaf data set. The shape/color of the points indicates which of the 5 clusters the leaf
belongs to
1 - Sycamore maple
2 - European chestnut
0.5
3 - Wintercreeper
4 - Leatherleaf arrowwood
0.4 5 - Sycamore maple
reflectance
0.3
0.2
0.1
(a)
0.5
0.4
transmittance
0.3
0.2
1 - Sycamore maple
2 - European chestnut
0.1
3 - Wintercreeper
4 - Leatherleaf arrowwood
0.0 5 - Sycamore maple
(b)
Fig. 4.19 The reflectance and transmittance spectra for the centers of the 5 clusters from K-means
and the leaf nearest the center of the cluster. The dashed lines are the reconstructed spectra for the
cluster centers, and the solid lines are the leaf closest to the cluster center. The plant name for the
leaves closest to the center is also provided. (a) Reflectance. (b) Transmittance
clusters 1, 2, and 4. When we look at the spectral data in Fig. 4.19 we see that cluster
5 has higher values in the visible range than clusters 1, 2, and 4. Also, we notice that
there are no sycamore maple leaves in clusters 3 and 4. This is interesting because
there were a large number of leaves from this plant in the data. This indicates that it
4.5 Case Study: The Reflectance and Transmittance Spectra of Different Foliages 113
Sycamore maple Wintercreeper European smoketree Laurustinus Common lilac Laurustinus English ivy Sweetgum
European chestnut Sycamore maple Sycamore maple Sycamore maple Staghorn sumac Sycamore maple European chestnut Sycamore maple
Wintercreeper Black locust English ivy Geranium Wintercreeper English holly Red-barked dogwood Variegated boxelder
Leatherleaf arrowwood Leatherleaf arrowwood Silver linden Holly oak Silver linden Holly oak White poplar White poplar
Sycamore maple Sycamore maple Sycamore maple Sycamore maple Wine grape Sycamore maple Sycamore maple Sycamore maple
Fig. 4.20 Effective color of the reflected and transmitted light for the different clusters of leaves.
Each row corresponds to a cluster, with cluster 1 being the top row. The columns contain the 8
leaves closest to the center of the cluster. Each circle has the top half color being the reflected light,
and the bottom half being the transmitted light. The names of the plant are shown as well
may be possible to have high confidence that a leaf did not come from this plant if
its spectra falls into cluster 3 or 4.
Beyond these observations it becomes somewhat difficult to describe the differ-
ences between the clusters using color information alone because we are reduced to
using terms like greenish-yellow or pale green to describe colors. This is one of the
benefits to studying the actual spectra rather than colors. Moreover, in the spectra
we have information from outside the visible range that we can use to distinguish
leaf types. This is most apparent in cluster 4 where the differences in the spectra
outside the visible range are most stark.
When we look at just the visible range for the reflectance in Fig. 4.21 we see
that the 3 of clusters have similar shapes below 700 nm. These clusters, clusters
1, 2, and 4, have a green color in Fig. 4.20. If we only had data from the visible
range, our analysis may not have found as many clusters as we found from using the
hyper-spectral data.
1 - Sycamore maple
2 - European chestnut
0.6 3 - Wintercreeper
4 - Leatherleaf arrowwood
5 - Sycamore maple
reflectance
0.4
0.2
0.0
400 500 600 700
wavelength (nm)
Fig. 4.21 Comparison of cluster centers (dashed lines) and nearest leaf in the data set for
reflectance in the visible range
Notes
Additional discussion of the SVD and its properties can be found in [5].
We have only covered a small subset of unsupervised learning techniques. We
will return to this problem later when we discuss autoencoders. There are variants
of the SVD that have different features. These include independent components
[6], factor analysis [7], and non-negative matrix factorization [8]. Non-negative
factorization is especially interesting in applications where the variables are known
to be positive (as occurs often in science and engineering). As mentioned above,
there are variants of K-means clustering that are useful. Descriptions of many of
these methods can be found in [9].
Problems
for t = 0, 0.05, 0.1, . . . , 1.0 and w1 and w2 are random numbers between 0 and 1
that are constant for each time series and t is a sample from a normal distribution
with mean 0 and standard deviation 0.05 at each time point. Using this data, create
a data matrix, perform an SVD, and see what rank you need to describe 95% of the
data.
References 115
4.2 The iris classification data set is available in scikit-learn. Use K-means and
t-SNE to cluster this data. There are three classes in the data. For K-means with
k = 3 do the clusters correspond to the 3 classes. Can you see the three clusters in
the t-SNE results?
4.3 The Olivetti face data set of 400 grayscale images of 64 × 64 pixels is available
in scikit-learn. Each image is the face of one of 40 people. Consider each face as a
vector of length 4096 to create a data matrix of size 400×4096. Perform an SVD on
this data matrix and visualize the first 5 columns as images. What do these images
tell you about the faces in the data set? Now perform a K-means clustering of the
faces data set with k = 40. Does K-means find the forty different people? Finally,
apply t-SNE to the data and see what groupings it finds in the data.
References
1. Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of
Machine Learning Research, 9(1):2579–2605, November 2008.
2. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for
benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
3. S Jacquemound, L Bidel, C Francois, and G Pavan. ANGERS leaf optical properties database
(2003). Data Set. Available online: http:// ecosis.org (accessed on 19 June 2019), 2003.
4. T Smith and J Guild. The C.I.E. colorimetric standards and their use. Transactions of the Optical
Society, 33(3):73–134, Jan 1931.
5. Ryan G McClarren. Uncertainty Quantification and Predictive Computational Science.
Springer, 2018.
6. J.V. Stone and Massachusetts Institute of Technology. MIT. Independent Component Analysis:
A Tutorial Introduction. A Bradford Book. MIT Press, 2004.
7. An Gie Yong and Sean Pearce. A beginner’s guide to factor analysis: Focusing on exploratory
factor analysis. Tutorials in quantitative methods for psychology, 9(2):79–94, 2013.
8. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.
Data Mining, Inference, and Prediction, Second Edition. Springer Science and Business Media,
New York, NY, August 2009.
9. Anil K Jain, Richard C Dubes, et al. Algorithms for clustering data, volume 6. Prentice Hall,
1988.
Part II
Neural Networks
Chapter 5
Feed-Forward Neural Networks
y = w1 x1 + w2 x2 + · · · + wJ xJ + b.
In this definition of the intermediate variable we have used a function σ (u) called
an activation function or a sigmoid function. There are a variety of forms for the
activation function that we will detail later.
The dependent variable is then a function of a linear combination of the zk :
Hidden
Inputs Output
Layer
w11 z1
x1
wo1
w12 b1
x2 z2 wo2 o y
b2
w1J woK
bo
xJ
wKJ
zK
bK
1 1
Fig. 5.1 Schematic of a simple feed-forward neural network defined by Eqs. (5.1) and (5.2) with
information flowing from the left to right. The inputs xj are multiplied by weights, added together
with the bias, and passed to the function σ to get the intermediate values zk , also known as hidden
units. The zk are then combined with a bias and fed to the output function o to get the value of y
5.1 Simple Neural Network 121
not seen by the user of the neural network. The individual zk are sometimes called
hidden units, nodes, or neurons. The hidden units are then combined via a linear
combination with the bias added to produce the output after being passed to an
output function, o.
The particular network in Fig. 5.1 is sometimes called a feed-forward neural
network because information travels only in one direction, left to right in this case:
information is fed on the left and that information flows to the hidden layer before
flowing to the output. This contrasts with recurring neural networks and other forms
where information might travel in loops.
One nonlinear function that we can model with this simple neural network is the
exclusive-OR function (XOR). This function takes two binary inputs that are 0 or 1
and returns 1 only if one of the inputs is 1, and returns 0 if both inputs are 0 or 1:
1 x1 + x2 = 1
y(x1 , x2 ) = , x1 , x2 ∈ {0, 1}. (5.3)
0 otherwise
We can exactly fit this function with the simple neural network if we define
1 u≥0
σS (u) = . (5.5)
0 otherwise
122 5 Feed-Forward Neural Networks
Inputs Hidden
Output
Layer
z1
x1 1
1 −1 1
z2 −2 o y
1
x2 1
1
z3
−2
−3
1 1
Fig. 5.2 Visualization of the neural network that reproduces the XOR function. The activation
function used in the hidden layer is the step function σS
y = z1 − 2z2 + z3 (5.6)
= σS (x1 − 1) − 2σS (x1 + x2 − 2) + σS (x2 − 1).
z1 = σS (1 − 1) = 1, z2 = σS (1 + 1 − 2) = 1, z3 = σS (1 − 1) = 1,
y = 1 − 2 + 1 = 0.
z1 = σS (1 − 1) = 1, z2 = σS (1 + 0 − 2) = 0, z3 = σS (0 − 1) = 0,
y = 1 − 0 + 0 = 1.
We can see that if x1 = 0 and x2 = 1 the function will also return y = 1. Finally,
x1 = x2 = 0 gives y = 0:
5.1 Simple Neural Network 123
Hidden Hidden
Inputs Output Inputs Output
Layer Layer
1 1
x1 1 x1 1
0 1 y 1 0 y
x2 0 x2 1
0 1
1 1 1 1
(a) x1 = 1, x2 = 0 (b) x1 = x2 = 1
Fig. 5.3 The flow of information through the neural network for the cases: (a) x1 = 1, x2 = 0 and
(b) x1 = x2 = 1. Only connections that are non-zero are drawn for each case. We observe that the
hidden layer shuts off information transfer depending on the inputs
z1 = σS (0 − 1) = 0, z2 = σS (0 + 0 − 2) = 0, z3 = σS (0 − 1) = 0,
y = 0 − 0 + 0 = 0.
One feature of neural networks is that the activation functions in the hidden
layer control how information flows through the network from the inputs to the
output. In particular we can see how the cases x1 = 1, x2 = 0 and x1 = x2 = 1
have information flow through the network in Fig. 5.3. In the figure we see that in
the case of x1 = 1, x2 = 0 the activation functions in z2 and z3 “shut off” and
propagate no further information. However, when x1 = x2 = 1 all of the hidden
layer functions flow information to the output. The weights of the connections to
the output function assure that the information that does flow to the output node is
combined appropriately to the give the correct value.
To this point we have seen that feed-forward networks have the property that
information flows from the inputs to the outputs and that the activation functions
act to shut off information flow through some channels. We have not stated why this
might be called a “neural” network. The primary reason for this is that the units in
the hidden layer, the intermediate values we call zk , act in a very similar way to the
biological neurons found in the brain. A neuron is a cell that has several inputs—
called dendrites—that receive chemo-electrical signals. When a large enough signal
is received at the dendrites, the neuron activates, or, more colloquially, fires, by
sending an electrical signal through an output channel called an axon. The axon can
be connected to other neurons and the firing of the axon can cause other neurons to
fire. The signal required for the neuron to fire can be affected by chemicals in the
brain, called neurotransmitters.
The structure of a neural network resembles the biological neurons. Loosely
speaking, the dendrites are the inputs to a node in the hidden layer, the bias are the
neurotransmitters, the activation function, σ , shuts off the signal unless a threshold
is reached, and the outputs of the hidden layer are the axons. The feed-forward
network we have considered is only a crude approximation of neural structure, but
the analogy can be helpful to think about how the network functions.
The activation function, σ (u), typically has the property that for u below a certain
value, the function returns 0. We saw this above with the step function σS (u): when
u < 0 the function returns 0. The activation function then acts to shut off the
flow of information through the network when the signal received is less than some
threshold, and this threshold is controlled by the bias.
The step function is one of several possibilities of the activation function.
Another function that is commonly used is one that we have already seen, the
logistic function:
1
σL (u) = . (5.7)
1 + e−u
Notice that this function has the same limits as the step function: as u → ∞,
σL (u) → 1, and u → −∞, σL (u) → 0. One difference between the logistic
function and the step function is that the logistic function is smooth: its derivative
exists everywhere and is positive everywhere. This is not true for the step function
where the derivative is zero everywhere except at u = 0 where the derivative is
undefined. The logistic function is plotted in Fig. 5.4.
5.1 Simple Neural Network 125
2
1 4
ReLU
σ(u)
1+e−u
log(1+eu)
σ(u)
tanh(u)
−1
tan (u) 1 3
identity
u
0 2
−4 −2 0 2 4
−1 1
u
0
−2
−4 −2 0 2 4
(a) (b)
Fig. 5.4 Comparison of several activation functions. (a) The functions that have characteristic
“s” shape, the logistic, hyperbolic tangent, and arctangent are compared. The identity function is
shown for comparison near the origin. (b) The ReLU and softplus functions have similar behavior,
but the softplus function is smooth
There are several other functions that have similar behavior to the logistic
function. One function is the hyperbolic tangent function, tanh(u). This function
has a similar shape to the logistic function but has a range from −1 to 1. We denote
an activation function using the hyperbolic tangent as σTH (u):
Though this function does not “shut off” by outputting zero when the input is low
enough, it has the property that if the input is much less than zero, the function
returns a known value of −1. In this way, the information can be shut off if we add
1 to the output of the function.
Additionally, the arctangent function can be an activation function:
The hyperbolic tangent and arctangent activation functions are smooth, like the
logistic function, and they have the additional property that they behave like the
identity function near the origin. At u = 0 both σTH (0) = 0 and σAT (0) = 0. Also,
the derivative of these functions with respect to u is 1 at the origin. This makes the
function locally similar to the identity. This will have important applications when
we train the networks later. The similarity to the identity can be seen in Fig. 5.4
where the hyperbolic tangent and arctangent are compared to the identity.
In Fig. 5.4 we notice that the limits of the arctangent and the hyperbolic
tangent functions are not the same as the logistic function (or the step function).
Nevertheless, one could shift and scale these functions so that they had the same
range as the logistic function.
126 5 Feed-Forward Neural Networks
Another important type of activation function is the rectified linear unit (ReLU)
function that returns 0 for u ≤ 0 and u for u > 0:
0 u≤0 1
σReLU (u) = = (|u| + u) . (5.10)
u u>0 2
This function “shuts off” when u is negative and can increase without bound as
u increases above 0. This function has a derivative defined everywhere, but the
derivative jumps from 0 to 1 at u = 0.
There is also a smooth version of the ReLU called the softplus that smooths the
transition to zero:
σSP (u) = log 1 + eu . (5.11)
This function behaves similar to the identity function as u 0 because σSP (u) ≈
log(eu ) = u for u large, and approaches 0 as u → −∞. The derivative of the
softplus function is the logistic function, and the derivative is therefore well-defined
and positive everywhere.
The ReLU and softplus activation functions are shown in Fig. 5.4b. In the figure
it is clear that as u → ±∞ the value of the softmax function approaches that of the
ReLU function; near the origin the softplus function is smooth.
Another function that we can use in the role of an activation function is the
softmax function that was discussed in Sect. 2.3.4. In this case the function takes in
J values and outputs J values. These will be used when we consider classification
models with neural networks.
Activation Functions
There are a variety of functions that we can use for the activation of a hidden
unit in a neural network. Most of them mimic a biological neuron by having
the ability to return a value of zero below a certain value. This is how the
network can control which units have a non-zero value based on the inputs to
the unit.
To understand how we determine the weights and biases of the neural network
we will consider a single layer network, like that shown in Fig. 5.1. We consider
a training set of data consisting of a single pair of independent variables, x, and
dependent variable y. We want to minimize the error between the neural network
prediction ŷ and the true value y. We use the squared error as our loss function:
5.2 Training Single Layer Neural Networks 127
To minimize this loss function we want to compute the derivative of the loss function
with respect to each of the parameters, i.e., all the weights and biases. Then we
want to adjust each parameter to make the loss function smaller in magnitude.
For instance, if we have the derivative of the loss function with respect to a given
weight, wkj , we can use the gradient descent method to decrease the loss function
by updating the weight as
∂L
new
wkj = wkj − η , (5.13)
∂wkj
where η > 0 is called the learning rate and governs how large the magnitude of the
change to the weight needs to be. In the gradient descent method we repeatedly
update the weights until we find a minimum value of the loss function. This is
called gradient descent because we adjust the weights in the opposite direction of
the gradient.
In what follows we present how to compute the derivative of the loss function
with respect to the weights and biases. We repeat the equations for the kth value in
the hidden layer, zk , and the equation for ŷ
In these equations we have written the sums that form the inputs to the activation
function and the output function using uk = wk1 x1 + wk2 x2 + · · · + wkJ xJ + bk ,
and uo = wo1 z1 + wo2 z2 + · · · + woK zK + bo . This will make it clear that these
functions only take a single input for the purposes of taking derivatives.
Now we compute the derivative of the loss function with respect to one of the
weights in the output layer, wok ,
∂L ∂ ŷ
= (ŷ − y) (5.16)
∂wok ∂wok
do ∂uo
= (ŷ − y)
duo ∂wok
do
= (ŷ − y) zk
duo
= δo zk .
Here we have defined a quantity δo as the product of the derivative of the output
function and the difference of ŷ and y:
128 5 Feed-Forward Neural Networks
do
δo = (ŷ − y) .
duo
Similarly, if we consider the derivative with respect to the bias in the output layer,
we have
∂L ∂ ŷ
= (ŷ − y) (5.17)
∂bo ∂bo
do ∂uo
= (ŷ − y)
duo ∂bo
= δo .
Notice that the derivative of the loss function with respect to the bias and the weights
in the output layer depend on the difference between the output and the true value of
the dependent variable times the derivative of the output function. In the case of the
weights, there is the additional term containing the value of the hidden layer. This
makes the derivative of the loss function with respect to the weights easy to evaluate
provided that we can compute the derivative of the output function. Also, notice that
we only have to compute δo once and can use it to evaluate all K + 1 derivatives we
need.
We can do the same calculation for the weights and biases that form the hidden
layer values, zk . The derivative with respect to wkj is
∂L ∂ ŷ
= (ŷ − y) (5.18)
∂wkj ∂wkj
do ∂uo ∂zk
= (ŷ − y)
duo ∂zk ∂wkj
dσ ∂uk
= δo wok
duk ∂wkj
dσ
= δo wok xj
duk
= δk xj .
This allows us to notice that the derivative with respect to a weight feeding the
hidden layer can be expressed in terms of a product of the derivative of the output
function, the derivative of the activation function, and the weight connecting the
hidden layer to the output layer.
5.2 Training Single Layer Neural Networks 129
The derivative with respect to the bias bk can be computed in a similar fashion:
∂L ∂ ŷ
= (ŷ − y) (5.19)
∂bk ∂bk
do ∂uo ∂zk
= (ŷ − y)
duo ∂zk ∂bk
dσ ∂uk
= δo wok
duk ∂wk
dσ
= δo wok
duk
= δk .
This procedure for calculating the derivative of the loss function with respect to
the weights/biases by starting at the output and working toward the inputs is called
back propagation. This is because we first compute the term δo for the output layer
and then use it to form δk for the hidden layer.
Above we considered only a single training point in computing the derivative of the
loss function with respect to weights/biases. If we have I training points, we can
write the input data as a matrix X of size I × J with each row containing the J
inputs for case i. This also implicitly defines a matrix Z of size I × K where each
row contains the K values for zk for case i. Using this notation, we then write the
loss function as
I
L= (ŷi − yi )2 . (5.20)
i=1
do dσ
δoi = (ŷi − yi ) , δki = δoi wok , (5.21)
duo duk
as
∂L I
∂L I
= δoi zik = δoi (5.22)
∂wok ∂bo
i=1 i=1
130 5 Feed-Forward Neural Networks
∂L I
∂L I
= δki xij = δki .
∂wkj ∂bk
i=1 i=1
These equations tell us that we can compute the derivative of the loss function
with respect to the loss function over a training set by adding the derivatives from
each training point.
At this point it is a good time to discuss one consequence of the nonlinear trans-
formations that arise from the activation functions and the gradient computation we
just performed: the necessity for normalization of network inputs and outputs. As
discussed above the activation functions have specific behavior near u = 0, namely
that many of them have the behavior that near u = 0 they behave somewhat like the
identity function. This makes
dσ
≈ 1,
du
for u ≈ 0. Also, in Eq. (5.22) we see that the gradient of the loss function contains
a term that is multiplied by the input value xij . If the xij values vary wildly in
magnitude between the independent variables (i.e., the range of the xij is much
larger than the range of xij ), the weights affected will have a large variation in
magnitude in their gradients. Additionally, if within a single independent variable
there is a large range between cases i and i , the sum in Eq. (5.22) will be adding
contributions of very different scales (and perhaps signs). Furthermore, all of the
terms in Eqs. (5.21) and (5.22) have (ŷi − yi ) in them. If the outputs have a large
variation to them, the magnitude of this term will have a large variation. Combined
will cause the contribution from each training point and independent variable to
have very different effects on the gradient computation.
While it is possible to apply gradient descent to problems where the gradients
have large variations in magnitude and sign, it will train much faster when the
contribution from each independent variable is on the same scale and when the
5.2 Training Single Layer Neural Networks 131
outputs are on a bounded or nearly bounded scale. For this reason it is the standard
practice to normalize the independent variables so that each has a fixed range and
variability. For instance, we can subtract from each variable its mean and then divide
by its standard deviation (this is sometimes called computing a z-score in statistics).
In this process we produce a new variable x̃ij as
xij − x̄j
x̃ij = , (5.23)
sj
where x̄j is the mean of the j th independent variable and sj is the sample standard
deviation of this variable. The normalization constants x̄j and sj must be stored so
that new data and validation data can be placed on the same scale. Furthermore,
we can apply the same transform to the output data y to normalize the dependent
variables. After applying the normalization the data will have mean 0 and standard
deviation of 1. In other words the data have a uniform variability between the
variables and all variables have the value of 0 correspond to the mean value of that
variable.
This type of normalization of the inputs and outputs of a neural network is the
standard practice and makes training the networks much easier in practice. We use
such normalizations (or variants when noted) throughout this work when dealing
with neural networks.
for each batch, and then shuffling the training points into different batches. The
algorithm looks like the following:2
• Repeat until the loss function is small enough:
– Shuffle training points between batches randomly.3
– For each of the b batches:
Evaluate Eq. (5.22) using gradients estimated over the Ib points in the
batch.
Update the weights using Eq. (5.13).
A beneficial feature of stochastic gradient descent is that it works well if more
training points become available while we are training the model: the new points
are just more data to shuffle between the batches. Additionally, if we have a large
number of training points we do not have to use all of the training points. We can
randomly select a subset of them in each iteration and shuffle them between batches.
This will give us a representative way to update models without evaluating the loss
function gradients at every training point available.
Above we mentioned that we can iteratively update the weights and biases by
adjusting them in the direction opposite to the derivative of the loss function. This
requires derivatives to be calculable and defined, which eliminates some choices for
activation functions, such as step functions, unless the derivative is approximated.
Another, more vexing, issue with the gradient descent approach is that because each
update decreases the loss function, it is susceptible to finding a local minimum rather
than the global minimum. This can be seen in the example shown in Fig. 5.5. In the
2 Sometimes this method is called mini-batch stochastic gradient descent, with SGD reserved for
the case where the batch size is 1. Given the ubiquity of the algorithm, dropping the “mini-batch”
descriptor should not lead to any confusion.
3 It is possible to skip this shuffling step, but in general shuffling the batches assures that the order
Loss function
η = 0.07
20
η = 0.18
η = 0.327
15
L
10
−4 −2 0 2 4
w
Fig. 5.5 Demonstration of four iterations of the gradient descent finding a local minimum: the
smooth curve is the loss function and we have an initial value of the parameter w at the top right.
The value is updated by gradient descent, as shown by the symbols. Different magnitudes of the
learning rate, η, lead to different minima found
figure we have an example loss function that is the function of a single parameter,
w. If the initial value is w = 3.9 and we update w using gradient descent, then
the minimum found depends on the learning rate used. It is also possible that if the
learning rate is too large the method will not converge.
Given that the minimum found is sensitive to the learning rate in this simple, one
parameter optimization problem, it is not hard to imagine that in applying gradient
descent to a large number of weights and biases that local minima are likely to
be found. For this reason there is much active research in finding the weights and
biases that minimize the loss function. Most software libraries for building neural
networks have multiple optimization options beyond simple gradient descent (or its
stochastic version). These methods are outside the scope of the discussion, but it
is important to know that a neural net is only as good as the optimization solution
found. Some of these methods use the concept of momentum to help the stochastic
gradient descent converge. The idea is to make the new update a linear combination
the previous change to the weights plus the estimated change from the gradient:
∂L
new
wkj = wkj − η + αΔwkj ,
∂wkj
where Δwkj is the change in wkj from the previous iteration, and α > 0. In this case
we can think of the weights as having their update depending on the most recent
134 5 Feed-Forward Neural Networks
update, as though it is an object traveling with momentum. The weights move along
their update path and that path is adjusted by the “force” represented by the gradient
of the loss function.
Another technique that has been found to be effective and simple is to start with
a relatively large learning rate and decrease it as the number of iterations increases.
The idea here is to try and skip over local minima in the early iterations and then
zoom in on the global minimum in the later iterations.
The real power of neural networks arises when the network has more than a single
hidden layer. Neural networks with multiple hidden layers are called deep neural
networks (DNN). These networks are an extension of the simple networks we have
studied earlier in this chapter. In DNNs there are several intermediate values rather
than a single set. A neural network with 3 hidden layers is shown in Fig. 5.6. This
network has K hidden units per hidden layer. Notice that as we add layers the
number of connections between nodes grows, making the number of weights and
biases that we must find grow as well. Note that this is still a feed-forward network
in that information is only moving from left to right in the figure: there are no
connections between units in the same layer or going backward from right to left.
To describe a deep neural network we need to define some notation for the
weights and biases. The number of hidden layers is L. We define a matrix A→
to be the matrix containing the weights and the bias that connect layer to layer
. We also write that layer has K () weights and layer has K ( ) weights; this
+1)
matrix has the size (K ( ×K (+1) , and the additional 1 is for the bias. We write
the vector of hidden units in layer as z() = (z1() , z2() , . . . , zK
() T
() , 1) including
the 1 for the bias. To make the notation general, we consider the input layer as layer
number 0 of length K (0) = J and z(0) = (x1 , x2 , . . . , xJ , 1)T . The output layer is
considered layer L + 1 of size K (L+1) = 1. The matrix A→ will have the form
5.3 Deep Neural Networks 135
Hidden
Inputs Output
Layers
o y
1 1 1 1
Fig. 5.6 Schematic of a neural network with J inputs, 3 hidden layers, each with K nodes, and a
single output
⎛ ⎞
→ w → . . . w → b→
w11 12 1 1
⎜ ⎟
⎜ ⎟
⎜ → → → b→ ⎟
⎜ w21 w22 . . . w2 ⎟
⎜ 2 ⎟
⎜ ⎟
⎜ ⎟
⎜ .. ⎟
A→ =⎜ . ⎟ (5.24)
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜w → w → . . . w→ b→ ⎟
⎜ K ( ) 1 K ( ) 2 K ( ) K ( ) ⎟
⎜ ⎟
⎝ ⎠
0 ... 1.
In the matrix we have written the weight that connects node j from layer to node
→ and the bias that feeds node i in layer as b→ . Note the
i in layer as wij i
matrix connecting the last hidden layer to the output, A(L→L+1) , will not have the
final row.
136 5 Feed-Forward Neural Networks
z(2) = σ A1→2 z(1) , (5.25b)
= σ A1→2 σ A0→1 z(0) ,
z(3) = σ A2→3 z(2) , (5.25c)
= σ A2→3 σ A1→2 σ A0→1 z(0) ,
y = o A3→4 z(3) , (5.25d)
= o A3→4 σ A2→3 σ A1→2 σ A0→1 z(0) .
These equations show how the output y is a function of the inputs. The inputs go
through several linear combinations and several nonlinear functions. We will now
demonstrate one of these networks on a simple example.
Consider the function sin(x) for x ∈ [0, 2π ]. A neural network to approximate this
function is displayed in Fig. 5.7; this network uses ReLU for the activation functions
and the output function is the identity. Using our notation the weight/bias matrices
are
5.3 Deep Neural Networks 137
x1 o y
z3(2)
z2(1) z2(3)
z4(2)
1 1 1 1
⎛ ⎞
1 0 − π2
⎛ ⎞ ⎜ −1 −106 π ⎟
1 −π ⎜ 2 ⎟
⎜ ⎟
A0→1 = ⎝−1 π ⎠ , A1→2 =⎜ 0 1 − π2 ⎟ , (5.26)
⎜ ⎟
0 1 ⎝−10 −1 2 ⎠
6 π
0 0 1
⎛ ⎞
−0.664439 −0.664439 −106 −106 1.15847
A2→3 = ⎝ −106 −106 −0.664439 −0.664439 1.15847⎠ ,
0 0 0 1
A3→4 = −1 1 0 .
To see how this network works, we will look at two potential inputs. If x1 = π/4,
as shown in Fig. 5.8a, the ReLU activation functions force information to flow only
(1) (2) (3)
along the path that goes from the input to the output as x1 → z2 → z3 → z2 →
o. Using the definition of the weight matrices, we get the hidden layers’ values as
138 5 Feed-Forward Neural Networks
ReLU ReLU
Hidden Output Hidden Output
Layers Layers
z1(2) z1(2)
4 o 0.6366 7
4
o −0.6366
z3(2) z3(2)
z4(2) z4(2)
(a) (b)
Fig. 5.8 Illustration of the information flow through the neural network to approximate the sine
function for different inputs. The only connections that are non-zero are drawn. (a) x1 = π/4. (b)
[x1 = 7π/4]
⎛ ⎞
0
⎛ ⎞ ⎜ 0 ⎟ ⎛ ⎞
0 ⎜ ⎟ 0
⎜ ⎟
z(1) = ⎝2.3561⎠ , z(2) = ⎜0.78539⎟ , z(3) = ⎝0.63662⎠ , y = 0.63662.
⎜ ⎟
1 ⎝ 0 ⎠ 1
1
(5.27)
If the input is x1 = 7π/4, the path through the network is different, as illustrated
in Fig. 5.8b. Here information flows along the topmost path of the network. In this
case the hidden layers are
⎛ ⎞
0.78539
⎛ ⎞ ⎜ 0 ⎟ ⎛ ⎞
2.3561 ⎜ ⎟ 0.63662
⎜ ⎟
z(1) = ⎝ 0 ⎠, z(2) = ⎜ 0 ⎟, z(3) = ⎝ 0 ⎠, y = 0.63662.
⎜ ⎟
1 ⎝ 0 ⎠ 1
1
(5.28)
Notice that this function preserves the property of the sine function that sin x =
− sin(x + π ).
Going beyond these two example inputs, we point out what this network is really
doing: it is using a piecewise linear approximation to the sine function using four
5.3 Deep Neural Networks 139
DNN
1.0 sinx
0.5
f(x)
0.0
−0.5
−1.0
0 2 4 6
x1
Fig. 5.9 Comparison of the deep neural network with the sine function. Note that the DNN is
computing a piecewise linear approximation
different lines defined on each of the intervals of length π/2 between 0 and 2π . The
network is really just determining which of the four lines to use and then computing
where on the line the input falls. A comparison between the neural network and
the true function are shown in Fig. 5.9. This neural network is not the best possible
neural network for approximating this function, but it does a reasonable job; in fact,
this network was fit “by hand” for this example. If we trained a neural network using
training data from the sine function, it is likely we would have obtained something
very different.
Above we discussed the process of training a simple neural network using back
propagation where the chain rule is used to get the derivative of the loss function
with respect to a weight or bias. The process is the same for DNNs, but the details are
significantly messier. This is due to the fact that in a DNN there are typically many
weights and biases. Using our notation for a DNN the derivative of the squared-error
loss function for a weight in layer in an L layer network is
∂L
→
= δi( ) zj() , (5.29)
∂wij
140 5 Feed-Forward Neural Networks
where u(
j
)
is the value of node j in hidden layer before applying the activation
( )
function σ (u). In this case the value of δi is the sum of all of δ that zi connects to
⎧⎛ ⎞
⎪
⎪ K +1
⎪
⎪⎝ → +1 ( +1) ⎠ dσ
⎪
⎨ wkj δk ( )
< L + 1
( ) dui
δi = k=1 . (5.30)
⎪
⎪
⎪
⎪ do
⎪
⎩(ŷ − y) = L + 1
duo
Because the value for a derivative for weights in the initial layers depends on the
deltas from the deeper layers, there can be problems with deltas growing too large
or growing too small. If the deltas grow too large, then the updating procedure will
be unstable, and, if the deltas are too small, then the network will not train. We
will return to this topic later when we discuss recurrent neural networks in a future
chapter.
There is also the issue of how to initialize the weights when training the model.
Typically, the initial weights and biases are randomly drawn from a standard normal
distribution (i.e., a normal distribution with mean zero and standard deviation one).
Other approaches include the Xavier initialization scheme [1] where the weights are
drawn from a normal distribution with mean zero and standard deviation equal to
the square root of the number of input parameters. The initialization of weights is
important because, as we can see in Eq. (5.30), the magnitude of the weights leads
( )
directly to the magnitude of δi : too large initial weights make the network update
unstable and too small initial weights make the network not update enough.
To this point we have only considered regression problems with feed-forward neural
networks. Everything we have discussed can be applied to classification problems
with only minor adjustments. The output layer for a K-class classification problem
will be the softmax function (see Eq. (2.24)) that takes as input K different units
from the final hidden layer and returns K probabilities, one for each class. For
the loss function we can use the cross-entropy function (see Eq. (2.25)) or similar
variants.
We only need to make changes to the loss function and have a final hidden layer.
The other features of the network will be the same as regression: we have choices
for the activation functions, the number of layers, and the number of units in the
layers. The training is also the same, and we use gradient descent to decrease the
loss function during the training.
5.4 Regularization and Dropout 141
where λ > 0 is the strength of the regularization and α ∈ [0, 1] gives the balance
between the L1 and L2 penalties (sometimes called weight decay). Both of these
parameters need to be chosen by the user and can be chosen in a similar way as
in regularized regression: λ is chosen to be the largest value that gives a model
within one standard deviation of all the λ tested. Choosing α can be guided based
on whether sparsity, that is having many zero weights and biases, is desired for a
network. If sparsity is desired, α should be close to 1.
5.4.1 Dropout
different training data. In this sense the network is a combination of many smaller
networks each specialized to different data.
When we use a neural network trained using dropout, we do not zero out any
of the nodes and multiply all the outgoing weights by the dropout probability p.
This makes the network effectively an average of all the sub-networks used during
training.
Dropout was first presented by Srivastava, Hinton, et al. [2] and they suggest that
the probability of dropping out a unit should be about p = 0.5 for hidden units
and closer to p = 1 for input units. Their paper shows that dropout works well
with other regularization techniques (L2 and L1 decay, for example). They also
find that dropout works with a variety of activation functions and network types.
They also give different inspirations for dropout including the superiority of sexual
reproduction to asexual reproduction to produce advanced species where different
parents can specialize in different tasks, or the superiority of having many small
conspiracies to one large conspiracy—10 small conspiracies of 5 people each are
likely to have more success than a single large conspiracy of 50 people. A different
biological explanation is that the human brain is using dropout all the time: our
neurons misfire for all kinds of reasons; maybe our brains are a natural average of
many “sub-brains.” Whatever the motivation, dropout is a standard regularization
technique for neural networks.
aggregate. The dependent variable for our study is the compressive strength of the
concrete in MPa. There are 1030 cases in the data.
Figure 5.10 shows scatter plots of the compressive strength of the concrete versus
each of the independent variables. From these plots we can see trends with respect
to the age of the concrete (i.e., new concrete has reduced strength), water content
(lower water content seems to indicate stronger concrete), and more cement tends to
make stronger concrete. The standard deviation of the compressive strength in the
data set is 16.698 MPa.
For our fitting we split the data randomly into a training and test set with 80%
of the data being used in the training set. We also normalize the inputs by dividing
each by the L2 norm of each input.
We begin by fitting a neural network with a single hidden layer of 8 inputs and an
output layer with a single node for the compressive strength. There are 81 trainable
parameters in this model: 72 weights and biases connecting the input layer to the
hidden layer, and 9 weights and biases connecting the output layer to the hidden
layer. We use the Adam optimization method for training [5], a type of stochastic
gradient training algorithm, and train the model for 105 epochs with a batch size
of 32. The activation function we use is the hyperbolic tangent function; this was
chosen after some experimentation. We used Xavier initialization for the weights
and biases and used an L2 penalty of λ = 0.0005.
For the single hidden layer neural network we achieve a mean-absolute error,
that is, the average absolute value of the difference between the true and predicted
values, of 5.52 MPa on the test set. This model also has R 2 = 0.800 on the
test set. From these numbers the model appears to be a useful predictor of the
compressive strength given the 8 inputs. To further inspect the model we can look
at the convergence of the mean-absolute error as a function of training epoch, as
shown in Fig. 5.11a. In this figure we can see that the error is decreasing for both
the training and test sets as the number of training epochs is increased. For this
reason, we can be reasonably assured that the model is not overfit to the training
data.
To attempt to improve on the results of a single hidden layer network, we fit a
two hidden layer network. It would be possible to use a randomly initialized network
and train it as we did with the single layer model, but in this case it makes sense to
start from the one layer model. We create a new neural network; this one has two
hidden layers; the first has 8 hidden units and the second has 2 hidden units. We
initialize the weights and biases in the first layer to be the same as what were fit in
the single layer network above. Then we train the other weights and biases on the
same training set. The idea here is to find a model that is more complicated and can
“correct” any deficiencies that we could not capture in a single layer model. The
convergence of the error for this model is shown in Fig. 5.11b. Here again we see
144 5 Feed-Forward Neural Networks
80 80
Compressive Strength (MPa)
40 40
20 20
0 0
0 100 200 300 100 200 300 400 500
3
Age (Days) Cement (kg/m )
(a) (b)
80 80
Compressive Strength (MPa)
40 40
20 20
0 0
0 50 100 150 200 0 100 200 300
3 3
Fly Ash (kg/m ) Slag (kg/m )
(c) (d)
80 80
Compressive Strength (MPa)
60 60
40 40
20 20
0 0
125 150 175 200 225 250 0 10 20 30
3 3
Water (kg/m ) Superplasticizer (kg/m )
(e) (f)
80 80
Compressive Strength (MPa)
60 60
40 40
20 20
0 0
600 700 800 900 1000 800 900 1000 1100
3 3
Fine Aggregate (kg/m ) Coarse Aggregate (kg/m )
(g) (h)
Fig. 5.10 Scatter plots showing the concrete strength versus the 8 different independent variables.
(a) Age. (b) Cement. (c) Fly ash. (d) Slag. (e) Water. (f) Superplasticizer. (g) Fine aggregate. (h)
Coarse aggregate
5.5 Case Study: The Strength of Concrete as a Function of Age and Ingredients 145
2×101
101 101
101
6×100
101 103 105 101 103 105 101 103 105
Epoch Epoch Epoch
(a) (b) (c)
Fig. 5.11 Convergence of the mean-absolute error for the neural networks using the raw indepen-
dent variables. (a) Single layer. (b) Two layers. (c) Three layers
that the error is decreasing in both the test and training sets as the number of epochs
increases. It may be beneficial to continue training the model for more epics, but
we stop here for this example. The mean-absolute error for the two-layer model is
4.96 MPa and R 2 = 0.831; both are an improvement over the single layer model.
If we continue and build a three hidden layer model by adding a new layer with
two hidden units, and initialize this model with the weights and biases from the first
two layers of the two hidden layer model, we get a model that performs worse on the
test data. In this case we have a mean-absolute error of 5.05 MPa and R 2 = 0.830.
Furthermore, we can see in Fig. 5.11c that the test error and training error have a
much noisier convergence plot. This is likely due to the added complexity of the
model: the three layer model has 99 parameters to fit.
This is an example of a more complicated model not necessarily performing
better than the simpler model. Some of this is undoubtedly due to the fact that the
optimization procedure does not find the absolute best model. It is minimizing the
loss function, but it might not be finding the global minimum. If we retrain the model
many times, we might be able to get a better model due to the random initialization
of some of the weights and biases, but there is no guarantee.
The original paper by Yeh defined a water to binder ratio as another independent
variable and also used the logarithm of the age in days of the concrete. The water to
binder ratio is a conventional parameter in studying concrete strength, as mentioned
by Yeh. The water to binder ratio is defined as the amount of water divided by the
sum of the cement and fly ash component. The water to binder ratio does seem to
have a strong relationship to the compressive strength, as seen in the scatter plot of
the compressive strength versus water to binder ratio in Fig. 5.12a. Using the water
to binder ratio and the logarithm of the age in days as part of a one-layer model
with all other network parameters the same (i.e., number of hidden units, activation
function, initialization, and regularization) gives a mean-absolute error of 4.95 MPa
and R 2 = 0.831. Initializing a two-layer model with the one-layer model, as we did
146 5 Feed-Forward Neural Networks
80 80
Compressive Strength (MPa)
40 40
20 20
0 0
0.5 1.0 1.5 0 20 40 60 80 100
Water to Binder Ratio Regression Prediction (MPa)
(a) (b)
Fig. 5.12 Scatter plots showing the concrete strength versus the water to binder ratio and the
results predicted by the regression model in [3]. (a) Water to binder ratio. (b) Regression model
101 101
1 3 5 1 3 5
10 10 10 10 10 10
Epoch Epoch
(a) (b)
Fig. 5.13 Convergence of the mean-absolute error for the neural networks using the new
independent variables of log days and water to binder ratio. (a) Single layer. (b) Two layers
above, gives a model with 4.84 MPa as the mean-absolute error and R 2 = 0.834.
The convergence of these models is shown in Fig. 5.13. These results indicate that
added new variables to the model, especially if they have physical meaning, can
improve the model without adding much complexity.
The original paper containing the concrete data also fit a linear regression model
to the compressive strength as
fc = aX (c ln t + d) , (5.32)
where fc is the compressive strength and t is the age in days. Values for the regres-
sion coefficients a, b, c, and d are given in the paper for 4 different randomized
training sets. We use the average of these regression parameters to form a new
independent variable to the model that is fc : the predicted compressive strength
from a linear regression model. The best value for R 2 on a test set for a linear
model reported by Yeh is 0.779. In Fig. 5.12b we see that the regression prediction
Notes and Further Reading 147
is directionally correct in that the regression model can qualitatively indicate the
compressive strength, but the quantitative prediction can be inaccurate.
Adding the regression prediction as an independent variable to the model is a
means to give an input that “close” to the correct answer. It is common to have
approximate models that need to be corrected, and we want the neural network to
learn this correction. We know that the linear regression model is missing some
features of the data; we want the neural network to improve the linear regression
model.
A one-layer model that includes all 11 independent variables (the original 8
inputs, the water to binder ratio, the log of age in days, and the linear regression
prediction) gives a mean-absolute error of 5.13 MPa and R 2 = 0.821. This is
an improvement over the one-layer model with only the raw data, but not an
improvement of the model that includes the water to binder ratio and log age.
Adding a second layer with two hidden units and initializing from the one-layer
model give a mean-absolute error of 4.96 MPa and R 2 = 0.836. This two-layer
model has the highest R 2 of any model tested, but the mean-absolute error is slightly
worse than the two-layer model with the new independent variables.
What should we take away from this case study? First, we found that for this
“small” data set of 1030 cases it is possible to predict the compressive strength of the
concrete to within about 5 MPa without too much trouble. We can also improve on
a linear regression model built with expert knowledge with simple neural networks.
Our study was not comprehensive in that we could have tried more varied network
architectures (different numbers of hidden units, etc.) and run more training epochs.
We did learn, however, that deeper (that is more hidden layers) networks do not
always outperform shallower networks.
There are many resources for further reading on neural networks, deep neural
networks, and the issues with training them. The paper in Nature by LeCun, Bengio,
and Hinton is a great read to see some of the application areas and basics of these
networks [6]. Additionally, there are topics that we did not cover in this chapter that
may be of interest to readers. One topic that we did not cover is batch normalization
[7]. Batch normalization is an approach to speed up the training networks by making
sure that the variations of the inputs within a batch are accounted for. Finally, there
are whole books written about optimizers for training neural networks. Readers
interested in learning more should see [8].
148 5 Feed-Forward Neural Networks
Problems
5.1 Consider a neural network with two hidden layers. By hand compute the
derivative of the loss function with respect to a weight from each of the hidden
layers and show that it is equivalent to Eq. (5.29).
5.2 Consider a neural network with one hidden layer. By hand compute the back
propagation formula for a network regularized with L1 and L2 penalties as in
Eq. (5.31).
5.3 Repeat problem 3.3 using neural networks of 1, 2, and 3 layers. How accurate
can you get your models? Plot your results as a function of z at time τ = 10.
5.4 Repeat problem 5.3 by building models that include a) dropout, b) L2 regular-
ization, and c) L1 regularization.
5.5 Build neural networks of 1, 2, and 3 layers to approximate the Rosenbrock
function
by sampling 100 points randomly in the interval [−1, 1] to create the training set.
Try several different activation functions and regularizations. Use the models to find
the location of the minimum of the function and compare your result with the true
value of x = −1 and y = 1.
References
1. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward
neural networks. In Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics, pages 249–256, 2010.
2. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine
Learning Research, 15(1):1929–1958, 2014.
3. I-C Yeh. Modeling of strength of high-performance concrete using artificial neural networks.
Cement and Concrete Research, 28(12):1797–1808, 1998.
4. Arthur Asuncion and David Newman. UCI machine learning repository, 2007.
5. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
6. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444,
2015.
7. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
8. Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016.
Chapter 6
Convolutional Neural Networks for
Scientific Images and Other Large Data
Sets
Abstract For problems where there is a large number of inputs, such as an image
where each pixel can be considered an input, a feed-forward neural network would
have a truly huge number of weight and bias parameters to fit during training.
For such problems rather than considering each input to be independent, we take
advantage of the fact that the input has structure, even if we do not know what that
structure is, by using convolutions. In a convolution we apply a function of particular
form repeatedly over the inputs to create the hidden layer variables. In image data,
this allows the network to highlight important features in an image. We begin the
discussion of convolutions by defining and giving examples of 1-D convolutions
for functions and for vectors, before extending to multiple dimensions. We then
talk about a special type of operation applied in machine learning called pooling,
before showing how convolutions work in neural networks. We apply convolutional
neural networks to the Fashion MNIST data set and give a case study of using these
networks, along with a concept called transfer learning, to find volcanoes on Venus.
6.1 Convolutions
The neural networks that we discussed in the previous chapter had many possible
unknown parameters (i.e., the weights and biases) to train. The number of parame-
ters increases as the number of inputs is increased and as the network gets deeper
by adding more hidden layers. In this chapter we investigate how we can reduce the
number of parameters to fit when we have a large number of inputs to the model, as,
for example, if we have an image as the input to the model. This is done by making
sets of weights have the same values and applying those weights repeatedly in a
pattern.
We begin with the definition of a convolution of two functions. Consider two
functions f and g; the convolution f ∗ g is defined as
'∞ '∞
(f ∗ g)(t) = f (τ )g(t − τ ) dτ = f (t − τ )g(τ ) dτ. (6.1)
−∞ −∞
'∞
(f ∗ g)(t) = f (t − τ )g(τ ) dτ (6.2)
−∞
'a
= f (t − τ )g(τ ) dτ.
−a
This shows that the convolution evaluated at t is the integral of f around the point t
times the function g.
Oftentimes the convolution will produce a smoothed version of the function f (t)
if g(t) is a function that has finite support. Some functions that perform a smoothing
are the rectangular function,
1
grect (t) = (H (t + a) − H (t − a)), (6.3)
2a
where H (t) is the Heaviside unit step function, and the triangle function
1
gtri (t) = (a − |t|)+ , (6.4)
a2
where again the plus subscript denotes that the function returns zero if its argument
is negative. If g(t) is the rectangular function, the convolution becomes a simple
integral:
6.1 Convolutions 151
0.5 grect
gtri
gGauss
0.4
0.3
g(t)
0.2
0.1
0.0
−4 −2 0 2 4
t
'a
1
(f ∗ grect )(t) = f (t − τ ) dτ. (6.5)
2a
−a
This convolution replaces f (t) with the average of f (t) over a finite range. As
integrals/averages are almost always smoother than the original function, f ∗ grect
is a smoothed version of f (t). Similarly, when g is the triangular function, we get
an average of f that is weighted so that points near t are more strongly weighted.
We could define many possible functions g that act as a weighting function
in averaging f , and thereby smoothing it. An extreme example is the Gaussian
function that exponentially approaches zero:
1
e−t /(2σ ) ,
2 2
gGauss (t) = √ (6.6)
2π σ 2
where σ > 0 is a parameter. The integral of the Gaussian over the entire real line is
1 and it has a maximum at gGauss (0), so it will produce a weighted average of f that
is centered at t.
In Fig. 6.1 we plot the values of the different example g functions defined above.
We see that with a = 2σ the Gaussian and triangular functions have roughly the
same shape, with the Gaussian function more smoothly transitioning to zero and not
reaching as high of a value. We will see that convolutions with these two functions
behave in a similar manner.
Because convolution smooths functions, they can be useful for treating noisy
data or functions before differentiating them. For example, if one wants to take a
derivative of a signal, the differentiation operation will amplify the noise. If we
152 6 Convolutional Neural Networks for Scientific Images and Other Large Data Sets
4 original
grect
gtri
gGauss
2
value
−2
−4
0 20 40 60 80 100
t
Fig. 6.2 Demonstration of the smoothing of a signal f (t) and using convolutions with grect , gtri ,
and gGauss with a = 4 and σ = 2
perform a smoothing convolution first, we can improve the derivative estimate. This
behavior is shown in Fig. 6.2 where a noisy signal is convolved with the three g
functions defined above using a = 4 and σ = 2 to get a much smoother function.
Here we see that the rectangular function smooths the function more drastically,
making the peaks much lower. All three functions remove the narrow peaks that
appear next to the “main” peak, but the triangular and Gaussian functions result in
a convolution that better preserves the shape of the large peaks.
The derivative of a convolution is related to the convolution of a derivative. To
see this we mechanically apply the derivative to a convolution:
'∞
d d
(f ∗ g)(t) = f (τ )g(t − τ ) dτ, (6.7)
dt dt
−∞
'∞
d
= f (τ ) g(t − τ ) dτ,
dt
−∞
d d
= f ∗ g (t) = f ∗ g (t).
dt dt
original
4 d
dt
(f*grect)
d
−dt f
2
value
−2
−4
20 40 60 80
t
Fig. 6.3 Demonstration of the approximate derivative calculated by taking the derivative of a
convolution with the rectangular function, grect , and f (t) using a = 2
'∞
d 1
(f ∗ grect )(t) = f (t − τ )(δ(t + a) − δ(t − a)) dτ, (6.8)
dt 2a
−∞
f (t − a) − f (t + a)
= .
2a
When we use other weighting functions, we get smoother versions of the derivative,
but all of them require convolving with the derivative of the weighting function.
We demonstrate that dt d
(f ∗ grect )(t) gives the negative of the finite difference
estimate of the derivative in Fig. 6.3. Here we plot the derivative of the convolution
as well as the true value of −1 times the derivative of f (t). In the derivative of the
signal we see a large negative value followed by a large positive value when the
original signal goes through a peak. The result from dt d
(f ∗ grect )(t) smooths the
derivative noticeably. This is a result of the signal being smoothed by convolving
with grect before taking the derivative. This is important if we believe that there is
noise in the signal, but we need to compute the derivative.
Before continuing on, we point out that convolutions can be extended to d
dimensions in a straightforward manner. If f and g are functions of x1 , . . . , xd ,
then the convolution is
'∞ '∞
(f ∗ g)(x1 , . . . , xd ) = d x̂1 . . . d x̂d f (x1 , . . . , xd )g(x1 − x̂1 , . . . , xd − x̂d ).
−∞ −∞
(6.9)
154 6 Convolutional Neural Networks for Scientific Images and Other Large Data Sets
The commutativity and linearity of the convolution also hold in multiple dimen-
sions. The special case of d = 2 is given by
'∞ '∞
(f ∗ g)(x1 , x2 ) = d x̂1 d x̂2 f (x1 , x2 )g(x1 − x̂1 , x2 − x̂2 ). (6.10)
−∞ −∞
Convolutions
A convolution involves the integral of the product of functions. This operation
can smooth a function under many circumstances and can provide a way to
take a derivative of a noisy function without amplifying the noise.
In many cases we do not have a continuous function to work with and we must deal
with the function at a finite number of points. In this case we consider a convolution
of two vectors. If f is a vector of length N, we write the nth component of the discrete
convolution, with subscripts denoting vector elements with zero-based indexing,1 as
∞
∞
(f ∗ g)n = fm gn−m = fn−m gm n = 0, . . . , N − 1, (6.11)
m=−∞ m=−∞
with the convention that all indices out of bounds for f, e.g., negative values or values
greater than N − 1, are zero. This is called zero padding: we add enough zeros to
the ends of f so that the result of the convolution has the same size as the original
vector. We can define the vector g to have negative indices, however, and this will
be worthwhile, as an example shows. The vector g that is used in the convolution is
often called a “kernel.” The size of the kernel is number of entries in g.
The discrete convolution can be used to define smoothing in a similar way to the
continuous case. For example, if we set
1
m = −1, 0, or 1
gavg,m = 3 , (6.12)
0 otherwise
1 Weneed to use zero-based indexing for our vectors for the formulas in this section to make sense
without adding several non-intuitive constants to our indices.
6.1 Convolutions 155
1
(f ∗ gavg )n = (fn+1 + fn + fn−1 ). (6.13)
3
In this case the kernel gavg has a size of 3.
Note that the way we have defined the discrete convolution, near the end of the
vector f, the formulas will be slightly different. For instance, the convolution that
gives the average of f will behave differently near the edge:
1 1
(f ∗ gavg )0 = (f1 + f0 ), (f ∗ gavg )N −1 = (fN −1 + fN −2 ). (6.14)
3 3
If the vector is long, i.e., N is large, these edge effects will have a small impact on
the result from the discrete convolution. We could also make other changes, such as
making the vector periodic by setting f−1 = fN −1 and fN = f0 , or by changing g
at the boundary. In general, we will be able to ignore these edge effects in our use
of convolutions.
We can also make the convolution give us a finite difference approximation of a
second derivative. If we define
⎧
⎪
⎪ 1
m = −1, or 1
⎨ h2
gD2,m = − 2 m = 0
2 , (6.15)
⎪
⎪ h
⎩0 otherwise
the discrete convolution will be the finite difference approximation to the second
derivative of f :
1
(f ∗ gD2 )n = (fn+1 − 2fn + fn−1 ). (6.16)
h2
In this case we can think of the matrix G as giving a stencil to average F over.
For instance if G is
⎛ ⎞
111
1⎝
Gmean = 1 1 1⎠ , (6.18)
9
111
the convolution (F ∗ G)m,n gives the average of the nine values near Fm,n :
1
1 1
(F ∗ Gmean )m,n = Fm−i,n−j . (6.19)
9
i=−1 j =−1
In practice this kernel Gmean is called the mean filter of size 3 × 3. We could define
mean filters of different sizes that would average F over a larger area.
∂2 ∂2
Also, we can take the discrete Laplacian Δ2 = ∂x 2 + ∂y 2 of the signal with the
proper definition of G:
⎛ ⎞
0 1 0
1
GLap = 2 ⎝1 −4 1⎠ . (6.20)
h
0 1 0
This makes the convolution compute the finite difference approximation of the
Laplacian
1
(F∗GLap )m,n = (Fm−1,n − 2Fm,n + Fm+1,n ) + (Fm,n−1 − 2Fm,n + Fm,n+1 ) .
h2
(6.21)
When considering discrete convolutions we need to consider what happens at the
edge of the data. Above we mentioned that we add zeros around the edges of the data
so that the convolution result has the same size as the original data. Alternatively,
one cannot add any zeros so that the data size shrinks because the convolution cannot
be applied at the edges. To demonstrate this we consider the data
⎛ ⎞
123
F = ⎝4 5 6⎠ ,
789
and we want to apply the mean filter to this data given by Eq. (6.18). If we do not
use padding, then there is only one entry in F that we can apply the convolution to,
the value 5 in the middle:
F ∗ Gmean = 5 No padding.
6.1 Convolutions 157
If we pad the data with a row or column on each side, we get a result that has the
same size as F:
⎛ ⎞ ⎛ ⎞
4 7 16
0 0 0 0 0
⎜0 0⎟ ⎜3 3 9
⎟
⎜ 1 2 3 ⎟ ⎜ ⎟
⎜ ⎟ ⎜ 11 ⎟
F ∗ Gmean = ⎜0 4 5 6 0⎟ ∗ Gmean = ⎜ 25 5 3 ⎟
With padding.
⎜ ⎟ ⎜9 ⎟
⎝0 7 8 9 0⎠ ⎝ ⎠
26 13 28
0 0 0 0 0 9 3 9
The number of zeros that need to be added in the padding depends on the size of the
convolution matrix (in this case G). If the matrix F is large, the effects at the edge
may be negligible and the issue of padding choice may not matter.
Figure 6.4 demonstrates the application of convolutions in 2-D. This convolution
has a 3 × 3 matrix for G. The top portion of the figure shows F, G, and F ∗ G. The
middle panel shows how 9 entries in F are combined to get an interior entry, and
the bottom row panel demonstrates the necessity of padding to keep the size of the
result identical to the original matrix.
For a further example of 2-D convolutions we will consider an image and apply
different convolutions to it. In Fig. 6.5a we show a grayscale image normalized so
that each pixel takes on a value between 0 and 1. When we apply the 2-D Laplacian
from Eq. (6.21) to the image, the result is Fig. 6.5b. The resulting image seems to
highlight the edges of the original image. We see the outline of the figures and
objects in the image; this demonstrates that the Laplacian convolution can find
where there are sharp changes in the data.
The Laplacian convolution is sensitive to noise in the data however. In Fig. 6.5c
the original image has had noise added to it in the form of a Gaussian with mean 0
and standard deviation 0.025. The noise is hardly perceptible to the eye, but when
the Laplacian convolution is applied to the image, as shown in Fig. 6.5d, we see that
the result has amplified noise and that the edges of the image are not perceptible.
We can remedy the situation by first applying a mean filter. In Fig. 6.5e we apply a
mean convolutional of size 9 × 9 to the noisy image. The resulting image is said to
have a mean filter applied, and the result does have noticeable smoothing. However,
when the mean-filtered image has the Laplacian convolution applied to it, we can
again see the edges of the image in Fig. 6.5f. The result does have the artifact that the
edges have spread out, i.e., they are thicker, compared with Fig. 6.5b; nevertheless,
the result is superior to the Laplacian applied to the noisy image.
Many of the ideas behind convolutional neural networks come from image pro-
cessing. Many images have multiple channels with 3 being the most common: a
158 6 Convolutional Neural Networks for Scientific Images and Other Large Data Sets
0 -1 -1 0 1 1 0 -1 -1 -3 -4 -2 2 4 2 -2 -4 -3
-1 -1 0 1 1 0 -1 -1 0 -4 -4 0 4 4 0 -4 -4 -2
-1 0 1 1 0 -1 -1 0 1 -2 0 4 4 0 -4 -4 0 2
0 1 1 0 -1 -1 0 1 1 1 1 1 2 4 4 0 -4 -4 0 4 4
1 1 0 -1 -1 0 1 1 0 1 1 1 4 4 0 -4 -4 0 4 4 2
1 0 -1 -1 0 1 1 0 -1 1 1 1 2 0 -4 -4 0 4 4 0 -2
0 -1 -1 0 1 1 0 -1 -1 -2 -4 -4 0 4 4 0 -4 -4
-1 -1 0 1 1 0 -1 -1 0 G -4 -4 0 4 4 0 -4 -4 -2
-1 0 1 1 0 -1 -1 0 1 -3 -2 2 4 2 -2 -4 -2 0
F F∗G
0 -1 -1 0 1 1 0 -1 -1
-1 -1 0 1 1 0 -1 -1 0
-1 0 1 1 0 -1 -11 0 1
0 1 1 0 -1 -1 0 1 1
1 1 0 -1 -1 0 1 1 0 -3
-3 -4
-4 -2
-2 2 4 2 -2
-2 -4
-4 -3
-3
1 0 -1 -11 0 1 1 0 -1 -4
-4 -4
-4 0 4 4 0 -4
-4 -4
-4 -2
-2
0 -1 -1 0 1 1 0 -1 -1 --2
-222 0 4 4 0 -4
-4 -4
-4 0 2
-1 -1 0 1 1 0 -1 -1 0 2 4 4 0 -4
-4 -4
-4 0 4 4
-1 0 1 1 0 -1 -1 0 1 4 4 0 -4
-4 -4
-4 0 4 4 2
2 0 -4
-4 -4
-4 0 4 4 0 -2
-2
-2
-2 -4
-4 -4
-4 0 4 4 0 -4
-4 -4
-4
-4
-4 -4
-4 0 4 4 0 -4
-4 -4
-4 -2
-2
-3
-3 -2
-2 2 4 2 -2
-2 -4
-4 -2
-2 0
0 -1 -1 0 1 1 0 -1 -1
-1 -1 0 1 1 0 -1 -1 0
-1 0 1 1 0 -1 -1 0 1
0 1 1 0 -1 -1 0 1 1
1 1 0 -1 -1 0 1 1 0 -3
-3 -4
-4 -2
-2 2 4 2 -22
-2 -4
-4 -3
-3
1 0 -1 -1 0 1 1 0 -1 -4
-4 -4
-4 0 4 4 0 -4
-4 -4
-4 -2
-2
0 -1 -1 0 1 1 0 -1 -1 -2
-2 0 4 4 0 -4
-4 -4
-4 0 2
-1 -1 0 1 1 0 -1 -1 0 2 4 4 0 -4
-4 -4
-4 0 4 4
-1 0 1 1 0 -1 -1 0 1 4 4 0 -4
-4 -4
-4 0 4 4 2
2 0 -4
-4 -4
-4 0 4 4 0 -2
-2
-2
-2 -4
-4 -4
-4 0 4 4 0 -4
-4 -4
-4
-4
-4 -4
-4 0 4 4 0 -4
-4 -4
-4 -2
-2
-3
-3 -2
-2 2 4 2 -2
-2 -4
-4 -2
-2 0
Fig. 6.4 Illustration of a 2-D convolution. At the top F and G are defined with the resulting
convolution shown at the top right. The middle of the figure shows how G is applied to get an
entry in F ∗ G, and the bottom portion demonstrates how the convolution needs padded values to
preserve the size of F
6.1 Convolutions 159
Fig. 6.5 Convolutions applied to an original image (a) and a version of the image with Gaussian
noise. (a) Original image. (b) 2-D Laplacian convolution of original. (c) Original image with noise.
(d) 2-D Laplacian convolution of noisy image. (e) Mean convolution on noisy image. (f) 2-D
Laplacian convolution of mean-filtered noisy image. Photo credit: the author
channel for the red, green, and blue levels in the image. This means that an image
is actually 3 matrices. When applying a convolution to this type of input, a different
convolution is applied to each matrix and the results are summed up. To define this
convolution we consider C channels stored in a tensor of size M × N × C (i.e., C is
the number of matrices of size M × N ). Call this tensor F and we wish to convolve
it with another tensor G. The convolution will be the sum of C 2-D convolutions
160 6 Convolutional Neural Networks for Scientific Images and Other Large Data Sets
C
F∗G= Fc ∗ Gc , (6.22)
c=1
Discrete Convolutions
In machine learning we typically deal with inputs and outputs that are of a
finite number of dimensions. In this case convolutions turn into operations
on vectors or matrices depending on whether the data layout is 1-D or 2-
D. Whereas continuous convolutions involve multiplication and integration,
discrete convolutions look like elementwise multiplication and summation.
When applying discrete convolutions we can smooth a signal or take its
derivative as before. Additionally, we usually have to pad the data if we want
to get an output from the convolution that is the same size as the input. One
of the most important features of a discrete convolution is that we can define
the kernel G in 2-D of size K × K using only K 2 values.
6.2 Pooling
1
kS
AvgPool(x)k = xi , k = 1, . . . , N/S. (6.23)
S
i=(k−1)S+1
The length of the resulting vector is N/S, i.e., the pooling reduces the length of the
vector by a factor of S. As an example consider a vector of length 9 given by
In this example, the input vector had a large value in the middle of the vector and
resulting from the average pooling operation did as well. Notice that if we shift the
location of the large value by one index either way (to make the large value in the
fourth or sixth position in vector), the value for the average pooling operation does
not change. That is, if we shift to the left or right (denoted by a ∓ subscript),
However, if we shifted the position of the large value in the vector too far, the
average pooling result would change.
This ability of average pooling to be somewhat insensitive to the exact position
of features in the input vector is one of the reasons they are valuable in machine
learning. There are many cases where shifting a vector by a small amount should
not affect the behavior of a machine learning model, and average pooling helps add
this insensitivity. The insensitivity to shifts in the input is a function of the size of
the stride, S. If S is large, then one can shift the input by a lot and not change the
result much with the side effect of reducing the size of the signal more. When S is
small, the pooling result will be more sensitive to shifts in the data.
The other common type of pooling called “max” pooling involves finding the
maximum value in the stride. It behaves in much the same way as the average filter.
The max pooling operation of stride S is
The result from our example above gives slightly different results using the max
pooling than with average pooling, but the character of the result is similar: with
Notice that max pooling preserved the largest value of the input vector, whereas
average pooling reduced it through the averaging process. In this example shifting
162 6 Convolutional Neural Networks for Scientific Images and Other Large Data Sets
by one element does not change the value, as with average pooling:
There is a special name for pooling operations when the stride S is equal to the
vector length N . These operations are called global pooling operations and they
reduce the input to a single value. Clearly, global pooling is not sensitive to shifting
the inputs.
1
kS
S
AvgPool(X)k = 2 Xij , k = 1, . . . , M/S, = 1, . . . , N/S.
S
i=(k−1)S+1 j =(−1)S+1
(6.25)
Similarly, the max pooling operation is
Fig. 6.6 Max pooling of stride 9 applied to the Laplacian-convolved images from Fig. 6.5. (a)
2-D Laplacian convolution of original. (b) Max pooling of (a). (c) 2-D Laplacian convolution of
mean-filtered noisy image. (d) Max pooling of (c)
Pooling
Pooling is an operation that can be applied to vectors or matrices to find
features in the data. The pooling operation decreases the size of the signal and
has a result that is somewhat insensitive to the exact location of a feature in the
signal. This insensitivity will be used by machine learning models. Common
types of pooling include average pooling that returns the average of the input
over a particular patch of the input and the max pooling operation that returns
the maximum value over a patch of the input.
164 6 Convolutional Neural Networks for Scientific Images and Other Large Data Sets
Pooling Layer
Pooling Layer Convolution Layer
Convolution Layer
Image
Fig. 6.7 Schematic of a typical convolutional neural network. The number of planes at each layer
indicates the number of channels. The dense layers before the output nodes are not shown. The
number of planes in a layer indicates the number of channels, and the plane area represents the size
of the data
We have already seen the fashion MNIST data set in Chap. 4. It is a set of grayscale
images of size 28 × 28 each containing one of ten types of clothing articles (e.g.,
shirt, sneaker, dress, boot, etc.). Here we use a convolutional neural network to try
to classify these images. There are many examples of CNNs applied to this problem
available and the idea for the network structure we use comes from [1].
Our network begins with a convolutional layer applied to the original 28 × 28
image. This convolution layer has 64 channels and each convolution is a 2 × 2
convolution. This first convolution layer then has 2 × 2 × 64 + 64 = 320 unknown
parameters. The additional 64 are for a bias in each channel. This layer is then
followed by a max pooling layer with stride 2. After the max pooling layer there
are 64 channels each with a 14 × 14 image. The max pooling layer is followed by a
dropout layer that randomly sets some of the values in the 64 channels to zero.
Next, there is a convolutional layer of size 2 × 2 applied to create 32 channels.
Each of 32 convolutions is 2 × 2 for each of the 64 input channels so that the total
number of parameters is (2 × 2) × 64 × 32 + 32 = 8 224. The result from this
second convolutional layer is 32 channels each with a size 14 × 14. This passes
to a max pooling layer with stride 2 and again to a dropout layer, resulting in 32
channels each with size 7 × 7. This then passes to a fully connected layer with 256
neurons and the ReLU activation function. In the fully connected layer there are
256 × 32 × 7 × 7 + 256 = 401 664 parameters. Dropout is applied before a final
fully connected layer that with 10 neurons to compute the softmax output for the 10
different classes of images. This final layer has 256 × 10 + 10 = 2 570 weights to
fit.
This network overall has 412 778 parameters to fit. We fit it with the Adam
optimizer and train it with a batch size of 64 for 10 epochs using 60 000 training
166 6 Convolutional Neural Networks for Scientific Images and Other Large Data Sets
cases and 10 000 test cases. After training the model has a training accuracy of
87.88% and a test accuracy of 89.98%, indicating that we can expect the model to
correctly determine what type of item is in the picture about 90% of the time.
Our real interest in this example is how it illustrates the behavior of the trained
convolution layers. In Fig. 6.8 we show the values of the intermediate channels for
two images. At the top we show the original image that is input to the CNN. Below
that we show the 64 channels that result from applying the first convolutional layer
(the images are arranged in a 4 × 12 grid). Then the max pooling layer is applied,
and the result is a shrinking of each channel by a factor of 2 in each dimension. Then
the results of the 32 channel convolution are shown below that followed by another
max pooling layer. The result of this layer is then fed to the fully connected layers
and then the output.
Looking at Fig. 6.8 we see what the convolutions are doing: they are detecting the
edges of the items. For instance, in the coat we see that the shoulders of the coat are
highlighted and then in the final layer before the fully connected layer, the locations
of the dark patches indicate the overall shape of the item. The sandal at the bottom
of the figure has different areas highlighted in this final layer: it is these patterns the
model uses to distinguish items.
One of the challenges of convolutional neural networks is that they can take a long
time to train and can be difficult to properly design (i.e., choose the convolution
sizes, etc.) for a given task. For this reason it is often productive to begin with a
network that has been pre-trained on some representative task and augment it to
solve the problem at hand. This will often take the form of loading a pre-trained
model and adding a number of fully connected layers to the end of the network
and training just those added layers on the task at hand, freezing the weights in the
pre-trained network.
In this case study we will do just that using the EfficientNet network developed
by Google [2]. The pre-trained version of EfficientNet that we use has been trained
to solve the “ImageNet” classification problem where an image is classified as one
of 1000 classes [3]. The version of EfficientNet that we use takes in an image of
size 240 × 240 with 3 color channels (i.e., the input is of size 240 × 240 × 3). The
EfficientNet model has 7.8 million parameters to train, and thankfully this work has
already been done for us. We will use this pre-trained network to find volcanoes3 on
Venus.
3 TheOxford English Dictionary indicates that both volcanoes and volcanos are acceptable plurals.
Based on the quotations provided therein, volcanoes seem to be more common in modern usage.
6.5 Case Study: Finding Volcanoes on Venus with Pre-fit Models 167
Taking a model that has been fit for one task and then adapting it for another task
are called transfer learning. The idea behind transfer learning is that for complicated
networks one requires a large amount of data and computational power to fit, as
in EfficientNet. However, if we change the network slightly, such as adding an
additional layer before the output, we can use a small amount of data to train that
layer to learn how to adjust the result of the original network to the new task.
This idea has successfully been applied in scientific applications to transfer learn
the result of an experiment from a neural network that is trained on low fidelity
simulations [4].
The data set that we use is based on the images collected by radar on the Magellan
spacecraft from 1990–1994 [5]. The images are each of size 1024 × 1024 with a
single gray color channel (i.e., the images are of size 1024 × 1024 × 1) [6]. Each
image comes with a key specifying the location in the image of a volcano and
the confidence that the object is actually a volcano with the following meaning:
1 indicates a definite volcano, 2 is for a probably volcano, 3 is a possible volcano,
and 4 indicates only a pit is visible. Given that no human has, or for matter could
have given current technology, visited Venus to observe the volcanoes, this labeling
does contain uncertainty. As provided by the UCI machine learning repository [7],
there are 134 images.
To turn this into a useable data set for EfficientNet we take random 240 × 240
samples from the available images by randomly selecting a corner pixel for the
240 × 240 box. This provides the input data for our model. To get the dependent
variables we use the original data to determine if there is a volcano in the sampled
image and label the image with the lowest value volcano in the image (e.g., if there
is a “1” volcano in the image, the dependent variable for that image is 1, regardless
if there are less probable volcanoes in the image). If there is no volcano in the image,
its label is set to 5. We created 11 596 images from the original data set and used an
80/20 training/test split. The classes in the set have roughly one-third of the images
having no volcano and one-sixth of the images having each of the four classes of
volcano probabilities.
This created data set is of size 240 × 240 × 1, whereas EfficientNet expects a
color image with 3 channels. We make data fit the expected input by simply copying
the image data into three identical channels. Additionally, EfficientNet outputs a
vector of size 1000 giving the probability of the image being each of 1000 classes.
Our classification problem only has 5 classes for the volcano probability. To deal
with this we add a softmax layer containing 5 nodes that takes as input the 1000
EfficientNet outputs. Therefore, our neural network will take as input a 240×240×3
image that is fed into EfficientNet and the outputs from EfficientNet are passed
through a softmax layer to get the probability of the image belonging to each of the
five classes.
The softmax layer has a weight connecting each of the 5 nodes to the 1000
EfficientNet outputs. Including a bias for each node results in 5005 parameters in
this final layer. When we train the model, we only train these 5005 parameters and
leave the pre-trained EfficientNet network’s parameters as static. This is the transfer
learning step: we transfer the output of EfficientNet to solve a different problem by
6.5 Case Study: Finding Volcanoes on Venus with Pre-fit Models 169
Transfer Learning
When one has a network that is trained to solve one task, it can be applied
to another, similar task with a smaller amount of data by adding layers to the
original network and training the model to learn a correction to the original
model’s outputs. This is called transfer learning, and because it is only training
an addition to the original network, it allows one to apply sophisticated models
with a large number of parameters requiring a large amount of data to train to
be applied to new problems with only a small amount of data.
170 6 Convolutional Neural Networks for Scientific Images and Other Large Data Sets
Fig. 6.9 The 6 images with the highest predicted probability of being each class are shown in each
column with the first column being images the neural network identified as being in class 1. The
numbers in each image are the correct class
The topic of convolutional neural networks is a rapidly evolving field with new
tweaks on the concept being adding continually. One aspect of these models that
has shown promise is residual layers where the inputs to a layer are operated on
by a convolution and then added back to the inputs [8]. These ideas have been
Notes and Further Reading 171
Fig. 6.10 The 6 images with the highest predicted probability of being in a class but with the
wrong prediction are shown in each column with the first column being images the neural network
identified as being in class 1. The numbers in each image are the correct class
shown to make more readily trainable networks that have high accuracy. A complete
discussion of these networks is outside the scope of our study, but via transfer
learning these models can be applied to a problem of the reader’s interest.
172 6 Convolutional Neural Networks for Scientific Images and Other Large Data Sets
Problems
6.1 Produce values of the function sin(t) at 100 equally spaced points from 0 to 2π .
Add random noise to each point, where the noise at each point is a random number
between −0.1 and 0.1. Call these values f.
• Compute the convolution of f with g as defined by Eq. (6.16) and plot the result
versus the sine function that generated f.
• Compute the convolution of f with gD2 using 2π/100 for h and plot the result
versus − sin(t).
6.2 Construct a matrix of size 100 × 100 of random numbers between 0 and 1. Call
this matrix F.
• Define a matrix G of size 3 × 3 where the matrix elements are
• Take the convolution F ∗ G and plot it as a 2-D grayscale image, and compare
this to the original matrix F.
References
Abstract In this chapter, we discuss neural networks that are specifically tailored
to solve problems where the inputs are time series data, i.e., inputs that represent a
sequence of values. We begin with the simplest such network, the basic recurrent
neural network, where the output of a neuron can be an input to itself. We discuss
how these networks function and demonstrate some issues with such a construction,
including vanishing gradients when the input sequences are long. We then develop
a more sophisticated network, the long short-term memory (LSTM) network to deal
with longer sequences of data. Examples include predicting the frequency and shift
of a signal and predicting the behavior of a cart-mounted pendulum
In this chapter, we build machine learning models for prediction problems where
the dependent variable is influenced by independent variables and the value of the
dependent variable at previous times. A scenario where the data set changes over
time is called a time series. Time series prediction problems are difficult because
they usually require combining information about the history of the dependent
variable with other independent variables. For instance, if we considered a time
series of the temperature at a given location, the temperature 1 h later is likely
influenced by the temperature right now. However, if other variables change, such
as it starts raining or the sun sets, these can have a large effect on the temperature
in 1 h.
1 1
Fig. 7.1 Visualization of a basic recurrent neural network. Note that the output of z acts as an
input to z with weight v
Time series prediction problems need to deal with the historical behavior of the
function as well as quantify the influence of other variables. We accomplish this by
making the neural network have recursion: that is, the output of a neuron serves as
an input to itself. This enables the network learn how to consider the past as well as
other variables in the model. There are several types of recurrent neural networks,
and we begin with the simple recurrent neural network.
In its simplest form, a recurrent neural network (RNN) consists of a single neuron
where the neuron output serves as an input. If zt is the output of the neuron at time
t, and xt is an independent variable at time t, then the neuron would take a value of
yt = o(wo zt + b0 ), (7.2)
where o(u) is the output activation function, wo is a weight, and bo is a bias. This
RNN has 5 parameters that need to be trained: w, v, b, wo , and bo .
Another way to picture a RNN is to “unroll” it. That is to show the recursion
by repeating the neural network at each time level, as is shown in Fig. 7.2. In this
depiction, we can directly see the connections between the time levels. Also, we
can see that zt contains some information about the history of the time series. The
neural network learns what information about the history of the time series should
be encoded in the state variable zt so that when combined with xt+1 via the weights
and biases, the correct value of yt+1 can be calculated. Notice that if the weight v is
zero, then this is just a standard feed-forward network.
Of course, it is possible to have more than one recurrent node in a neural network.
Adding further recursion allows the network to have several different variables store
information about the history of the time series. Adding more recurrent nodes is
7.1 Basic Recurrent Neural Networks 177
Hidden
Inputs Output
Layer
b bo
v
1 1 ot–1
xt w zt wo ot yt
b bo
v
1 1
b bo
v
1 1
Fig. 7.2 Unrolled version of a basic recurrent neural network to show three different times and
the connections between them. Note that the weights do not change with time
akin to making the network deeper in the time domain. There are now connections
between the different recursive nodes in the network. A RNN with two recursive
nodes, a single input, and a single output is shown in Fig. 7.3. In the figure, we
can see that the network gets more difficult to visualize as there are recursive
connections and connections between the recursive nodes at different time levels.
Mathematically, the values of the recursive nodes are
The fact that this network has two recursive nodes and they are connected to
each other gives the network the ability to encode two pieces of information about
the time history of the data, but also allows that information to interact with each
178 7 Recurrent Neural Networks for Time Series Data
Hidden
Inputs Output
Layer
v2→1
v1→1 v1→2 w2→o
w0→2
bo
z(2)
b1 b2
1 v2→2 1
Fig. 7.3 Visualization of a recurrent neural network with a single input, x, and two recursive
nodes, z(1) and z(2) . In this case, each of the recursive nodes takes as input both values of z from
the previous time
other. This is indeed a very flexible network, and it becomes more flexible (and more
complicated to draw) as we add more recursive nodes.
It is also possible to create neural networks where recursive nodes are a piece of
a much larger network. For instance, we could have a deep feed-forward network
or a convolutional neural network between the inputs and the recursive nodes.
Additionally, we can have hidden layers between the recursive nodes and the
output layer, for instance, a softmax layer to solve a classification problem. These
additional levels of complexity give the network further flexibility in the types of
relations that the network can model at the potential cost of having a more difficult
model to train.
Training recursive neural networks can be done using back propagation in the
same way that feed-forward and convolutional neural networks are trained. In the
typical training scenario, a single training instance will have T values for xt , i.e.,
(x1 , x2 , . . . , xT ), and a known value for yT . In this case, we train the network with
7.1 Basic Recurrent Neural Networks 179
the inputs from T times and have it produce a prediction for the dependent variable
at time T . If we unroll the RNN T times, then we can perform back propagation
from yT to get the derivative of the error with respect to each of the parameters.
To see how this works, we consider a single training point for the neural network
shown in Figs. 7.1 and 7.2 with linear activation functions for σ (u) and o(u). If we
consider the derivative of yT with respect to v, we get
∂yT ∂zT
= wo (7.4)
∂v ∂v
∂
= w0 (vzT −1 + wxT + b)
∂v
∂z
= wo v T −1 + zT −1
∂v
( )
∂
= w0 v (vzT −2 + wxT −1 + b) + (vzT −2 + wxT −1 + b)
∂v
( ( ) )
∂
= wo v v (vzT −3 + wxT −2 + b) + (vzT −3 + wxT −2 + b) + (vzT −2 + wxT −1 + b)
∂v
..
..
∂y3
= wo (2v(b + wx1 ) + b + wx2 ); (7.6)
∂v
and T = 4 gives
∂y4
= wo 3bv 2 + 2bv + b + 3v 2 wx1 + 2vwx2 + wx3 . (7.7)
∂v
We need not continue further. What we observe here is the derivative of the output
with respect to v will have terms containing v T −2 . Therefore, as T gets large, that
is we have long time series, the derivative will have a large power of v in it. This
means that if |v| > 1, then the magnitude of the derivative goes to infinity as T gets
large. Similarly, if |v| < 1 then v T −2 goes to zero as T gets large. Typically, when
we train a neural network we start with values of the weights that are close to zero.
Therefore, when we evaluate the derivative of y with respect to v in training the
network, the influence of data at early times is functionally zero when T is large.
This is known as the vanishing gradients problem. It makes it very difficult for the
180 7 Recurrent Neural Networks for Time Series Data
network to learn to propagate information from the time series from early times to
later times because the early times have little influence on the late time data when
T is large. We will discuss approaches to deal with the vanishing gradients problem
later in the chapter.
Vanishing Gradients
One drawback of RNNs is that the recursion behaves as though we have
added hidden layers to the network. Therefore, when using back propagation
to compute derivatives, the problem can arise that values early in the time
series cannot affect the later time values due to the phenomenon of vanishing
gradients. This causes RNN models that deal with time series at a large
number of points to be difficult to train.
where a and b are unknown constants for a given signal chosen from a standard
normal distribution, and (t) is a noise term where for each time the value is drawn
from a normal distribution with mean 0 and standard deviation of 0.01. To train the
RNN, we create 220 ≈ 106 signals sampled at 10 equally spaced points between
0 and 2π with a spacing between points of Δ = π/9. This training data creates a
matrix X of size 220 × 10. The dependent variable for this problem is a matrix Y
of size 220 × 1, which contains the value of y(2π + Δt) without the noise term
added. We also construct a test data set containing 218 series, about one-quarter of
the training data.
The goal here is to train a model that can predict the next value of the series
by only having information about the current and past states of the signal. To do
this, it must learn the frequency of the signal, represented in the constant a, and
the shift in the signal given by b. The first model we build has identical structure
to that in Fig. 7.1: It has a single recursive unit followed by a single output neuron.
This network has 5 parameters that need to be trained. We train the model using
200 epochs and a batch size of 256. After training, the network has an RMS error of
about 0.3. It seems that this RNN is too simple to learn the behavior of these signals.
To better accomplish the task of this example, we try a more complex RNN. This
model has 3 recurrent neurons and a dense layer of 3 neurons after the recurrent
7.1 Basic Recurrent Neural Networks 181
neurons. This network has similar structure to that of Fig. 7.3 except that there are 3
recurrent neurons (instead of 2) and there are 3 neurons between the output and the
recurrent layer. This network performs better than the simple network with an RMS
error of 0.109 on the test set.
These networks were trained to predict the eleventh point in a sequence, but we
could use them to generate more points of the signal. For instance, we can use the
prediction of the model and last 9 points to predict the twelfth point in the sequence.
This could be repeated many times and compared with the true value of a signal. We
do this type of repeated prediction to see how the model behaves in extrapolating
from what it has seen into a continued signal. We use the model to predict the next
10 points in the sequence. In practice, this means that we pick values of a and b
and produce a signal at 10 points. We then feed these 10 points into the model to
predict point 11. Then points 2-11 are used to predict point 12 and repeat until we
have predicted 10 points.
The results from this prediction exercise for a particular case are shown in
Fig. 7.4. In the figure, only the 10 predicted points from each RNN are plotted and
the true signal is also shown. We can see that the first prediction is somewhat close
to the correct value for both models, though it is closer for the more complicated
RNN model. As more predictions are made using the prediction from the network
at previous times, we see that the errors compound. The complex network does a
better job and has the predicted signal have the shape of a sine wave (without the
correct amplitude), but the simple network does not produce results that correctly
capture the features of the true signal.
As noted above, RNNs cannot learn information from the distant past due to
the phenomenon of vanishing gradients. To demonstrate this, we retrain the more
1.0
0.5
y(t)
0.0
−0.5
true
simple RNN
complex RNN
−1.0
0.0 2.5 5.0 7.5 10.0 12.5
t
Fig. 7.4 Comparison of a simple RNN with one recurrent neuron and a more complex RNN with
3 recurrent neurons and a dense layer of 3 neurons on predicting the next 10 points in a sequence
182 7 Recurrent Neural Networks for Time Series Data
1.0
0.5
y(t)
0.0
−0.5
true
5 train points
10 train points
−1.0
0.0 2.5 5.0 7.5 10.0 12.5
t
Fig. 7.5 Comparison of the more complicated RNN using 5 or 10 training points per signal on the
task of predicting the next 10 points
complex network with only 5 values from the true signal. That is, our training
data only contains half the number of values for each signal. The phenomenon of
vanishing gradients indicates that this network should have similar performance to
the network trained with 10 points per signal because the ability to predict a value
should not depend on values from the signal 10 time levels in the past. Indeed, this is
what we see. In terms of RMS error, the network using only 5 points in the training
has an error of size 0.0989. Furthermore, comparing the networks on the task of
predicting the next ten values shows that the network trained with 5 points performs
slightly better than that trained with 10 points, as shown in Fig. 7.5.
The basic RNN has the problem that via vanishing gradients it is difficult to carry
information from several steps in the past. To address this issue, long short-term
memory (LSTM) [1, 2] networks were developed. This network adds an additional
path of recursion through the network called the “state” that is designed to be passed
over time without being changed much. The network uses the state as well as the
output for the dependent variable in the recursive process so that the network can,
in a sense, keep different kinds of information propagating in time.
To demonstrate the LSTM, we will build up the network one step at a time for
a network with a single state and output. In this way, we can understand all the
connections in the network and the operations that are occurring. We begin with the
network knowing the state, Ct−1 , and output, zt−1 , from the previous step.
7.2 Long Short-Term Memory (LSTM) Networks 183
zt–1
vf
ft
wf
Xt
f
Ct−1 = ft Ct−1 . (7.10)
wp ×
zt–1 vω
ω
vf
ft
wf wω
xt
where the subscript p denotes parameters used in the proposal of a new state. We
also denote a sigmoid function σ as a function taking values between -1 and 1 so
that the state can be positive or negative. We then decide how much to weigh this
f
state when adding it to Ct−1 to get the new state. We call this weight ω and calculate
it as
Once again we have used σf to assure that the weight ω is between 0 and 1. Using
this weight we create the new state by adding the weighted, proposed state to the
previous state after forgetting
f
Ct = Ct−1 + ωC̃t . (7.13)
The calculations in step 2 are added to the LSTM schematic in Fig. 7.7. In this
figure, we have denoted some of the connections with dashed lines for clarity. In
this step, we produce the new value for Ct . Here xt and zt−1 compute a weight, ω
that is combined via elementwise multiplication with the proposed state, C̃t , which
is also a function of xt and zt−1 . The proposed state is then added to the result from
the forget gate.
7.2 Long Short-Term Memory (LSTM) Networks 185
Ct–1 × + Ct
vp C̃t
wp ×
zt–1 vω × zt
ω
vf
ft
vz
wf wω ot
wz
xt
In Fig. 7.8 we can see the entire LSTM. The state only influences the future state
as well as the output value of zt . This is in contrast to the inputs xt and the output
from the previous step zt−1 : These influence the forget gate, the proposed new state,
and the output.
The number of parameters in the LSTM model is 4J + 4 + 4 = 4(J + 2). The
4J comes from the weight vectors in the forget gate, the proposal for the new state,
the weighting of the proposed state, and in the calculation of the new output. The
other two terms come from the 4 weights on zt−1 and the four biases.
In introducing the state, the LSTM network removes the vanishing gradients
problem. That is, the state can be largely unchanged by the network from time
level to time level without the value of the state going to zero exponentially with
the number of time levels. To see this, we consider a network where the weights are
such that ω = 0. In this case, the value of Ct will be
186 7 Recurrent Neural Networks for Time Series Data
Ct = ft Ct−1 . (7.15)
k−1
Ct = ft ft−1 Ct−2 = Ct−k fk . (7.16)
k =0
Therefore, we can compute the derivative of Ct with respect to Ct−k using the
product rule to get
k−1 k−1
∂Ct ∂
= fk + Ct−k fk . (7.17)
∂Ct−k ∂Ct−k
k =0 k =0
This first term is just a product of the k forget gates between time levels t and
t − k. These are not necessarily the same number, therefore, it is possible to train the
network so that this term does not vanish for large k. This is in contrast to the basic
RNN where powers of a weight v appear so that terms of the form v k go to zero or
to an infinite magnitude as k gets large.
In the description above, there was a single LSTM “unit” in the network. In principle
there could be several units, each with its own state and output. In this case, the
inputs to the network would be modified so that zt and Ct would be vectors of the
same length as the number of units. This would require a vector of weights v for
each operation involving z, e.g. in the forget gate vf . The other operations would be
executed elementwise, i.e., σ (Ct ) would be a vector of σ applied to each entry in
Ct . When the number of units increases to N units, the number of free parameters
is then 4(J N + N 2 + N).
7.2 Long Short-Term Memory (LSTM) Networks 187
As with the basic RNN, we can connect other layers between the independent
variables and the LSTM units and/or place layers between the outputs zt and the
actual dependent variables. Additional layers added after the LSTM module will
not typically take the state, Ct , as an input.
There are two important variants of LSTM that we mention in passing. The first
is the peephole LSTM [3] where the Ct−1 is also an input to the forget gate and the
weight ω, and the value of the new state Ct is an input to the neuron in Eq. (7.14).
This allows the terms that construct the state and the output to have knowledge of
the state or to peek at the state and use it to influence the results. The other variant is
the gated recurrent unit (GRU) [4], which acts in many ways like an LSTM unit, but
only has a single output rather than adding the state. This output is recursive in a way
that uses a weight similar to the forget gate so that information can be propagated
for long times. Because the GRU does not add the extra degree of freedom of the
state, these networks are typically simpler than LSTM models.
We can apply LSTM models to the problem of estimating the functions of the form
of Eq. (7.8). We train a model that takes 10 data points as input and predicts the next
10 points. In this case, the LSTM model can be thought of as having 10 outputs to
predict. The model we build has a layer with 16 LSTM units. From these 16 units,
the zt are fed into a fully connected layer of 10 hidden neurons with the ReLU
activation function. These 10 nodes then connect to an output layer of 10 nodes,
one each for the 10 output values. We train this network using the same data as in
the RNN example. After training the model, we find that the mean-absolute error is
0.0323 on the training data and 0.0331 on the test data.
In Fig. 7.9 we show 3 validation cases for the LSTM model. These results
indicate that model is able to correctly predict the behavior of the signal, including
points where the function passes through a minimum, a property that the basic RNN
models had difficulty with. We also fit another model, this time using 15 values
from the sequence as input and predicting the next 5 values in the sequence as
output. RNN networks would be expected to do worse on this problem because
of the vanishing gradients problem. In the case of the LSTM, adding more points
improves the model performance, as shown in Fig. 7.9b.
It is clear that a neural network with LSTM nodes is able to solve this problem.
In future chapters, we will see that these techniques can be used in conjunction with
other advanced techniques to create powerful models for science and engineering
problems.
188 7 Recurrent Neural Networks for Time Series Data
0.5 0.5
y(t)
y(t)
0.0 0.0
−0.5 −0.5
−1.0 −1.0
0 2 4 6 0 2 4 6
t t
(a) (b)
Fig. 7.9 Performance of the model with 16 LSTM units applied to problem of predicting the next
values of a sequence given the first 10 or 15 values. (a) Length 10 input sequence; predict 10.
(b) Length 15 input sequence; predict 5
The system we want to model with recurrent neural networks is the cart-mounted
pendulum. We have a mass m at the end of a rigid pendulum. The other end of the
pendulum is attached to a cart of mass M that can move to the left or the right. The
cart is on an elevated track so that the mass can swing below the cart. As the mass
swings the cart moves as the center of mass of the system changes. This system is
shown in Fig. 7.10.
The goal of our model will be to take data from the pendulum system for the
position of the pivot point, x, and the angle of the pole θ , for a system with unknown
masses, M and m, and length , and to predict the future behavior of the pendulum.
To produce data to train our machine learning model, we will integrate the equations
of motion for this system.
Equations of Motion for the Pendulum
As derived in [5], the equations for motion for the cart-mounted pendulum system
are given by the second-order differential equations:
m sin θ θ̇ 2 + g cos θ
ẍ = , (7.18)
M + m sin2 θ
Fig. 7.10 Schematic for the inverted pendulum. A mass is connected to the cart by a rigid arm.
The cart rolls freely as the mass We define the pendulum angle to be in the range θ ∈ [−π, π ]. The
length of the pendulum is
update the position and velocity of the cart, (x, ẋ), the angle of the pendulum and
its angular velocity, (θ, θ̇ ), to get new values of the system state at time t + Δt.
This allows us to virtually operate the system by specifying an applied force and
seeing what happens to the cart and pendulum. For the masses and length, M, m
and , we selected them from a uniform distribution between 0.5 and 1.5 (in units of
kilograms or meters), the acceleration due to gravity g was sampled from a normal
distribution with mean 9.81 m/s2 and standard deviation of 0.005 m/s2 . The initial
states had θ uniformly sampled between −π and π , with x, ẋ, and θ̇ drawn from
uniform distributions between -1 and 1.
For training data for our model, we sampled 213 = 8192 different systems and
initial states. The state of each system was recorded at intervals of 0.1 s between t =
0 s and 10 s, for 100 data points per system. From these data, we created a training
set by randomly selecting a system and a starting point in that system’s time series.
From that time point we select 60 values for the position x and angle θ using the first
40 values of each as inputs and the last 20 time points of x and θ as the dependent
variables that we want our model to predict. We add to the dependent variables the
values of m, M, , and g to make the total number of dependent variables 44 =
2 × 20 + 4. We sample 221 = 2 097 152 such time series and system parameters for
training data (i.e. our training data has size 221 × 40 × 2 for inputs and 221 × 44 for
outputs) and 219 time series as test data.
190 7 Recurrent Neural Networks for Time Series Data
10 true x
true θ 1
8 LSTM model
0
6
x or θ
x or θ
4 −1
2
−2
0
true x
−3
−2 true θ
LSTM model
−4
0 1 2 3 4 5 6 0 1 2 3 4 5 6
t t
(a) (b)
2 0.0
−0.5
0
−1.0
x or θ
x or θ
−1.5
−2
−2.0
−4 −2.5
true x true x
true θ −3.0 true θ
−6 LSTM model LSTM model
−3.5
0 1 2 3 4 5 6 0 1 2 3 4 5 6
t t
(c) (d)
Fig. 7.11 Comparison of the LSTM model with the actual cart-mounted pendulum behavior as a
function of time for four test cases, (a) through (d). The first 40 time points are inputs to the model,
and the model predicts the next 20 points for both x and θ
The model we use consists of an LSTM layer with 16 units connected to another
LSTM layer with 16 units. The second LSTM layer is connected to an output layer
with 44 nodes corresponding to the 44 dependent variables (20 each for x and θ
and the 4 parameters m, M, , and g). Before training, we normalize the input data
by subtracting its mean and dividing by its standard deviation for each time point.
We do the same normalization for the 44 dependent variables. After 40 epochs of
training using the Adam optimizer, we obtain a mean-absolute error of 0.0714 on the
training data and 0.0703 on the test data. In terms of the normalized data, this means
that the model prediction is on average within about ±0.07 standard deviations from
the correct value.
In Fig. 7.11 results for 4 different test cases are shown. These cases span a variety
of phenomena, and our model seems to be able to handle each of them (with varying
degrees of success). In the case in Fig. 7.11a, the pendulum makes several full
revolutions. This makes the value of θ discontinuous as it switches from −π to π
repeatedly. The model tries to smoothly match this transition, and does a reasonable
7.3 Case Study: Determining the Behavior of a Cart-Mounted Pendulum 191
x or θ
4
−2
0 2 4 6 8 10 12 14 16
t
job, but the continuous transition in θ causes the model to predict incorrect response
in x. In hindsight, perhaps we could have tried to predict cos θ , for example, to have
a continuous variable. This does highlight how selection of input and dependent
variables can affect the model performance.
At the opposite extreme, we have Fig. 7.11d where the pendulum does not
oscillate much at all. In this case, the model correctly predicts that the position
of the cart continues along the linear trajectory established in the input. The most
common case in the training data is similar to the time series in Fig. 7.11b, c. These
results have the pendulum oscillating from positive to negative angles and pulling
the cart in one direction. For both of these cases, the model correctly predicts the
path of the cart and the frequency of the oscillation.
Another way to test the model is to apply it repeatedly to predict longer time
series. In this case, we feed the model 40 time points to predict the next 20 time
points. We could then use the 20 predicted points (along with 20 of the original
inputs) as inputs to the model to get time points 21 through 40. Repeating this
process, we could make the time series as long as we wanted. We would expect
a degradation of the model performance because the inputs to the subsequent model
evaluations will have errors in them. We can observe this phenomenon in Fig. 7.12.
The model can predict the overall behavior of the cart and pendulum, but as the time
gets farther away from the true input data, the errors in the prediction compound to
give noticeable discrepancy between the model and the true behavior. In this case,
it might be better to apply the approach from Sect. 2.5 and directly estimate the
equations of motion and use the derived model to extrapolate the behavior of the
system.
192 7 Recurrent Neural Networks for Time Series Data
The topic of RNNs and other methods for time series could be an entire book-length
discussion in itself. There are techniques that originated in machine translation
called sequence-to-sequence (seq2seq). These models train an encoder and decoder
where the encoder takes the original sequence, converts it to a hidden representation
much like the LSTM, and then the decoder network takes that hidden representation
and converts it to the output sequence. The input and output sequences can be of
arbitrary length, and this is one of the benefits of these approaches. One variant of
the seq2seq approach that has found widespread use in machine translation is the
attention network [6] that allows the network to find what parts of the inputs are
most important to decoding a specific output. The discussion above of RNNs and
LSTMs is a good foundation for further exploration in this area.
Problems
7.1 Construct a training data set of 105 examples from the logistic map
xn+1 = rxn (1 − xn ),
d 2y dy
m 2
+ 200 + 10000(1 + 5000y 2 )y = 10000 sin(2π νt),
dt dt
with initial condition y(0) = 0 and ẏ(0) = 0. Using a numerical integrator generate
2000 examples from this equation at 1000 times between t = 0 and 1 with m
randomly selected between 0.5 and 1.5 and ν randomly selected between 15 and
25. From each of these examples, select 105 training cases where one of the 2000
examples is selected randomly and a start time is selected randomly. From the start
time, collect 20 consecutive time points as the input and 10 time points as the output.
Build an LSTM neural network to predict the output as a function of the input.
References 193
7.4 Recurrent neural networks can be used with image data as well. Using the
Fashion MNIST data set, build an RNN or LSTM network that takes as input each
row from the image starting at the top. That is, this network will take in 28 inputs,
over 28 time steps, to have the entire image as input. The output of the network will
be a softmax layer that identifies which class the image belongs to. This problem
has the network reading in the image as though it were a time series to determine
the class it belongs to.
References
1. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
2. Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction
with LSTM. 1999.
3. Hasim Sak, Andrew W Senior, and Françoise Beaufays. Long short-term memory recurrent
neural network architectures for large scale acoustic modeling. 2014.
4. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation
of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,
2014.
5. Russ Tedrake. Underactuated robotics: Learning, planning, and control for efficient and agile
machines course notes for MIT 6.832. 2009.
6. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
7. Joseph M Powers and Mihir Sen. Mathematical Methods in Engineering. Cambridge University
Press, 2015.
Chapter 8
Unsupervised Learning with Neural
Networks: Autoencoders
To this point we have only discussed neural networks for supervised learning
problems where we have a target dependent variable. In this chapter, we look at
how we can use neural networks for unsupervised learning problems to find reduced
representations of data or other structure. We discussed unsupervised learning
before in Chap. 4. In this chapter, we extend those ideas using neural networks
to get more powerful, though less explainable, methods. By less explainable, we
mean that the results from a neural network-based method may not have the direct
explanation that, for instance, the singular value decomposition has in terms of
linear combinations of variables. Nevertheless, the resulting reduction in the size
of the data can be impressive. We begin with a nonlinear analog to the singular
value decomposition: the autoencoder.
The idea of an autoencoder is simple, we try to learn an identity mapping, i.e.,
a function that returns its argument: f(x) = x, except the function has reduced the
complexity of the input data in some form. For instance, if we consider x having N
components, a neural network that had a single hidden layer with N/2 nodes, and
N output nodes would have effectively reduced the data by nearly factor of 2.
This network: N inputs, N/2 hidden units, and N output nodes is visualized in
Fig. 8.1. This figure shows a very simple autoencoder. The first part of the network
(going from N inputs to the N/2 hidden units) is called the encoding layer or
encoder. The second half (going from the N/2 hidden units to the N output nodes)
is called the decoder. When we train this network, we adjust the weights and biases
in the network so that the encoded values, that is, the values of the hidden units, give
back the original data when decoded. In this network, there are (N/2 + 1)N weight
and bias parameters in each of the encoding and decoding layers.
We call the values of nodes in the middle layer, z(x), the latent variables or the
codes. The basic idea is that if such a network can be trained we can store only
the codes for our data, and use the decoder when we need the full representation of
the data. In this sense, we have compressed the input data to the codes. After we
train such an autoencoder, assuming the error in the map f(x) = x, is low, then we
only need to store the N/2 encoded values for our data x and can use the decoding
layer to reproduce the full vector when needed. Therefore, if we have M vectors of
data we need to store M ∗ N/2 numbers, that is, the codes for each case, plus the
(N/2 + 1)N weights and biases in the decoding layer. If M N, then we have
effectively reduced our storage requirement in half.
Of course, if a single hidden layer can be used as an autoencoder, we could add
further hidden layers to the network to try to get better compression. These networks
typically have an hourglass shape. In the encoding layers the width, or number of
nodes in a layer, reduces as one gets deeper. A typical pattern would be N input
nodes, N/2 nodes in the first hidden layer, N/4 in the subsequent hidden layer,
and a reduction by a factor of two in each additional layer. In the center of the
network will be a layer that contains the encoded variables. The decoder layer is
typically symmetric about the encoding layer with more nodes being added as one
moves from the encoded variables to the N output nodes. For instance, the decoding
layer could double the number of hidden units in each subsequent layer. There is no
requirement that the network must be symmetric or that there needs to be a constant
factor reduction/increase between the layers, but it has been observed that these
simple structures seem to work in practice.
A deep autoencoder with this symmetric structure is shown in Fig. 8.2. The
network reduces in the width of the hidden layers in the encoder until it reaches
(2)
the encoded variables that are the values of zk (x). The network then expands back
8.1 Fully Connected Autoencoder 197
Hidden
Inputs Output
Layer
x1 x1
x2 x2
z1
x3 x3
z2
x4 x4
z3
x5 x5
x6 zK x6
x7 x7
xN xN
1 1
Encoder Decoder
Fig. 8.1 Schematic of an autoencoder with a single hidden layer. In this case, there are N inputs
that are encoded into K degrees of freedom in the hidden layer, which are then decoded back into
the N original inputs
to the size of the original data in the decoder. Though the network is symmetric, the
weights in the encoder and decoder are not necessarily the same.
Once we have built an autoencoder, we can use the encoded variables to gain
insight into the structure of the data. For example, we could look at those cases that
have extreme values in each of the encoded variables. This can give us an idea of
what the encoded variables mean such as if they are encoding particular features
of the data. We must be careful, however, because there are nonlinearities in the
encoding/decoding process that could make that inference difficult.
198 8 Unsupervised Learning with Neural Networks: Autoencoders
Fig. 8.2 Schematic of an autoencoder with multiple hidden layers. This particular network is
symmetric about the encoded variables in that the encoder and decoder have the same structure
to store the parameters in the decoder. We built two networks with = 2, 4. The
required storage of this information is then 12.1% and 12.2% of the full data set for
= 2 and = 4, respectively. The percentage did not change very much because
most of the storage is for the parameters in the decoder.
We trained the autoencoder over 500 epochs with a batch size of 32 for both
cases and used the entire data set for training. The trained model produced a mean-
absolute error for the data set of 0.0104 and 0.00557 for the = 2 and = 4 models.
If we look at some representative cases from the training data, see Fig. 8.3, we
observe that the autoencoded values are capturing the reflectance and transmittance
spectra for the leaves well. In the figure, we can see that the autoencoder with a
latent space of size = 2 does have trouble getting the correct level of some of the
features, but overall it captures the shape of the spectra. The autoencoder with latent
space of size = 4 does appear to have an overall better accuracy, as we would
expect from the mean-absolute error numbers.
From the autoencoder, we can try to interpret what the latent space variables
mean. In the case of a latent space of size two, we can make a scatter plot of all
of the spectra with the latent space variables denoted as z1 and z2 . This scatter
plot is shown in Fig. 8.4a where the points have the color associated with each
spectra, as discussed in Sect. 4.5. We can see there are two clusters that appear
in the scatter plot, with most of the points centered in the top left of the figure.
If we use the decoder, we can explore how changing a latent variable changes the
resulting spectrum. For instance, in Fig. 8.4b we see that increasing z1 increases
the overall spectrum level when we keep z2 fixed. However, we can also see that
the effect is nonlinear: The shape of each resulting spectra is different and is not
just a scaling as we would see in a singular value decomposition of the data.
Figure 8.4c, d also display this nonlinear behavior when the values of z2 are varied
with z1 fixed or when both are varied together. This nonlinear behavior is what
allows an autoencoder to have such a small latent space that can still represent the
data well.
0.6
Sycamore maple
European chestnut
Wintercreeper
Leatherleaf arrowwood
Sycamore maple
0.4
reflectance
0.2
0.0
500 1000 1500 2000 2500
wavelength (nm)
(a)
0.5
0.4
transmittance
0.3
0.2
Sycamore maple
European chestnut
0.1
Wintercreeper
Leatherleaf arrowwood
0.0 Sycamore maple
(b)
Fig. 8.3 Comparison between the autoencoder results using latent spaces of size 2 (dash-dot
lines), size 4 (dashed lines), and the original data (solid lines) for the leaf reflectance and
transmittance data. (a) Reflectance. (b) Transmittance
at every pixel rather we skip some. The stride tells the convolution how many pixels
to move in each direction when applying the convolution. Therefore, a stride of (1,1)
does not skip any pixels, but a stride of (2,2) will skip every other row and column
in the image. This is shown in Fig. 8.6 where a 9 × 9 input is convolved with a 3 × 3
202 8 Unsupervised Learning with Neural Networks: Autoencoders
(c) (d)
0.4
4
value
z2
0.2 z1 = 1→5
2
(b)
0 0.0
0.5 0.8
0.4
0.6
0.3
value
value
0.4
0.2 z2=1→5
0.2
0.1 z1=z2=1→5
0.0 0.0
Fig. 8.4 The results from the autoencoder with latent space of size 2: (a) all the spectra (both
reflectance and transmitted) on a scatter plot based on the values of the two latent space variables.
The color of each point corresponds with the perceived color and the lines denote the path for
the other three plots. (b–d) Plots of resulting spectra from feeding into the decoder values in unit
increments along the lines shown in part (a)
kernel and stride (2,2) with zero padding of the input. Every input pixel contributes
to the output, but every other row/column is skipped when moving the kernel over
the input. This results in a reduced size for the output, and, as a result, a reduction
of the data. The encoder combines several strided convolutional layers to compress
the image into a set of code images, or latent images, in the middle of the network.
To expand the data from the code images to the original image, we apply
convolutions in reverse through transposed convolutions. Transposed convolutions
act in the opposite way of a convolution: A convolution takes an image and
computes a linear combination of the pixel values using a kernel to get a new
value. The transposed convolution takes an image and distributes those values to
a set of new pixels with weights defined by the kernel. In this sense, the transposed
convolution works backwards to the standard convolution. Furthermore, we can use
a stride in the transposed convolution to generate a larger image in the output. In a
strided, transposed convolution we skip some of the pixels in the output to center
the convolution around. By applying several layers of transposed convolutions, we
expand the code images back to the original size.
8.3 Convolutional Autoencoders 203
Encoder Decoder
Fig. 8.5 Schematic of an autoencoder with multiple hidden layers. This particular network is
symmetric about the encoded variables in that the encoder and decoder have the same structure
As with the fully connected autoencoder, if we can train a network that can
reproduce the original data with acceptable accuracy, then we can store the coded
images and use the decoder to reproduce the original data as needed. This could be
useful in data transmission where rather than sending full images across a network,
we send the smaller coded images and the receiver uses the decoder to reproduce
the full image. Additionally, as we will see later that it is possible to use the codes as
input to another neural network for a supervised learning problem. Using the codes
as input reduces the input dimensionality to the problem and may make the network
easier to train.
Convolutional Autoencoder
Convolutional autoencoders use convolutions to compress data, typically
images, into a set of codes in a similar manner to the fully connected
autoencoder.
• The encoder uses strided convolutions that only center the convolution at a
fraction of the original inputs to reduce the data.
• Typically, while the size of the data is being reduced, extra channels are
added.
• The decoder takes the codes and performs strided, transposed convolutions
to project the reduced data contained in the codes back to the original size
of the data.
204 8 Unsupervised Learning with Neural Networks: Autoencoders
0 -1 -1 0 1 1 0 -1 -1 -3 -2 4 -2 -3
-1 -1 0 1 1 0 -1 -1 0
-1 0 1 1 0 -1 -1 0 1 -2 4 0 -4 2
0 1 1 0 -1 -1 0 1 1 1 1 1
1 1 0 -1 -1 0 1 1 0 1 1 1 4 0 -4 4 2
1 0 -1 -1 0 1 1 0 -1 1 1 1
0 -1 -1 0 1 1 0 -1 -1 -2 -4 4 0 -4
-1 -1 0 1 1 0 -1 -1 0 G
-1 0 1 1 0 -1 -1 0 1 -3 2 2 -4 0
F F∗G
0 -1 -1 0 1 1 0 -1 -1
-1 -1 0 1 1 0 -1 -1 0
-1 0 1 1 0 -1 -1 0 1
0 1 1 0 -11 -1 0 1 1
1 1 0 -1 -1 0 1 1 0 -3
-3 -2
-2 4 -2
-2 -3
-3
1 0 -1 -1 0 1 1 0 -1
-1 -1 0 1 1 0 -1 -1 0
-1 0 1 1 0 -1 -1 0 1 4 0 -4
-4 4 2
-2
-2 -4
-4 4 0 -4
-4
-3
-3 2 2 -4
-4 0
0 -1 -1 0 1 1 0 -1 -1
-1 -1 0 1 1 0 -1 -1 0
-1 0 1 1 0 -1 -1 0 1
0 1 1 0 -1 -1 0 1 1
1 1 0 -1 -1 0 1 1 0 -3
-3 -2
-4 -2
42 -22 -3
1 0 -1 -1 0 1 1 0 -1
0 -1 -1 0 1 1 0 -1 -1 -2
-4 -4
4 0 -4 24
-1 -1 0 1 1 0 -1 -1 0
-1 0 1 1 0 -1 -1 0 1 -2
4 0 -4 4 20
-2 -4 4 0 -4
-4
-3 24 20 -4
-4 -04
Fig. 8.6 Depiction of a 3 × 3 convolution with stride (2,2) using zero padding. The output is a
5 × 5 matrix whereas the input was 9 × 9
8.3 Convolutional Autoencoders 205
We should step back and think about what this model is doing. For the encoder, it
is learning a set of convolutions to apply to the original image to create a smaller set
of images that contain all of the information in the original image. This is possible
8.3 Convolutional Autoencoders 207
Fig. 8.8 Demonstration of the noise removal properties of the autoencoder. For each original item
(left image), we added random noise (middle image) and passed the noisy image through the
autoencoder to get the third image. We can see that the autoencoder successfully removes the noise
around the item and improves the image
because the original images have structure to them: The images are not just random
pixels. Then the decoder learns a set of nonlinear transformations to turn the codes
into the original image. Both of these tasks, encoding and decoding, are being
learned at the same time because our goal is to have the result of the decoder be
as close to the original image as possible.
Beyond compressing the data, autoencoders can remove noise from an image. In the
fashion MNIST example, we trained the network with images without noise. If we
pass a noisy image through the network, the resulting image can have less noise.
This is because the random noise in the image is removed by the encoding process.
We demonstrate this effect by adding Gaussian noise with mean zero and
standard deviation equal to 10% of the total pixel range to example images from
the fashion MNIST data set. We then pass the noisy image through the autoencoder.
As we see in Fig. 8.8, the result from the autoencoder effectively removes noise
from the image, especially the noise that corrupts the whitespace around the item.
This is a beneficial side effect of the autoencoder being trained to find only the
important features in the image for reconstruction. We could improve this noise
filtering property by training the autoencoder with noisy inputs but with target
images that do not have noise. This would be useful, if one expects to use the
autoencoder with noisy data.
208 8 Unsupervised Learning with Neural Networks: Autoencoders
Autoencoders have found use in reducing the size of scientific simulation data in
order to reduce the dimensionality of simulation data to then use as inputs to other
machine learning models [1, 2]. We will demonstrate how this works using the
simulations of laser irradiated targets. The target, called a hohlraum, is made of
gold and has the shape given in Fig. 8.9. This shape is loosely based on the Apollo
hohlraum used in opacity experiments [3, 4].
5.2
3.5
2.05
1.2
3.5
2.1
r
8.4 Case Study: Reducing the Data from a Physics Simulation with Autoencoders 209
asymmetry
pulse
inner
pulse
pulse
1.0 1.0 1.0
Fig. 8.10 Scatter plots of the 64 values of the four parameters (overall scale, asymmetry, interior
scale, and laser pulse scale) defining the hohlraum geometry and the laser pulse for the simulation
data set
Fig. 8.11 Depiction of the radiation temperature in the cavity at several times showing the cavity
filling with radiation and then cooling after the laser turns off. The horizontal axis is the radial
coordinate and the vertical axis is z
8.4 Case Study: Reducing the Data from a Physics Simulation with Autoencoders 211
From the 64 simulations and 500 time steps for each, we have 500 × 64 = 32 000
2-D arrays for the radiation temperature in the hohlraum. Looking at Fig. 8.11, we
can see that there is an obvious structure to the outputs from the simulation. We
would like to understand if there is a reduced representation of the data that can
be found with a convolutional autoencoder. If we are able to find a suitable latent
space representation of the data, we will use these latent variables as the inputs to
an LSTM network to predict the time history of the radiation temperature in a given
simulation.
The data that we use to train the model consists of 80% of the 32 000 time steps
from the simulation outputs. We also pad the images to be of size 128 × 80. This
is done so that we can repeatedly reduce the image in each dimension by a factor
of 2. Furthermore, we compute the maximum value of the radiation temperature in
each image and store this value. We then normalize each image by dividing by the
maximum radiation temperature. This makes each image in our set have each pixel
have a value between 0 and 1.
The autoencoder network we train has a similar structure to that used for
the Fashion MNIST data. The encoder portion of the network consists of 3
convolutional layers that compress the original image from 128 × 80 = 10 240
to 8 images of size 16 × 10 for a total code size of 8 × 16 × 10 = 1 280 or about
12.5% of the original size. The decoder portion of the network also has three layers
to increase from the codes back to the original image size.
In detail, the encoder portion of the network contains
1. An input layer of shape 128 × 80,
2. A convolutional layer with kernel size of 3 × 3, a stride of 2 in each direction,
and 32 channels. This outputs 32 images of size 64 × 40;
3. A convolutional layer with kernel size of 3 × 3, a stride of 2 in each direction,
and 16 channels. This outputs 16 images of size 32 × 20;
4. A convolutional layer with kernel size of 3 × 3, a stride of 2 in each direction,
and 8 channels. This outputs 8 images of size 16 × 10.
The result of this final layer is the encoded data. The size of this data is 16×10×8 =
1280; this is about 14% the size of the original 120 × 74 image. For the decoder, we
have
1. A transposed convolutional layer with kernel size of 3 × 3, a stride of 2 in each
direction, and 16 channels. This outputs 16 images of size 32 × 20;
2. A transposed convolutional layer with kernel size of 3 × 3, a stride of 2 in each
direction, and 32 channels. This outputs 32 images of size 64 × 40;
3. A transposed convolutional layer with kernel size of 3 × 3, a stride of 2 in each
direction, and 1 channel. This outputs an image of size 128 × 80.
212 8 Unsupervised Learning with Neural Networks: Autoencoders
One benefit to the codes from the autoencoder is that they contain the information
from the original data. Therefore, rather than using the original data as input to a
prediction model, we can use the codes as a reduced image. Given that the size of the
input data is reduced, it is likely that fewer training examples and/or simpler models
can be used too. In our case, we would like to use the results from a few initial time
steps to predict the later time behavior of the radiation temperature using a LSTM
model (see Chap. 7). Without using the autoencoder, we would need to attempt this
8.4 Case Study: Reducing the Data from a Physics Simulation with Autoencoders 213
20
40
60
80
100
120
0 10 20 30 40 50 60 70
20
40
60
80
100
120
0 100 200 300 400 500 600
0
10
20
30
40
50
60
0 20 40 60 80 100 120 140
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35
0
10
20
30
40
50
60
0 20 40 60 80 100 120 140
20
40
60
80
100
120
0 100 200 300 400 500 600
0
20
40
60
80
100
120
0 10 20 30 40 50 60 70
Fig. 8.12 Application of the convolutional autoencoder to an image from the hohlraum simulation
data. The autoencoder layers are arranged vertically: The original output is on top, followed by the
3 encoder layer outputs, and the 3 decoder layer outputs. The small block of 4 images are the
codes, or latent variables, for the autoencoder. The images are shown with a color scale where
black represents a value of 1 and white is a value of 0
214 8 Unsupervised Learning with Neural Networks: Autoencoders
Fig. 8.13 Demonstration of the noise removal properties of the autoencoder for the hohlraum
data. For each item we added random noise (middle image) and passed the noisy image through
the autoencoder to get the third image. We can see that the autoencoder successfully removes the
noise around the item and improves the image when compared to the original on the left. The
images are shown with a color scale where black represents a value of 1 and white is a value of 0
using the full data (the 120 × 74 images) from each time step or we could try to use
domain expertise such as knowledge of the physics of the simulation to reduce the
data [10].
With the autoencoder we trained above, we can train an LSTM model to take a
sequence of the codes, each of size 8 × 16 × 10 and then predict future values for
the codes. Then using the decoder, we can feed the output of this model into the
decoder to get the radiation temperature at the full 120 × 74 size. The LSTM model
will also have to predict the maximum value for the radiation temperature so that
we can undo the normalization.
Our model is designed to take the codes corresponding to the simulation data and
the maximum value of the radiation temperature at ten time points, each spaced 20
time steps apart. It will predict the codes and the maximum values of the radiation
temperature at ten subsequent times separated 20 time steps apart. The total number
of inputs to the model is (16 × 10 × 8 + 1) × 10 = 12 810 and an equivalent number
of outputs. The model consists of two connected LSTM units each with 16 states.
The second LSTM unit is connected to two dense layers each with 10 neurons using
the ReLU activation function. The output layer has 12 810 units to give codes and
normalization constants at ten time levels. From this data, we can apply the decoder
to get the full image.
Notes and Further Reading 215
We select at random 105 starting times and use 80% of these as training values.
We train the model for 100 epochs with the Adam optimizer and obtain a mean-
absolute error of 0.0176 for the training data and 0.0177 for the test data. Predictions
from two different time series for the predicted and actual radiation temperatures
are shown in Fig. 8.14. In the figure, we can see that the model is able to predict the
behavior for different regimes of evolution. In the top case, the laser is still on and
the temperature is not decreasing in the early frames, and the bottom case depicts
after the laser has turned off and radiation is leaving the hohlraum.
One feature we notice in the results from the model is that the edge of the
hohlraum in the narrow part of the cavity has a noticeable amount of error.
This is especially noticeable in the first frame of the lower example in Fig. 8.14.
Nevertheless, it appears on the scale of the figure that away from the boundary there
is agreement between the predicted and actual values.
Figure 8.15 shows the values of the radiation temperature at the center of the
narrow part of the cavity for four different test cases. The behavior of the radiation
temperature has different shapes depending on the starting time, geometry, and laser
pulse length. We observe that the model is able to predict the shape of the behavior,
but there is some inaccuracy in the numeric values. These results demonstrate the
model would be useful for analyzing hohlraum design and predicting the radiation
temperature based on the geometry and laser pulse parameters. Once a design was
selected, a full simulation could be performed to get more accurate numerical values
for the temperature.
Unsupervised learning with neural networks is a broader topic than just autoen-
coders. Indeed in there are variations of autoencoders, called variational autoen-
coders, that attempt to find not just latent variables but the distributions that govern
these variables. The background required to understand these tools would take us too
far afield for our study, but interested readers can see the tutorial by Doersch [11].
The benefit of a variational autoencoder is that one can sample from the distribution
of latent variables to produce new cases. In the case of Fashion MNIST, one could
generate new items of clothing.
Other types of unsupervised learning techniques include Boltzmann machines
and Hopfield networks. These approaches to understanding structure in data were
introduced many years ago and have not been as widely adapted as autoencoders.
The restricted Boltzmann machine has been used to create generative networks that
given a set of data can learn the probability distribution of its inputs. Restricted
Boltzmann machines can also be used to initialize autoencoders and speed up
training [12].
216 8 Unsupervised Learning with Neural Networks: Autoencoders
0 1 2 3 4
6
Radiation Temperature (10 K)
0 1 2 3 4
Fig. 8.14 Comparison of the actual and predicted values for the evolution over 10 time levels of
the radiation temperature in two cases. In each case, the top row contains the values predicted by
the LSTM model and passed through the decoder, and the lower row is the actual values of the
codes passed through the decoder model
Problems
3.0
True values
LSTM predictions
Radiation Temperature (106 K)
2.5
2.0
1.5
2 4 6 8 10
step
Fig. 8.15 Values for the predicted at actual values for the radiation temperature in the middle of
the narrow part of the cavity for ten time values. There are four different time series shown
the original signal. Theoretically, the autoencoder should be able to have a middle
layer that only has two nodes because there are only 2 degrees of freedom in the
data you created. How close to this theoretical limit can you get?
8.2 This is a variation on the previous problem,
* but using convolution. Produce data
from the function sin(ar + b)/r where r = x 2 + y 2 for values of x and y given
by −0.95, −0.9, . . . , 1 and values of a and b selected at random between -1 and 1.
Each case will be an image of size 40 × 40. Apply a convolutional autoencoder and
see how well you can compress the data.
References
1. Kelli D Humbird, J Luc Peterson, and Ryan G McClarren. Parameter inference with deep
jointly informed neural networks. Statistical Analysis and Data Mining: The ASA Data Science
Journal, 12(6):496–504, 2019.
2. John E Herr, Kevin Koh, Kun Yao, and John Parkhill. Compressing physics with an autoen-
coder: Creating an atomic species representation to improve machine learning models in the
chemical sciences. The Journal of chemical physics, 151(8):084103, 2019.
3. Tana Cardenas, Derek William Schmidt, Evan S Dodd, Theodore Sonne Perry, Deanna
Capelli, T Quintana, John A Oertel, Dominic Peterson, Emilio Giraldez, and Robert F Heeter.
Design and fabrication of opacity targets for the national ignition facility. Fusion Science and
Technology, 73(3):458–466, 2018.
4. Evan S Dodd, Barbara Gloria DeVolder, ME Martin, Natalia Sergeevna Krasheninnikova,
Ian Lee Tregillis, Theodore Sonne Perry, RF Heeter, YP Opachich, AS Moore, John L Kline,
et al. Hohlraum modeling for opacity experiments on the national ignition facility. Physics of
Plasmas, 25(6):063301, 2018.
218 8 Unsupervised Learning with Neural Networks: Autoencoders
In these equations, fˆ is the force applied which we allow to be one of the three
values −10, 0, or 10 N. Additionally, we fix the mass and the cart to each be 1 kg
(m = M = 1 kg), the length of the pendulum arm to be 1 m, and the acceleration
due to gravity to be g = 9.81 m/s2 .
The goal for our reinforcement learning problem is to learn a strategy for the
cart to begin with the pendulum at the bottom of its swing (θ = 0) and to swing
the pendulum to the vertical position of θ = ±π and balance it there for as long
as possible. This is a more difficult problem than it might sound because the cart
must accelerate in one direction and then rapidly accelerate in the other direction to
swing the pendulum to be completely vertical. In other words, to accomplish its task
our machine learning model must understand that there are times when it wants the
pendulum to go to the bottom in order to get the pendulum to a high height later.
The process of swinging the mass to be upright is illustrated in Fig. 9.1. When
the mass is at rest, the cart accelerates in one direction (right in the figure), which
causes the mass to swing up behind it (see panel 2 in the figure). Then when the
mass reaches its apex in panel 3, the cart accelerates in the other direction, giving
the mass more speed as it falls (panel 4). This gives it enough momentum to reach
an upright position (panels 5 and 6).
9.1 Reinforcement Learning to Swing the Cart-Mounted Pendulum 221
1 2 3
4 5 6
Fig. 9.1 Illustration of the strategy to swing the mass of the cart-mounted pendulum from rest to
the maximum height. The numbers in each figure indicate a step in the process; the arrow on the
cart indicates the force applied by the cart, and the arrow from the mass indicates direction it is
moving and its speed
Reinforcement Learning
Reinforcement learning aims to train a model to maximize a reward function.
It does not need to know the correct answer for training, unlike supervised
learning. In reinforcement learning with deep neural networks, we design a
training procedure that will find how to optimize the weights and biases in a
neural network to maximize the reward.
To fully state the problem, we need to define our objective function or reward. This
will encode the goal of our reinforcement learning problem that we wish to solve.
Given the goal that we have defined, for the reward at a given time, we evaluate the
following function:
1
ρ(θ ) = + b, (9.3)
(π − |θ |)4 +
where
20(y + 1) y ≥ − 12
b= (9.4)
0 y < − 12 ,
y = cos(π − θ ). (9.5)
222 9 Reinforcement Learning with Policy Gradients
Reward
101
100
10–1
10–2
−3 −2 −1 0 1 2 3
θ
The reward function we have specified has two terms and is plotted in Fig. 9.2.
The first term rewards the system for getting the angle between the pendulum and
the cart to be close to ±π . The form of this function makes the reward increase
nonlinearly as the goal is approached. Moreover, once the pendulum gets above a
certain height, y, there is a “bonus” reward, b, that kicks in. When the height of
the pendulum mass is greater than −0.5 (or when |θ | ≥ 2π/3), there is a bonus
applied that increases the magnitude of the reward that is linear in y. The purpose
of the bonus is to indicate that getting the pendulum to high level will lead to higher
rewards. This will be discussed in more detail later. For now we can think of it as
analogous to a bonus in a video game when a certain threshold is crossed. If the cart
system were a video game, we could imagine the screen flashing the words “super
bonus” when the height of the pendulum gets high enough.
Reward Functions
The reward function is where we encode the desired behavior of the system
numerically. It can be important that reward function is designed so that
intermediate rewards exist that can be found via random sampling.
We use the same differential equation integrator from Sect. 7.3 to update the state
of the system given the current position of the cart, x, and its time derivative, ẋ, the
value of θ and its time derivative, θ̇ , and the force applied. The integrator updates
the state to a time Δt forward in time 1/32 = 0.03125 s, and the force applied
over this time is constant. Before each update, our controller function will be given
(x, ẋ, θ, θ̇ ) as input, and it must output the force to be applied, either −10, 0, or
10 N.
9.2 Training a Neural Network with Policy Gradients 223
The controller will have 200 steps of size Δt to try to maximize the reward. If we
call one realization of these 200 steps a case, the total reward, R, for the case will
be the sum of the rewards at the end of each step:
200
R= ρ(θ (iΔt)), (9.6)
i=1
For the problem described above for the cart-mounted pendulum, or for any scenario
where we want to take actions to maximize a reward function, we consider the
problem of having a neural network to make decisions on what actions to take.
The act of taking actions based on current conditions (i.e., the state of the system) is
called a policy. For example, a policy could always apply a force of the opposite sign
as the derivative of θ in the pendulum problem. In our case the neural network will
embody the policy. It takes in inputs and chooses a decision to make. This network
will output the probability that a given action should be taken by having the last
layer of the network be a softmax function applied to a number of hidden units
equal to the number of possible decisions. This is the policy function, f(x), where x
is a vector of the same size as the number of inputs and f = (f1 , f2 , . . . , fN ) is a
vector of N probabilities containing the probability of choosing a particular action.
For the cart-mounted pendulum problem, x is of length 4 containing (x, ẋ, θ, θ̇ ).
The policy function outputs 3 values: the probability of applying a force of -10 N,
the probability of applying a force of 0 N, and the probability of applying a force of
10 N.
In the application of the policy function provided by the neural network, we
evaluate f(x) to get the probabilities for each action. We then take the cumulative
sum of the probabilities:
s = fn , 1 ≤ ≤ N. (9.7)
n=1
Selecting a random number between 0 and 1 called ω, the action taken is the largest
value of such that ω < s . For instance, if f = [0.1, 0.3, 0.6], the first action would
be selected for ω less than 0.1, the second action would be selected for 0.1 ≤ ω <
0.4, and the third action would be taken for ω ≥ 0.4.
224 9 Reinforcement Learning with Policy Gradients
Given that we have a network and a means to evaluate f(x) and then select an
action, the question remains how to train this neural network. We want to train
the network to maximize the reward for a case, but the function only outputs
probabilities for making a decision. How can we adjust the weights and biases of
the model to maximize the reward function? To do this, we will have to weigh the
outputs of the policy function with the reward.
To see how to train the model, we consider a problem with a reward function
ρ(n), where n is the action taken by the controller and fn (x) is the probability of
the policy choosing action n. The policy function will depend on the weights and
biases of the neural network; we denote the weights and biases as w. Given this
scenario, the expected value for the reward, E[ρ], is equal to the sum of the rewards
for each choice n times the probability of making the choice. In our case, we have
N
E[ρ] = ρ(n)fn (x). (9.8)
n=1
This equation says that the reward we can expect is equal to the weighted sum of the
potential rewards where the weight is given by the probability of selecting a given
action. To apply stochastic gradient descent to increase the reward, we need to know
what the gradient of the expected value of the reward is with respect to w. Taking
the derivative of the expectation with respect to a single weight, w, we get
∂
N
∂
E[ρ] = ρ(n)fn (x) (9.9)
∂w ∂w
n=1
N
∂
= ρ(n) fn (x).
∂w
n=1
Note that the derivative can be taken into the sum and that the reward function does
not depend on the weight. Now we use an identity for derivatives that says
∂g g(x) ∂g ∂
= = g(x) log g, (9.10)
∂x g(x) ∂x ∂x
because the derivative of the natural logarithm is one divided by its argument. Using
this identity in Eq. (9.9), we get
∂ N
∂
E[ρ] = ρ(n)fn (x) log fn (x) (9.11)
∂w ∂w
n=1
( )
∂
= E ρ(n) log fn (x) .
∂w
9.2 Training a Neural Network with Policy Gradients 225
This final equation tells us that the derivative of the expected reward is equal to the
expected value of the reward times the derivative of the policy function with respect
to the weight. We can compute the gradient of the logarithm of the policy function
with respect to its weight using back propagation, as previously discussed. This is
why the method is called policy gradients: we use the gradient of the logarithm of
the policy function to update the weights and biases in the neural network. When we
use the policy function to control our problem, only a single action will be selected
at each step, and we will know the reward of taking this action. Therefore, we will
know what ρ(n) is for the selected case.
This still does not complete the description of how we train the network. To see
how this is done, we consider a single decision point. We evaluate f(x) and select
an action, n̂. After selecting the action, we will know what the value of the reward
is. We then treat the “correct” example, as if we were doing supervised learning, as
n̂, and compute the cross-entropy loss (c.f. Eq. (2.25)). The loss function for policy
gradients is
I
N
L=− ρ(n)I(n, n̂) log softmax(n, z(1) , . . . , z(N ) ), (9.12)
i=1 n=1
where I(n, n̂) is the indicator function that takes on the value of 0 when its two
arguments are not identical and 1 when they are. This loss function has the same
form as that in Eq. (9.11). Therefore, we weight the loss by the reward obtained by
taking the action and train in the same manner as a standard classification problem.
This is the crucial step in policy gradients: conversion of the reinforcement learning
problem into a supervised learning problem where the cases are weighted by the
reward of the selected action.
When the optimizer attempts to minimize the weighted loss function, it will try to
increase the probability of selecting actions that lead to a good reward and decrease
the probability of selecting the actions that lead to low rewards. This is exactly what
we want our training to do: find a policy that maximizes the reward.
Policy Gradients
The policy gradients approach to reinforcement learning turns the reinforce-
ment learning problem into a modified supervised learning problem. This is
done by running the model to select an action, calling that action the “correct”
dependent variable from a supervised learning perspective. When computing
the loss function, for supervised learning we weight each case by the reward.
This will make the trained model reproduce the actions that give large rewards
and avoid the actions that give low rewards.
226 9 Reinforcement Learning with Policy Gradients
Cumulative Rewards
In policy gradients, and reinforcement learning in general, we want to
optimize the total reward over all the actions taken by the model. Otherwise,
if we maximize the reward in the next step, the model will not select actions
that lead to delayed, but larger rewards. In policy gradients, this means we
sum all of the rewards over a series of actions for the purpose of weighting
the loss function.
9.3 Policy Gradient Training and Results for the Cart-Mounted Pendulum 227
Exploration
In reinforcement learning there is a necessity to explore the possible solution
space to find higher rewards. This is in tension with the desire to use the
model’s estimate of what the best action to take is. Therefore, in training we
need to allow some random moves to be made that disregard the model’s
(continued)
228 9 Reinforcement Learning with Policy Gradients
estimate. As the model is trained, we can reduce the number of moves that are
randomly selected as the reward space will be explored via random chance in
the early cases seen by the model.
We ran 250 sessions of the model. We ran 50 sessions where every move made
was randomly selected and then trained the model with the Adam optimizer for 20
epochs and a learning rate of 0.001. Then we ran 50 sessions where only 32% of the
moves were randomly selected, and the others were chosen based on the model’s
output probabilities, and repeated the training for 20 epochs. The final three groups
of sessions had 50 sessions each with 10%, 3.2%, and 1% probabilities of a random
force applied. After each set of 50, we train the model for 20 epochs. Before each
round of training, we normalize the rewards by subtracting the mean reward divided
by the standard deviation of rewards.
The resulting model is able to find a policy that swings the mass to its maximum
possible height. Because the moves are still selected based on the probabilities of
selecting a given force, running the model different times gives different outcomes.
The outcome of running the model for the cart-mounted pendulum is shown in
Fig. 9.3. In the figure we show the system at several different times, depicting the
force applied and using partial transparency to show several previous times to give
a sense of how the system is evolving. In Fig. 9.3, at t = 0.4 s, the cart moves to
the left, before applying a force in the opposite direction to give the mass further
angular momentum (see t = 0.9 s panel). This oscillation between directions of
applied force is repeated until eventually the mass has enough momentum to reach
its maximum height around 3.4 s.
For a different view of the strategy found by the neural network via reinforcement
learning, in Fig. 9.4 we show the evolution of the vertical and horizontal positions of
the pendulum and the force applied by the cart. In the top panel of the figure, we can
see that initially the pendulum has relatively modest swings that are successively
amplified until the third period where it passes through the maximum height, before
having a small amplitude path, that leads to a large swing again (this last swing is
occurring when the final step in the session is reached).
The second panel of Fig. 9.4 shows that initially the cart moves to the left,
before its velocity switches to the positive x-direction and back again to a negative
acceleration. Then at about t = 2 s, the cart accelerates in the positive direction
to accomplish the swing of the pendulum through the positive vertical, before
switching directions again at about t = 4.5 s. All of these changes in direction can be
seen in the third panel where the forces are displayed. In this part of Fig. 9.4, we can
also see how there is still some randomness in the behavior of the system due to the
model returning the probability of applying a force. Despite the randomness, there
is a pattern to these forces: initially negative and then positive, before switching to
negative for a short period of time, before a long period of positive force applied
starting around t = 2.5 s. One thing we notice in the force applied is that there
9.3 Policy Gradient Training and Results for the Cart-Mounted Pendulum 229
1 t = 0.4s t = 0.9s
y (m)
0
−1
1 t = 1.4s t = 1.9s
y (m)
0
−1
1 t = 2.4s t = 2.9s
y (m)
0
−1
1 t = 3.4s t = 3.9s
y (m)
0
−1
1 t = 4.4s t = 4.9s
y (m)
0
−1
1 t = 5.4s t = 5.9s
y (m)
0
−1
0 5 10 0 5 10
x (m) x (m)
Fig. 9.3 Depiction of the strategy learned by reinforcement learning for the cart-mounted
pendulum problem. The cart and the pendulum are shown at equally spaced times. The system
at several previous frames is also shown as partially transparent to give a sense of how the system
is changing
are relatively few times where the force applied is 0. Many of these occur during
the time when the mass is near its maximum height. It is possible that the model
is trying to balance the pendulum at the peak height, but that the forces that it can
apply are too large for the fine control needed for this balancing.
Before moving on from the cart-mounted pendulum problem, we compare the
model’s strategy to the strategy from control theory. The swing-up of the pendulum
from rest requires a nonlinear control strategy, and balancing it can be dealt with
via linear controls [2]. This suggests that it might be beneficial to train a different
model for the balancing of the pendulum and apply that when the pendulum is near
vertical.
230 9 Reinforcement Learning with Policy Gradients
1.00
0.75
Pendulum height (m)
0.50
0.25
0.00
−0.25
−0.50
−0.75
−1.00
x
8 ẋ
6
x (m) or ẋ (m/s)
4
2
0
−2
−4
10.0
7.5
5.0
Force applied (N)
2.5
0.0
−2.5
−5.0
−7.5
−10.0
0 1 2 3 4 5 6
t (s)
Fig. 9.4 The evolution of the pendulum height (top), cart position and velocity (middle), and the
force applied (bottom) as a function of time from the strategy employed by the machine learning
model
9.4 Case Study: Control of Industrial Glass Cooling with Policy Gradients 231
One step in the manufacturing of glass involves the cooling of the glass after it is
formed. In glass different temperature levels promote different chemical reactions
and physical transformations, and temperature gradients can induce stresses in the
glass that lead to the development of cracks. Additionally, because the temperatures
of the medium during the cooling are high, in the 1000 K range, much of the heat
transfer occurs via thermal radiation [3–5]. In this case study we use reinforcement
learning to develop a control strategy for cooling of glass along a specified
temperature curve. Our problem is based on a problem presented in the control work
of Frank et al. [5].
We use a 1D diffusion model for the radiative transfer from the hot glass where
we assume that the glass is a sheet that is much larger in two dimensions and thin
in a single direction, x. The equation that governs the gray energy density of the
thermal radiation, E(x, t) in units of J/m3 , is
∂E c ∂ 2E
− = −cκ(E − aT 4 ), (9.13)
∂t κ ∂x 2
where c is the speed of light with units of m/s, κ is the absorption opacity with units
of m−1 , T (x, t) is the temperature in K, and
4σSB J
a= = 7.56438 × 10−16 3 4 (9.14)
c m K
is the radiation constant with σSB ≈ 5.67 × 10−8 W/(m2 K4 ) denoting the Stefan–
Boltzmann constant. We consider that the slab is symmetric about x = 0 and that at
x = L the boundary condition imposes that there is incoming radiation equivalent
to that being emitted by a blackbody at temperature u. The temperature of the slab
is governed by an equation that couples to Eq. (9.13):
∂T ∂ 2T
ρcm − k 2 = cκ(E − aT 4 ), (9.15)
∂t ∂x
with cm the specific heat in J/kg, ρ the glass density in kg/m3 , and k the thermal
conductivity in W/(m·K). As with the radiation energy density equation, the glass
temperature is symmetric at x = 0. The boundary condition at x = L is a convective
boundary condition that provides
∂T
k = h(u − T ), (9.16)
∂x x=L
232 9 Reinforcement Learning with Policy Gradients
where h is the convection coefficient in units of W/(m2 ·K). Note that these equations
are nonlinear in the temperature due to the T 4 term that accounts for radiation
emission.
In this model we can control the temperature of the glass only by setting the
temperature outside the boundary of the glass u. When u changes, the amount of
radiation entering the glass changes and the flow of heat from convection will also
change the temperature. Raising the value of u will increase the temperature of the
glass, but the added heat will have to diffuse in from the boundary over time. As
a result, there is a finite time for the changes of the temperature on the boundary
to propagate through the medium and cause the temperature to reach a new steady
state.
The values of the glass properties that we use for our case study are
• the half-thickness of the glass sheet is L = 0.2 m,
• the density of the glass is ρ = 2200 kg/m3 ,
• the specific heat of the glass is cm = 900 J/kg,
• the thermal conductivity is k = 22.27 W/(m·K),
• the convection coefficient is h = 100 W/(m2 ·K), and
• the absorption opacity is κ = 100 m−1 .
In the case study, we desire to have the glass temperature follow the temperature
profile given in Fig. 9.5. This profile has the glass temperature start at 800 K and
then increase over several hours to 900 K, before a steep temperature rise to 1000 K
around 10 h, and staying there for about 1 h before rapidly cooling to 850 K. A
profile such as this could be designed to control chemical reactions or physical
changes that take place at different rates at 900, 1000, and 850 K. We also desire the
entire volume of glass to have the same temperature, but, as we explained above,
temperature changes at the boundary require a finite amount of time to propagate
through the glass.
Our goal, G, is to minimize the squared difference between the glass temperature
and the desired temperature integrated over the glass thickness:
9.4 Case Study: Control of Industrial Glass Cooling with Policy Gradients 233
' L
G(t) = (T (x, t) − Tgoal (t))2 dx. (9.17)
0
We use this goal to define the reward function to define the reinforcement learning
problem as the inverse of G(t):
1
R(t) = , (9.18)
G(t) +
Fig. 9.6 Temperature at x = L/2 as a function of time produced by applying the reinforcement
learning model to control the temperature, the simple control, the desired temperature, and the
control temperature u selected by the reinforcement learning model
After training, we evaluate the model by running 10 cases and averaging the
results to see how the model will perform on average. This serves to smooth
the randomness in the application of the model. We compare the result from
reinforcement learning with the intuitive, simple control of setting the value of u to
be the desired temperature at time t [5]. In Fig. 9.6, we show the temperature value
at x = L/2 over time using the reinforcement learning (RL) control of the boundary
temperature and compare that to the ideal temperature and the temperature resulting
from the simple control. This figure also shows the boundary temperature set by the
reinforcement learning model.
From Fig. 9.6, we see that on the whole the RL control is closer to the ideal
temperature than that resulting from the simple control. This is especially clear in
the initial ramp up to 900 K, the peak temperature reached, and the cooling to 850 K.
We can see that the model has figured out that to match the initial temperature ramp
it must raise the boundary temperature above the desired temperature because the
desired temperature is rising faster than the glass temperature can respond. At the
plateau at 900 K, the RL control raises and lowers the boundary temperature to
attempt to make the overall temperature settle near the desired temperature. We also
see that there are limitations to the control system we have. Given the slow response
of the glass temperature to the boundary control, the boundary temperature must
rise very quickly to go up to 1000 K and decrease very quickly to go down to 850 K.
We can further investigate the system behavior by looking at the temperature as
a function of space at a given time. The temperature as a function of space at time
t = 12 h and 50 min is shown in Fig. 9.7. In this figure, we can observe that the RL
control has adjusted the temperature so that it intersects the desired temperature. The
9.4 Case Study: Control of Industrial Glass Cooling with Policy Gradients 235
time depicted in the figure is soon after the quick cooling from 1000 K to 850 K. The
strategy used by the RL control is able to get to this lower temperature by setting
the boundary temperature far below the desired temperature.
A final way to judge the behavior of our control strategy is by looking at
the difference between the temperatures obtained from the RL control strategy
compared with the desired temperature over all time and over the entire volume.
These differences are depicted in Fig. 9.8 for the RL control and the simple control.
From the MAE and MSE of each temperature field, we see that the RL control is
superior to the simple control. Nevertheless, one feature we notice in the results
from the RL control is the rapid changes in the temperature near the boundary
at x = 0.2 m due to the rapid changing of the boundary temperature by the RL
control. Though the control makes the temperature difference close to zero, the rapid
changes would likely be less than desirable for the cooling of glass in an industrial
process. However, we did not specify in our reward function that the temperature
should not change rapidly over time. If we did desire this property, we could change
the reward and retrain the model.
In Fig. 9.8, we also see that the largest errors are at low values of x. This is due
to the fact that changes in the boundary temperature take time to reach the center
of the plate. Finally, we can also observe that error in the RL control temperature is
larger in the rise to 1000 K between 10 and 11 h, but that it has a longer region of
small error than the simple control.
Conclusions from the Case Study
This case study involved the control of a system of partial differential equations to
obtain a given solution profile. The reinforcement learning model based on policy
gradients had no knowledge of the underlying equations for the temperature of
the glass, it only knew what it learned from changing the boundary temperature
and seeing how the reward changed. In this particular problem, solving the system
of equations was not too burdensome because the numerical solution was not
expensive. If each trial, that is the simulation of the glass temperature over the
236 9 Reinforcement Learning with Policy Gradients
Fig. 9.8 The difference between the desired temperature and the temperatures obtained by the
different controls at all times and over the entire volume for the RL control (top panel) and the
simple control (bottom panel). The mean-absolute error and mean-squared error for each of the
temperatures are also given
Policy gradients are only one type of reinforcement learning problem that can be
applied in science and engineering. Coverage of the common approaches can be
found in the text by Sutton and Barto [6]. There is a sophisticated theory behind
the ideas of using randomness to explore the solution space (like we did in our cart-
mounted example), the structure of the rewards, and the number of cases that must
be used to train the model.
Problems
9.1 Repeat the cart-mounted pendulum problem, but start the pendulum at the top
of swing with a small perturbation in θ . Use policy gradients to train a control
strategy for balancing the pendulum using the procedures outlined above.
9.2 A variant on the cart-mounted pendulum problem involves a mechanical gantry
that needs to move a mass from a starting position to another position in a minimum
amount of time. The time required to move the mass includes the time for the mass
to settle (i.e., any swinging to stop). Using policy gradients train a control strategy
for a mechanical gantry where the reward is the inverse of the time required to reach
the endpoint times the stopping point (to account for the fact that longer distances
will require more time). The inputs to the model are the current position, the desired
stopping position, and the model outputs are the probability that a force of −10, 0,
or 10 N is applied. Say the mass of the gantry is 1 kg, the mass of the load is 1 kg,
and the arm length is 1 m. When training the model, you will need to give it various
values for the desired stopping position. Can your model accomplish the goal?
References
1. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G
Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
2. Russ Tedrake. Underactuated robotics: Learning, planning, and control for efficient and agile
machines course notes for MIT 6.832. 2009.
3. Guido Thömmes, René Pinnau, Mohammed Seaid, Th. Götz, and Axel Klar. Numerical methods
and optimal control for glass cooling processes. Transport theory and statistical physics, 31(4-
6):513–529, 2002.
4. Axel Klar, Jens Lang, and Mohammed Seaïd. Adaptive solutions of -approximations to radiative
heat transfer in glass. International Journal of Thermal Sciences, 44(11):1013–1023, November
2005.
5. Martin Frank, Axel Klar, and René Pinnau. Optimal Control of Glass Cooling Using Simplified
PN Theory. Transport Theory and Statistical Physics, 39(2-4):282–311, March 2010.
6. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,
2018.
Appendix A
Data and Implementation of the
Examples and Case Studies
The data and code required to reproduce the examples and case studies in this mono-
graph are available at the author’s page on Github: https://github.com/DrRyanMc.
The files there are arranged chapterwise. Where possible, random number seeds
have been set to make reproducible results, but this has not always been possible.
The codes are in the form of Jupyter notebooks, written in Python. The examples
below assume a basic knowledge of Python.
A.1 Scikit-Learn
Many of the examples from Part I were produced using the tools available in Scikit-
learn. Scikit-learn is a library of machine learning methods and algorithms that make
the use of machine learning in Python rather straightforward. This library can be
found at https://scikit-learn.org/. Below is some discussion of the features and use
of Scikit-learn. It is not meant to be exhaustive, nor is it a replacement for the fine
and thorough documentation available in the Scikit-learn website.
Scikit-learn is structured with a common syntax for the different models that it
employs (e.g., linear regression, logistic regression, random forests, etc.) in that
there is a common way to initialize, train, and use the models. For instance, the
Python code to initialize and train a linear regression model with training data X and
y, for the independent and dependent variables, would be
In Scikit-learn, there are functions that automate some common model assessment
tasks. For example, one can perform K-fold cross-validation using a single function:
A.1 Scikit-Learn 241
Another benefit of Scikit-learn is that many data preprocessing tasks can be handled
with pre-defined routines. Scaling, normalizing, and other transformations to data
can be accomplished by the Preprocessor module in Scikit-learn. Additionally,
the partitioning of data into training and test sets can be done using a built in
function:
1 from sklearn.model_selection import train_test_split
2 Xtrain,Xtest,ytrain,ytest = train_test_split(X, y,
3 test_size=0.2)
In this code snippet, the independent variables in X and the dependent variables in
y are split into test and training sets where 20% of the data is randomly assigned to
the test set.
Finally, the datasets module in Scikit-learn contains a variety of standard
datasets used in machine learning problems, such as the handwritten images data
242 A Data and Implementation of the Examples and Case Studies
set, and various other image, classification, or regression data sets. Additionally,
there are functions that can load data directly from existing ML data repositories.
A.2 Tensorflow
Tensorflow is a library for building deep neural networks (and related models) in
Python. It was originally developed by Google, but released in 2015 under a public
use license. The website for Tensorflow is https://www.tensorflow.org. The majority
of the neural networks used in this work can be constructed with the keras
module of Tensorflow. Keras is another open-source machine learning library, and
Tensorflow has its own implementation of the library. We refer to the Keras module
in Tensorflow as tf.keras.
In tf.keras we can construct a deep neural network sequentially. For instance,
we can define a feed-forward, fully connected neural network with 4 inputs, 2
outputs, and 3 hidden layers each with 6 hidden units and the ReLU activation
function as
1 import tensorflow as tf
2 model = tf.keras.Sequential()
3 model.add(tf.keras.layers.Dense(6, activation=’relu’,
4 input_shape=(4,)))
5 model.add(tf.keras.layers.Dense(6, activation=’relu’))
6 model.add(tf.keras.layers.Dense(6, activation=’relu’))
7 model.add(tf.keras.layers.Dense(2))
Line 2 of this code defines model as a tf.keras model that will be built
sequentially. The first item added to this model is a fully connected, or “dense,”
hidden layer of 6 nodes with the ReLU activation function that takes an input of
size 4, as defined in lines 3 and 4. Lines 5 and 6 add two subsequent hidden layers,
before the output layer of size 2 (i.e., with two outputs) finalizes the model in line
7. If an activation function is not specified, then the identity activation is used.
To train the model we first compile it by giving it training parameters, and then
pass it training data to train:
1 model.compile(loss=’MSE’,
2 optimizer=’adam’,
3 metrics=[’MAE’])
4 model.fit(x=X, y=Y, epochs=20, batch_size=32)
When we compile the model we specify the loss function, here we use the mean-
squared error loss function, and specify the optimization method, Adam with the
default parameters in this case. We can also specify other metrics that we want to
track during training. Here we keep track of the mean-absolute error as well. Line 4
in the above performs the training by calling the fit method. There we specify the
inputs as x and the dependent variables for training as y. For these parameters we
A.2 Tensorflow 243
specify that the independent variables are in a variable called X and the dependent
variables are in Y. Additionally, we specify how many epochs to train for and the
batch size.
When we have a trained model, we can use it to predict new dependent variables
using the predict method:
1 Yhat = model.predict(X)
Convolutional neural networks can be built in almost identical fashion, but with
different layer types. The model used to predict the class for the MNIST fashion
data set in Chap. 6 is defined by
1 model = tf.keras.Sequential()
2 model.add(tf.keras.layers.Conv2D(filters=64, kernel_size=2,
3 padding=’same’, activation=’relu’,
4 input_shape=(28,28,1)))
5 model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
6 model.add(tf.keras.layers.Dropout(0.3))
7 model.add(tf.keras.layers.Conv2D(filters=32, kernel_size=2,
8 padding=’same’, activation=’relu’))
9 model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
10 model.add(tf.keras.layers.Dropout(0.3))
11 model.add(tf.keras.layers.Flatten())
12 model.add(tf.keras.layers.Dense(256, activation=’relu’))
13 model.add(tf.keras.layers.Dropout(0.5))
14 model.add(tf.keras.layers.Dense(10, activation=’softmax’))
As before, this model is built sequentially. The first layer we add is a 2-D
convolution of kernel size 2 that produces 64 channels (this is the number of filters
in tf.keras), padded to keep the same size, using the ReLU activation function,
and expecting an input of size 28×28. We then add max pooling and dropout layers.
The pool_size parameter will shrink the image by a factor of 2 in each dimension
(down to 14 × 14), and the dropout layer applies dropout with a probability of 30%
to the connections from the max pooling layer to the next layer, a 2-D convolution
producing 32 channels of size 14 × 14 each.
Eventually, the data is flattened (in line 11) into a vector that is fed into a fully-
connect layer containing 256 hidden units (in line 12). This layer then has dropout
applied to it with probability of 50% of dropping a connection, before being passed
to an output layer with a softmax activation function to produce a set of probabilities
for each of the 10 classes. To train this model, one would compile and fit as done
for the fully connected example above.
There are many possible layers that can be included in a tf.keras model.
These include recurrent units, long short-term memory units, transposed convo-
lutions, and all the other types of units discussed in the text. The Tensorflow
documentation gives descriptions of how to use each of these.
It is possible to construct models without making them sequentially. The
following code constructs a neural network with 50 independent variables as inputs
and 4 outputs by defining intermediate steps:
1 inpts = tf.keras.Input(shape=(50,1), name=’inputs’)
2 x = tf.keras.layers.Conv1D(2, kernel_size=2)(inputs)
244 A Data and Implementation of the Examples and Case Studies
3 x1 = tf.keras.layers.Conv1D(4,kernel_size=2,padding="same")(x)
4 x2 = tf.keras.layers.Conv1D(8,kernel_size=2,padding="same")(x1)
5 x3 = tf.keras.layers.Flatten()(x2)
6 x4 = tf.keras.layers.Dropout(rate=0.05)(x3)
7 outpts = tf.keras.layers.Dense(4, activation="softplus")(x4)
8 model = tf.keras.Model(inputs = inpts, outputs = outpts)
In this code inpts is defined as the input to a model in line 1. Then a succession
of variables, x, x1, x2, x3, and x4 are defined as the outputs from hidden layers.
Finally, we define a variable called outpts that outputs 4 numbers passed through
the softplus activation function. In line 8 the model is constructed by specifying the
inputs and outputs to the tf.keras model. This model can then be trained and
used to make predictions as discussed above.
With these simple instructions, we can build a variety of sophisticated deep learn-
ing models. Of course, this discussion only scratches the surface of the capabilities
in Tensorflow. Autoencoders, transfer learning, and reinforcement learning are all
possible using slight modifications to the models defined above. See the code for
the examples provided via Github for this book, or the Tensorflow documentation
for more information.
Index
J
Joachim of Fiore, 119 P
Pendulum
cart-mounted, 188
K Poe, Edgar Allen, 3
K-means, 94, 95, 97, 99, 108, 110, 111, 113, Pooling, 160, 161
114 average, 161
selecting number of clusters, 97, 99 global, 162
Kullback–Leibler (KL) divergence, 102 max, 161–163
two-dimensional, 162, 163
Pound, Ezra, 25
L
Latin hypercube sampling, 209
Learning rate, 127, 133 R
Long short term memory (LSTM), 182, 184, R2 , 7
185, 212, 214, 215 maximizing by adding independent
solving the vanishing gradient problem, variables, 8
185 Random forests, 64, 65, 67
Loss function, 6, 7 built-in cross-validation, 66
cross-entropy, 10, 39 comparison with tree models, 67
Gini index, 61 out-of-bag error, 66, 67, 69
logistic, 34 variable importances from, 66
mean-absolute error, 74 Recurrent neural network (RNN), 176, 177
mean-absolute percent error, 74 vanishing gradients, 178
squared-error, 6 Regression, 5
squared error with sensitivity, 6 exponential transform, 30
logistic, 33, 34
comparing to the null model, 35
M divide by four rule, 36
Main effects, 75–77 multinomial, 38
Modest Mouse, 175 power-law models, 32
“Murders in the Rue Morgue”, 3 regularized
elastic net, 45
lasso, 43, 45, 141, 142
N ridge, 43, 141
Neural networks regression
back propagation, 129, 139 regularized
convolutional (CNN), 164 ridge, 142
deep neural networks (DNN), 134, 135, Reinforcement learning, 12, 13, 219, 222, 223
137, 138 exploration, 226
dropout, 141, 142 policy gradients, 223–225
feed-forward, 121 RGB color space, 110
need for data normalization, 130, 131 Risk matrix, 64
Index 247
T
t-distributed stochastic neighbor embedding W
(t-SNE), 100–103 West, Nathanael, 195