0% found this document useful (0 votes)
3 views

DaOne

This document introduces the application of machine learning in statistical mechanics, highlighting its ability to analyze complex data generated from molecular simulations and predict trends. It discusses various machine learning tasks such as regression, classification, and clustering, and emphasizes the importance of data volume for effective learning. Additionally, the document covers key probability distributions and provides a case study on simple linear regression to illustrate fundamental machine learning concepts.

Uploaded by

Solomon Asghar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DaOne

This document introduces the application of machine learning in statistical mechanics, highlighting its ability to analyze complex data generated from molecular simulations and predict trends. It discusses various machine learning tasks such as regression, classification, and clustering, and emphasizes the importance of data volume for effective learning. Additionally, the document covers key probability distributions and provides a case study on simple linear regression to illustrate fundamental machine learning concepts.

Uploaded by

Solomon Asghar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

17

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Introduction to machine learning in
statistical mechanics

17.1 Machine learning in statistical mechanics: What and why?


The term artificial intelligence (AI) brings to mind the creation of computer systems
capable of mimicking the decision-making and problem-solving tasks of a human mind
by emulating its thought patterns. In a broad sense, machine learning is a pathway to
AI that uses statistical models and “training” algorithms to take in data, learn insights
and patterns in the data, and apply that learning to make new predictions without
additional input or programming. Viewed this way, it follows that machine learning is
best poised to provide reliable predictions when data are plentiful. For example, online
vendors track shoppers’ purchases and employ machine learning models to learn their
preferences so that they can recommend to shoppers specific items they might want to
purchase in the future; however, the very notion of “preference” implies a pattern in a
shopper’s purchases, which can only be discerned if a large number of purchases can
be analyzed for such patterns. Recommendations based only on one or a few purchases
may or may not meet a shopper’s needs or wants.
In many applications, machine learning is employed to perform one of several im-
portant tasks; these include regression, classification, data clustering, feature extrac-
tion and engineering, and dimensionality reduction, tasks that are particularly useful
in statistical mechanics. When molecular simulation approaches are applied to inves-
tigate a complex system, large amounts of data, such as time series, conformational
samples, energies, free energies, distributions, and so forth, are generated and need
to be analyzed. Most of the time, this data is of very high dimension and may con-
tain complex patterns that are difficult to discern using simple statistical analysis
tools. When this is the case, machine learning becomes a powerful tool for learning
these patterns and predicting trends in a system that might not have been explicitly
generated in a simulation. It is also very likely to happen that a simulation yields
incomplete data about a system or a particular thermodynamic or dynamical process.
In such instances, machine learning can often fill in some of the missing information.
In an enhanced conformational sampling simulation, for example, employing a method
such as d-AFED/TAMD (Section 8.10) or replica-exchange MD (Section 7.5), a sparse
sampling of points is produced as the simulation sweeps over a free-energy hypersur-
face. The density of points increases with the number of sweeps, and at some stage,
there will be enough points to fit or “train” a regression model that can subsequently
predict free energies of points not visited during the simulation while also providing a
smooth closed-form representation of the surface (Schneider et al., 2017; Zhang et al.,
Three probability distributions 741

2018; Wang et al., 2021). The number of sampled points needed to generate an accu-
rate model may be far fewer than that needed to reach full convergence in a direct
simulation. In studying complex processes for which a suitable reaction coordinate
might not be known a priori, a classification model can be used to look for patterns

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


in local structural information and use this information to create an order parameter
capable of parameterizing a pathway from one conformational basin to another. This
has been done, for example, to study a solid-solid phase transition in a metal (Rogal
et al., 2019), in which environmental descriptors are used as a means of classifying
individual atoms as belonging to one phase or another; the classification model ul-
timately allows free-energy barriers and mechanisms of the phase transition can be
determined. Such an approach could also be applied to the prediction of pathways
of protein or nucleic acid folding (Senior et al., 2019; AlQuraishi, 2019). Finally, in
the field of active matter, where systems are driven to desired patterns and modes
of behavior by supplying them with energy and applying external stimuli, given an
appropriate amount of data and a rule-discovery algorithm, machine learning can be
leveraged to identify the underlying interactions and governing equations that lead to
the emergence of such complex behavior (Cichos et al., 2020).
The purpose of this chapter is to introduce basic concepts in machine learning,
including a selection of machine learning models particularly useful in statistical me-
chanics, and to show how these models can be applied to solve a range of specific
statistical mechanical problems. It is important to note that machine learning meth-
ods, in and of themselves, are not likely to reveal any new physical insights. However,
because they are able to predict the outcomes of computationally intensive tasks with
significantly greater efficiency than without the benefit of machine learning, their ap-
plication can lead to such new insights by allowing time and length scales to be sig-
nificantly increased. Readers interested in a more comprehensive examination of the
machine learning field, which has evolved into an enormous discipline, or a broader
survey of the landscape of machine learning models, are referred to, for example, C.
M. Bishop, Pattern Recognition and Machine Learning or Hastie et al., The Elements
of Statistical Learning. Additional references include Neural Networks and Comput-
ing by T. W. S. Chow and S. -Y. Cho, and Understanding Machine Learning by S.
Shalev-Shwartz and S. Ben-David. Mathematical underpinnings of machine learning
are described in D. Simovici’s Mathematical Analysis for Machine Learning and Data
Mining.

17.2 Three key probability distributions


In our discussion of algorithms for training machine learning models, we will make
use of several standard probability distribution functions. These are the Gaussian,
Bernoulli, and categorical distributions, each of which we discuss in this section. The
Gaussian, or normal, distribution is a familiar and widely used function that models
the distribution of continuous variables. Thinking back to Section 7.2, the central limit
theorem guarantees the convergence of a Monte Carlo calculation by establishing that
estimators of averages approach a Gaussian distribution about the true average. Given
a continuous random variable x, the Gaussian distribution takes the form
742 Machine learning

1 2
/2σ2
PG (x; µ, σ) = 1/2
e−(x−µ) , (17.2.1)
(2πσ 2 )

where µ and σ 2 are the mean and variance of the distribution. Here, µ determines the

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


location of the peak of the distribution and σ determines its width, meaning that
Z ∞
2 2
h(x − µ) i = (x − µ) PG (x; µ, σ) dx = σ 2 . (17.2.2)
−∞

The number of parameters needed to specify a Gaussian distribution of one variable is


clearly two. If, instead of a single random variable, we have an n-dimensional vector x
of random variables and a vector of mean values µ, the general form of the Gaussian
distribution is
1 1 T −1
PG (x; µ, Σ) = e−(x−µ) Σ (x−µ)/2 , (17.2.3)
(2π)n/2 [det(Σ)]1/2

where Σ is an n × n matrix known as the covariance matrix, related to averages of the


distribution by
h(x − µ)(x − µ)T i = Σ. (17.2.4)
Equation (17.2.4) makes clear that Σ is a symmetric matrix, which, therefore, has real
eigenvalues. For eqn. (17.2.3) to be well defined, the eigenvalues of Σ must also be
positive-definite. In any case, the number of parameters needed to specify a Gaussian
distribution of x is n(n + 1)/2 + n = n(n + 3)/2. However, from eqn. (17.2.4), we
see that if the components of x are independent random variables, then Σ becomes a
diagonal matrix, Σ = diag(σ12 σ22 · · · σn2 ), such that

h(x − µ)(x − µ)T iij = σi2 δij , (17.2.5)

which leaves us with 2n parameters to specify the distribution. Finally, if the compo-
nents of x are isotropic, then all of the σi are the same, so that Σ reduces to Σ = σ 2 I,
where I is the n × n identity matrix, which leaves just n + 1 parameters needed to
specify the distribution.
The second distribution we will employ in our discussion of machine learning con-
cerns random variables that can take on discrete values only. For starters, suppose x
is a binary random variable that can take on the values 0 or 1. An example is a coin
toss where we assign tails = 0 and heads = 1. Let ν be the probability that x = 1
so that 1 − ν is the probability that x = 0. Then, the probability distribution of x is
a special case of a binomial distribution known as a Bernoulli distribution, which is
given by
1−x
PB (x; ν) = ν x (1 − ν) . (17.2.6)
With the convention used here, it can be easily shown that hxi = ν and hx2 i − hxi2 =
ν(1 − ν). The Bernoulli distribution requires that just one parameter, ν, be specified.
Suppose, next, that x can take on n values, which, for simplicity, we take to be the
integers 1, 2, 3, ..., n. An example is the roll of a six-sided die. The third distribution,
called the categorical distribution, generalizes the Bernoulli distribution to treat such
Simple linear regression 743

cases where n > 2. If the probabilities that x takes on each of n values are ν1 , ..., νn ≡ ν,
with ν1 + ν2 + · · · + νn = 1, then the categorical distribution takes the form
n
[x=i]
Y
PC (x; ν) = νi . (17.2.7)

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


i=1

Here, the quantity [x


P= i] evaluates to 1 if x = i and to 0 if x 6= i. It is straightforward
n
to show that hxi = i=1 iνi . An alternative formulation of the categorical distribution
introduces an n-dimensional vector x having just one component with the value 1 and
all other components with the value 0. There will be n such vectors, and if νi is the
probability associated with the vector xi whose single nonzero component is xi , then
the categorical distribution can be written as
n
Y
PC (x; ν) = νixi . (17.2.8)
i=1

With this definition, it can be shown that hxi = ν. In either formulation, the categor-
ical distribution reduces to the Bernoulli distribution for binary random variables.

17.3 Simple linear regression as a case study


In order to introduce the basic concepts of machine learning for a regression problem,
we will use the “toy” example of simple linear regression, but we will discuss it within
the framework and nomenclature of machine learning. Suppose we have a set of data
points (x1 , y1 ), (x2 , y2 ),...,(xn , yn ) that we assume lie on a line yi = w0 + w1 xi , where
w0 and w1 are the intercept and slope of the line, respectively. If the points come from
a measurement of some kind and have associated measurement error, they might not
satisfy a perfect linear relation. Therefore, our task is to find an optimal linear model
that best fits the given data. To this end, we take as a data model the linear form
y(x, w) = w0 + w1 x, (17.3.1)
where w = (w0 , w1 ) is a two-dimensional vector of parameters whose specific values
determine the optimal model. Figure 17.1 illustrates how the data might be distributed
around the assumed model in eqn. (17.3.1). To determine w, we seek an objective
function, expressing the deviation of the input data from the target model in eqn.
(17.3.1); this objective function, when minimized with respect to w, yields an optimal
realization of the model capable of predicting any new value of y given an input x.
Optimization of the linear model is achieved by a least-squares minimization pro-
cedure, in which the average distance between yi and the prediction of yi by the model
in eqn. (17.3.1), i.e., y(xi , w), is minimized with respect to w. This procedure defines
an objective function, which takes the form
N
1 X
E(w) = |yi − y(xi , w)|2 . (17.3.2)
N i=1

The function E(w) in eqn. (17.3.2) has various names throughout the machine learning
literature; depending on the source, it is referred to as the error function, cost function,
744 Machine learning

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.1 Illustration of the process of linear regression: Seven points are distributed about
the line y = w0 + w1 x, shown as a solid line. Dashed lines represent distances from each point
(xi , yi ) to the line y = w0 + w1 x at x = xi .

or loss function. Here, we will refer to it as the loss function, which is the most
commonly used terminology. The optimal value of w is that which minimizes E(w),
i.e., it is the solution of the minimization problem

∂E
= 0. (17.3.3)
∂w
In fact, the solution of the minimization problem is ultimately independent of the 1/N
prefactor in eqn. (17.3.2), so the choice of this prefactor, which makes E(w) an average
distance, is arbitrary. If eqn. (17.3.1) is substituted into eqn. (17.3.2) to give
N
1 X
E(w0 , w1 ) = |yi − w0 − w1 xi |2 , (17.3.4)
N i=1

it becomes clear that eqn. (17.3.3), in this simple case, yields two equations ∂E/∂w0 =
0 and ∂E/∂w1 = 0 in the two unknowns w0 and w1 . From eqn. (17.3.4), these two
conditions yield the coupled equations

hyi − w0 − w1 hxi = 0
hxyi − w0 hxi − w1 hx2 i = 0, (17.3.5)

where h· · ·i indicates an average over the N data points. The solution to these equations
is the familiar result
hxyi − hxihyi
w1 =
hx2 i − hxi2
Simple linear regression 745

w0 = hyi − w1 hxi. (17.3.6)

At this point, we note that the minimization of the loss function in eqn. (17.3.2) is
tantamount to maximizing a Gaussian probability distribution of the form

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


N
" #
2 1 1 X 2
P (y; w, σ ) = exp − 2 |yi − y(xi , w)| , (17.3.7)
(2πσ 2 )N/2 2σ i=1

since
2σ 2
ln PG (y; w, σ 2 ) − σ 2 ln 2πσ 2 ,

E(w) = − (17.3.8)
N
where y = (y1 , y2 , ..., yN ) is a vector of the N y-values. Therefore, the optimal param-
eters of our data model are those that maximize a Gaussian probability distribution
between the input data and the chosen model of that data. Once the parameter vec-
tor wmin that minimizes the loss function is determined, eqn. (17.3.2) p can be used to
compute a root-mean-square error (RMSE) ǫRMSE via ǫRMSE = E(wmin ). However,
a more commonly used error is the mean absolute error (MAE), based on an L1 norm
and given by
N
1 X
ǫMAE = |yi − y(xi , wmin )|. (17.3.9)
N i=1
In order to establish a robust test of the quality of a data model, we can divide
the available data into two subsets: a training set of size Ntrain and a test set of size
Ntest . The size of the test set is assumed to remain fixed while the number of points
in the training set is allowed to vary. Thus, when Ntrain + Ntest < N , the rest of the
data are held in reserve to augment the size of the training set. Initially, the training
set size can be a small fraction of the total available data and is used to minimize the
loss function and determine optimal parameters. Then, the test set is used to evaluate
the accuracy of the model via an RMSE or MAE, evaluated over the points in the
test set. If the error over the test set is too large, more points can be added to the
training set from the reserve set and the process repeated until the magnitude of the
error is deemed acceptable. What is meant by an “acceptable” error is typically one
that is lower than the intrinsic error in the original data. For example, if a data set is
within chemical accuracy of 1 kcal/mol, then the prediction error in the data model
should be lower than 1 kcal/mol. A plot of the RMSE or MAE versus training set size,
known as a learning curve, will reveal how well the model learns from the training
data. Once an acceptable error is reached, the model is considered to be trained, and
it can subsequently be used to predict new y values from input x values that are new
to the model.
In our simple linear regression example, we assumed that x is a scalar variable.
Suppose, instead, that we have N points of the form (y1 , x1 ), (y2 , x2 ),...,(yN , xN ),
where xi is an n-dimensional vector and the N points, therefore, exist in an (n + 1)-
dimensional space. If we again assume a linear relation between y and x, then the
generalization of the linear data model in eqn. (17.3.1) becomes

y(x, w) = w0 + w · x, (17.3.10)
746 Machine learning

where w is an n-dimensional vector of parameters w = (w1 , ..., wn ). As with the scalar


linear regression model, an analytical solution for the n + 1 parameters (w0 , w) can be
obtained by direct minimization of the loss function in eqn. (17.3.2). The analytical
solution, in particular, is expressible in terms of the inverse of the covariance matrix

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


C ≡ hxxT i − hxihxiT of the data vector x (see Problem 17.3).
As a final point, we note that, depending on the nature of the input data, certain
points may overly influence the regression. In particular, if some points are outliers
with respect to the posited model, a small subset of parameters might be favored as
the learning procedure attempts to fit these points. This is generally undesirable and
can lead to overfitting of outlier data. A control measure in the learning procedure can
be introduced that prevents undue bias of certain data points. Consider modifying the
loss function so that it reads
N
1 X λ
E(w0 , w, λ) = (yi − w0 − w · xi )2 + w · w. (17.3.11)
N i=1 2

The last term (λ/2)w · w in eqn. (17.3.11) is known as a regularizer or ridge term
and involves a new parameter λ, which needs to be included in the optimization of the
model. However, if we were to include λ into the set (w0 , w) of optimizable parameters,
then the optimization problem would become a nonlinear one, which is more difficult
to solve analytically. Therefore, regularization parameters, such as λ, are typically
chosen a priori before the optimization is performed. Such a parameter is known as a
hyperparameter. The choice of λ is governed by its ability to lower the overall error
across the test set compared to what the error would be if λ = 0. Regularization terms
need not be restricted to quadratic forms. Other choices include a linear regularizer
of the form λ′ |w|, known as a lasso term, or a combination of linear and quadratic
regularizers, λ|w|2 /2 + λ′ |w|, known as an elastic net term. The elastic net regularizer
requires determination of two hyperparameters, λ and λ′ . Depending on the nature of
the data, a lasso or elastic net regularizer might be a better discriminator of the most
relevant parameters (w0 , w) weighting the input values x.
Choosing hyperparameters in a machine learning model can be accomplished using
a robust scheme known as k-fold cross validation. In this approach, the training data
are divided into k subsets of equal size; of these subsets, (k − 1) are used for hyper-
parameter searching, and the remaining subset is used to validate the choice. This
process is repeated such that each of the k subsets acts as the validation set and the
validation error is retained for each of the k searches. Ultimately, the hyperparameters
that give the lowest error can be selected, or the hyperparameters can be averaged over
the k searches (in which case, the associated error will be an average error over the k
individual errors). As a final assessment, the quality of the choice of hyperparameters
should be evaluated against the test dataset. The use of k-fold cross validation ensures
that the hyperparameter search is not biased toward a single validation set.

17.4 Kernel methods


Linear regression methods such as the n-dimensional generalization in eqn. (17.3.10)
are special cases of a more general approach to regression problems known as kernel
Kernel methods 747

regression. In kernel regression methods, the vector x in eqn. (17.3.10) is replaced by


a vector function Φ : Rn → Rnf , where nf is a feature space of dimension nf ≥ n.
More specifically, nf is the number of parameters needed to define the data model.
For linear regression, it is easy to see that nf = n and Φ(x) = x. The more general

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


expression for the data model is

y(x, w) = w0 + w · Φ(x), (17.4.1)

where now dim(w) = nf and Φ(x) is a nonlinear function of x. There is some vagueness
to the idea of a feature space of dimension nf , but the good news is that neither nf nor
Φ(x) need to be known explicitly, as we will show shortly. The power of eqn. (17.4.1)
is that it serves as a framework for obtaining different types of kernel methods. In this
context, a kernel or kernel matrix is an N × N matrix defined by

Kij ≡ Φ(xi ) · Φ(xj ). (17.4.2)

In addition, the parameters w are never explicitly definedP


but are treated as functions
N
of the training data, which take the general form w = i=1 g(xi ). This is called a
support-vector expansion.
A framework that allows us to exploit these ideas is constrained optimization of
the data model. That is, we define a set of N quantities εi , which we will require to
equal yi − y(xi , w) via the constrained optimization procedure. To this end, we start
with the quadratic loss function

N
1 X 1
E(w) = |εi |2 + w · w, (17.4.3)
2λ i=1 2

which we minimize subject to the constraint that εi = yi − y(xi , w). Here, λ is a


hyperparameter similar to a regularization parameter. Since i = 1, ..., N , we have N
constraints, which we can enforce by introducing N Lagrange multipliers αi and an
extended loss function
N N
1 X 1 X
Ẽ(w) = |εi |2 + w · w − αi (εi − yi + w0 + w · Φ(xi )) . (17.4.4)
2λ i=1 2 i=1

In eqn. (17.4.4), εi , w0 , and w are all treated as optimization parameters. The mini-
mization conditions then become
N
∂ Ẽ X
=− αi = 0
∂w0 i=1

N
∂ Ẽ X
=w− αi Φ(xi ) = 0
∂w i=1
748 Machine learning

∂ Ẽ 1
= εi − αi = 0. (17.4.5)
∂εi λ
PN PN
From these equations, we see that i=1 αi = 0, w = i=1 αi Φ(xi ), and εi = λαi .

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Substituting these into the constraint condition εi − yi + w0 + w · Φ(xi ) = 0, we obtain
an expression for each yj value
N
X
yj = αi Φ(xi ) · Φ(xj ) + w0 + λαi . (17.4.6)
i=1

Equations (17.4.6) can be written in matrix form as


    
0 0 1T w0
= , (17.4.7)
y 1 K + λI α
where K is the N × N kernel matrix, y is an N -dimensional vector (y1 y2 · · · yN )T ,
1T is an N -dimensional row vector of 1s, α is the vector (α1 α2 · · · αN )T , and I is
the N × N identity matrix.
A number of different regression methods emerge from eqn. (17.4.6). For example,
if we set w0 = 0, we obtain the kernel ridge-regression model for a multidimensional
function y(x). In this model, the Lagrange multipliers α1 , ..., αN become the optimiza-
tion parameters, and the solution of eqn. (17.4.7) for the vector α is
−1
α = (K + λI) y. (17.4.8)
The data model corresponding to this solution takes the form
N
X
y(x, α) = αi K(xi , x), (17.4.9)
i=1

where the kernel function K(xi , x) ≡ Φ(xi ) · Φ(x) and gives the kernel matrix Kij
via Kij = K(xi , xj ). In fact, eqn. (17.4.8) can be derived by substituting the kernel
ridge-regression data model into the least-squares loss function
N
(yi − y(xi , α))2 + λαT Kα,
X
E(α) = (17.4.10)
i=1

where we see that λ becomes the ridge parameter. On the other hand, if we retain
the parameter w0 and rename it α0 for notational uniformity, then the corresponding
data model, known as a least-squares support-vector machine model, takes the form
N
X
y(x, α) = αi K(xi , x) + α0 . (17.4.11)
i=1

Optimizing the corresponding loss function leads to the solution


   −1  
α0 0 1T 0
= . (17.4.12)
α 1 K + λI y
Kernel methods 749

Note that in the kernel ridge and support-vector machine models presented here,
the function Φ(x) has disappeared. As we noted immediately below eqn. (17.4.1), we
do not need to know the form of Φ(x). Rather, we can bypass specification of Φ(x)
and introduce a kernel function K(xi , x) directly. This replacement is referred to as

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


the “kernel trick”, which allows flexibility in how we specify the kernel function. When
leveraging the kernel trick, a common choice is a Gaussian kernel K(xi , x)
2
/2σ2
K(xi , x) = e−|xi −x| . (17.4.13)

The Gaussian kernel option illustrates the advantage of the kernel trick, as eqn.
(17.4.13) cannot be derived from a dot product of the form Φ(x) and Φ(xi ). When
a Gaussian kernel is employed within the kernel ridge-regression model, for example,
the method is known as Gaussian kernel ridge regression. In this model, σ becomes
a hyperparameter to be chosen along with the ridge parameter λ. Other kernel func-
tions could be envisioned; however, the Gaussian kernel ridge regression model is both
simple and widely applicable.
A downside of kernel methods is that very large data sets (large values of N )
require kernel matrices of size N 2 , which can lead to significant memory issues when
the matrix needs to be stored and inverted. This issue raises the question of whether
more compact and flexible data models might be possible, a topic that will be addressed
in the next section where neural network models are discussed.
Before leaving this section, we illustrate how kernel methods might be used in a
statistical mechanical application. Suppose we have performed an enhanced sampling
calculation to generate an n-dimensional free-energy surface A(s1 , ..., sn ) ≡ A(s) using
one of the techniques for generating high-dimensional free-energy surfaces such as
were discussed in Sections 8.10 and 8.11. The d-AEFD/TAMD method, for example,
generates global sweeps across the free-energy landscape generating a scattering of N
points, si and corresponding free-energy values Ai ≡ A(si ). In the early phase of a run,
the N points are sparsely distributed over the surface as illustrated in the left panel
of Fig. 17.2. The point distribution might not be dense enough to reveal features of
the surface to the naked eye. However, the points (si , Ai ) can be used to train one of
the regression models described in this section in order to fill in details of the surface
not easily identifiable by inspection, allowing its features to be discerned with greater
clarity. This is illustrated in the right panel of Fig. 17.2. As the calculation is carried
out further, more regions are sampled, the density of points (si , Ai ) increases, and the
model can be updated with additional training data. This is known as active learning.
Active learning allows the features of the surface predicted by the kernel model to
become sharper as the amount of training data increases. If a kernel ridge regression
model is used, then at any stage in the simulation, the explicit representation of the
free-energy surface A(s) is
N
X
A(s) = αi K(si , s) + α0 , (17.4.14)
i=1

where si , i = 1, ..., N are the N training points generated in the simulation and K(si , s)
is the kernel function expressed in terms of the extended variables s that parameterize
750 Machine learning

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.2 Generation of a free-energy surface via machine learning. (Left panel) Points sam-
pled in a short d-AFED/TAMD simulation of the alanine dipeptide. (Right panel) Free-energy
surface generated using a kernel ridge regression model. Dark blue are regions of low free en-
ergy, red indicates regions of high free energy (reproduced with permission from Cendagorta
et al., J. Phys. Chem. B 124, 3647 (2020), copyright American Chemical Society).

the marginal probability distribution and the free-energy surface. Equation (17.4.14)
allows the free energy at any point s to be evaluated. If the kernel employed is a
Gaussian kernel, then eqn. (17.4.14) becomes

N
2
/2σ2
X
A(s) = αi e−|si −s| . (17.4.15)
i=1

Training is performed by optimizing the loss function with a ridge term, as in eqn.
(17.4.10). We defer discussion of learning curves and training protocols for this type of
application until Section 17.7, where we will compare kernel methods to other types
of machine learning models.

17.5 Neural networks


By far, among the most powerful machine learning models are neural networks. With
their enormous flexibility in allowable architectures and alluring mathematical struc-
ture, neural networks can be constructed for a broad range of learning tasks, including
speech and text recognition, image recognition and classification, customer preference
analysis, risk management, materials and chemical compound design and property
characterization, and computer simulation processing and enhancement. Here, we will
not endeavor to describe all possible network architectures and types, as the list is
simply too long; rather, we aim to introduce general concepts of neural networks and
provide examples of simple architectures that are useful in statistical mechanical ap-
plications.
Neural network models are mathematical constructs, derived from neuronal pat-
terns observed in the human brain, that take in input data and perform a series of
transformations—nonlinear or linear—on the data in order to produce a desired out-
put. One of the earliest examples of a neuron-based computational model is described
in the work of Warren S. McCulloch and Walter Pitts (1943) who proposed a “logical
calculus” of neuronal activity, relating the behavior of certain networks to specific psy-
Neural networks 751

chological conditions based on assumptions about the inputs1 . Fourteen years after the
work of McCulloch and Pitts, an important theorem would be established that pro-
vided a mathematical foundation for many widely used modern neural network archi-
tectures. This theorem is the Kolmogorov superposition theorem (Kolmogorov, 1957)

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


after the mathematician Andrey N. Kolmogorov (1903-1987). Following a number of
subsequent refinements (Sprecher, 1965; Fridman, 1967; Lorentz, 2005), Kolmogorov’s
theorem asserts that a continuous scalar function y(x1 , ..., xn ) of n variables xi ∈ [0, 1]
can be expressed in terms of 2n+1 monotonically increasing functions φq (x) of a single
variable (φ : [0, 1] → [0, 1]) and a single continuous function g(x) (g : R → R) in the
following way:
2n+1 n
!
X X
y(x1 , ..., xn ) = g λp φq (xp ) , (17.5.1)
q=1 p=1

where the coefficients λ1 , ..., λn > 0. While a detailed proof of Kolmogorov’s theorem
is beyond the scope of this book (a sketch of the proof is outlined in the steps of
Problem 17.17), an existence proof involves establishing that the function g(x) ex-
ists; this can be done by constructing a series of approximants to y(x1 , ..., xn ) based
on the superposition principle and then showing that this series converges exactly to
y(x1 , ..., xn ) (Kahane, 1975). In 1987, R. Hecht-Nielsen connected Kolmogorov’s the-
orem to a type of neural network known as the feed-forward network (Hecht-Nielsen,
1987) to be discussed in this section. However, the feed-forward network construction
we will describe is based on a corollary to the Kolmogorov theorem, introduced in
1991 by Vera Kurkova (1991). Kurkova’s corollary is a more flexible formulation of
Kolmogorov’s theorem that allows the number of inner functions to be greater than
2n + 1 while still guaranteeing that y(x1 , ..., xn ) can be approximated to arbitrary
accuracy. Kurkova’s restatement of Kolmogorov’s theorem is
m n
!
X X
y(x1 , ..., xn ) = gq ψqp (xp ) , (17.5.2)
q=1 p=1

where ψqp (x) are continuous monotonically increasing functions on [0, 1], gq (x) is a
continuous function, and m > 0 is an integer. The structure of a feed-forward neural
network emerges by repeated application of eqn. (17.5.2) as we will now demonstrate.
In order to see how the mathematical structure of a neural network emerges from
eqn. (17.5.2), note that the argument of gq is, itself, a function of x1 , ..., xn for each
value of q. We denote this function as hq (x1 , ..., xn ). Applying Kurkova’s representation
to hq , we obtain

n m n
!
X X X
hq (x1 , ..., xn ) ≡ ψqp (xp ) = γqs χsr (xr ) , (17.5.3)
p=1 s=1 r=1

1 In fact, in 1943, there was already considerable activity in the biophysics community to establish
a mathematical framework for neuronal networks. The novelty of the work of McCulloch and Pitts,
in addition to involving a collaboration between a neurophysiologist and logician, is its use of logic
and computation as a way to understand neural activity. For a deeper look at the work of McCulloch
and Pitts, see the historical and contextual analysis of G. Piccinini (2004).
752 Machine learning

where γqs (x) is a continuous function analogous to gq (x). Let us now choose χsr (xr )
to be
(0)
χsr (xr ) = wsr xr + asr , (17.5.4)

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


(0)
where wsr and asr are constants. This function increases monotonically, as required,
(0)
provided wsr > 0. With this choice, we see that
n n n n
(0)
X X X X
(0) (0)
χsr (xr ) = wsr xr + asr ≡ wsr xr + ws0 . (17.5.5)
r=1 r=1 r=1 r=1

We then set
(1)
γqs (x) = wqs h(x) + bqs , (17.5.6)
where h(x) is a continuous function, about which we will have more to say later in
this section. Substituting eqn. (17.5.6) into eqn. (17.5.3), we obtain
′ ′ ′ ′
m m m m
(1)
X X X X
(1) (1)
γqs (x) = wqs h(x) + bqs ≡ wqs h(x) + wa0 . (17.5.7)
s=1 s=1 s=1 s=1

(2)
Finally, we set gq (x) = wq h(x) + cq . If we now substitute eqns. (17.5.3) through
(17.5.7) into eqn. (17.5.2), we obtain
 ′ 
m m n
!
(0) (1)
X X X
y(x1 , ..., xn ) = wq(2) h  (1)
wqs h (0)
wsr xr + ws0 + wq0  + w(2) , (17.5.8)
q=1 s=1 r=1

Pm
where w(2) = q=1 cq . If we iterate the Kurkova theorem once again, we obtain

y(x1 , ..., xn ) =
 ′
 ′′ !  
m m m n
(1) (0) (0) (1) (2)
X X X X
wq(3) h  (2) 
wqs h wst h wtr xr + wt0 + ws0  + wq0  + w(3) . (17.5.9)
q=1 s=1 t=1 r=1

Equations (17.5.8) and (17.5.9) are the mathematical representations of two-hidden-


layer and three-hidden-layer feed-forward neural networks, respectively. The term “hid-
(l)
den layer” will be explained shortly. The parameters wp0 are known as biases. The
operational interpretation of the feed-forward neural network in eqn. (17.5.9) is as fol-
lows: We input a set of values for the n variables x1 , ..., xn . These are then transformed
(0) (0)
according to the linear function wtr xr + wt0 . The result is fed into the nonlinear
(1) (1)
function wst h(x) + ws0 , and the result of this transformation is fed into the function
(2) (2) (3)
wqs h(x) + wq0 . This result is fed into the function wq h(x) + w(3) to produce the out-
put y(x1 , ..., xn ). The procedure is illustrated in Fig. 17.3. In this figure, the network
is shown as a mathematical graph in which the set of nodes in the layer at the left
represents the inputs x1 , ..., xn . These nodes are then fully connected to a second set
of nodes that represents the first transformation using the function h(x). These nodes
Neural networks 753

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.3 Feed-forward neural network with three hidden layers represented as a mathe-
(j)
matical graph. Quantities zσ are defined in eqn. (17.5.10). Each set of parameters {w(K) }
determines weights of connections between hidden layers (shown with grey shading).

are fully connected to the next set, and so on, until the final output node represents
(j)
the output y(x1 , ..., xn ). The quantities zσ in Fig. 17.3 are defined as follows:
n
!
(1) (0) (0)
X
zt = h wtr xr + wt0
r=1
 ′

m
(1) (1)
X
zs(2) = h  wst zt 
t=1
 ′′

m
(2) (2) 
X
zq(3) = h  wqs zt . (17.5.10)
s=1

The resemblance of the graph in Fig. 17.3 to the connections between neurons in
the brain has led to the use of the neural network moniker for the models in eqns.
(17.5.8) and (17.5.9). Since neurons are activated when presented with input stimuli,
the functions h(x) are known as activation functions. The edges, or connections, in the
(K)
graph denote the various parameters wij . These parameters are determined in the
training phase by optimizing a loss function. What happens to the data in the layers
that contain the activation functions h(x) is, in some sense, hidden from the user of
the feed-forward network, as these transformations are performed automatically by
the network. For this reason, these layers are called hidden layers. The numbers m′′ ,
m′ , and m determine the numbers of nodes in the first, second, and third hidden
layers, respectively. The input layer must contain n nodes for each input value, and,
in the example illustrated in Fig. 17.3, the output layer contains just one node that
contains the function y(x1 , ..., xn ). Feed-forward neural networks with a single hidden
layer are generally referred to as single-layer perceptron models. Network structures
with more than one hidden layer are referred to as deep neural networks. Although
754 Machine learning

deep networks usually outperform single-layer perceptron models, how deep they need
to be depends on the nature of the machine learning problem, and these networks may
range in depth from two to hundreds of hidden layers of different architectures.
Importantly, since the activation functions h(x) are chosen a priori, representations

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


of y(x1 , ..., xn ) like those in eqns. (17.5.8) and (17.5.9) are no longer exact, as would
be guaranteed by the Kolmogorov or Kurkova theorems, because an h(x) chosen this
way does not necessarily correspond to the correct choice of the functions gq , γqs ,...
for a given y(x1 , ..., xn ) ≡ y(x). Therefore, it is necessary to optimize the parameters
(K)
wij of the network via the training process. For this reason, the representation of
y(x) as a neural network should actually be expressed as y(x, w) to indicate that
the representation depends on the full set of parameters w. Given N training points
(x1 , y1 ), ..., (xN , yn ), we must optimize a loss function of the form
N
1 X
E(w) = |yi − y(xi , w)|2 (17.5.11)
N i=1

by solving ∂E/∂w = 0 in order to determine the optimal parameter set w. The


mathematical complexity of neural networks causes the calculation of derivatives of
the network with respect to w to be nontrivial, a point to which we will return later
in this section. First, we discuss the choice of the activation function h(x).
Activation functions serve several purposes in neural networks. First, they add non-
linearity into the data transformations that are performed; the corresponding weights
determine how much a particular neuron contributes to the process of learning pat-
terns in the data. Perhaps more importantly, activation functions need to effect a
reshaping of the data between input and output layers such that redundancies in the
data are eliminated, allowing true patterns to emerge. If we consider the input data as
having an intrinsic dimension, which could be lower than the dimension of the space
in which it is specified or embedded, then it has been suggested (Ansuini et al., 2019)
that as a data set propagates through the layers of a neural network, transformations
performed by these layers reduce this intrinsic dimension so that the data manifold
is “compressed” into a space that contains its essential, or non-redundant, features.
Although there is no systematic procedure for selecting an activation function for a
particular problem, sigmoid-shaped functions such as
1
h(x) = , h(x) = tanh(x) (17.5.12)
1 + e−x
are widely used. Sigmoid functions such as those in eqn. (17.5.12) effect a significant
compression of the data input to them, leading to a small change in the output for a
large change in the input. The result is a very small gradient at a large number of nodes,
a problem known as the vanishing gradient problem. This problem can ameliorated by
the choice of a rectified linear unit or ReLU function as the activation function. The
ReLU function is defined as h(x) = max(0, x); a slight variant that does not return 0
is the leaky ReLU, defined as h(x) = bx for x < 0 and x for x > 0. Here, 0 < b < 1 is
a small positive number such as 0.01. However, the ReLU and leaky ReLU functions
are not differentiable at x = 0, which could be a problem during training when the
Neural networks 755

gradient of the network is needed. Therefore, alternatives to the ReLU function that
are everywhere differentiable are the softplus function

h(x) = ln (1 + ex ) (17.5.13)

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


and the exponential linear unit, or ELU, function

 x x>0
h(x) = (17.5.14)
α(ex − 1) x ≤ 0

where α is a constant. Another differentiable activation function that does not suffer
from vanishing gradients is the so-called “swish” activation function, defined by
x
h(x) = . (17.5.15)
1 + e−x
Unlike the other activation functions present here, the swish function is not mono-
tonically increasing. Although this violates the condition of monotonicity in the Kol-
mogorov and Kurkova theorems, empirical evidence suggests that this condition is
likely sufficient but not necessary. Consequently, it is also possible to take h(x) to
be a Gaussian or a Lorentzian function. There is also no requirement that the same
activation function be used in every layer of a neural network. Since different layers
can serve different purposes as concerns learning patterns from the data, there could
be advantages in using different activation functions in different layers, an issue we ad-
dress in Section 17.5.1 when we discuss classification problems. Note that it is possible
to tune the shape of a chosen activation function by replacing any of the h(x) functions
defined above with h(ax), where the constant a becomes another hyperparameter that
would need to be chosen at the outset, before training. The activation functions we
have introduced here are shown in Fig. 17.4.
With so many possible choices for activation functions with little guidance on how
to choose an optimal function for a given layer in a neural network, one might ask
whether the data, itself, could dictate the selection for a given learning problem. This
is, indeed, possible, via an approach known as self learning of activation functions. Self
learning of activation functions can be achieved by expanding h(x) in terms of a set
of M basis functions as
M
X
h(x; β) = βk φk (x) (17.5.16)
k=1

and including the set of M coefficients β1 , ..., βM ≡ β, together with the parameters
w, as parameters to be learning in the training phase. The neural network is then
represented as y(x, w, β), and optimization of the loss function requires the two con-
ditions ∂E(w, β)/∂w = 0 and ∂E(w, β)/∂β = 0. Examples of possible choices of
φk (x) are simple polynomials φk (x) = xk−1 (Goyal et al., 2019) or sinc functions,
φk (x) = sin(x − kd)/[π(x − kd)], where d defines a grid spacing for x.
From a technical standpoint, the most complex operation when employing neural
networks is the calculation of the derivatives needed to perform the parameter opti-
mization. Complexity arises from the deep nesting of layers between input and output.
756 Machine learning

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.4 Activation functions in eqns. (17.5.12) (tanh(x)) (upper left panel), (17.5.13)
(upper right panel), (17.5.14) (lower left panel), and (17.5.15) (lower right panel).

Because of this nesting, long products arising from the application of the chain rule
result when the derivatives are performed. To illustrate the structure of these prod-
ucts, consider a simple nested function g(w) = h(h(h(wx0 ))), where x0 is a constant.
We can think of this function as representing a toy network having three hidden layers
with one node in each layer. From the chain rule, the derivative of g with respect to
w is g ′ (w) = [h′ (h(h(x0 w))][h′ (h(x0 w))][h′ (x0 w)]x0 . From the pictorial representation
in Fig. 17.3, if this product is read from left to right, the first term in square brackets
is the derivative of the outermost layer, which produces the output result g(w), the
second term is the derivative of the layer just to the left of the previous layer, the
third term is the next layer to the left, and finally, the last term “x0 ” is the derivative
of the input layer. Thus, we see that the product is a propagation backward through
the layers of the network from the rightmost (output) layer back to the leftmost (in-
put) layer. Hence, the approach for computing derivatives of the nested functions that
comprise a feed-forward network via the chain rule is called back propagation.
Of course, computing the derivative of the loss function in eqn. (17.5.11) with a
complete feed-forward network, although straightforward in principle, requires consid-
erable bookkeeping to account for all of the terms that arise when the chain rule is
applied. Suppose the network y(x, w) has K hidden layers with m(i) nodes in the ith
layer. In order to derive the rules of back propagation, let us define a recursive variable
Neural networks 757

 (k−1)
mX  
(k−1)

h za(k−1) war
(k−1)
+ w0r , n = 2, ..., K





 a=1
zr(k) (x) = (17.5.17)

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


n



(0)
X
 (0)


 xa war + w0r , n = 1.
a=1

(k−1)
Here, k indexes the hidden layer and war is the weight parameter that connects the
ath node in layer k − 1 to the rth node in layer k. The output layer of the network
can be written compactly as
(K)
m  
(K)
X
y(x, w) = h za(K) (x) wa(K) + w0 . (17.5.18)
a=1

With the recursion in eqn. (17.5.17), the derivatives can also be defined recursively as
a backward propagation through the layers of the network, from output back to input.
(k)
Thus, the derivative of eqn. (17.5.11) with respect to wrs can be written recursively
as
(k+1)
N mX (k+1)
∂E(w) X ∂E(w) ∂za (xi )
(k)
= (k+1) (k)
∂wrs i=1 a=1 ∂za (xi ) ∂wrs

 N
 X ∂E(w) 
(k)



 (k+1)
h z i (xi ) , 0<k≤K
 i=1 ∂zs (xi )


= (17.5.19)
N

∂E(w)

 X



 (k+1)
xi,r , k = 0,
i=1 ∂zs (xi )

where xi,r is the rth component of the ith input data point. The derivatives in eqn.
(17.5.19) are expressed as

∂E(w) ∂E(w) 1
≡ = (y(xi , w) − yi )
∂z (K+1) (xi ) ∂yi N

 (k+1)
mX

 ∂E(w) (k) ′

(k)


 wsa h z s (xi ) , 1≤k≤K
 a=1 ∂za(k+1) (xi )


∂E(w) 
(k)
= (17.5.20)
∂zs (xi ) 
 (k+1)
mX
∂E(w)


(k)
wsa , k = 0.


(k+1)

∂za (xi )

a=1
758 Machine learning

As noted previously, the gradient G(w) = ∂E/∂w is needed to optimize the neural
network, which requires solving G(w) = 0 for w. However, since a neural network is
a highly nonlinear function of w, the optimization cannot be performed analytically
as it can when using kernel methods. Therefore, a numerical optimization algorithm

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


is needed. The simplest such algorithm is the steepest descent or gradient descent
method. In the gradient descent algorithm, w is regarded as a function of an “evo-
lution” or time-like parameter τ , and we solve the first-order differential equation
dw/dτ = −G(w) numerically by discretizing τ into values τ0 , τ1 , τ2 , .... The gradient
descent algorithm is then implemented by iterating

w(τn+1 ) = w(τn ) − ηG(w(τn )) (17.5.21)

until the gradient is approximately zero. The parameter η determines the step size
and is known as the learning rate of the algorithm, which typically needs to be small
in much the same way that the time step ∆t in a molecular dynamics calculation
must be for numerical stability. We see from eqn. (17.5.21) that the gradient descent
algorithm requires the full gradient G(w) at each step, and this, in turn, needs the
full set of training points. The gradient descent algorithm can be used as specified in
eqn. (17.5.21) for optimization problems involving small training sets. Because eqn.
(17.5.21) optimizes the full parameter set w, it is known as a batch optimization
approach. Gradient descent methods are often slow to converge because a small value
of η is needed for stable optimization. Efficiency can be improved in batch schemes
by employing more sophisticated methods such as conjugate gradient or quasi-Newton
algorithms. These are standard numerical approaches and will not be discussed here. It
is important to note, however, that because machine learning problems often involve
very large training data sets containing hundreds of thousands or even millions of
points in some situations, batch methods will become inefficient because of the cost
of evaluating the full gradient vector G(w). Fortunately, it is possible to streamline
the optimization problem so that only subsets of the training data are needed for each
step of the iteration (LeCun et al., 1989).
Note that the loss function in eqn. (17.5.11) is expressible as a sum over each
observation, i.e.,
XN
E(w) = ei (w) (17.5.22)
i=1
and the gradient can be similarly expressed as
N
X
G(w) = gi (w). (17.5.23)
i=1

Therefore, in the most extreme subdivision of the training data into individual ob-
servations, the gradient descent algorithm could be performed on each term ei (w)
according to
w(τn+1 ) = w(τn ) − ηgi (w(τn )). (17.5.24)
The update is iterated by cycling through the data, either sequentially or by choosing
points at random with replacement, until the full data set is exhausted. Such an
Neural networks 759

algorithm is known as stochastic gradient descent. Stochastic gradient descent must


be iterated both through the evolution of w and through the data set until all of the
gradients gi are approximately 0. Of course, it is not necessary to work at such a fine-
grained level. Rather, individual observations can be grouped in subsets, possibly of

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


different sizes, so as to reach a compromise between the loss of evaluating the gradients
and the number of iterations needed to reach full convergence. Other approaches make
use of Langevin type equations (see Chapter 15) to search the parameter space for
optimal solutions (Leimkuhler et al., 2019).
For some applications, data on the gradient of the function y(x) might be more
readily available than on the function itself, in which case training on the gradient
of y(x) would need to be performed. The N training points would be expressed as
(x1 , ∇y(x1 )), ..., (xN , ∇y(xN )) ≡ (x1 , ∇y1 ), ..., (xN , ∇yN ), and although we still seek
a neural network model y(x, w) for the function y(x) itself, the training would be
based on optimizing the gradient-based loss function
N
1 X
EG (w) = |∇yi − ∇y(xi , w)|2 . (17.5.25)
N i=1

Note that in eqn. (17.5.25), the analytical gradient of the neural network with respect to
xi is needed, which requires that the mathematical form of the network be everywhere
differentiable. When gradient training is used, the equations for back propagation
change somewhat, as illustrated in Problem 17.8.
In Section 17.4, we highlighted the example of using regression-based machine
learning, specifically kernel-based learning, to fill in missing points on a free-energy
surface generated by an enhanced sampling technique such as d-AFED/TAMD, which
generates a scattering of points over the surface with each full sweep. Neural networks
can be used in much the same way as kernel methods to perform this task (Schneider
et al., 2017; Zhang et al., 2018; Wang et al., 2021). In this case, the representation
of a free-energy surface A(s1 , ..., sn ) ≡ A(s) as a general feed-forward neural network
having K hidden layers would be

A(s, w) =
mK m2 m1 n
! ! !
(K) (2) (1) (0) (0) (1) (2)
X X X X
wjK h ··· wj3 j2 h wj2 j1 h wj1 α sα + wj1 0 + wj2 0 + wj3 0
jK =1 j2 =1 j1 =1 α=1
(K)
· · · + w0 . (17.5.26)

Clearly, a general feed-forward network allows for considerable flexibility in the design
of the architecture. For learning high-dimensional free-energy surfaces from enhanced
sampling, optimal architectures prove to be those where the early layers, those closest
to the input layer, have larger numbers of nodes than layers closer to the output.
That is, tapering the network such that m1 ≥ m2 ≥ m3 · · · ≥ mK tends to lead to
optimal network performance. This notion of tapering is the inspiration for a type of
network known as an autoencoder, in which a tapered network—the encoder—is used
to compress high-dimensional data into a lower dimensional representation or manifold,
760 Machine learning

and an expanding network—the decoder—is used to reconstruct the original data as


accurately as possible; tapering comes from the encoder phase of the autoencoder
architecture. Interestingly, it has been suggested (Zhang et al., 2018; Wang et al., 2021)
that a neural network representation of A(s), as in eqn. (17.5.26), can be employed as

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


a bias, similar to the metadynamics bias in the unified free-energy dynamics approach
of eqns. (8.11.13). As the network becomes increasingly knowledgeable of the free-
energy surface, it becomes highly effective at helping the system escape free-energy
minima. Later in this chapter (see Section 17.7), we present several concrete examples
of free-energy reconstruction using neural networks.
17.5.1 Neural networks for classification
Until now, our discussion of machine learning methods has focused on regression prob-
lems. Another important use of machine learning techniques is the classification of an
input vector x into one of C classes ck , k = 1, ..., C. As an example, we might wish to
assign x to one of two options, such as “recommend” or “do not recommend” to an
online shopper, where x could denote an item available for purchase. In the statistical
mechanics of a phase transition, x could represent an atom or molecule that needs to
be assigned to one phase or another based on its local environment. These are binary
classification problems for which C = 2. However, it is easy to imagine situations
where an input needs to be assigned to a larger number of classes. In handwritten
digit recognition, an input image vector x would be classified as one of ten possible
digits 0, ..., 9, and in handwritten letter recognition, the number of classes would be
that in the corresponding alphabet. In our phase classification example, if an atomic
or molecular solid had p polymorphs, then atoms or molecules could be classified as
belonging to one of these p solid phases.
Just as machine learning models for regression have an associated error in their
ability to predict new values of a function, classification also has residual errors, and
it is generally not possible for even the best models to achieve 100% classification
accuracy. Consequently, it is worth examining classification probabilities associated
with machine learning models. Using notation from Chapter 7, let us define P (ck |x) as
the conditional probability that a given x will be assigned
PC to class ck by a given machine
learning model. We note that P (ck |x) ∈ [0, 1] and k=1 P (ck |x) = 1. It should be
clear that x will be assigned to class ck with greater likelihood if P (ck |x) > P (cj |x)
for all j 6= k. In addition to the conditional probability P (ck |x), we define p(x) as the
probability distribution of input vectors x, P (ck ) as the probability of the occurrence
of class ck in the machine learning model, and p(x|ck ) as the conditional probability
distribution of inputs x given the class ck . The detailed balance condition in eqn.
(7.3.23) allows us to relate these probabilities via
P (ck |x)p(x) = p(x|ck )P (ck ). (17.5.27)
Rearranging eqn. (17.5.27), we can write P (ck |x) as
p(x|ck )P (ck )
P (ck |x) = , (17.5.28)
p(x)
a result known as Bayes’ theorem after the English statistician, philosopher, and Pres-
byterian minister Thomas Bayes (1701–1761). The importance of Bayes’ theorem is
Neural networks 761

that it determines a key component of neural network architectures for classification


problems. The term P (ck |x) in Bayes’ theorem is known as the posterior distribution;
the conditional probability p(x|ck ) is referred to as the likelihood function, express-
ing a likelihood for obtaining the data x given parameters, i.e., class designations ck ;

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


finally, p(ck ) is known as the prior distribution, which allows us to input known infor-
mation about the class parameters ck if such information is known. Noting that p(x)
is a marginal distribution obtained from the sum rule
C
X
p(x) = p(x|ck )P (ck ), (17.5.29)
k=1

we see that
p(x|ck )P (ck )
P (ck |x) = PC , (17.5.30)
k=1 p(x|ck )P (ck )
so that p(x) is a normalization for the product p(x|ck )P (ck ) in the numerator of the
theorem.
In the neural networks we derived for regression problems, we were able to express
the output layer as a linear sum of the activation functions for the penultimate hidden
layer as a consequence of Kurkova’s theorem. For classification problems, however, we
cannot assume this is possible, and consequently we must express the output layer
using a final activation function. Thus, eqn. (17.5.8) for a neural network with three
hidden layers would become
  ′  
m m n
!
(0) (1)
X X X
yk (x) = H  wq(2) h  (1)
wqs h (0)
wsr xr + ws0 + wq0  + w(2)  , (17.5.31)
q=1 s=1 r=1

with a similar modification for the three-hidden-layer network in eqn. (17.5.9). Here,
H(x) is the outer activation function whose form we need to determine. Note that
yk (x), k = 1, ..., C, which replaced the continuous function y(x) in eqns. (17.5.8) and
(17.5.9), is now interpreted as a numerical label for membership in the kth class. If we
interpret yk (x) as the conditional probability P (ck |x), then it is clear that yk (x) ∈ [0, 1]
with C
P
k=1 yk (x) = 1.
Bayes’ theorem can now be used to determine the form of H(x). The facts that
H(x) determines the output yk and that yk ∈ [0, 1] already restrict the type of activa-
tion function H(x) can be. What Bayes’ theorem accomplishes is a precise specification
of the particular functional form of H(x). Let us begin by considering a binary classifi-
cation with two classes c1 and c2 , and let z = z(x) denote the vector that results from
transforming x through all of the hidden layers of a classification neural network. The
output activation function H(x) determines the conditional probability P (ck |z) that a
particular class ck is assigned to z. We start by specifying a form for the distribution
p(z|ck ), which we might take to be an exponential construct
p(z|ck ) = exp [F (θk ) + B(z, φ) + θk · z] , (17.5.32)
where F (θk ) is a function of a set of parameters θk that vary with the class k, φ is
a set of universal parameters, and B(z, φ) is a function of z. This form is sufficiently
762 Machine learning

general to encompass the most commonly employed distribution functions such as the
Gaussian and Bernoulli (see Section 17.2), binomial, Poisson, and various other distri-
butions. For binary classification, using Bayes’ theorem, we can write the probability
for one of the two classes, say c1 , as

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


p(z|c1 )P (c1 )
P (c1 |z) =
p(z|c1 )P (c1 ) + p(z|c2 )P (c2 )
1
= p(z|c2 )P (c2 )
1+ p(z|c1 )P (c1 )

1
= , (17.5.33)
1 + e−a
where  
p(z|c1 )P (c1 )
a = ln , (17.5.34)
p(z|c2 )P (c2 )
which is a linear function of z of the form
a = w · z + w0
 
P (c1 )
w0 = F (θ1 ) − F (θ2 ) + ln
P (c2 )

w = θ1 − θ2 . (17.5.35)
We only need to determine P (c1 |z), as we can determine P (c2 |z) from P (c2 |z) =
1 − P (c1 |z). When a is written this way, we see that the argument of the activation
function takes the expected form of a weighted linear combination of components of
z with the bias w0 . This analysis tells us that H(x) should be chosen as the logistic
sigmoid function H(x) = 1/(1 + exp(−x)) (eqn (17.5.12)). By the same analysis, if
there are C > 2 classes, then the Bayes’ theorem along with eqn. (17.5.32) leads to
p(z|ck )P (ck )
P (ck |z) = PC
l=1 p(z|cl )P (cl )

e−ak
= PC , (17.5.36)
−al
l=1 e
where
ak = w · z + wk0 (17.5.37)
with
wk = θ k , wk0 = F (θk ) + ln P (ck ) (17.5.38)
(see Problem 17.12). These conditions require that H(x) be chosen as the softmax or
normalized exponential function
e−βk x
H(x; β) = PC (17.5.39)
−βl x
l=1 e
which depends on a vector β of parameters.
Neural networks 763

We now turn to the determination of the correct loss function for classification. Just
as the Gaussian distribution determined the loss function for regression, the Bernoulli
and categorical distributions in Section 17.2 determine the form of the loss function
for classification. Once again, we first consider the binary classification problem with

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


two classes c1 and c2 . If y1 (x, w) is the neural network used to determine membership
in class c1 , with y2 (x, w) = 1 − y1 (x, w), then we use eqn. (17.3.8), relating the loss
function to the negative logarithm of a probability distribution, to derive E(w) from
the negative logarithm of the Bernoulli distribution. Referring to the distribution in
eqn. (17.2.6), we interpret x as N known input classifications y (i) corresponding to N
input vectors xi . Then, since − ln PB (x) = −[x ln ν + (1 − x) ln(1 − ν)], the correct loss
function E(w) for binary classification becomes
N
1 X h (i) (i)
i
E(w) = − y1 ln y1 (xi , w) + y2 ln y2 (xi , w)
N i=1
N
1 X h (i) 
(i)
 i
=− y1 ln y1 (xi , w) + 1 − y1 ln (1 − y1 (xi , w)) , (17.5.40)
N i=1

which is known as a cross-entropy loss function. Training then proceeds by minimizing


E(w) with respect to the network parameters w, just as in the regression problem.
When there are more than just two classes (C > 2), we use the categorical distribu-
tion in eqn. (17.2.7) to derive an appropriate loss function E(w). Taking the negative
(i)
log of this distribution and substituting in the input training data (xi , yk ) and the
neural networks yk (x, w), we obtain the multi-state cross-entropy loss function
N C
1 X X (i)
E(w) = − yk ln yk (xi , w) (17.5.41)
N i=1
k=1

(i) PC (i)
with yk ∈ [0, 1], and k=1 yk = 1.

17.5.2 Convolution layers in neural networks


We have discussed only one type of data transformation in neural networks thus far,
specifically, activation of simple linear combinations in feed-forward networks. How-
ever, neural networks can incorporate other types of transformations in their effort to
enhance or blur certain input features. For example, image classification via neural
networks can be achieved by drawing out features capable of distinguishing one image
from another. This kind of feature enhancement (or its opposite, obfuscation) can be
achieved by means of filters applied to an input data stream. Filters are generally
applied by convolving them with the data using a discrete version of a convolution
operation, such as we encountered in Chapter 15 in our discussion of the generalized
Langevin equation.
Suppose the input to a neural network is not a vector but a matrix x of dimension
nr × nc . Such a matrix could hold information about the pixels in an image, for
example. Let F be a matrix of dimension Nr × Nc , which we refer to as a filter. Then,
764 Machine learning

we define the two-dimensional (2D) convolution of x with F as a matrix X of size


nr − Nr + 1 × nc − Nc + 1 given by
NX
r −1 N c −1

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


X
XIJ ≡ (x ◦ F)IJ = xI+i,J+j Fij . (17.5.42)
i=0 j=0

Convolutions can be similarly defined for 1D arrays, 3D arrays, and, generally, tensors
of any dimension, depending on the number of indices needed to describe the input
data. As an example of a convolution, consider an input matrix x and filter F specified
as  
1 2 3 1  
1 1
x = 4 5 6 1, F= ,
1 1
7 8 9 1
then  
12 16 11
x◦F= .
24 28 17
In neural networks, convolution layers perform operations such as that in eqn. (17.5.42),
and it is the filter matrix that must be learned via training. In addition, since convolu-
tions are linear transformations, it is common to finalize the transformation of the layer
by running XIJ through an activation function to give a new matrix ZIJ = h(XIJ +b),
where b is a bias. Finally, it is also possible to train multiple filters in a convolution
layer by adding an additional index to the filter. Multiple filters are used to extract
multiple features from the input data. For the 2D convolution in eqn. (17.5.42), mul-
tiple filters would be included by modifying the definition to read

r −1 N
NX X c −1

XIJk = xI+i,J+j Fijk , (17.5.43)


i=0 j=0

where k = 1, ..., Nf indexes the number of desired filters.


When convolution layers are used in networks that are also partially feed-forward
networks, it is necessary to feed the output of a convolution layer into an activation
layer whose input is a one-dimensional vector. This is done via an intermediate flat-
tening layer in which a multidimensional layer ZIJ is converted into a one-dimensional
array shaped for input into an activation layer.

17.6 Weighted neighbor methods


In this section, we will briefly describe two additional machine learning models that
fall into a class of techniques known as weighted neighbor methods, specifically k-
nearest neighbors and random forests. The idea behind weighted neighbor methods
is to predict unknown values of a function y at a point x using only the nearest
neighbors of x within the training set. This class of methods derives from the notion
of regression or decision trees, depending on the desired task, in which we begin by
partitioning the n-dimensional space Rn into M regions Ri , chosen such that the
Weighted neighbor methods 765

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.5 Illustration of a splitting scheme for two-dimensional data (a) with an associated
decision tree graph (b).

function y(x1 , ..., xn ) ≡ y(x) has a constant or nearly constant value wi within each
region. This subdivision allows us to create a model for y(x) given by
M
X
y(x, w) = wi h(x ∈ Ri ), (17.6.1)
i=1

where h(x ∈ R) is an indicator function that is 1 if x ∈ R and 0 otherwise. w1 , ..., wM ≡


w are parameters representing the value of y in each region. If we minimize a least-
squares loss function in eqn. (17.3.2) over the training data (x1 , y1 ), ...., (xN , yN ), the
result is
1 PN
yi h(xi ∈ Rk )
wk = 1 Pi=1
N
N
, (17.6.2)
N i=1 h(xi ∈ Rk )

which is just the average value of the target function in Rk over the training set.
Unfortunately, determining the regions Rk is nontrivial, as obtaining an optimized
splitting of R increases in complexity with the size of the training set and the dimension
of x. Figure 17.5(a) illustrates the region-splitting procedure for two-dimensional data.
The definitions of the regions Ri can be gleaned from the figure; for example, R1 is
the region for which x1 < t1 and x2 < t2 , R2 is the region for which x1 < t1 and
x2 > t2 , and so forth. Note that the splitting procedure can be represented in a graph
structure known as a decision tree. We will return to this decision tree graph shortly
when we discuss ensemble methods. First, we introduce an approximate, yet tractable,
protocol for approaching this splitting problem.
Equation (17.6.2) allows us to construct a model for y(x) in a local neighborhood
of the point x. The approximation takes the form
K
X
y(x) ≈ W (x, xi )yi , (17.6.3)
i=1
766 Machine learning

where W (x, xi ) is a non-negative weight for the ith training point within a cluster
of K neighbors of the point x2 . Each machine learning model of this form will have
a different set of associated weights. The following choice for W (x, xi ) defines the
K-nearest neighbors model:

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


 1
 i = 1, ..., K
W (x, xi ) = Kdi (17.6.4)

0 otherwise.

Here, di is an additional parameter whose value depends on the specific K-nearest


neighbor algorithm employed. For example, we would set di = 1 if we could assume
that all neighboring points carried equal weight in determining y(x) in the neighbor-
hood of x. This parameter could also be based on a distance metric that weights closer
neighbors more heavily than more distant neighbors. In practice, K is a hyperparame-
ter chosen to result in the lowest error in a cross-validation procedure. The K-nearest
neighbors model results in an approximation yK−NN (x) for y(x) that takes the form
K
X yi
yk−NN (x) = . (17.6.5)
i=1
Kdi

Clearly, values of y(x) can only be accurately predicted in regions for which the model
has been trained.
Figure 17.5(b) depicts the decision tree corresponding to the splitting of R2 in
Fig. 17.5(a). In general, splitting is performed according to a set of rules defined by
logical functions that partition data as evenly as possible into the different regions Ri .
As Fig. 17.5(b) illustrates, a decision tree consists of a root node and internal nodes
set by the splitting rules, which ultimately dictate the path from the root node to a
set of terminal or decision nodes, also referred to as leaves. In regression problems,
splitting rules are determined by minimization of the relative errors (or variances) at
each split until the tree grows to a pre-specified cutoff or until the data can no longer
be split. The weight associated with each point in the training set is given by
 1
 i = 1, ..., K
W̃ (x, xi ) = K (17.6.6)

0 otherwise.

The distinction between W (x, xi ) and W̃ (x, xi ) will become clear by the end of
this paragraph. Note that W̃ (x, xi ) also represents the weight of each point for a
single decision tree. Here, K is the number of points within the same leaf at the target
point x. Unlike K-nearest neighbors, the number of neighbors within each leaf can
vary among leaves in a tree. The difficulty with the use of a single decision tree in
2 Weighted neighbor methods define a directed, weighted graph structure on a data set, in which
nodes are represented by the data points {(xi , yi )} and edges are directed from point i to point j,
assuming that j is among the K neighbors of i. The weight of each edge is given by the weight function
W connecting points i and j.
Demonstrations 767

applying eqn. (17.6.1) for regression is that one tree has a tendency to overfit the
training data. An approach by which this overfitting problem can be avoided when
using decision trees is to divide the training data into random subsets, an approach
known as bootstrap aggregation or bagging, and to create a decision tree for each

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


subset. Each decision tree is created with a different set of splitting rules, and the
collection of all decision trees forms an ensemble known as a random forest. In order
to determine the weights for a random forest, we average the weights in eqn. (17.6.6)
over all trees in the
Pforest (ensemble). Thus, if there are m trees in the forest, then
m
W (x, xi ) = (1/m) j=1 W̃j (x, xi ), where W̃j is the weight given in eqn. (17.6.6) for
the jth decision tree. These average weights are then used to produce a model yRF (x)
using eqn. (17.6.3). A random forest generates an ensemble of decision trees, which
reduces the overall variance in predictions of y(x) without overfitting the training data.

17.7 Demonstrating machine learning in free-energy simulations


In this section, we examine the performance of the different machine learning models
in this chapter for their ability to represent free-energy surfaces accurately, generate
observables from these surfaces, and drive rare-event simulations. Our first demon-
stration involves the regression of high-dimensional free-energy surfaces produced by
enhanced sampling calculations. The second demonstration illustrates the use of clas-
sification neural networks to design collective variables for use in enhanced-sampling
simulations, applied here to a solid-solid phase transition in a bulk metallic system.

17.7.1 Regression of free-energy surfaces


As noted at the end of Section 17.4, regression of high-dimensional free-energy sur-
faces is relatively straightforward within methods such as d-AFED/TAMD or replica-
exchange Monte Carlo (or replica-exchange molecular dynamics). These methods scat-
ter points over the free-energy surface that can be used to train a machine learning
model. The longer the simulation runs, the more points will be generated and the
better the training will be. Since the goal of leveraging a machine learning model is
to represent the function A(s1 , ..., sn ) ≡ A(s), the input needed to train the model is
the M values of s, si , i = 1, ..., M and corresponding free-energy values A(sk ). Once
trained, the model provides a compact, smooth, closed-form representation A(s) of
the free-energy surface that can be evaluated at any desired point s. This representa-
tion can then be analyzed for its landmark points (minima and saddle points) (Chen
et al., 2015), fed back into a simulation as a bias to accelerate it further (Zhang et al.,
2018; Wang et al., 2021), and employed to generate observable properties of interest
via evaluation of integrals over Boltzmann factors exp(−βA(s)) (Cendagorta et al.,
2020). In particular, if a(r) is a coordinate-dependent function, then we can obtain a
canonical average hai of a(r) from an enhanced-sampling simulation as follows:

ds hair (s)e−βA(s)
R
hai = R , (17.7.1)
ds e−βA(s)

where hair is given by


768 Machine learning

Qn
dr a(r) e−βU(r) α=1 δ (fα (r) − sα )
R
hair = R Qn (17.7.2)
dr e−βU(r) α=1 δ (fα (r) − sα )

(cf. eqn. (8.6.6)). Here, fα (r) is a set of collective variables and U (r) is the potential

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


energy of the system. Although enhanced-sampling simulations deliver sampled values
of A(s), in order to apply eqns. (17.7.1) and (17.7.2), we need an analytical represen-
tation of the free-energy surface, which is what the machine learning model provides.
Thus, in applying eqn. (17.7.1), the function A(s) is replaced by the machine-learned
model, which we denote as AML (s), and averages are computed from the integral us-
ing either molecular dynamics or a Monte Carlo algorithm. If an observable of interest
is a function only of collective variables, then eqn. (17.7.2) is not needed. Although
we will only consider classical free-energy surfaces in this section, machine learning
models can be applied equally well to quantum free-energy surfaces generated from
path-integral simulations (see Section 12.7).

Fig. 17.6 Four molecules employed to test machine learning regression of free-energy land-
scapes: (a) alanine dipeptide, (b) alanine tripeptide, (c) met-enkaphalin oligopeptide (amino
acid sequence Tyr-Gly-Gly-Phe-Met), (d) zwitterionic alanine pentapeptide (reproduced with
permission from Cendagorta et al., J. Phys. Chem. B 124, 3647 (2020), copyright American
Chemical Society).

For this comparative study, we will focus on a set of small peptides, commonly used
as benchmark cases, and the corresponding conformational free-energy landscapes as a
function of their backbone Ramachandran dihedral (φ, ψ) angles, which are used as col-
lective variables. The four systems are: the alanine dipeptide, the alanine tripeptide,
and the oligopeptide met-enkaphalin (amino acid sequence Tyr-Gly-Gly-Phe-Met),
which are studied in vacuum; and the alanine pentapeptide, which is studied in zwit-
terionic form in aqueous solution. These molecules are pictured in Fig. 17.6. For the
alanine dipeptide, there are just two Ramachandran angles, for the alanine tripeptide,
the number of angles used is four. For met-enkephalin, ten angles are needed, and
for the solvated alanine pentapeptide, the inner three residues and corresponding six
Ramachandran angles are selected, as these are the same as have been used in experi-
mental studies (Feng et al., 2016). The gas-phase simulations are performed using the
Demonstrations 769

CHARM22 force field (MacKerell et al., 1998) while the solvated alanine pentapep-
tide is simulated using the OPLS-AA force field (Jorgensen et al., 1996). All of the
training data for the machine learning models are generated from d-AFED/TAMD
simulations (see Section 8.10). The simulation parameters are set as follows: for the

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


di- and tripeptides, Ts = 1500 K and µα = 168.0 amu·Å2 /rad2 ; for met-enkephalin,
Ts = 400 K and µα = 2.8 amu·Å2 /rad2 ; for the alanine pentapeptide, Ts = 1000 K
and µα = 168.0 amu·Å2 /rad2 . In all simulations, the harmonic coupling between the
collective variables and the coarse-grained variables is 2.78×103 kcal/mol·rad2.
Training and test data. For the alanine dipeptide, 9×104 values of (s1 , s2 ) and
A(s1 , s2 ) are generated as a training set. These values cover the entire free-energy
surface, which is partitioned into a 300 by 300 grid of evenly spaced bins. For the
alanine tripeptide, a training set of 2×105 points on the four-dimensional free-energy
surface is randomly selected from the d-AFED/TAMD trajectory, and free-energy
values A(s1 , s2 , s3 , s4 ) are obtained from a Gaussian fit to the histogram corresponding
to these points (Chen et al., 2012). For met-enkephalin, the 1081 minima and 1431
index-1 saddle points and their corresponding free-energy values identified by Chen et
al. on the ten-dimensional surface (2015) are employed as the training set. Finally, for
the aqueous alanine pentapeptide, a 1 ms simulation is performed and 106 free-energy
points are randomly selected from a Gaussian fit to the histogram as the training
set. For all systems, an additional 50,000 points are randomly generated in separate
d-AFED/TAMD runs, and these points are used as a test set. The training and test
sets are carefully checked to ensure there is no overlap between these two sets. For
complete consistency, all machine learning models used in the comparison are trained
on the same training sets for each system and tested using the same test set.
Machine learning model details. Each of the machine learning models employed in
this comparative study involve hyperparameters that must be chosen. For Gaussian-
based kernel methods, there are two hyperparameters, specifically, the Gaussian width
σ and the regularization or ridge parameter λ. The random forest and K-nearest neigh-
bor models have a number of hyperparameters. For K-nearest neighbors, if we choose
di = 1 (as we do here), then the only hyperparameter that needs to be determined
is the value of K. For the random-forest model, the key parameters are the number
of trees in the ensemble or forest and the number of input variables to place into
each random subset. For feed-forward neural networks, the hyperparameters are the
number of hidden layers and the number of nodes in each layer. In this comparison, a
ten-fold cross validation is used to perform the hyperparameter search. The resulting
neural network for the alanine di- and tripeptides consists of two hidden layers with
20 nodes in each layer for the dipeptide and 40 nodes in each layer for the tripeptide.
For met-enkephalin, three hidden layers are employed with 100, 50, and 50 nodes in
each of the three layers, respectively. For the aqueous alanine pentapeptide, three hid-
den layers are employed with 60, 30, and 30 nodes in each layer, respectively. For the
kernel, K-nearest neighbors and random-forest models, the resulting hyperparameters
depend on the size of the training set, and as the learning curves are generated as a
function of the training set size, the number of parameters determined is quite large
and can be found in tables in the supporting information document accompanying the
work of Cendagorta et al. (2020); the interested reader is encouraged to study these
770 Machine learning

tables. It is worth noting that for larger training set sizes used with random forests,
the number of trees is more than 200 for the di- and tripeptides and approximately 50
for met-enkephalin and the alanine pentapeptide. The learning curves are performed
with respect to the test set using the L2 error formula

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


v
u 1 NX
u test
2
L2 = t (AML (sj ) − Atest (sj )) , (17.7.3)
Ntest j=1

where Ntest is the number of points in the test set (here, 50,000) and Atest (sj ) is the
(known) free energy at the jth test point.
Generating observables. In order to test the ability of the training machine learning
models to generate observables from eqn. (17.7.1), we select different types of observ-
ables for each system. For the alanine tripeptide, we study the following “observable”:
v
u n  2  2 
u 1 X (min) (min)
O({φ, ψ}) = t φi − φi + ψi − ψi , (17.7.4)
2n i=1

where n is the number of Ramachandran angle pairs used to generate the free-energy
(min) (min)
surface (n = 2 for the tripeptide). The angles φi and ψi are the angles at the
global minimum of the free-energy surface. Although this is not a physical observ-
able, it is a sensitive test of the ability of the machine learning model to generate
an observable that depends on the full set of collective variables. For met-enkephalin,
we compute the average of the HN Hα nuclear magnetic resonance (NMR) J-couplings,
which characterizes the indirect interaction between the nuclear spins of the Cα hydro-
gen and the amide hydrogen. These J-couplings can be computed using the Karplus
equation (Karplus, 1959)

2
J(φ) = A cos2 (φ − φ0 ) + B cos (φ − φ1 ) + C, (17.7.5)

where φ is the Ramachandran angle, A = 7.09 Hz, B = 1.42 Hz, C = 1.55 Hz, and the
constant angles φ0 and φ1 are both 60◦ . J(φ) is computed for each amino acid residue
in the oligopeptide. Finally, for the alanine pentapeptide, we focus on the propensities
for different secondary structural motifs, specifically, α helix, β sheet, and the left-
handed polyproline II helix (ppII). These are defined by simple indicators that are
functions of φ and ψ and define specific regions in the φ-ψ plane for each alanine
residue. The definitions are as follows:

α : −160◦ < φ < −20◦ and − 120◦ < ψ < 50◦


β : −180◦ < φ < −90◦ and 50◦ < ψ < 180◦ or
−180◦ < φ < −90◦ and − 180◦ < ψ < −120◦ or
160◦ < φ < 180◦ and 110◦ < ψ < 180◦
ppII : −90◦ < φ < −20◦ and 50◦ < ψ < 180◦ or
Demonstrations 771

−90◦ < φ < −20◦ and − 180◦ < ψ < 120◦ . (17.7.6)
For the OPLS-AA force field used here, the populations of α, β, and ppII are 14%,
48%, and 37%, respectively. The remaining 1% of structures are characterized simply

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


as random coil.
Results. All data for Figures 17.7, 17.8, and 17.9 are taken from Cendagorta, et
al. (2020). Figure 17.7 depicts the learning curves for the four systems. For the sim-
ple alanine dipeptide system, we see that all models learn at approximately the same
rate as a function of the number of training points. However, the kernel methods—
kernel ridge regression and the least-squares support-vector machine—achieve the low-
est overall error. The errors associated with the neural network and weighted–neighbor
methods are roughly the same. In all cases, however, the error is only a fraction of
a kcal/mol. Turning to the alanine tripeptide, for which the dimensionality of the
free-energy surface is increased to four, we see that the neural network outperforms
the weighted neighbor methods, while the kernel methods still achieve the lowest er-
ror overall. The weighted neighbor methods learn at a slower rate and are not able
to achieve as low an error as the other models. For the ten-dimensional restricted
free-energy surface of met-enkephalin, we see that the two kernel methods now per-
form rather differently, with the support-vector machine outperforming kernel ridge
regression, while the neural network performs similarly to the support-vector machine.
Note, also, that the learning rates are different for the different methods, and that the
weighted-neighbor methods reach errors just below 1 kcal/mol, although it appears
that with more training data, they might be able to achieve a lower error. Finally,
for the alanine pentapeptide, for which we have generated the full six-dimensional
free-energy surface, the neural network, kernel ridge regression, and support-vector
machine models reach roughly the same error, which is lower than 1 kcal/mol; how-
ever, the neural network reaches this error with fewer training points. Once again, we
see that the weighted neighbor methods underperform compared to the kernel and
neural network models.
Our comparison shows that dimensionality and sample set influence the perfor-
mance of different machine learning models in capturing the full free-energy surface.
By contrast, accurate calculation of observables depends on how well the low free-
energy regions are described by the machine learning model due to the exp(−βA(s))
factor in the integrand of eqn. (17.7.1). As we will now demonstrate, this weighting
changes the comparison and highlights which methods exhibit the best performance
in representing these regions.
The protocol for calculating the ensemble averages in eqn. (17.7.1) is to replace A(s)
with the representation AML (s) of the free-energy surface associated with a particular
machine learning model and then perform the averages using a Metropolis Monte
Carlo algorithm (see Section 7.3.3) in which trial moves of s are generated from a
uniform distribution and the change in AML (s) is used to determine whether the trial
move is accepted. In Fig. 17.8, we show the convergence of the observables in eqns.
(17.7.4) for the alanine tripeptide, (17.7.5) for met-enkephalin, and (17.7.6) for the
alanine-pentapeptide as a function of the training set size. For the RMSD observable
in eqn. (17.7.4), we see that all models perform well for large training sets, with the
neural network and least-squares support-vector machine outperforming the others in
772 Machine learning

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.7 Free-energy surface learning curves for the four peptide molecules studied.

a manner consistent with the learning curve in Fig. 17.7. Interestingly, for smaller
training set sizes, we see that the random-forest method performs marginally better
than the neural network and the least-square support-vector machine, suggesting that
the random forest is learning the low free-energy regions with fewer data points than
the kernel and neural network models. For the conformational populations of the
alanine pentapeptide, we see that the neural network generates the most accurate
averages across the three populations, consistent with the learning curve in Fig. 17.8.
For met-enkephalin and the calculation of the average J-couplings for each of the
five amino acid residues, we see from Fig. 17.8 that the neural network exhibits the
lowest overall error in generating converged averages, outperforming the least-square
support-vector machine. This is somewhat surprising given that the latter achieved
better overall accuracy of the global free-energy surface, as reflected in the learning
curve. More surprisingly, perhaps, are the accurate averages generated by the random
forest for both small and large training set sizes for all residues except Phe.
For insight into the performance of the various methods for met-enkephalin, we
show, in Fig. 17.9, a scatter plot of 5000 randomly selected points from the test set on
models trained using 105 training points. The plot shows the difference between free-
Demonstrations 773

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.8 Dependence of observables on training set size: root-mean square deviation of
Ramachandran angles from the global minimum (cf. eqn. (17.7.4)) for the alanine tripep-
tide, NMR J-couplings for each of the five residues in met-enkephalin (cf. eqn. (17.7.5)),
and conformational populations (α, β, PPII) of the alanine pentapeptide (cf. eqn. (17.7.6)).
Horizontal lines indicate the fully converged values of each observable. The line types and
symbols correspond to the legend given in Fig. 17.7.

energy values predicted by the least-squares support-vector machine, neural network,


and random-forest machine learning models and direct simulation of the free-energy
surface. The figure shows that the neural network and support-vector machine mod-
els predict the low free-energy regions accurately but then systematically underpre-
dict larger free-energy values. This is particularly true for the support-vector machine
model. By contrast, the random-forest model has roughly the same error across the
full range of free-energy values shown in the figure; however, the differences are sym-
metric about zero, suggesting that the accuracy of the random-forest model may be
due to a fortuitous error cancellation. The conclusion of this comparative study is that
the most accurate predictions of the free-energy surfaces and observables using eqn.
(17.7.1) is the feed-forward neural network model with the qualification that other
774 Machine learning

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.9 Difference in free energy between the least-squares support-vector machine, neural
network, and random-forest machine learning models and direct simulation of the free-energy
surface for met-enkephalin.

models perform well in specific cases.

17.7.2 Collective variables from classification neural networks

In this section, we discuss leveraging classification neural networks for the design of
collective variables that can describe rare-event processes for use in the enhanced
sampling methods described in Chapters 7 and 8. We will apply classification to study
a solid-solid phase transition in a bulk atomic crystal.
Suppose the crystal has p solid phases. If a sample of this bulk material contains
some amount of thermal disorder, we seek to employ machine learning to classify this
sample as one of the p phases. Beyond this, if there are regions in the sample where
multiple phases coexist, the machine learning model should be able to identify all of
these phases in such a region. A machine learning model trained to perform these
classification tasks could be used to design a collective variable capable of driving
transitions between different phases. In order to devise such a classification neural
network, we require suitable descriptors as input functions. These descriptors need
to represent the local environment of each atom in the system, which will depend
on distances between an atom and its nearest neighbors as well as angles between
the vectors joining the atom to its neighbors. Descriptor functions that capture these
features should satisfy a number of criteria: first, they must be invariant with respect
to rotations, translations, and exchanges between atoms of the same chemical element;
second, they need to be smooth, differentiable functions of the atomic coordinates; and
third, they should be short-ranged in order to capture only nearest neighbors. Ideally,
we prefer to work with a small number of relatively simple functions.
One possible choice of descriptors is a set of functions known as symmetry func-
tions, originally introduced by Behler and Parrinello (2007) (see, also, Behler (2011)),
for the development of neural network potential energy functions (see Appendix C)
and suggested by Geiger and Dellago (2013) as useful descriptors of atomic environ-
ments. These functions, being evaluated within a spherical region around an atom,
start with a simple cutoff function fc (r). Some choices of this function are a Fermi
function
Demonstrations 775

 1
 r < rc
fc (r) = 1 + e[αc (r−rc +εc )] (17.7.7)

0 otherwise

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


or a shifted, scaled cosine function (Rogal et al., 2019):

1 r ≤ rmin




     
1 r − rmin

fc (r) = cos π +1 rmin < r ≤ rc . (17.7.8)
2 rc − rmimn




0 r > rc

Here, rc is a cutoff radius that defines the spherical region within which neighbors
are considered. From an appropriately defined cutoff function, we build up a series
of symmetry functions that capture different features of the environment around an
atom at position ri . If the system has N atoms at positions r1 , ..., rN ≡ r, then in
terms of these positions, the simplest such function is
(i)
X
G1 (r) = fc (|rij |) , (17.7.9)
j6=i

where rij = ri − rj . The sum in eqn. (17.7.9) is, in principle, taken over all j; however,
because of the short-range nature of fc (r), the sum only involves neighbors of atom
i within the cutoff radius rc . Moreover, because G1 is defined purely in terms of
fc (r), these neighbors are given roughly equal weight. Other symmetry functions give
different weights to these neighbors. For example, the function
(i)
X 2
G2 (r) = e−η(|rij |−Rs ) fc (|rij |) (17.7.10)
j6=i

weights neighbors whose distances from ri are close to the distance parameter Rs
more than those whose distances are sigificantly different from Rs . The inverse width
parameter η determines how quickly this weight decays to zero. Different choices of Rs
and η define different G2 symmetry function choices. In practice, we might use a range
of values of Rs and η to capture different features of the local environment. Another
such symmetry function employs a cosine weighting, i.e.,
(i)
X
G3 (r) = cos (κ|rij |) fc (|rij |) . (17.7.11)
j6=i

Here, the parameter κ modulates the periodicity of the cosine function such that neigh-
bors whose distances from atom i satisfy κ|rij | = (2n − 1)π/2 will have large positive
or negative weights, depending on the value of n, and zero weight if κ|rij | = nπ. The
symmetry functions G1 , G2 , and G3 depend only on the distances between neighbors
of atom i. Other symmetry functions incorporate angular dependence between the
vectors rij and rik . An example of such a symmetry function is
776 Machine learning

1 XX
(1 + λ cos θijk ) e−η(|rij | +|rik | +|rjk | )
(i) ξ 2 2 2
G4 (r) =

j6=i k6=i

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


×fc (|rij |) fc (|rik |) fc (|rjk |) (17.7.12)

and the closely related


1 XX
(1 + λ cos θijk ) e−η(|rij | +|rik | ) fc (|rij |) fc (|rik |) ,
(i) ξ 2 2
G5 (r) = ξ
(17.7.13)
2
j6=i k6=i

which does not restrict the distance between neighbors j and k of i. In eqns. (17.7.12)
and (17.7.13), the parameter λ is either 1 or −1, while the parameter ξ modulates
the angular resolution. Apart from the symmetry functions, other useful descriptors
capable of capturing angular information in the local environment are the Steinhardt
bond-order parameters (Steinhardt et al., 1983). These are defined as
v
u
u 4π X l 2
(i)
G(i)
ql (r) = t qlm (r) (17.7.14)
2l + 1
m=−l

where P
(i) j6=i Ylm (θij , φij ) fc (|rij |)
qlm (r) = P . (17.7.15)
j6=i fc (|rij |)

Here, Ylm (θ, φ) is a spherical harmonic, and θij and φij are the polar and azimuthal
angles of the vector rij . The combination of symmetry functions and spherical har-
monics reduces the number of descriptors needed to describe local environments in an
atomic crystal.
We will now apply these descriptors to the specific case of the transformation
between the metastable A15 phase in solid molybdenum to the stable BCC (body-
centered cubic) phase. A snapshot showing the coexistence between these two phases
in a single simulation cell is shown in Fig. 17.10. The transition occurs via the migration
of the interface between the two phases to the right, which transforms each layer of
the A15 phase (on the right) to the BCC phase (on the left).
In order to classify both pure and mixed phases in solid molybdenum, we only
need eleven radial symmetry functions of the G2 and G3 type and three Steinhardt
parameters corresponding to l =6, 7, and 8. The parameters Rs range between 2.8 Å
and 6.0 Å with η fixed at 20 Å−2 for G2 while κ ranges from 3.5 Å−1 to 7.0 Å−1 for
G3 . The cutoff function in eqn. (17.7.8) is employed with rmin = 3.8 Å, and rc = 4.0
Å. With these, we can distinguish four solid phases, A15, BCC, FCC (face-centered
cubic), and HCP (hexagonal close-packed) that can exist in the system, as well as
disordered or “liquid-like” phases and mixtures of these various phases.
We now proceed to describe the training procedure of the classification neural
network. Because the descriptors take in raw atomic coordinates and transform them
into translationally and rotationally invariant local environment variables, the only
input data we need for training are system configurations, which can be generated
Demonstrations 777

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.10 Snapshot of a simulation cell of a system of molybdenum atoms with an interface
between stable BCC phase (left region) and metastable A15 phase (right region) (reproduced
with permission from Rogal et al., Phys. Rev. Lett. 123, 254701 (2019), copyright American
Physical Society).

from molecular dynamics or Monte Carlo calculations. Training proceeds by using


molecular dynamics simulations at temperatures of 300 K, 450 K, 600 K, 1000 K, 3000
K, and 4000 K to generate 176,000 pure atomic environments as well as interfacial
configurations. An additional set of 125,000 atomic environments is generated as a
test set. The learning curve in Fig. 17.11 shows that bulk environments are correctly
classified with better than 99% accuracy, while interfacial environments containing
A15, BCC, and liquid phases are correctly classified with better than 93% accuracy.
The output of the neural network is a five-component classification probability vector
qi (r) for each atom.
Once complete training of the network is achieved, a collective variable capable of
driving the transition is constructed. We start by defining a global classifier vector
N
1 X
Q(r) = qi (r). (17.7.16)
N i=1

The global classifier serves as a reporter on the extent to which the entire system is in
one phase or the other. In Fig. 17.10, the value of Qbcc = 0.20 while QA15 = 0.52.
In order to drive the transition, the collective variable we employ is expressed
as a path in the vector space in which the global classifier Q exists. The reason for
working in this space is that it avoids the need to choose physical configurations be-
tween the A15 and BCC phases in order to construct a physical path (Branduardi
et al., 2007). Such a physical path could be biased by preconceived notions of how
the transition should occur. Working in classifier space allows the neural network to
decide what configurations, including pure and mixed phases, exist during the transi-
tion, which is likely to be quite complex and involve multiple ordered and disordered
local environments. Thus, let Q1 , ..., QP be a set of P nodal points along a puta-
tive path between the phases. This putative path exists in the two-dimensional space
(Qbcc , QA15 ) constructed from the BCC and A15 components of the Q vector. In
this particular example, we start with an interface already present in the system such
778 Machine learning

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.11 Learning curves for each of the pure phases (BCC, A15, FCC, HCP, liq-
uid/disordered phases) and mixed phases/interfaces.

that Q1 = (Qbcc A15


1 , Q1 ) = (0.2, 0.5) and create a path with P = 10 points, where
bcc A15
Q10 = (Q10 , Q10 ) = (0.65, 0.05). The path collective variable, inspired by the phys-
ical path form of Branduardi et al. (2007), allows for fluctuations around these nodal
points and takes the form
PP  2

1 k=1 (k − 1) exp −λ|Q(r) − Qk |
f (Q(r)) = PP , (17.7.17)
P −1 2
k=1 exp [−λ|Q(r) − Qk | ]
where λ is a parameter roughly determined by the inverse square distance between
consecutive nodal points. An illustration of this path collective variable is given in
Fig. 17.12(a). We see that f (Q(r) increases smoothly from 0 to 1 as the fraction of
the BCC phase increases and that of the A15 phase decreases. Use of eqn. (17.7.17)
alone can lead to large fluctuations around the nodal points, and, therefore, it is often
useful to add a second collective variable that restricts these excursions. In classifier
space, this collective variable takes the form
P
!
1 X  2

z(Q(r)) = − ln exp −λ|Q(r) − Qk | . (17.7.18)
λ
k=1

This collective variable is illustrated in Fig. 17.12(b). The function in eqn. (17.7.18) can
be used either as an additional collective variable in an enhanced-sampling simulation
or to construct a restraining potential (Cuendet et al., 2018; Rogal et al., 2019)
1 2
Ur (r) = κz (z(Q(r))) , (17.7.19)
2
Demonstrations 779

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.12 (Left) Path collective variable f (Q(r)) in the Qbcc -QA15 plane. Red points indi-
cate the path along the nodal points while lines are foliations of the path. (Right) Same for
z(Q(r)) (reproduced with permission from Rogal et al., Phys. Rev. Lett. 123, 254701 (2019),
copyright American Physical Society).

where the parameter κz determines the tightness of the restraint. If such a restraint is
used in a simulation, then the bias must be removed, which requires reweighting with
a factor exp(βUr (r)), in order to obtain final results.
The path collective variable in eqn. (17.7.17) can now be used in an enhanced
sampling simulation, such as d-AFED/TAMD or metadynamics, in order to drive
the structural phase transition via migration of an interface created between the two
phases. Note that using a neural network in this way incorporates machine learning
directly into the enhanced sampling procedure rather than merely using it as a post-
processing tool. In particular, the neural network performs a classification “on the fly”
as each configuration generated by the simulation is fed into it, and it immediately
outputs a classification for each atom in the system at that instant in the simulation
from which Q(r) and f (Q(r)) can be determined. For use in molecular dynamics
simulations, it is critical that the neural network employed be everywhere smoothly
differentiable, which restricts the choice of activation functions.
If an enhanced sampling simulation is performed in a canonical ensemble at 300
K with fixed volume, as is shown in Fig. 17.13 for (a) d-AFED/TAMD and (b) meta-
dynamics simulations (Rogal et al., 2019), then the metastability of the A15 phase
is not revealed. The reason is that the two phases have different lattice parameters,
and one simulation box size cannot accommodate both phases. Nevertheless, there is
a clear free-energy barrier revealed in both profiles, which is approximately 0.5 eV
≈ 48.2 kJ/mol and which agrees with previous independent computational studies
performed on the same system (Duncan et al., 2016). This free-energy barrier corre-
sponds to the thermodynamic loss of converting each layer in the A15 crystal to the
BCC structure under the constant-volume conditions. If we switch from the canonical
to the isothermal-isobaric ensemble at 1 atm, then, as is revealed in Fig. 17.13(c), the
780 Machine learning

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.13 (a) Free-energy profile at 300 K and constant volume from d-AFED/TAMD
using f (Q(r)) as the collective variable. (b) Same for metadynamics. (c) Free-energy profile
at 300 K and 1 atm pressure showing the metastability of the A15 phase relative to BCC
(reproduced with permission from Rogal et al., Phys. Rev. Lett. 123, 254701 (2019), copyright
American Physical Society).

free-energy profile acquires a negative slope, indicating that it is thermodynamically


“downhill” from the A15 to the BCC phase, which manifestly reveals the metastability
of the A15 phase relative to the BCC phase. However, the 0.5 eV barrier for each layer
transition is retained, giving the profile a kind of staircase-like character3.

17.7.3 Reaction coordinates from regression neural networks


In Section 8.12, we alluded to the challenge of determining a proper reaction coordi-
nate to describe a particular process. We introduced the committor distribution pB (r)
between two stable states A and B in configuration space and the relationship between
the committor and a putative reaction coordinate q(r) capable of fully characterizing
the transition from A to B. Problem 8.13 asked the reader to rationalize a model of
the dependence of pB (r) on q(r) (Peters et al., 2007):

3 The staircase-like profile is sometimes referred to as a “Galton staircase” after Sir Francis Galton
(1822–1911), inventor of the Galton board, which is used to demonstrate normal distributions. The
Galton staircase can be modeled by the functional form A(s) = A0 cos(αs) − λs. As shown by
Liu and Tuckerman (2000), this type of function is a particularly challenging one for deterministic
thermostatting techniques.
Demonstrations 781

1 + tanh(q(r))
πB (q(r)) = . (17.7.20)
2
We also presented an algorithm for computing a committor distribution, which, though

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


useful, does not provide a closed-form expression for pB (r), either directly or indirectly,
because a closed-form for q(r) that can be inserted into eqn. (17.7.20) is not specified.
By employing machine learning, we can provide such a compact representation of both
q(r) and the committor distribution. There are various ways this can be achieved, but
we will assume, here, that a reaction coordinate q(r) can be obtained from a large set
of possibly redundant collective variables, the space of which has been well sampled
by an enhanced sampling algorithm and conformational basins on the corresponding
free-energy hypersurface have been identified. What we seek is a reaction coordinate
capable of describing the mechanism whereby the system transitions from one free-
energy basin to another on this surface. Let the set of n collective variables be denoted
fα (r), α = 1, ..., n. We then propose a machine learning model that expresses q(r) in
terms of f1 (r), ..., fn (r). For example, as was suggested by Mori et al. (2020), a possible
model for q(r) is a simple linear combination of the collective variables
n
X
q(r, w) = w0 + wα fα (r). (17.7.21)
α=1

Apart from this simple linear model, any of the machine learning models, such as a
feed-forward neural network or a kernel model, could be employed to represent q(r).
Once a model is chosen, we train it by optimizing the parameters w. Mori et
al. suggest the use of a binary classification scheme to achieve the required training.
Within such a scheme, the machine learning model q(r, w) is substituted into eqn.
(17.7.20) and the loss function (cf. eqn. (17.5.40))
M
X
E(w) = − p∗B (r(k) ) ln πB (q(r(k) , w))
k=1
M 
X   
− 1 − p∗B (r(k) ) ln 1 − πB (q(r(k) , w)) , (17.7.22)
k=1

where M is the number of training points, is used to perform the optimization. In eqn.
(17.7.22), we interpret r(k) as a point in configuration space from which a trajectory is
initiated that can either end in state A or state B. If the trajectory ends in A, then the
target committor value p∗B (r(k) ) = 0, and if it ends in B, then p∗B (r(k) ) = 1. One way
to generate the trajectories needed to obtain the training data is to use the techniques
in Section 7.7, such as aimless shooting. It is also helpful to add a regularization term
into the loss function in eqn. (17.7.22) in order to avoid overfitting.
An alternative scheme for predicting reaction coordinates is via regression learning
with a least-squares loss function. Suppose we have generated enough trajectories from
a point r(k) to obtain a converged committor distribution value pB (r(k) ) corresponding
to r(k) . Then, we can obtain a value q (k) for the reaction coordinate corresponding to
r(k) by inverting eqn. (17.7.20), as
782 Machine learning

 
q (k) = tanh−1 2pB (r(k) ) − 1 . (17.7.23)

Given a model q(r, w) for the reaction coordinate, we then train the machine learning
model using these committor values via the least-squares loss function

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


M 2
1 X
E(w) = q(r(k) , w) − q (k) + Reg(w). (17.7.24)
2M
k=1

Here, Reg(w) is a regularization term, which could be a standard ridge form, a lasso
form, or an elastic net form. An optimal choice of the regularization term will depend
on the choice of the machine learning model q(r, w). Finally, one could represent the
committor distribution pB (r) as a linear combination of collective variables as
n
X
pB (r, ω) = ω0 + ωα fα (r). (17.7.25)
α=1

and train the coefficients ω on target committor values p∗B (r(k) ) using either a cross-
entropy or least-squares loss function.

17.8 Clustering algorithms


The machine learning models and procedures we have discussed thus far constitute
examples of supervised learning strategies. Data in supervised learning approaches are
are termed “labeled data”, meaning that they are specified as inputs xi and corre-
sponding outputs yi . In this final section, we introduce unsupervised learning, which
is designed to handle unlabeled data. In particular, we present two examples from a
class of algorithms known as clustering methods, which classify unlabeled data into
categories, or “clusters”, based on similarities between data points. Clustering reveals
features that separate collections of data points from each other. Various strategies
exist for performing clustering and assigning points to different groups based on par-
ticular attributes in the data. In this section, we describe two such methods: The
popular K-means approach (MacQueen, 1967; Kanungo et al., 2002) and the density-
peaks scheme.
K-means clustering. An important feature of any cluster is its “centroid”, which
is the location of the average or mean over all of the data points in the cluster. The
term K-means refers to a strategy whereby n data points x1 , ..., xn are sorted into K
clusters with centroids µ1 , ..., µK . The K-means clustering algorithm proceeds first
by choosing the centroids µ1 , ..., µK randomly and then assigning the n data points
x1 , ..., xn to each of the K clusters based on their distance to each centroid. Thus, for
each data point xi , we compute the K distances diγ = |xi − µγ |, where γ = 1, ..., K
indexes the K centroids and corresponding clusters. The value of γ for which diγ is
minimal determines the cluster membership of xi :
Cluster index of xi = arg min diγ . (17.8.1)
γ

Note that different clusters will have different numbers mγ of points, such that m1 +
m2 + · · · + mK = n. When all n points have been assigned to clusters in this way,
Clustering 783

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.14 Example of K-means clustering on two clusters (circles and triangles) in two
dimensions. Following the arrows, an initial choice of centroids, shown as the “+” and “×”
symbols for the two clusters, respectively, is followed by three reassignments leading to a
converged assignment of cluster membership. Dashed lines divide the two clusters. In each
iteration, some circles and triangles interchange according to new cluster membership assign-
ments.

a new set of cluster centroids µ1 , ..., µK is generated by computing the average over
the data points in each cluster. Once the new centroids have been determined, the
n data points are reassigned to clusters by computing new distances diγ and using
eqn. (17.8.1) to determine new cluster membership. Most likely, the assignments will
change, and, consequently, the numbers m1 , ..., mK of data points in each cluster will
change as well. The procedure is repeated as many times as needed until the cluster
assignments no longer change. The K-means procedure is illustrated in Fig. 17.14 for
a two-dimensional data set with two clusters.
K-means clustering is both efficient and straightforward to implement. However,
the parameter K must be determined a priori. This can be done by running the
algorithm for different values of K, and for each K, calculating the average distance of
all data points to their assigned cluster centroids. When plotted as a function of K, this
average distance should fall off suddenly at some value of K; this value is the optimal
one. Additionally, K-means clustering tends to assign outlier data points inaccurately
because such points are difficult to assign to clusters and can pull centroid positions
away from regions of high data population. As a final note, the K-means method can
lead to inaccurate assignment of any data point that sits at a boundary between two
784 Machine learning

clusters; this problem can be treated using fuzzy clustering approaches (Gustafson
and Kessel, 1978; Bezdek et al., 1984; Corsini et al., 2005; Tzanov et al., 2014), and
although we will not describe these methods in detail here, the basic idea of fuzzy
clustering is to assign points to multiple clusters with a weight for membership in each

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


cluster. Points at boundary regions are often assigned to each cluster with weights
close to 0.5 in each.
Density-based clustering. As the previous paragraph makes clear, K-means clus-
tering assigns cluster membership based on a distance of data points to the cluster
centroids, with an iterative algorithm to refine the cluster centroid locations. An al-
ternative approach considers the density of data points near a cluster centroid by
assuming that the density peaks at cluster centroids, which are surrounded by neigh-
boring data points with a lower local density, and that the centroids are far from other
points with high local density. The approach we will discuss here was introduced by
Rodriguez and Laio (2014) and employs two quantities for each data point xi : its local
density ρi and its distance di from points of higher density. Let rij = |xi − xj | be the
distance between data points xi and xj . Then, the local density ρi is defined as
n
X
ρi = θ (rc − rij ) , (17.8.2)
j=1

where θ(z) is the Heaviside step function. The quantity rc is a cutoff distance. Accord-
ing to eqn. (17.8.2), ρi simply counts the number of points that are within a distance
rc of xi . In order to obtain di , we use the definition

di = min rij . (17.8.3)


{j s.t. ρj >ρi }

That is, we compute the smallest distance between the point xi and any other point
of higher density. If xi is already the point of highest density, then we can compute
di = maxj rij . Cluster centroids are now recognized as points xi for which di is anoma-
lously large. The algorithm is illustrated in Fig. 17.15. Once the cluster centroids are
determined, cluster membership of each remaining point is determined by assigning
it to the same cluster as its nearest neighbor of higher density. Thus, the assignment
begins with the cluster centroids, themselves, as these are points of maximum local
density.

17.9 Intrinsic dimension of a data manifold


We conclude this chapter by showing how the neighbors of a data point can be used to
estimate the dimensionality of a data manifold. Data sets used in learning protocols
often lie in a low-dimensional space embedded in a higher dimensional one, and the
lower dimension of the data set might not be obvious under such embedding. An
example of such an embedding might be a Swiss roll embedded in a rectangular box
(see Fig. 17.16). This lower dimensionality is known as the intrinsic dimension of
the data manifold or space that contains the data points. Although the embedding
dimension of the box in Fig. 17.16 is three, the intrinsic dimension of the Swiss-roll
data set is two.
Intrinsic dimension 785

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.15 Illustration of density-based clustering. Two clusters are shown in the left panel
as circles and squares. Values of d and ρ are plotted in the right panel. The two centroids
emerge as points of anomalously high d values (see right panel).

It was suggested in Section 17.5 that the change in intrinsic dimension through the
layers of a neural network can be an important metric for evaluating the performance
of the network. Facco et al. (2017) introduced an approach for estimating this intrinsic
dimension using only the nearest and second nearest neighbors of each point on the
data manifold. Given a set of n data points x1 , ..., xn , let r1 , ..., rk be the k nearest
neighbors of a point xi in the data set. If these neighbors are arranged in ascending
order such that r1 < r2 < r3 · · · < rk , then r1 and r2 correspond to the nearest
and second-nearest neighbors of xi , respectively. Introduce the ratio µ = r2 /r1 . Since
r2 > r1 , it follows that µ ∈ [1, ∞). If the ratio µ is computed for every point in the data
set, then n values of µ, µ1 , ..., µn will be obtained, and a histogram of µ values can be

Fig. 17.16 A Swiss roll data manifold embedded in a rectangular box.


786 Machine learning

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Fig. 17.17 Intrinsic dimension as a function of the size of the data set for the alanine
dipeptide. The data set is taken from a d-AFED/TAMD run that uses the two Ramachandran
dihedral angles as collective variables.

generated. This histogram represents a probability distribution f (µ) whose analytical


form can be shown to be
f (µ) = dµ−d−1 θ(µ − 1), (17.9.1)
where d is the intrinsic dimension of the data set. If we now integrate f (µ), we obtain
the cumulative probability P (µ) of µ values
Z µ Z µ
y −d−1 dy = 1 − µ−d θ(µ − 1).

P (µ) = f (y) dy = d (17.9.2)
−∞ 1

Equation (17.9.2) is now solved for the intrinsic dimension d to give

ln (1 − P (µ))
d=− . (17.9.3)
ln µ
Equation 17.9.3 prescribes a straightforward approach for calculating the intrinsic di-
mension of a data manifold: We only need to compute the probability P (µ) of different
values of µ and feed these µ and P (µ) into eqn. (17.9.3); the result will be the intrinsic
dimension d of the data set. The result of applying eqn. (17.9.3) to the sampled data
set for the alanine dipeptide, shown in Fig. 17.17, shows that the intrinsic dimension of
data used to obtain the free-energy surface from a d-AFED/TAMD simulation using
the Ramachandran angles as collective variables (see Fig. 8.5) is two, as expected. Since
this is a data set from an enhanced sampling simulation, the plot is somewhat noisier
Problems 787

than what we would expect to observe for the synthetic data in Facco et al. (2017);
however, a trend toward the value of two is clear. If the dimensionality of the data
were not known a priori, this type of analysis would be capable of revealing it.

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


17.10 Problems

17.1. Verify the following properties for a Bernoulli distribution:


1
X
PB (x; ν) = 1
x=0

hxi = ν

hx2 i − hxi2 = ν (1 − ν) .

17.2. Recall that the Shannon entropy S[f ] for a probability distribution f (x) is
Z
S[f ] = − dx f (x) ln f (x).

a. By performing a maximization of S[f ] over all distributions f (x) subject


to the three constraints
Z
dx f (x) = 1
Z
dx xf (x) = µ
Z
dx (x − µ)(x − µ)T f (x) = Σ,

show that the resulting distribution is the Gaussian distribution of eqn.


(17.2.3).
b. Show that the entropy of the Gaussian distribution in eqn. (17.2.3) is
given by
1 n
S[PG ] = ln det (Σ) + (1 + ln(2π)) .
2 2

17.3. a. Derive eqns. (17.3.5).


b. For the linear regression model in eqn. (17.3.10), find the analytical so-
lution for w0 and w by optimizing the loss function in eqn. (17.3.11).
Examine your solution in the limit that λ = 0.
788 Machine learning

17.4. Starting from the data model in eqn. (17.4.9), optimize the loss function in
eqn. (17.4.10) and show that the analytical solution is given by eqn. (17.4.8).

17.5. For each of the diagrams shown above, write explicit expressions for the cor-

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


responding feed-forward neural networks, based either on eqn. (17.5.9) or on
eqn. (17.5.31), depending on which general form applies.

Fig. 17.18 Diagrams for problem 17.5.

17.6. Generalize eqn. (17.5.9) to a feed-forward neural network having M hidden


layers with m1 , ..., mK nodes in each hidden layer.


17.7. Derive eqns. (17.5.19) and (17.5.20).


17.8. Gradient training of a neural network requires optimization of the loss func-
tion given in eqn. (17.5.25). Derive the back propagation scheme for this loss
function.

17.9. Consider the following application of Bayes’ theorem to a regression neural


network. Let eqn. (17.3.7) be used to construct a likelihood function for the
network in terms of its loss function. We now define a prior distribution
 n/2
λ
p(w) = e−λw·w ,

where dim(w) = n. If we additionally define a new loss function as the loga-
rithm of the posterior probability distribution, show that the new loss function
Problems 789

is given by eqn. (17.5.11) but with an additional ridge regularization term.

17.10. The following ten data points are assumed to lie approximately along the
curve y(x) = sin(2πx): (0,0.30), (1/9, 0.86), (2/9, 1.00), (1/3, 0.98), (4/9,

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


0.10), (5/9, 0.06), (2/3, −0.90), (7/9, −0.40), (8/9, −0.50), (1, 0.29) The
points are plotted on the accompanying figure along with the function y(x) =
sin(2πx). Write a program to train a regression neural network with a single
hidden layer on these ten points. In particular, train three such networks,
the first with one node in the hidden layer, the second with three nodes
in the hidden layer, and the third with ten nodes in the hidden layer. For
this problem, the use of the gradient descent algorithm in eqn. (17.5.21) is
relatively straightforward; however, more advanced readers might wish to try
the stochastic gradient descent approach. Plot the three functions that result
from each trained network along with the ten training points. Is there an
optimal number of nodes in the hidden layer? What happens if the number
of nodes in the hidden layer is too large?

Fig. 17.19 Plot for problem 17.10.

17.11. Consider the matrix x and filter F given below


 
1 3 5 2  
1 2
x = 5 7 9 2 ,
 F= .
2 −1
2 4 5 1

Determine the convolution matrix X = x ◦ F.


790 Machine learning

17.12. Consider a classification neural network with C > 2 classes. Derive eqns.
(17.5.36) through (17.5.38) and show that the final activation function H(x)
should be the softmax function given in eqn. (17.5.39). If the neural network
has no hidden layers, to what explicit form does the learning model simplify?

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


17.13. Determine how the back propagation approach needs to be modified for a
classification neural network with C > 2 classes. Hint: Do not forget to take
PC (i)
into account the constraint that k=1 yk = 1.

17.14. The phase classification example in Section 17.7.2 employs molecular dynam-
ics based enhanced sampling in order to generate the free-energy profiles in
Fig. 17.13. The forces needed by these methods require derivatives of the form
∂f (Q(r))/∂ri on atom i. These derivatives are computed using the chain rule,
which means that products of the form

∂f ∂Q ∂Gk
·
∂Q ∂Gk ∂ri

need to be computed. Here, Gk is one of the symmetry functions discussed


in Section 17.7.2. Discuss qualitatively how each term is most efficiently eval-
uated. Some of the approaches in Appendix C might be useful for your dis-
cussion. Derive an explicit expression for the force on atom i for a neural
network that contains one hidden layer with M nodes, whose output classi-
fies p phases, and whose input consists only of G2 descriptors with a single
value of η and n different values of Rs .

17.15. Let P (x) and Q(x) be two normalized probability distributions. The rela-
tive Shannon entropy between P and Q with respect to P is known as the
Kullback-Leibler (KL) divergence and is defined by
 
Q(x)
Z
KL(P ||Q) = − dx P (x) ln .
P (x)

The KL divergence is a measure of a statistical “distance” between P and Q.


a. If P (x) is a Gaussian distribution of mean µ and width σ, and Q(x) is a
one-dimensional Gaussian distribution of mean ν and width λ, calculate
the KL divergence between P and Q.

b. Repeat for multivariate Gaussian distributions of a vector x of dimension
n, assuming that P (x) has a mean vector µ and covariance matrix Σ and
that Q(x) has a mean vector ν and covariance matrix Λ.

17.16. Derive a probability density corresponding to the K-nearest neighbors learn-


ing model, and show that this distribution cannot be normalized. That is,
show that the integral of this distribution over all space is divergent.
Problems 791

∗∗
17.17. Define the L2 norm of a function f (x1 , ..., xn ) of n variables over an n-
dimensional volume Ω as
Z 1/2
2
||f (x1 , ..., xn )|| = dx1 · · · dxn (f (x1 , ..., xn )) .

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


Let x1 , ..., xn be variables such that xi ∈ [0, 1], i = 1, ..., n, and let φi (x)
be monotonically increasing functions φi : [0, 1] → [0, 1]. Let ǫ and δ, with
0 < ǫ < 1 and 0 < δ < 1, be ordinary numbers. Finally, let γ1 (x) be a function
such that γ1 : R → R, and suppose we can choose γ1 such that ||γ1 || ≤ ||f ||,
2n+1 n
!
X X
f (x1 , ..., xn ) − γ1 λi φq (xi ) ≤ (1 − ǫ)||f ||,
q=1 i=1

and ||γ1 || = δ||f ||. Let us now define a series of functions γj : R → R and
hj : [0, 1]n → R such that
2n+1 n
!
X X
hj (x1 , ..., xn ) = γj λi φq (xi ) .
q=1 i=1

With these definitions, note that ||f − h1 || = (1 − ǫ)||f || and ||γ1 || = δ||f ||.
a. Show that this series of functions leads to an approximation to f (x1 , ..., xn )
such that
r
r
X
f− hj ≤ (1 − ǫ) ||f ||
j=1

and
r−1
||γr || = δ (1 − ǫ) ||f ||.

b. Now let r → ∞. Show that


r
r
X
lim f− hj ≤ (1 − ǫ) ||f || = 0
r→∞
j=1

and
lim ||γr || = 0.
r→∞

c. Finally, show that


2n+1 ∞ n 2n+1 n
! !
XX X X X
f (x1 , ..., xn ) = γj λi φq (xi ) ≡ g λi φq (xi )
q=1 j=1 i=1 q=1 i=1

as in eqn. (17.5.1).
d. Can this procedure provide guidance on how to construct a feed-forward
neural network for regression of a function? Explain.
792 Machine learning

17.18. In Problem 8.12, we considered the effect of an invertible transformation on the


partition function of a complex system. Machine learning models can be trained
to learn such transformations. Let x ∈ Rn be a vector, and let f : Rn → Rn be an
invertible, smooth mapping with inverse f −1 . Let z = f (x) and x = f −1 (z), and

Downloaded from https://academic.oup.com/book/51892/chapter/420754141 by University College London user on 20 March 2024


let P (x) be a probability distribution function of x.
a. Show that under the transformation z = f (x), the probability distribution
P (x) transforms as
 −1   
∂f ∂f
P̃ (z) = P (x) det = P (x) det ,
∂z ∂x

where |det(∂f /∂x)| is the determinant of the transformation.


b. Suppose we now have a sequence of invertible transformations from z0 ≡ x
to zK : z1 = f1 (z0 ), z2 = f2 (z1 ),...,zK = fK (zK−1 ). If P0 (z0 ) is a probability
distribution of z0 and PK (zK ) is a distribution of zK , show that
K  
X ∂fk
ln PK (zK ) = ln P0 (z0 ) − ln det .
∂zk−1
k=1

c. The sequence of random variables {zk − fK (zk−1 )} is known as a flow, and


the sequence of distributions {Pk (zk )} is known as a normalizing flow. Based
on the discussion of neural networks given in Section 17.5, discuss how a
normalizing flow can be formulated and learned as a feed-forward neural
network model.

You might also like