DaOne
DaOne
2018; Wang et al., 2021). The number of sampled points needed to generate an accu-
rate model may be far fewer than that needed to reach full convergence in a direct
simulation. In studying complex processes for which a suitable reaction coordinate
might not be known a priori, a classification model can be used to look for patterns
1 2
/2σ2
PG (x; µ, σ) = 1/2
e−(x−µ) , (17.2.1)
(2πσ 2 )
where µ and σ 2 are the mean and variance of the distribution. Here, µ determines the
which leaves us with 2n parameters to specify the distribution. Finally, if the compo-
nents of x are isotropic, then all of the σi are the same, so that Σ reduces to Σ = σ 2 I,
where I is the n × n identity matrix, which leaves just n + 1 parameters needed to
specify the distribution.
The second distribution we will employ in our discussion of machine learning con-
cerns random variables that can take on discrete values only. For starters, suppose x
is a binary random variable that can take on the values 0 or 1. An example is a coin
toss where we assign tails = 0 and heads = 1. Let ν be the probability that x = 1
so that 1 − ν is the probability that x = 0. Then, the probability distribution of x is
a special case of a binomial distribution known as a Bernoulli distribution, which is
given by
1−x
PB (x; ν) = ν x (1 − ν) . (17.2.6)
With the convention used here, it can be easily shown that hxi = ν and hx2 i − hxi2 =
ν(1 − ν). The Bernoulli distribution requires that just one parameter, ν, be specified.
Suppose, next, that x can take on n values, which, for simplicity, we take to be the
integers 1, 2, 3, ..., n. An example is the roll of a six-sided die. The third distribution,
called the categorical distribution, generalizes the Bernoulli distribution to treat such
Simple linear regression 743
cases where n > 2. If the probabilities that x takes on each of n values are ν1 , ..., νn ≡ ν,
with ν1 + ν2 + · · · + νn = 1, then the categorical distribution takes the form
n
[x=i]
Y
PC (x; ν) = νi . (17.2.7)
With this definition, it can be shown that hxi = ν. In either formulation, the categor-
ical distribution reduces to the Bernoulli distribution for binary random variables.
The function E(w) in eqn. (17.3.2) has various names throughout the machine learning
literature; depending on the source, it is referred to as the error function, cost function,
744 Machine learning
or loss function. Here, we will refer to it as the loss function, which is the most
commonly used terminology. The optimal value of w is that which minimizes E(w),
i.e., it is the solution of the minimization problem
∂E
= 0. (17.3.3)
∂w
In fact, the solution of the minimization problem is ultimately independent of the 1/N
prefactor in eqn. (17.3.2), so the choice of this prefactor, which makes E(w) an average
distance, is arbitrary. If eqn. (17.3.1) is substituted into eqn. (17.3.2) to give
N
1 X
E(w0 , w1 ) = |yi − w0 − w1 xi |2 , (17.3.4)
N i=1
it becomes clear that eqn. (17.3.3), in this simple case, yields two equations ∂E/∂w0 =
0 and ∂E/∂w1 = 0 in the two unknowns w0 and w1 . From eqn. (17.3.4), these two
conditions yield the coupled equations
hyi − w0 − w1 hxi = 0
hxyi − w0 hxi − w1 hx2 i = 0, (17.3.5)
where h· · ·i indicates an average over the N data points. The solution to these equations
is the familiar result
hxyi − hxihyi
w1 =
hx2 i − hxi2
Simple linear regression 745
At this point, we note that the minimization of the loss function in eqn. (17.3.2) is
tantamount to maximizing a Gaussian probability distribution of the form
since
2σ 2
ln PG (y; w, σ 2 ) − σ 2 ln 2πσ 2 ,
E(w) = − (17.3.8)
N
where y = (y1 , y2 , ..., yN ) is a vector of the N y-values. Therefore, the optimal param-
eters of our data model are those that maximize a Gaussian probability distribution
between the input data and the chosen model of that data. Once the parameter vec-
tor wmin that minimizes the loss function is determined, eqn. (17.3.2) p can be used to
compute a root-mean-square error (RMSE) ǫRMSE via ǫRMSE = E(wmin ). However,
a more commonly used error is the mean absolute error (MAE), based on an L1 norm
and given by
N
1 X
ǫMAE = |yi − y(xi , wmin )|. (17.3.9)
N i=1
In order to establish a robust test of the quality of a data model, we can divide
the available data into two subsets: a training set of size Ntrain and a test set of size
Ntest . The size of the test set is assumed to remain fixed while the number of points
in the training set is allowed to vary. Thus, when Ntrain + Ntest < N , the rest of the
data are held in reserve to augment the size of the training set. Initially, the training
set size can be a small fraction of the total available data and is used to minimize the
loss function and determine optimal parameters. Then, the test set is used to evaluate
the accuracy of the model via an RMSE or MAE, evaluated over the points in the
test set. If the error over the test set is too large, more points can be added to the
training set from the reserve set and the process repeated until the magnitude of the
error is deemed acceptable. What is meant by an “acceptable” error is typically one
that is lower than the intrinsic error in the original data. For example, if a data set is
within chemical accuracy of 1 kcal/mol, then the prediction error in the data model
should be lower than 1 kcal/mol. A plot of the RMSE or MAE versus training set size,
known as a learning curve, will reveal how well the model learns from the training
data. Once an acceptable error is reached, the model is considered to be trained, and
it can subsequently be used to predict new y values from input x values that are new
to the model.
In our simple linear regression example, we assumed that x is a scalar variable.
Suppose, instead, that we have N points of the form (y1 , x1 ), (y2 , x2 ),...,(yN , xN ),
where xi is an n-dimensional vector and the N points, therefore, exist in an (n + 1)-
dimensional space. If we again assume a linear relation between y and x, then the
generalization of the linear data model in eqn. (17.3.1) becomes
y(x, w) = w0 + w · x, (17.3.10)
746 Machine learning
The last term (λ/2)w · w in eqn. (17.3.11) is known as a regularizer or ridge term
and involves a new parameter λ, which needs to be included in the optimization of the
model. However, if we were to include λ into the set (w0 , w) of optimizable parameters,
then the optimization problem would become a nonlinear one, which is more difficult
to solve analytically. Therefore, regularization parameters, such as λ, are typically
chosen a priori before the optimization is performed. Such a parameter is known as a
hyperparameter. The choice of λ is governed by its ability to lower the overall error
across the test set compared to what the error would be if λ = 0. Regularization terms
need not be restricted to quadratic forms. Other choices include a linear regularizer
of the form λ′ |w|, known as a lasso term, or a combination of linear and quadratic
regularizers, λ|w|2 /2 + λ′ |w|, known as an elastic net term. The elastic net regularizer
requires determination of two hyperparameters, λ and λ′ . Depending on the nature of
the data, a lasso or elastic net regularizer might be a better discriminator of the most
relevant parameters (w0 , w) weighting the input values x.
Choosing hyperparameters in a machine learning model can be accomplished using
a robust scheme known as k-fold cross validation. In this approach, the training data
are divided into k subsets of equal size; of these subsets, (k − 1) are used for hyper-
parameter searching, and the remaining subset is used to validate the choice. This
process is repeated such that each of the k subsets acts as the validation set and the
validation error is retained for each of the k searches. Ultimately, the hyperparameters
that give the lowest error can be selected, or the hyperparameters can be averaged over
the k searches (in which case, the associated error will be an average error over the k
individual errors). As a final assessment, the quality of the choice of hyperparameters
should be evaluated against the test dataset. The use of k-fold cross validation ensures
that the hyperparameter search is not biased toward a single validation set.
where now dim(w) = nf and Φ(x) is a nonlinear function of x. There is some vagueness
to the idea of a feature space of dimension nf , but the good news is that neither nf nor
Φ(x) need to be known explicitly, as we will show shortly. The power of eqn. (17.4.1)
is that it serves as a framework for obtaining different types of kernel methods. In this
context, a kernel or kernel matrix is an N × N matrix defined by
N
1 X 1
E(w) = |εi |2 + w · w, (17.4.3)
2λ i=1 2
In eqn. (17.4.4), εi , w0 , and w are all treated as optimization parameters. The mini-
mization conditions then become
N
∂ Ẽ X
=− αi = 0
∂w0 i=1
N
∂ Ẽ X
=w− αi Φ(xi ) = 0
∂w i=1
748 Machine learning
∂ Ẽ 1
= εi − αi = 0. (17.4.5)
∂εi λ
PN PN
From these equations, we see that i=1 αi = 0, w = i=1 αi Φ(xi ), and εi = λαi .
where the kernel function K(xi , x) ≡ Φ(xi ) · Φ(x) and gives the kernel matrix Kij
via Kij = K(xi , xj ). In fact, eqn. (17.4.8) can be derived by substituting the kernel
ridge-regression data model into the least-squares loss function
N
(yi − y(xi , α))2 + λαT Kα,
X
E(α) = (17.4.10)
i=1
where we see that λ becomes the ridge parameter. On the other hand, if we retain
the parameter w0 and rename it α0 for notational uniformity, then the corresponding
data model, known as a least-squares support-vector machine model, takes the form
N
X
y(x, α) = αi K(xi , x) + α0 . (17.4.11)
i=1
Note that in the kernel ridge and support-vector machine models presented here,
the function Φ(x) has disappeared. As we noted immediately below eqn. (17.4.1), we
do not need to know the form of Φ(x). Rather, we can bypass specification of Φ(x)
and introduce a kernel function K(xi , x) directly. This replacement is referred to as
The Gaussian kernel option illustrates the advantage of the kernel trick, as eqn.
(17.4.13) cannot be derived from a dot product of the form Φ(x) and Φ(xi ). When
a Gaussian kernel is employed within the kernel ridge-regression model, for example,
the method is known as Gaussian kernel ridge regression. In this model, σ becomes
a hyperparameter to be chosen along with the ridge parameter λ. Other kernel func-
tions could be envisioned; however, the Gaussian kernel ridge regression model is both
simple and widely applicable.
A downside of kernel methods is that very large data sets (large values of N )
require kernel matrices of size N 2 , which can lead to significant memory issues when
the matrix needs to be stored and inverted. This issue raises the question of whether
more compact and flexible data models might be possible, a topic that will be addressed
in the next section where neural network models are discussed.
Before leaving this section, we illustrate how kernel methods might be used in a
statistical mechanical application. Suppose we have performed an enhanced sampling
calculation to generate an n-dimensional free-energy surface A(s1 , ..., sn ) ≡ A(s) using
one of the techniques for generating high-dimensional free-energy surfaces such as
were discussed in Sections 8.10 and 8.11. The d-AEFD/TAMD method, for example,
generates global sweeps across the free-energy landscape generating a scattering of N
points, si and corresponding free-energy values Ai ≡ A(si ). In the early phase of a run,
the N points are sparsely distributed over the surface as illustrated in the left panel
of Fig. 17.2. The point distribution might not be dense enough to reveal features of
the surface to the naked eye. However, the points (si , Ai ) can be used to train one of
the regression models described in this section in order to fill in details of the surface
not easily identifiable by inspection, allowing its features to be discerned with greater
clarity. This is illustrated in the right panel of Fig. 17.2. As the calculation is carried
out further, more regions are sampled, the density of points (si , Ai ) increases, and the
model can be updated with additional training data. This is known as active learning.
Active learning allows the features of the surface predicted by the kernel model to
become sharper as the amount of training data increases. If a kernel ridge regression
model is used, then at any stage in the simulation, the explicit representation of the
free-energy surface A(s) is
N
X
A(s) = αi K(si , s) + α0 , (17.4.14)
i=1
where si , i = 1, ..., N are the N training points generated in the simulation and K(si , s)
is the kernel function expressed in terms of the extended variables s that parameterize
750 Machine learning
the marginal probability distribution and the free-energy surface. Equation (17.4.14)
allows the free energy at any point s to be evaluated. If the kernel employed is a
Gaussian kernel, then eqn. (17.4.14) becomes
N
2
/2σ2
X
A(s) = αi e−|si −s| . (17.4.15)
i=1
Training is performed by optimizing the loss function with a ridge term, as in eqn.
(17.4.10). We defer discussion of learning curves and training protocols for this type of
application until Section 17.7, where we will compare kernel methods to other types
of machine learning models.
chological conditions based on assumptions about the inputs1 . Fourteen years after the
work of McCulloch and Pitts, an important theorem would be established that pro-
vided a mathematical foundation for many widely used modern neural network archi-
tectures. This theorem is the Kolmogorov superposition theorem (Kolmogorov, 1957)
where the coefficients λ1 , ..., λn > 0. While a detailed proof of Kolmogorov’s theorem
is beyond the scope of this book (a sketch of the proof is outlined in the steps of
Problem 17.17), an existence proof involves establishing that the function g(x) ex-
ists; this can be done by constructing a series of approximants to y(x1 , ..., xn ) based
on the superposition principle and then showing that this series converges exactly to
y(x1 , ..., xn ) (Kahane, 1975). In 1987, R. Hecht-Nielsen connected Kolmogorov’s the-
orem to a type of neural network known as the feed-forward network (Hecht-Nielsen,
1987) to be discussed in this section. However, the feed-forward network construction
we will describe is based on a corollary to the Kolmogorov theorem, introduced in
1991 by Vera Kurkova (1991). Kurkova’s corollary is a more flexible formulation of
Kolmogorov’s theorem that allows the number of inner functions to be greater than
2n + 1 while still guaranteeing that y(x1 , ..., xn ) can be approximated to arbitrary
accuracy. Kurkova’s restatement of Kolmogorov’s theorem is
m n
!
X X
y(x1 , ..., xn ) = gq ψqp (xp ) , (17.5.2)
q=1 p=1
where ψqp (x) are continuous monotonically increasing functions on [0, 1], gq (x) is a
continuous function, and m > 0 is an integer. The structure of a feed-forward neural
network emerges by repeated application of eqn. (17.5.2) as we will now demonstrate.
In order to see how the mathematical structure of a neural network emerges from
eqn. (17.5.2), note that the argument of gq is, itself, a function of x1 , ..., xn for each
value of q. We denote this function as hq (x1 , ..., xn ). Applying Kurkova’s representation
to hq , we obtain
′
n m n
!
X X X
hq (x1 , ..., xn ) ≡ ψqp (xp ) = γqs χsr (xr ) , (17.5.3)
p=1 s=1 r=1
1 In fact, in 1943, there was already considerable activity in the biophysics community to establish
a mathematical framework for neuronal networks. The novelty of the work of McCulloch and Pitts,
in addition to involving a collaboration between a neurophysiologist and logician, is its use of logic
and computation as a way to understand neural activity. For a deeper look at the work of McCulloch
and Pitts, see the historical and contextual analysis of G. Piccinini (2004).
752 Machine learning
where γqs (x) is a continuous function analogous to gq (x). Let us now choose χsr (xr )
to be
(0)
χsr (xr ) = wsr xr + asr , (17.5.4)
We then set
(1)
γqs (x) = wqs h(x) + bqs , (17.5.6)
where h(x) is a continuous function, about which we will have more to say later in
this section. Substituting eqn. (17.5.6) into eqn. (17.5.3), we obtain
′ ′ ′ ′
m m m m
(1)
X X X X
(1) (1)
γqs (x) = wqs h(x) + bqs ≡ wqs h(x) + wa0 . (17.5.7)
s=1 s=1 s=1 s=1
(2)
Finally, we set gq (x) = wq h(x) + cq . If we now substitute eqns. (17.5.3) through
(17.5.7) into eqn. (17.5.2), we obtain
′
m m n
!
(0) (1)
X X X
y(x1 , ..., xn ) = wq(2) h (1)
wqs h (0)
wsr xr + ws0 + wq0 + w(2) , (17.5.8)
q=1 s=1 r=1
Pm
where w(2) = q=1 cq . If we iterate the Kurkova theorem once again, we obtain
y(x1 , ..., xn ) =
′
′′ !
m m m n
(1) (0) (0) (1) (2)
X X X X
wq(3) h (2)
wqs h wst h wtr xr + wt0 + ws0 + wq0 + w(3) . (17.5.9)
q=1 s=1 t=1 r=1
are fully connected to the next set, and so on, until the final output node represents
(j)
the output y(x1 , ..., xn ). The quantities zσ in Fig. 17.3 are defined as follows:
n
!
(1) (0) (0)
X
zt = h wtr xr + wt0
r=1
′
m
(1) (1)
X
zs(2) = h wst zt
t=1
′′
m
(2) (2)
X
zq(3) = h wqs zt . (17.5.10)
s=1
The resemblance of the graph in Fig. 17.3 to the connections between neurons in
the brain has led to the use of the neural network moniker for the models in eqns.
(17.5.8) and (17.5.9). Since neurons are activated when presented with input stimuli,
the functions h(x) are known as activation functions. The edges, or connections, in the
(K)
graph denote the various parameters wij . These parameters are determined in the
training phase by optimizing a loss function. What happens to the data in the layers
that contain the activation functions h(x) is, in some sense, hidden from the user of
the feed-forward network, as these transformations are performed automatically by
the network. For this reason, these layers are called hidden layers. The numbers m′′ ,
m′ , and m determine the numbers of nodes in the first, second, and third hidden
layers, respectively. The input layer must contain n nodes for each input value, and,
in the example illustrated in Fig. 17.3, the output layer contains just one node that
contains the function y(x1 , ..., xn ). Feed-forward neural networks with a single hidden
layer are generally referred to as single-layer perceptron models. Network structures
with more than one hidden layer are referred to as deep neural networks. Although
754 Machine learning
deep networks usually outperform single-layer perceptron models, how deep they need
to be depends on the nature of the machine learning problem, and these networks may
range in depth from two to hundreds of hidden layers of different architectures.
Importantly, since the activation functions h(x) are chosen a priori, representations
gradient of the network is needed. Therefore, alternatives to the ReLU function that
are everywhere differentiable are the softplus function
h(x) = ln (1 + ex ) (17.5.13)
where α is a constant. Another differentiable activation function that does not suffer
from vanishing gradients is the so-called “swish” activation function, defined by
x
h(x) = . (17.5.15)
1 + e−x
Unlike the other activation functions present here, the swish function is not mono-
tonically increasing. Although this violates the condition of monotonicity in the Kol-
mogorov and Kurkova theorems, empirical evidence suggests that this condition is
likely sufficient but not necessary. Consequently, it is also possible to take h(x) to
be a Gaussian or a Lorentzian function. There is also no requirement that the same
activation function be used in every layer of a neural network. Since different layers
can serve different purposes as concerns learning patterns from the data, there could
be advantages in using different activation functions in different layers, an issue we ad-
dress in Section 17.5.1 when we discuss classification problems. Note that it is possible
to tune the shape of a chosen activation function by replacing any of the h(x) functions
defined above with h(ax), where the constant a becomes another hyperparameter that
would need to be chosen at the outset, before training. The activation functions we
have introduced here are shown in Fig. 17.4.
With so many possible choices for activation functions with little guidance on how
to choose an optimal function for a given layer in a neural network, one might ask
whether the data, itself, could dictate the selection for a given learning problem. This
is, indeed, possible, via an approach known as self learning of activation functions. Self
learning of activation functions can be achieved by expanding h(x) in terms of a set
of M basis functions as
M
X
h(x; β) = βk φk (x) (17.5.16)
k=1
and including the set of M coefficients β1 , ..., βM ≡ β, together with the parameters
w, as parameters to be learning in the training phase. The neural network is then
represented as y(x, w, β), and optimization of the loss function requires the two con-
ditions ∂E(w, β)/∂w = 0 and ∂E(w, β)/∂β = 0. Examples of possible choices of
φk (x) are simple polynomials φk (x) = xk−1 (Goyal et al., 2019) or sinc functions,
φk (x) = sin(x − kd)/[π(x − kd)], where d defines a grid spacing for x.
From a technical standpoint, the most complex operation when employing neural
networks is the calculation of the derivatives needed to perform the parameter opti-
mization. Complexity arises from the deep nesting of layers between input and output.
756 Machine learning
Because of this nesting, long products arising from the application of the chain rule
result when the derivatives are performed. To illustrate the structure of these prod-
ucts, consider a simple nested function g(w) = h(h(h(wx0 ))), where x0 is a constant.
We can think of this function as representing a toy network having three hidden layers
with one node in each layer. From the chain rule, the derivative of g with respect to
w is g ′ (w) = [h′ (h(h(x0 w))][h′ (h(x0 w))][h′ (x0 w)]x0 . From the pictorial representation
in Fig. 17.3, if this product is read from left to right, the first term in square brackets
is the derivative of the outermost layer, which produces the output result g(w), the
second term is the derivative of the layer just to the left of the previous layer, the
third term is the next layer to the left, and finally, the last term “x0 ” is the derivative
of the input layer. Thus, we see that the product is a propagation backward through
the layers of the network from the rightmost (output) layer back to the leftmost (in-
put) layer. Hence, the approach for computing derivatives of the nested functions that
comprise a feed-forward network via the chain rule is called back propagation.
Of course, computing the derivative of the loss function in eqn. (17.5.11) with a
complete feed-forward network, although straightforward in principle, requires consid-
erable bookkeeping to account for all of the terms that arise when the chain rule is
applied. Suppose the network y(x, w) has K hidden layers with m(i) nodes in the ith
layer. In order to derive the rules of back propagation, let us define a recursive variable
Neural networks 757
(k−1)
mX
(k−1)
h za(k−1) war
(k−1)
+ w0r , n = 2, ..., K
a=1
zr(k) (x) = (17.5.17)
(k−1)
Here, k indexes the hidden layer and war is the weight parameter that connects the
ath node in layer k − 1 to the rth node in layer k. The output layer of the network
can be written compactly as
(K)
m
(K)
X
y(x, w) = h za(K) (x) wa(K) + w0 . (17.5.18)
a=1
With the recursion in eqn. (17.5.17), the derivatives can also be defined recursively as
a backward propagation through the layers of the network, from output back to input.
(k)
Thus, the derivative of eqn. (17.5.11) with respect to wrs can be written recursively
as
(k+1)
N mX (k+1)
∂E(w) X ∂E(w) ∂za (xi )
(k)
= (k+1) (k)
∂wrs i=1 a=1 ∂za (xi ) ∂wrs
N
X ∂E(w)
(k)
(k+1)
h z i (xi ) , 0<k≤K
i=1 ∂zs (xi )
= (17.5.19)
N
∂E(w)
X
(k+1)
xi,r , k = 0,
i=1 ∂zs (xi )
where xi,r is the rth component of the ith input data point. The derivatives in eqn.
(17.5.19) are expressed as
∂E(w) ∂E(w) 1
≡ = (y(xi , w) − yi )
∂z (K+1) (xi ) ∂yi N
(k+1)
mX
∂E(w) (k) ′
(k)
wsa h z s (xi ) , 1≤k≤K
a=1 ∂za(k+1) (xi )
∂E(w)
(k)
= (17.5.20)
∂zs (xi )
(k+1)
mX
∂E(w)
(k)
wsa , k = 0.
(k+1)
∂za (xi )
a=1
758 Machine learning
As noted previously, the gradient G(w) = ∂E/∂w is needed to optimize the neural
network, which requires solving G(w) = 0 for w. However, since a neural network is
a highly nonlinear function of w, the optimization cannot be performed analytically
as it can when using kernel methods. Therefore, a numerical optimization algorithm
until the gradient is approximately zero. The parameter η determines the step size
and is known as the learning rate of the algorithm, which typically needs to be small
in much the same way that the time step ∆t in a molecular dynamics calculation
must be for numerical stability. We see from eqn. (17.5.21) that the gradient descent
algorithm requires the full gradient G(w) at each step, and this, in turn, needs the
full set of training points. The gradient descent algorithm can be used as specified in
eqn. (17.5.21) for optimization problems involving small training sets. Because eqn.
(17.5.21) optimizes the full parameter set w, it is known as a batch optimization
approach. Gradient descent methods are often slow to converge because a small value
of η is needed for stable optimization. Efficiency can be improved in batch schemes
by employing more sophisticated methods such as conjugate gradient or quasi-Newton
algorithms. These are standard numerical approaches and will not be discussed here. It
is important to note, however, that because machine learning problems often involve
very large training data sets containing hundreds of thousands or even millions of
points in some situations, batch methods will become inefficient because of the cost
of evaluating the full gradient vector G(w). Fortunately, it is possible to streamline
the optimization problem so that only subsets of the training data are needed for each
step of the iteration (LeCun et al., 1989).
Note that the loss function in eqn. (17.5.11) is expressible as a sum over each
observation, i.e.,
XN
E(w) = ei (w) (17.5.22)
i=1
and the gradient can be similarly expressed as
N
X
G(w) = gi (w). (17.5.23)
i=1
Therefore, in the most extreme subdivision of the training data into individual ob-
servations, the gradient descent algorithm could be performed on each term ei (w)
according to
w(τn+1 ) = w(τn ) − ηgi (w(τn )). (17.5.24)
The update is iterated by cycling through the data, either sequentially or by choosing
points at random with replacement, until the full data set is exhausted. Such an
Neural networks 759
Note that in eqn. (17.5.25), the analytical gradient of the neural network with respect to
xi is needed, which requires that the mathematical form of the network be everywhere
differentiable. When gradient training is used, the equations for back propagation
change somewhat, as illustrated in Problem 17.8.
In Section 17.4, we highlighted the example of using regression-based machine
learning, specifically kernel-based learning, to fill in missing points on a free-energy
surface generated by an enhanced sampling technique such as d-AFED/TAMD, which
generates a scattering of points over the surface with each full sweep. Neural networks
can be used in much the same way as kernel methods to perform this task (Schneider
et al., 2017; Zhang et al., 2018; Wang et al., 2021). In this case, the representation
of a free-energy surface A(s1 , ..., sn ) ≡ A(s) as a general feed-forward neural network
having K hidden layers would be
A(s, w) =
mK m2 m1 n
! ! !
(K) (2) (1) (0) (0) (1) (2)
X X X X
wjK h ··· wj3 j2 h wj2 j1 h wj1 α sα + wj1 0 + wj2 0 + wj3 0
jK =1 j2 =1 j1 =1 α=1
(K)
· · · + w0 . (17.5.26)
Clearly, a general feed-forward network allows for considerable flexibility in the design
of the architecture. For learning high-dimensional free-energy surfaces from enhanced
sampling, optimal architectures prove to be those where the early layers, those closest
to the input layer, have larger numbers of nodes than layers closer to the output.
That is, tapering the network such that m1 ≥ m2 ≥ m3 · · · ≥ mK tends to lead to
optimal network performance. This notion of tapering is the inspiration for a type of
network known as an autoencoder, in which a tapered network—the encoder—is used
to compress high-dimensional data into a lower dimensional representation or manifold,
760 Machine learning
we see that
p(x|ck )P (ck )
P (ck |x) = PC , (17.5.30)
k=1 p(x|ck )P (ck )
so that p(x) is a normalization for the product p(x|ck )P (ck ) in the numerator of the
theorem.
In the neural networks we derived for regression problems, we were able to express
the output layer as a linear sum of the activation functions for the penultimate hidden
layer as a consequence of Kurkova’s theorem. For classification problems, however, we
cannot assume this is possible, and consequently we must express the output layer
using a final activation function. Thus, eqn. (17.5.8) for a neural network with three
hidden layers would become
′
m m n
!
(0) (1)
X X X
yk (x) = H wq(2) h (1)
wqs h (0)
wsr xr + ws0 + wq0 + w(2) , (17.5.31)
q=1 s=1 r=1
with a similar modification for the three-hidden-layer network in eqn. (17.5.9). Here,
H(x) is the outer activation function whose form we need to determine. Note that
yk (x), k = 1, ..., C, which replaced the continuous function y(x) in eqns. (17.5.8) and
(17.5.9), is now interpreted as a numerical label for membership in the kth class. If we
interpret yk (x) as the conditional probability P (ck |x), then it is clear that yk (x) ∈ [0, 1]
with C
P
k=1 yk (x) = 1.
Bayes’ theorem can now be used to determine the form of H(x). The facts that
H(x) determines the output yk and that yk ∈ [0, 1] already restrict the type of activa-
tion function H(x) can be. What Bayes’ theorem accomplishes is a precise specification
of the particular functional form of H(x). Let us begin by considering a binary classifi-
cation with two classes c1 and c2 , and let z = z(x) denote the vector that results from
transforming x through all of the hidden layers of a classification neural network. The
output activation function H(x) determines the conditional probability P (ck |z) that a
particular class ck is assigned to z. We start by specifying a form for the distribution
p(z|ck ), which we might take to be an exponential construct
p(z|ck ) = exp [F (θk ) + B(z, φ) + θk · z] , (17.5.32)
where F (θk ) is a function of a set of parameters θk that vary with the class k, φ is
a set of universal parameters, and B(z, φ) is a function of z. This form is sufficiently
762 Machine learning
general to encompass the most commonly employed distribution functions such as the
Gaussian and Bernoulli (see Section 17.2), binomial, Poisson, and various other distri-
butions. For binary classification, using Bayes’ theorem, we can write the probability
for one of the two classes, say c1 , as
1
= , (17.5.33)
1 + e−a
where
p(z|c1 )P (c1 )
a = ln , (17.5.34)
p(z|c2 )P (c2 )
which is a linear function of z of the form
a = w · z + w0
P (c1 )
w0 = F (θ1 ) − F (θ2 ) + ln
P (c2 )
w = θ1 − θ2 . (17.5.35)
We only need to determine P (c1 |z), as we can determine P (c2 |z) from P (c2 |z) =
1 − P (c1 |z). When a is written this way, we see that the argument of the activation
function takes the expected form of a weighted linear combination of components of
z with the bias w0 . This analysis tells us that H(x) should be chosen as the logistic
sigmoid function H(x) = 1/(1 + exp(−x)) (eqn (17.5.12)). By the same analysis, if
there are C > 2 classes, then the Bayes’ theorem along with eqn. (17.5.32) leads to
p(z|ck )P (ck )
P (ck |z) = PC
l=1 p(z|cl )P (cl )
e−ak
= PC , (17.5.36)
−al
l=1 e
where
ak = w · z + wk0 (17.5.37)
with
wk = θ k , wk0 = F (θk ) + ln P (ck ) (17.5.38)
(see Problem 17.12). These conditions require that H(x) be chosen as the softmax or
normalized exponential function
e−βk x
H(x; β) = PC (17.5.39)
−βl x
l=1 e
which depends on a vector β of parameters.
Neural networks 763
We now turn to the determination of the correct loss function for classification. Just
as the Gaussian distribution determined the loss function for regression, the Bernoulli
and categorical distributions in Section 17.2 determine the form of the loss function
for classification. Once again, we first consider the binary classification problem with
(i) PC (i)
with yk ∈ [0, 1], and k=1 yk = 1.
Convolutions can be similarly defined for 1D arrays, 3D arrays, and, generally, tensors
of any dimension, depending on the number of indices needed to describe the input
data. As an example of a convolution, consider an input matrix x and filter F specified
as
1 2 3 1
1 1
x = 4 5 6 1, F= ,
1 1
7 8 9 1
then
12 16 11
x◦F= .
24 28 17
In neural networks, convolution layers perform operations such as that in eqn. (17.5.42),
and it is the filter matrix that must be learned via training. In addition, since convolu-
tions are linear transformations, it is common to finalize the transformation of the layer
by running XIJ through an activation function to give a new matrix ZIJ = h(XIJ +b),
where b is a bias. Finally, it is also possible to train multiple filters in a convolution
layer by adding an additional index to the filter. Multiple filters are used to extract
multiple features from the input data. For the 2D convolution in eqn. (17.5.42), mul-
tiple filters would be included by modifying the definition to read
r −1 N
NX X c −1
function y(x1 , ..., xn ) ≡ y(x) has a constant or nearly constant value wi within each
region. This subdivision allows us to create a model for y(x) given by
M
X
y(x, w) = wi h(x ∈ Ri ), (17.6.1)
i=1
which is just the average value of the target function in Rk over the training set.
Unfortunately, determining the regions Rk is nontrivial, as obtaining an optimized
splitting of R increases in complexity with the size of the training set and the dimension
of x. Figure 17.5(a) illustrates the region-splitting procedure for two-dimensional data.
The definitions of the regions Ri can be gleaned from the figure; for example, R1 is
the region for which x1 < t1 and x2 < t2 , R2 is the region for which x1 < t1 and
x2 > t2 , and so forth. Note that the splitting procedure can be represented in a graph
structure known as a decision tree. We will return to this decision tree graph shortly
when we discuss ensemble methods. First, we introduce an approximate, yet tractable,
protocol for approaching this splitting problem.
Equation (17.6.2) allows us to construct a model for y(x) in a local neighborhood
of the point x. The approximation takes the form
K
X
y(x) ≈ W (x, xi )yi , (17.6.3)
i=1
766 Machine learning
where W (x, xi ) is a non-negative weight for the ith training point within a cluster
of K neighbors of the point x2 . Each machine learning model of this form will have
a different set of associated weights. The following choice for W (x, xi ) defines the
K-nearest neighbors model:
Clearly, values of y(x) can only be accurately predicted in regions for which the model
has been trained.
Figure 17.5(b) depicts the decision tree corresponding to the splitting of R2 in
Fig. 17.5(a). In general, splitting is performed according to a set of rules defined by
logical functions that partition data as evenly as possible into the different regions Ri .
As Fig. 17.5(b) illustrates, a decision tree consists of a root node and internal nodes
set by the splitting rules, which ultimately dictate the path from the root node to a
set of terminal or decision nodes, also referred to as leaves. In regression problems,
splitting rules are determined by minimization of the relative errors (or variances) at
each split until the tree grows to a pre-specified cutoff or until the data can no longer
be split. The weight associated with each point in the training set is given by
1
i = 1, ..., K
W̃ (x, xi ) = K (17.6.6)
0 otherwise.
The distinction between W (x, xi ) and W̃ (x, xi ) will become clear by the end of
this paragraph. Note that W̃ (x, xi ) also represents the weight of each point for a
single decision tree. Here, K is the number of points within the same leaf at the target
point x. Unlike K-nearest neighbors, the number of neighbors within each leaf can
vary among leaves in a tree. The difficulty with the use of a single decision tree in
2 Weighted neighbor methods define a directed, weighted graph structure on a data set, in which
nodes are represented by the data points {(xi , yi )} and edges are directed from point i to point j,
assuming that j is among the K neighbors of i. The weight of each edge is given by the weight function
W connecting points i and j.
Demonstrations 767
applying eqn. (17.6.1) for regression is that one tree has a tendency to overfit the
training data. An approach by which this overfitting problem can be avoided when
using decision trees is to divide the training data into random subsets, an approach
known as bootstrap aggregation or bagging, and to create a decision tree for each
ds hair (s)e−βA(s)
R
hai = R , (17.7.1)
ds e−βA(s)
Qn
dr a(r) e−βU(r) α=1 δ (fα (r) − sα )
R
hair = R Qn (17.7.2)
dr e−βU(r) α=1 δ (fα (r) − sα )
(cf. eqn. (8.6.6)). Here, fα (r) is a set of collective variables and U (r) is the potential
Fig. 17.6 Four molecules employed to test machine learning regression of free-energy land-
scapes: (a) alanine dipeptide, (b) alanine tripeptide, (c) met-enkaphalin oligopeptide (amino
acid sequence Tyr-Gly-Gly-Phe-Met), (d) zwitterionic alanine pentapeptide (reproduced with
permission from Cendagorta et al., J. Phys. Chem. B 124, 3647 (2020), copyright American
Chemical Society).
For this comparative study, we will focus on a set of small peptides, commonly used
as benchmark cases, and the corresponding conformational free-energy landscapes as a
function of their backbone Ramachandran dihedral (φ, ψ) angles, which are used as col-
lective variables. The four systems are: the alanine dipeptide, the alanine tripeptide,
and the oligopeptide met-enkaphalin (amino acid sequence Tyr-Gly-Gly-Phe-Met),
which are studied in vacuum; and the alanine pentapeptide, which is studied in zwit-
terionic form in aqueous solution. These molecules are pictured in Fig. 17.6. For the
alanine dipeptide, there are just two Ramachandran angles, for the alanine tripeptide,
the number of angles used is four. For met-enkephalin, ten angles are needed, and
for the solvated alanine pentapeptide, the inner three residues and corresponding six
Ramachandran angles are selected, as these are the same as have been used in experi-
mental studies (Feng et al., 2016). The gas-phase simulations are performed using the
Demonstrations 769
CHARM22 force field (MacKerell et al., 1998) while the solvated alanine pentapep-
tide is simulated using the OPLS-AA force field (Jorgensen et al., 1996). All of the
training data for the machine learning models are generated from d-AFED/TAMD
simulations (see Section 8.10). The simulation parameters are set as follows: for the
tables. It is worth noting that for larger training set sizes used with random forests,
the number of trees is more than 200 for the di- and tripeptides and approximately 50
for met-enkephalin and the alanine pentapeptide. The learning curves are performed
with respect to the test set using the L2 error formula
where Ntest is the number of points in the test set (here, 50,000) and Atest (sj ) is the
(known) free energy at the jth test point.
Generating observables. In order to test the ability of the training machine learning
models to generate observables from eqn. (17.7.1), we select different types of observ-
ables for each system. For the alanine tripeptide, we study the following “observable”:
v
u n 2 2
u 1 X (min) (min)
O({φ, ψ}) = t φi − φi + ψi − ψi , (17.7.4)
2n i=1
where n is the number of Ramachandran angle pairs used to generate the free-energy
(min) (min)
surface (n = 2 for the tripeptide). The angles φi and ψi are the angles at the
global minimum of the free-energy surface. Although this is not a physical observ-
able, it is a sensitive test of the ability of the machine learning model to generate
an observable that depends on the full set of collective variables. For met-enkephalin,
we compute the average of the HN Hα nuclear magnetic resonance (NMR) J-couplings,
which characterizes the indirect interaction between the nuclear spins of the Cα hydro-
gen and the amide hydrogen. These J-couplings can be computed using the Karplus
equation (Karplus, 1959)
2
J(φ) = A cos2 (φ − φ0 ) + B cos (φ − φ1 ) + C, (17.7.5)
where φ is the Ramachandran angle, A = 7.09 Hz, B = 1.42 Hz, C = 1.55 Hz, and the
constant angles φ0 and φ1 are both 60◦ . J(φ) is computed for each amino acid residue
in the oligopeptide. Finally, for the alanine pentapeptide, we focus on the propensities
for different secondary structural motifs, specifically, α helix, β sheet, and the left-
handed polyproline II helix (ppII). These are defined by simple indicators that are
functions of φ and ψ and define specific regions in the φ-ψ plane for each alanine
residue. The definitions are as follows:
−90◦ < φ < −20◦ and − 180◦ < ψ < 120◦ . (17.7.6)
For the OPLS-AA force field used here, the populations of α, β, and ppII are 14%,
48%, and 37%, respectively. The remaining 1% of structures are characterized simply
a manner consistent with the learning curve in Fig. 17.7. Interestingly, for smaller
training set sizes, we see that the random-forest method performs marginally better
than the neural network and the least-square support-vector machine, suggesting that
the random forest is learning the low free-energy regions with fewer data points than
the kernel and neural network models. For the conformational populations of the
alanine pentapeptide, we see that the neural network generates the most accurate
averages across the three populations, consistent with the learning curve in Fig. 17.8.
For met-enkephalin and the calculation of the average J-couplings for each of the
five amino acid residues, we see from Fig. 17.8 that the neural network exhibits the
lowest overall error in generating converged averages, outperforming the least-square
support-vector machine. This is somewhat surprising given that the latter achieved
better overall accuracy of the global free-energy surface, as reflected in the learning
curve. More surprisingly, perhaps, are the accurate averages generated by the random
forest for both small and large training set sizes for all residues except Phe.
For insight into the performance of the various methods for met-enkephalin, we
show, in Fig. 17.9, a scatter plot of 5000 randomly selected points from the test set on
models trained using 105 training points. The plot shows the difference between free-
Demonstrations 773
In this section, we discuss leveraging classification neural networks for the design of
collective variables that can describe rare-event processes for use in the enhanced
sampling methods described in Chapters 7 and 8. We will apply classification to study
a solid-solid phase transition in a bulk atomic crystal.
Suppose the crystal has p solid phases. If a sample of this bulk material contains
some amount of thermal disorder, we seek to employ machine learning to classify this
sample as one of the p phases. Beyond this, if there are regions in the sample where
multiple phases coexist, the machine learning model should be able to identify all of
these phases in such a region. A machine learning model trained to perform these
classification tasks could be used to design a collective variable capable of driving
transitions between different phases. In order to devise such a classification neural
network, we require suitable descriptors as input functions. These descriptors need
to represent the local environment of each atom in the system, which will depend
on distances between an atom and its nearest neighbors as well as angles between
the vectors joining the atom to its neighbors. Descriptor functions that capture these
features should satisfy a number of criteria: first, they must be invariant with respect
to rotations, translations, and exchanges between atoms of the same chemical element;
second, they need to be smooth, differentiable functions of the atomic coordinates; and
third, they should be short-ranged in order to capture only nearest neighbors. Ideally,
we prefer to work with a small number of relatively simple functions.
One possible choice of descriptors is a set of functions known as symmetry func-
tions, originally introduced by Behler and Parrinello (2007) (see, also, Behler (2011)),
for the development of neural network potential energy functions (see Appendix C)
and suggested by Geiger and Dellago (2013) as useful descriptors of atomic environ-
ments. These functions, being evaluated within a spherical region around an atom,
start with a simple cutoff function fc (r). Some choices of this function are a Fermi
function
Demonstrations 775
1
r < rc
fc (r) = 1 + e[αc (r−rc +εc )] (17.7.7)
0 otherwise
1 r ≤ rmin
1 r − rmin
fc (r) = cos π +1 rmin < r ≤ rc . (17.7.8)
2 rc − rmimn
0 r > rc
Here, rc is a cutoff radius that defines the spherical region within which neighbors
are considered. From an appropriately defined cutoff function, we build up a series
of symmetry functions that capture different features of the environment around an
atom at position ri . If the system has N atoms at positions r1 , ..., rN ≡ r, then in
terms of these positions, the simplest such function is
(i)
X
G1 (r) = fc (|rij |) , (17.7.9)
j6=i
where rij = ri − rj . The sum in eqn. (17.7.9) is, in principle, taken over all j; however,
because of the short-range nature of fc (r), the sum only involves neighbors of atom
i within the cutoff radius rc . Moreover, because G1 is defined purely in terms of
fc (r), these neighbors are given roughly equal weight. Other symmetry functions give
different weights to these neighbors. For example, the function
(i)
X 2
G2 (r) = e−η(|rij |−Rs ) fc (|rij |) (17.7.10)
j6=i
weights neighbors whose distances from ri are close to the distance parameter Rs
more than those whose distances are sigificantly different from Rs . The inverse width
parameter η determines how quickly this weight decays to zero. Different choices of Rs
and η define different G2 symmetry function choices. In practice, we might use a range
of values of Rs and η to capture different features of the local environment. Another
such symmetry function employs a cosine weighting, i.e.,
(i)
X
G3 (r) = cos (κ|rij |) fc (|rij |) . (17.7.11)
j6=i
Here, the parameter κ modulates the periodicity of the cosine function such that neigh-
bors whose distances from atom i satisfy κ|rij | = (2n − 1)π/2 will have large positive
or negative weights, depending on the value of n, and zero weight if κ|rij | = nπ. The
symmetry functions G1 , G2 , and G3 depend only on the distances between neighbors
of atom i. Other symmetry functions incorporate angular dependence between the
vectors rij and rik . An example of such a symmetry function is
776 Machine learning
1 XX
(1 + λ cos θijk ) e−η(|rij | +|rik | +|rjk | )
(i) ξ 2 2 2
G4 (r) =
2ξ
j6=i k6=i
which does not restrict the distance between neighbors j and k of i. In eqns. (17.7.12)
and (17.7.13), the parameter λ is either 1 or −1, while the parameter ξ modulates
the angular resolution. Apart from the symmetry functions, other useful descriptors
capable of capturing angular information in the local environment are the Steinhardt
bond-order parameters (Steinhardt et al., 1983). These are defined as
v
u
u 4π X l 2
(i)
G(i)
ql (r) = t qlm (r) (17.7.14)
2l + 1
m=−l
where P
(i) j6=i Ylm (θij , φij ) fc (|rij |)
qlm (r) = P . (17.7.15)
j6=i fc (|rij |)
Here, Ylm (θ, φ) is a spherical harmonic, and θij and φij are the polar and azimuthal
angles of the vector rij . The combination of symmetry functions and spherical har-
monics reduces the number of descriptors needed to describe local environments in an
atomic crystal.
We will now apply these descriptors to the specific case of the transformation
between the metastable A15 phase in solid molybdenum to the stable BCC (body-
centered cubic) phase. A snapshot showing the coexistence between these two phases
in a single simulation cell is shown in Fig. 17.10. The transition occurs via the migration
of the interface between the two phases to the right, which transforms each layer of
the A15 phase (on the right) to the BCC phase (on the left).
In order to classify both pure and mixed phases in solid molybdenum, we only
need eleven radial symmetry functions of the G2 and G3 type and three Steinhardt
parameters corresponding to l =6, 7, and 8. The parameters Rs range between 2.8 Å
and 6.0 Å with η fixed at 20 Å−2 for G2 while κ ranges from 3.5 Å−1 to 7.0 Å−1 for
G3 . The cutoff function in eqn. (17.7.8) is employed with rmin = 3.8 Å, and rc = 4.0
Å. With these, we can distinguish four solid phases, A15, BCC, FCC (face-centered
cubic), and HCP (hexagonal close-packed) that can exist in the system, as well as
disordered or “liquid-like” phases and mixtures of these various phases.
We now proceed to describe the training procedure of the classification neural
network. Because the descriptors take in raw atomic coordinates and transform them
into translationally and rotationally invariant local environment variables, the only
input data we need for training are system configurations, which can be generated
Demonstrations 777
The global classifier serves as a reporter on the extent to which the entire system is in
one phase or the other. In Fig. 17.10, the value of Qbcc = 0.20 while QA15 = 0.52.
In order to drive the transition, the collective variable we employ is expressed
as a path in the vector space in which the global classifier Q exists. The reason for
working in this space is that it avoids the need to choose physical configurations be-
tween the A15 and BCC phases in order to construct a physical path (Branduardi
et al., 2007). Such a physical path could be biased by preconceived notions of how
the transition should occur. Working in classifier space allows the neural network to
decide what configurations, including pure and mixed phases, exist during the transi-
tion, which is likely to be quite complex and involve multiple ordered and disordered
local environments. Thus, let Q1 , ..., QP be a set of P nodal points along a puta-
tive path between the phases. This putative path exists in the two-dimensional space
(Qbcc , QA15 ) constructed from the BCC and A15 components of the Q vector. In
this particular example, we start with an interface already present in the system such
778 Machine learning
This collective variable is illustrated in Fig. 17.12(b). The function in eqn. (17.7.18) can
be used either as an additional collective variable in an enhanced-sampling simulation
or to construct a restraining potential (Cuendet et al., 2018; Rogal et al., 2019)
1 2
Ur (r) = κz (z(Q(r))) , (17.7.19)
2
Demonstrations 779
where the parameter κz determines the tightness of the restraint. If such a restraint is
used in a simulation, then the bias must be removed, which requires reweighting with
a factor exp(βUr (r)), in order to obtain final results.
The path collective variable in eqn. (17.7.17) can now be used in an enhanced
sampling simulation, such as d-AFED/TAMD or metadynamics, in order to drive
the structural phase transition via migration of an interface created between the two
phases. Note that using a neural network in this way incorporates machine learning
directly into the enhanced sampling procedure rather than merely using it as a post-
processing tool. In particular, the neural network performs a classification “on the fly”
as each configuration generated by the simulation is fed into it, and it immediately
outputs a classification for each atom in the system at that instant in the simulation
from which Q(r) and f (Q(r)) can be determined. For use in molecular dynamics
simulations, it is critical that the neural network employed be everywhere smoothly
differentiable, which restricts the choice of activation functions.
If an enhanced sampling simulation is performed in a canonical ensemble at 300
K with fixed volume, as is shown in Fig. 17.13 for (a) d-AFED/TAMD and (b) meta-
dynamics simulations (Rogal et al., 2019), then the metastability of the A15 phase
is not revealed. The reason is that the two phases have different lattice parameters,
and one simulation box size cannot accommodate both phases. Nevertheless, there is
a clear free-energy barrier revealed in both profiles, which is approximately 0.5 eV
≈ 48.2 kJ/mol and which agrees with previous independent computational studies
performed on the same system (Duncan et al., 2016). This free-energy barrier corre-
sponds to the thermodynamic loss of converting each layer in the A15 crystal to the
BCC structure under the constant-volume conditions. If we switch from the canonical
to the isothermal-isobaric ensemble at 1 atm, then, as is revealed in Fig. 17.13(c), the
780 Machine learning
3 The staircase-like profile is sometimes referred to as a “Galton staircase” after Sir Francis Galton
(1822–1911), inventor of the Galton board, which is used to demonstrate normal distributions. The
Galton staircase can be modeled by the functional form A(s) = A0 cos(αs) − λs. As shown by
Liu and Tuckerman (2000), this type of function is a particularly challenging one for deterministic
thermostatting techniques.
Demonstrations 781
1 + tanh(q(r))
πB (q(r)) = . (17.7.20)
2
We also presented an algorithm for computing a committor distribution, which, though
Apart from this simple linear model, any of the machine learning models, such as a
feed-forward neural network or a kernel model, could be employed to represent q(r).
Once a model is chosen, we train it by optimizing the parameters w. Mori et
al. suggest the use of a binary classification scheme to achieve the required training.
Within such a scheme, the machine learning model q(r, w) is substituted into eqn.
(17.7.20) and the loss function (cf. eqn. (17.5.40))
M
X
E(w) = − p∗B (r(k) ) ln πB (q(r(k) , w))
k=1
M
X
− 1 − p∗B (r(k) ) ln 1 − πB (q(r(k) , w)) , (17.7.22)
k=1
where M is the number of training points, is used to perform the optimization. In eqn.
(17.7.22), we interpret r(k) as a point in configuration space from which a trajectory is
initiated that can either end in state A or state B. If the trajectory ends in A, then the
target committor value p∗B (r(k) ) = 0, and if it ends in B, then p∗B (r(k) ) = 1. One way
to generate the trajectories needed to obtain the training data is to use the techniques
in Section 7.7, such as aimless shooting. It is also helpful to add a regularization term
into the loss function in eqn. (17.7.22) in order to avoid overfitting.
An alternative scheme for predicting reaction coordinates is via regression learning
with a least-squares loss function. Suppose we have generated enough trajectories from
a point r(k) to obtain a converged committor distribution value pB (r(k) ) corresponding
to r(k) . Then, we can obtain a value q (k) for the reaction coordinate corresponding to
r(k) by inverting eqn. (17.7.20), as
782 Machine learning
q (k) = tanh−1 2pB (r(k) ) − 1 . (17.7.23)
Given a model q(r, w) for the reaction coordinate, we then train the machine learning
model using these committor values via the least-squares loss function
Here, Reg(w) is a regularization term, which could be a standard ridge form, a lasso
form, or an elastic net form. An optimal choice of the regularization term will depend
on the choice of the machine learning model q(r, w). Finally, one could represent the
committor distribution pB (r) as a linear combination of collective variables as
n
X
pB (r, ω) = ω0 + ωα fα (r). (17.7.25)
α=1
and train the coefficients ω on target committor values p∗B (r(k) ) using either a cross-
entropy or least-squares loss function.
Note that different clusters will have different numbers mγ of points, such that m1 +
m2 + · · · + mK = n. When all n points have been assigned to clusters in this way,
Clustering 783
a new set of cluster centroids µ1 , ..., µK is generated by computing the average over
the data points in each cluster. Once the new centroids have been determined, the
n data points are reassigned to clusters by computing new distances diγ and using
eqn. (17.8.1) to determine new cluster membership. Most likely, the assignments will
change, and, consequently, the numbers m1 , ..., mK of data points in each cluster will
change as well. The procedure is repeated as many times as needed until the cluster
assignments no longer change. The K-means procedure is illustrated in Fig. 17.14 for
a two-dimensional data set with two clusters.
K-means clustering is both efficient and straightforward to implement. However,
the parameter K must be determined a priori. This can be done by running the
algorithm for different values of K, and for each K, calculating the average distance of
all data points to their assigned cluster centroids. When plotted as a function of K, this
average distance should fall off suddenly at some value of K; this value is the optimal
one. Additionally, K-means clustering tends to assign outlier data points inaccurately
because such points are difficult to assign to clusters and can pull centroid positions
away from regions of high data population. As a final note, the K-means method can
lead to inaccurate assignment of any data point that sits at a boundary between two
784 Machine learning
clusters; this problem can be treated using fuzzy clustering approaches (Gustafson
and Kessel, 1978; Bezdek et al., 1984; Corsini et al., 2005; Tzanov et al., 2014), and
although we will not describe these methods in detail here, the basic idea of fuzzy
clustering is to assign points to multiple clusters with a weight for membership in each
where θ(z) is the Heaviside step function. The quantity rc is a cutoff distance. Accord-
ing to eqn. (17.8.2), ρi simply counts the number of points that are within a distance
rc of xi . In order to obtain di , we use the definition
That is, we compute the smallest distance between the point xi and any other point
of higher density. If xi is already the point of highest density, then we can compute
di = maxj rij . Cluster centroids are now recognized as points xi for which di is anoma-
lously large. The algorithm is illustrated in Fig. 17.15. Once the cluster centroids are
determined, cluster membership of each remaining point is determined by assigning
it to the same cluster as its nearest neighbor of higher density. Thus, the assignment
begins with the cluster centroids, themselves, as these are points of maximum local
density.
It was suggested in Section 17.5 that the change in intrinsic dimension through the
layers of a neural network can be an important metric for evaluating the performance
of the network. Facco et al. (2017) introduced an approach for estimating this intrinsic
dimension using only the nearest and second nearest neighbors of each point on the
data manifold. Given a set of n data points x1 , ..., xn , let r1 , ..., rk be the k nearest
neighbors of a point xi in the data set. If these neighbors are arranged in ascending
order such that r1 < r2 < r3 · · · < rk , then r1 and r2 correspond to the nearest
and second-nearest neighbors of xi , respectively. Introduce the ratio µ = r2 /r1 . Since
r2 > r1 , it follows that µ ∈ [1, ∞). If the ratio µ is computed for every point in the data
set, then n values of µ, µ1 , ..., µn will be obtained, and a histogram of µ values can be
ln (1 − P (µ))
d=− . (17.9.3)
ln µ
Equation 17.9.3 prescribes a straightforward approach for calculating the intrinsic di-
mension of a data manifold: We only need to compute the probability P (µ) of different
values of µ and feed these µ and P (µ) into eqn. (17.9.3); the result will be the intrinsic
dimension d of the data set. The result of applying eqn. (17.9.3) to the sampled data
set for the alanine dipeptide, shown in Fig. 17.17, shows that the intrinsic dimension of
data used to obtain the free-energy surface from a d-AFED/TAMD simulation using
the Ramachandran angles as collective variables (see Fig. 8.5) is two, as expected. Since
this is a data set from an enhanced sampling simulation, the plot is somewhat noisier
Problems 787
than what we would expect to observe for the synthetic data in Facco et al. (2017);
however, a trend toward the value of two is clear. If the dimensionality of the data
were not known a priori, this type of analysis would be capable of revealing it.
hxi = ν
hx2 i − hxi2 = ν (1 − ν) .
17.2. Recall that the Shannon entropy S[f ] for a probability distribution f (x) is
Z
S[f ] = − dx f (x) ln f (x).
17.4. Starting from the data model in eqn. (17.4.9), optimize the loss function in
eqn. (17.4.10) and show that the analytical solution is given by eqn. (17.4.8).
17.5. For each of the diagrams shown above, write explicit expressions for the cor-
∗
17.7. Derive eqns. (17.5.19) and (17.5.20).
∗
17.8. Gradient training of a neural network requires optimization of the loss func-
tion given in eqn. (17.5.25). Derive the back propagation scheme for this loss
function.
17.10. The following ten data points are assumed to lie approximately along the
curve y(x) = sin(2πx): (0,0.30), (1/9, 0.86), (2/9, 1.00), (1/3, 0.98), (4/9,
17.12. Consider a classification neural network with C > 2 classes. Derive eqns.
(17.5.36) through (17.5.38) and show that the final activation function H(x)
should be the softmax function given in eqn. (17.5.39). If the neural network
has no hidden layers, to what explicit form does the learning model simplify?
17.14. The phase classification example in Section 17.7.2 employs molecular dynam-
ics based enhanced sampling in order to generate the free-energy profiles in
Fig. 17.13. The forces needed by these methods require derivatives of the form
∂f (Q(r))/∂ri on atom i. These derivatives are computed using the chain rule,
which means that products of the form
∂f ∂Q ∂Gk
·
∂Q ∂Gk ∂ri
17.15. Let P (x) and Q(x) be two normalized probability distributions. The rela-
tive Shannon entropy between P and Q with respect to P is known as the
Kullback-Leibler (KL) divergence and is defined by
Q(x)
Z
KL(P ||Q) = − dx P (x) ln .
P (x)
∗∗
17.17. Define the L2 norm of a function f (x1 , ..., xn ) of n variables over an n-
dimensional volume Ω as
Z 1/2
2
||f (x1 , ..., xn )|| = dx1 · · · dxn (f (x1 , ..., xn )) .
Let x1 , ..., xn be variables such that xi ∈ [0, 1], i = 1, ..., n, and let φi (x)
be monotonically increasing functions φi : [0, 1] → [0, 1]. Let ǫ and δ, with
0 < ǫ < 1 and 0 < δ < 1, be ordinary numbers. Finally, let γ1 (x) be a function
such that γ1 : R → R, and suppose we can choose γ1 such that ||γ1 || ≤ ||f ||,
2n+1 n
!
X X
f (x1 , ..., xn ) − γ1 λi φq (xi ) ≤ (1 − ǫ)||f ||,
q=1 i=1
and ||γ1 || = δ||f ||. Let us now define a series of functions γj : R → R and
hj : [0, 1]n → R such that
2n+1 n
!
X X
hj (x1 , ..., xn ) = γj λi φq (xi ) .
q=1 i=1
With these definitions, note that ||f − h1 || = (1 − ǫ)||f || and ||γ1 || = δ||f ||.
a. Show that this series of functions leads to an approximation to f (x1 , ..., xn )
such that
r
r
X
f− hj ≤ (1 − ǫ) ||f ||
j=1
and
r−1
||γr || = δ (1 − ǫ) ||f ||.
and
lim ||γr || = 0.
r→∞
as in eqn. (17.5.1).
d. Can this procedure provide guidance on how to construct a feed-forward
neural network for regression of a function? Explain.
792 Machine learning