Bishop 1994
Bishop 1994
Rev. Sci. Instrum. 65 (6), June 1994 0034.6748/94/65(6)l1803/30/$6.00 Q 1994 American Institute of Physics 1803
majority of practical applications, and therefore represent .the
models which are likely to be of most direct interest to the
present audience. They form part of a general class of net-
work models known as feedforward networks, which have
been the subject of considerable research in recent years. A
guide to the neural computing literature, given at the end of
this review, should provide the reader with some suggested
starting points for learning about other models.
Much of the research on neural network applications re-
ported in the literature appeals to ad-hoc ideas, or loose
analogies to biological systems. Here we shall take a “prin-
cipled” view of neural networks, based on well established 10pm
theoretical and statistical foundations. Such an approach fre-
quently leads to considerably improved performance from FIG. 1. Schematic illustration of two biological neurons. The dendrites act
neural network systems, as well as providing greater insight. as inputs, and when a neuron fires an action potential propagates aIong its
A more extensive treatment of neural networks, from this axon in the direction shown by the arrow. Interaction between neurons takes
place at junctions called synapses.
principled perspective, can be found in the book “Neural
Networks for Statistical Pattern Recognition.“’
many applications this circumvents the need to develop a
A. Overview of neural networks first-principles model of the underlying physical processes,
which can often prove difficult or impossible to find.
The conventional approach to computing is based on an The principal disadvantages of neural networks stem
explicit set of programmed instructions, and dates from the from the need to provide a suitable set of example data for
work of Babbage, Turing, and von Neumann. Neural net- network training, and the potential problems which can arise
works represent an alternative computational paradigm in if a network is required to extrapolate to new regions of the
which the solution to a problem is learned from a set of input space which are significantly different from those cor-
examples. The inspiration for neural networks comes origi- responding to the training data. In many practical applica-
nally from studies of the mechanisms for information pro- tions these problems will not be relevant, while in other
cessing in biological nervous systems, particularly the hu- cases various techniques can be used to mitigate their worst
man brain. Indeed, much of the current research into neural effects.3
network algorithms is focused on gaining a deeper under- The advantages and limitations of neural networks are
standing of information processing in biological systems. often complementary to those of conventional data process-
However, the basic concepts can also be understood from a ing techniques. Broadly speaking, neural networks should be
purely abstract approach to information processing.‘72 For considered as possible candidates to solve those problems
completeness we give a brief overview of biological neural which have some, or all, of the following characteristics: (i)
networks later in this section. However, our focus in this there is ample data for network training; (ii) it is difficult to
review will be primarily on artificial networks for practical provide a simple first-principles or model-based solution
applications. which is adequate; (iii) new data must be processed at high
A feedforward neural network can be regarded as a non- speed, either because a large volume of data must be ana-
linear mathematical function which transforms a set of input lyzed, or because of some real-time constraint; (iv) the data
variables into a set of output variables. The precise form of processing method needs to be robust to modest levels of
the transformation is governed by a set of parameters called noise on the input data.
weights whose values can be determined on the basis of a set
of examples of the required mapping. The process of deter-
B. Biological neural networks
mining these parameters values is often called learning or
training, and may be a computationally intensive undertak- The human brain is the most complex structure known,
ing. Once the weights have been fixed, however, new data and understanding its operation represents one of the most
can be processed by the network very rapidly. We shall find difficult and exciting challenges faced by science. Biological
it convenient at several points in this review to draw an anal- neural networks provide a driving force behind a great deal
ogy between artificial neural networks and the standard tech- of research into artificial network models, which is comple-
nique of curve fitting using polynomial functions. A polyno- mentary to the desire to build better pattern recognition and
mial can be regarded as a mapping from a single input information processing systems. For completeness we give
variable to a single output variable. The coefficients in the here a simplified outline of biological neural networks.
polynomial are analogous to the weights in a neural network, The human brain contains around 10 r1 electrically ac-
and the determination of these coefficients (by minimizing a tive cells called neurons. These exist in a large variety of
sum-of-squares error) corresponds to the process of network different forms, although most have the common features
training. indicated in Fig. 1. The branching tree of dendrites provides
As well as offering high processing speed, neural net- a set of inputs to the neuron, while the axon acts as an out-
works have the important capability of learning a general put. Communication between neurons takes place at junc-
solution to a problem from a set of specific examples. For tions called synapses. Each neuron typically makes connec-
1804 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
g g
A= a
I
a
g g
+
inputs
(4 @)
FIG. 2. The McCulloch-Pitts model of a single neuron forms a weighted
sum of the inputs x1 , . . .,xd given by a = Biwix, and then transforms this
sum using a non-linear activation function g( ) to give a final output
z=g(Cl).
(d)
a
apses simultaneously leads to an effective processing power FIG. 3. A selection of typical activation functions: (a) linear, (b) threshold,
which greatly exceeds that of present day supercomputers. It (c) threshold linear, (d) sigmoidal. The multilayer perceptron network makes
use of sigmoidal units to give network mapping functions which are both
also leads to a high degree of fault tolerance, with many
non-linear and differentiable.
neurons dying each day with little adverse effect on perfor-
mance.
Many neurons act in an all-or-nothing manner, and when a neuron as a processing unit, or simply unit, to distinguish it
they “fire” they send an electrical impulse (called an action from its biological counterpart.
potential) which propagates from the cell body along the In the McCulloch-Pitts model, the signal xi at input i is
axon. When this signal reaches a synapse it triggers the re- first multiplied by a parameter wi known as a weight (which
lease of chemical neuro-transmitters which cross the synaptic is analogous to the synaptic strength in a biological network)
junction to the next neuron. Depending on the type of syn- and is then added to all the other weighted input signals to
apse, this can either increase (excitatory synapse) or decrease give a total input to the unit of the form
(inhibitory synapse) the probability of the subsequent neuron
d
firing. Each synapse has an associated strength (or weight)
a=C WiXi+Wo 9
which determines the magnitude of the effect of an impulse
i=l
on the post-synaptic neuron. Each neuron thereby computes
a weighted sum of the inputs from other neurons, and, if this
where the offset parameter w. is called a bias (and corre-
total stimulation exceeds some threshold, the neuron fires. As
sponds to the tiring threshold in a biological neuron). For-
we shall see later, networks of such neurons have very gen-
mally, the bias can be regarded as a special case of a weight
eral information processing capabilities.
from an extra input whose value x0 is permanently set to
A key property of both real and artificial neural systems
+ 1. Thus we can write Eq. (1) in the form
is their ability to modify their responses as a result of expo-
sure to external signals. This is generally referred to as learn- d
ing, and occurs primarily through changes in the strengths of LZ=C WiXi 3
the synapses. i=O
The above, grossly simplified, picture of biological neu-
ral systems provides a convenient starting point for a discus- where x0= 1. Note that the weights (and the biasj can be of
sion of artificial network models. Unfortunately, lack of either sign, corresponding to excitatory or inhibitory syn-
space prevents a more comprehensive overview, and the in- apses. The output z of the unit (which may loosely be re-
terested reader is referred to Refs. 4-h for more information. garded as analogous to the average tiring rate of a neuron) is
then given by operating on a with a non-linear activation
function g() so that
C. Artificial neural networks
z=g(a). (31
A simple mathematical model of a single neuron was
introduced in a seminal paper by McCulloch and Pitts in Some possible forms. for the function g( ) are shown in Fig.
1943,7 and takes the form indicated in Fig. 2. It can be re- 3. The original McCulloch-Pitts model used the threshold
garded as a non-linear function which transforms a set of function shown in Fig. 3(b). Most networks of practical in-
input variables xi, (i= 1 ,...,d) into an output variable z. terest make use of sigmoidal (meaning S-shaped! activation
Note that from now on we shall refer to an artificial model of functions of the kind shown in Fig. 3(d).
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1605
As we shall see, this simple model of the neuron forms gan to diminish towards the end of the 1960’s as a number of
the basic mathematical element in many artificial neural net- difficult problems emerged which could not be solved by the
work models. By linking together many such simple process- algorithms then available. In addition, neural computing suf-
ing elements it is possible to construct a very general class of fered fierce criticism from proponents of the field of Artifi-
non-linear mappings, which can be applied to a wide range cial Intelligence (which tries to formulate solutions to pattern
of practical problems. Adaptation of the weight values, ac- recognition and similar problems in terms of ex licit sets of
cording to an appropriate training algorithm, can allow net- rules), centering around the book Perceptrons 1B by Minsky
works to learn in response to external data. and Papert. Their criticism focused on a class of problems
Although we have introduced this mathematical model called linearly non-separable which could not be solved by
of the neuron as a representation of the behavior of biologi- networks such as the perceptron and ADALINE. The field of
cal neurons, precisely the same ideas also arise when we neural computing fell into disfavor during the 1970’s, with
consider optimal approaches to the solution of problems in only aihandful of researchers remaining active.
statistical pattern recognition. In this context, expressions A dramatic resurgence of interest in neural networks be-
such as Eqs. (2) and (3) are known as linear discriminants. gan in the early 1980’s and was driven in large part by the
work of the physicist Hopfield,“@ who demonstrated a close
D. A brief history of neural computing link between a class of neural network models and certain
The origins of neural networks, or neural computing physical systems known as spin glasses. A second major de-
velopment was the discovery of learning algorithms, based
(sometimes also called neurocomputing or connectionism),
lie in the- 1940’s with the paper of McCulloch and Pitts7 on error backpropagation17 (to be discussed at length in Sec.
discussed above. They showed that networks of model neu- III), which overcame the principal limitations of earlier neu-
ral networks such as the simple perceptron. During this pe-
rons are capable of universal computation, in other words
riod, many researchers developed an interest in neural com-
they can in principle emulate any general-purpose computing
machine. puting through the books Parallel Distributed Processing by
Rumelhart et aZ.6*‘8,19 An additional important factor was the
The next major step was the publication in 1949 of the
book The Organization of Behaviour by Hebb,* in which he widespread availability by the 1980’s of cheap powerful
proposed a specific mechanism for learning in biological computers which had not been available 20 years earlier. The
combination of these factors, coupled with the failure of Ar-
neural networks. He suggested that learning occurs through
tificial Intelligence to live up to many of its expectations, led
modifications to the strengths of the synaptic interconnec-
tions between neurons, such that if two neurons tend to fire to an explosion of interest in neural computing. The early
1990’s has been characterized by a consolidation of the theo-
together then the synapse between them should be strength-
ened. This learning rule can be made quantitative, and forms retical foundations of the subject, as well as the emergence
the basis for learning in some simple neural network models of widespread successful applications. Neural networks can
(which will not be considered in this review). even be found now in consumer electronics and domestic
appliances, for applications varying from sophisticated au-
During the late 1950’s the first hardware neural network
system was developed by Rosenblatt.Y’lo Known as the per- toexposure on video cameras to “intelligent” washing ma-
chines.
ceptron, this was based on McCulloch-Pitts neuron models
of the form given in Eqs. (2) and (3). It had an array of Many of the historically important papers from the field
photoreceptors which acted as external inputs, and used of neural networks have been collected together and re-
printed in two volumes in Refs. 20 and 21.
banks of motor-driven potentiometers to provide adaptive
synaptic connections which could retain a learned setting.
II. MULTlVARlATE NON-LINEAR MAPPINGS
Adjustments to the potentiometers were made using the per-
ceptron learning algorithm.” In many circumstances the per- In this review we shall restrict our attention primarily to
ceptron could learn to distinguish between characters or feedforward networks, which can be regarded as general pur-
shapes presented to the inputs as pixellated images. Rosen- pose non-linear functions for performing mappings between
blatt also demonstrated theoretically the remarkable result two sets of variables. As we indicated earlier, such networks
that, if a given problem was soluble in principle by a percep- form the basis for most present day applications. In addition,
tron, then the perceptron learning algorithm was guaranteed a sound understanding of such networks provides a good
to find the solution in a finite number of steps. Similar net- basis for the study of more complex network architectures.
works were also studied by Widrow, who developed the Figure 4 shows a schematic illustration of a non-linear func-
ADALINE (ADAptive LINear Element) network” and a tion which takes d independent variables x1,. . . ,xd dnd maps
corresponding training procedure called the Widrow-Hoff them onto c dependent variables y r , . . . ,yc . In the terminol-
learning rule.12 These network models are reviewed in Ref. ogy of neural computing, the x’s are called inpzct variables
13. The underlying algorithm is still in routine use for echo and the y’s are called output variables. As we shall see, a
cancellation on long distance telephone cables. wide range of practical applications can be cast in this frame-
The 1960’s saw a great deal of research activity in neural work.
networks, much of it characterized by a lack of rigor, some- As a specific example, consider the problem of analyz-
times bordering on alchemy, as well as excessive claims for ing a Doppler-broadened spectral line. The x’s might repre-
the capability and near-term potential of the technology. De- sent the observed amplitudes of the spectrum at various
spite initial successes, however, momentum in the field be- wavelengths, and the y’s might represent the amplitude,
1806 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
values of the parameters w. , . . . w, , which are analogous to
the weights in a neural network [strictly, w,, is analogous to
a bias parameter, as in Eq. (l)]. Note that the polynomial can
mapping be written as a functional mapping in the form y =y(x; w) as
was done for more general non-linear mappings above.
There are two important ways in which neural networks
differ from such simple polynomials. First, a neural network
FIG. 4. Schematic illustration of a general non-linear functional mapping can have many input variables xi and many output variables
from a set of input variables x1 ,...,xd to a set of output variables
yk, as compared with the one input variable and one output
Yl ,...>YC. Each of the yk can be an arbitrary non-linear function of the
inputs. variable of the polynomial. Second, a neural network can
approximate a very large class of functions very efficiently.
In fact, a sufficiently large network can approximate any
width, and central wavelength of the spectral line. A suitably continuous function, for a finite range of values of the inputs,
trained neural network can then provide a direct mapping to arbitrary accuracy.24-29 Thus, neural networks provide a
from the observed data onto the required spectral line param- general purpose set of mathematical functions for represent-
eters. Practical applications of neural networks to spectral ing non-linear transformations between sets of variables.
analysis problems of this kind can be found in Refs. 22 and Note that, although in principle multi-variate polynomials
23, and will be discussed further in Sec. VIII. would satisfy the same property, they would require ex-
It is sometimes convenient to gather the input and output tremely (exponentially) large numbers of adjustable coeffi-
variables together to form input and output vectors which we cients. In practice, neural networks can achieve similar re-
shall denote by x=(x1 ,..., xd) and y=(yt ,..., y,). The pre- sults using far fewer parameters, and so offer a practical
cise form of the function which maps x to y is determined approach to the representation of general non-linear map-
both by the internal structure (i.e., the topology and choice of pings in many variables.
activation functions.) of the neural network, and by the values
of a set of weight parameters w r , . . . w. I . . Again, the weights B. Error functions and network training
(and biases) can conveniently be grouped together to form a The problem of determining the values for the weights in
u~eight vector w= (w r ,. . . w. 1). We can then write the net- a neural network is called ti’aining and is most easily intro-
work mapping in the form y= y(x;w), which denotes that duced using our analogy of fitting a polynomial curve
y is a function of x which is parameterized by w. through a set of n data points. We shall label a particular data
In this review we shall consider two of the principal point with the index q= 1 , . . . ,n. Each data point consists of
neural network architectures. The first is called the a value of x, denoted by xq, and a corresponding desired
m&layer pcrceptron (MLP) and is currently the most value for the output y, which we shall denote by tq. These
widely used neural network model for practical applications. desired output values are called target values in the neural
The second model is known as the radial basis function network context. (Note that data points are sometimes also
(RBF) network, which has also been used successfully in a referred to as patterns.) In order to find suitable values for
variety of applications, and which has a number of advan- the coefficients in the polynomial, it is convenient to con-
tages, as well as limitations, compared with the MLP. Al- sider the error between the desired output value tq, for a
though this by no means exhausts the range of possible mod- particular input xq, and the corresponding value predicted by
els (which now number many hundreds) these two models the polynomial function given by y (x4; w). Standard curve
together provide the most useful tools for many applications. fitting procedures involve minimizing the square of this er-
In Sec. IX we shall give an overview of some of the other ror, summed over all data points, given by
major models which have been developed and indicate their
potential uses. Some of these models do more than provide
static non-linear mappings, as the networks themselves have E=; 5 {y(xq;w)- tq}2. (5)
dynamical properties. q=l
A. Analogy with polynomial curve fitting We can regard E as being a function of w, and so the curve
We shall find it convenient at several points in this re- can be fitted to the data by choosing a value for w which
view to draw an analogy between the training of neural net- minimizes E. Note that the polynomial (4) is a linear func-
works and the problem of curve fitting using simple polyno- tion of the parameters w and so Eq. (5) is a quadratic func-
mials. Consider for instance the nzth order polynomial given tion of w. This means that the minimum of E can be found in
terms of the solution of a set of linear algebraic equations.
by
It should be noted that the standard sum-of-squares error,
m introduced here from a heuristic viewpoint, can be derived
y = w,,~x”*=-I-. . . +WlX+W*“C WjXj. (4) from the principle of maximum likelihood on the assumption
j=O that the noise on the target data has a Gaussian
distributi0n.r.” Even when this assumption is not satisfied,
This can be regarded as a non-linear mapping which takes x however, the sum-of-squares error function remains of great
as an input variable and produces y as an output variable. practical importance. We shall discuss some of its properties
The precise form of the function,y(x) is determined by the in later sections.
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1807
outputs
1808 Rev. Sci. Instrum., Vol. 85, No. 6, June 1994 Neural networks
outputs where Wkj denotes a weight in the second layer connecting
Yl .Yc hidden unit j to output unit k. Note that we have introduced
an extra hidden unit with activation zu== 1 to provide a bias
for the output units. The bias terms (for both the hidden and
output units) play an important role in ensuring that the net-
work can represent general non-linear mappings. We can
combine Eqs. (7) and (8) to give the complete expression for
the transformation represented by the network in the form
bi
Yk=i c Wkt? (9)
(i-” ’ ($w+#
Rev. Sci. Instrum., Vol. 85, No. 6, June 1994 Neural networks 1809
often convenient to apply a logistic sigmoidal activation
function of the form (12) to the output units, as this ensures
that the network outputs will lie in the &urge (0,l) which
assists in the interpretation of network outputs as probabili-
ties. (Note that in most applications we should also arrange
for the outputs to sum to unity, and this can be achieved by
using other forms of activation function.‘) The use of sig-
moidal activation functions on the network outputs would be
inappropriate for many interpolation problems since we do
not in general want to restrict the range of possible network
outputs.
Unlike the single-layer network in Eq. (7), a 2-layer net-
work of the form (9) has very general capabilities for func-
tion approximation. It has been shown that, provided the
number m of hidden units is sufficiently large, such a net-
FIG. 8. Plot of the response z(xr ,x,) of a unit with a sigmoidal activation work can represent any continuous mapping, defined over a
function, as a function of its two input variables x1 and xa.
finite range of the input variables, to arbitrary accuracy.24-29
As a simple illustration of this “universal” capability, con-
The response of a single unit with a logistic sigmoidal acti- sider a mapping from a single input variable x to a single
vation function, as a function of the input variables for the output variable y . In Fig. 9 we see four examples of function
case of 2 dimensions, is plotted in Fig. 8. approximation using a network having 5 hidden units. The
With sigmoidal hidden units, the universal approxima- circles show data obtained by sampling various functions at
tion properties of the network hold even if the output units equally spaced values of x, and the curves show the network
have linear activation functions [so that g(a) = a and in ef- functions obtained by training the network using techniques
fect no activation function is applied]. For interpolation to be described later. We see that the same network function,
problems, in which we wish to generate mappings whose with the weights suitably chosen, can indeed represent a
outputs represent smoothly varying quantities, it is conve- wide range of functional forms.
nient and sufficient to choose the output unit activation func- The multilayer perceptron structure which we have con-
tions to be linear. For classification problems, however, it is sidered has a particularly simple topology consisting of two
1.0 1.0
Y Y
t
0.5 0.5
0.0 0.0
-1 .o -0.5 0.0 0.5 x 1.0 -1.0 -0.5 0.0 0.5 x 1.0
(4
(d
1.5
-1.5 -1.5
-1 .o -1.0
04 (d)
FIG. 9. Four examples of functions learned by a multilayer perceptron with one input unit, 5 hidden units with “tanh” activation functions, and 1 linear output
unit. In each case the network function (after training using 1000 cycles of the BFGS quasi-Newton algorithm) is shown by the solid curve. The circles show
the data points used for training, which were obtained by sampling the following functions: (a) x2, (b) sin(2mr), (c) 1x 1, and (d) the Heaviside step function
H(x). We see that the same network can be used to approximate a wide range of different functions, simply by choosing different values for the weight and
bias parameters. (From Ref. 1.)
1810 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
FIG. 10. Schematic illustration of the error function E(w) seen as a surface
over weight space (the space spanned by the values of the weight and bias
parameters w={~r,..., w, , #). The weight vector & corresponds to the glo- FIG. 11. An illustration of how backpropagation of error signals is used to
bal minimum of the error function, while the weight vector wR corresponds. evaluate derivatives of the error function with respect to the weight (and
to a local minimum. Network training by the gradient descent algorithm bias) parameters in the first layer of a Z-layer network. The error signal 3 at
begins with a random choice of weight vector and then proceeds by making hidden unit j isobtained by summing error signals 8, from the output units
small changes to the weight vector so as to move it in the direction of the k=l ,...,c after first multiplying them by the corresponding weights ~ti.
negative of the error function gradient VE, until the weight vector reaches a The derivative of the error function with respect to a weight wit is then
local or global minimum. (From Ref. 1.) given by the product of the error signal 8, at hidden unit j with the activa-
tion zi of input unit i. (From Ref. 1.)
layers of weights, with full connectivity between inputs and
hidden units and between hidden units and output units. In
nearest local minimum, while others are able to escape local
principle, there is no need to consider other architectures,
minima and offer the possibility of finding a global mini-
since the ‘J-layer network already has universal approxima-
mum. In general, the error surface will be extremely complex
tion capabilities. In practice, however, it is often useful to
and for many practical applications a good local minimum
consider more general topologies of neural network. One im-
may be. sufficient to achieve satisfactory results.
portant motivation for this is to allow additional information
Many of the algorithms for performing the error function
(called prior knowledge) to be built into the form of the
minimization make use of the derivatives of the error func-
mapping. This will be discussed further in Sec. VI, and a
tion with respect to the weights in the network. These deriva-
simple example will be given in Sec. VIII. An example of a
tives form the components of the gradient vector VE(w) of
more complex network structure (having 4 layers of weights)
the error function, which, at any given point in weight space,
used for fast recognition of postal codes, can be found in
gives the gradient of the error surface, as indicated in Fig. 10.
Ref. 30. In each case there is a direct correspondence be-
Since there is considerable benefit to the training procedure
tween the network diagram and the corresponding non-linear
from making use of this gradient information, we begin with
mapping function.
a discussion of techniques for evaluating the derivatives of
E.
B. Network training
One of the important features of the class of non-linear
As we have already discussed, the fitting of a network mapping functions given by the multilayer perceptron is that
function to a set of data (network training) is performed by there exists a computationally efficient procedure for evalu-
seeking a set of values for the weights which minimizes ating the derivatives of the error function, based on the tech-
some error function, often chosen to be the sum-of-squares nique of error backpropagation.’ Here we consider the
error given by Eq. (6). The error function can be regarded problem of finding the error derivatives for a network having
geometrically as an error surface sitting over weight space, a single hidden layer, as given by the expression in Eq. (9),
as indicated schematically in Fig. 10. The problem of net- for the case of a sum-of-squares error function given by Eq.
work training corresponds to the search for the minimum of (6). In principle this is very straightforward since, by substi-
the error surface. An absolute minimum of the error function, tuting Eq. (9) into Eq. (6) we obtain the error as an explicit
indicated by the weight vector ti in Fig. 11, is called a function of the weights, which can then be differentiated
global minimum. There may, however, also exist other higher using the usual rules of differential calculus. However, if
minima, such as the one corresponding to the weight vector some care is taken over how this calculation is set out, it
@ in Fig. 10, which are referred to as local minima. leads to a procedure which is both computationally efficient
For single-layer networks with linear activation func- and which is readily extended to feedforward networks of
tions, the sum-of-squares error function is a generalized qua- arbitrary topology. This same technique is easily generalized
dratic, as was the case for polynomial curve fitting. It has no to other error functions which can be expressed explicitly as
local minima, and its global minimum is easily found by functions of the network outputs. It can be also used to
solution of a set of linear equations. For multilayer networks, evaluate the elements of the Jacobian matrix (the matrix of
however, the error function is a highly non-linear function of derivatives of output values with respect to input values)
the weights,31 and the search for the minimum generally pro- which can be used to study the effects on the outputs of small
ceeds in an iterative fashion, starting from some randomly changes in the input values.’ Similarly, it can be extended to
chosen point in weight space. Some algorithms will find the the evaluation of the second derivatives of the error with
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1811
respect to the weights (the elements of the Hessian matrix) of the weight times the activation of the hidden unit at the
which play an important role in a number of advanced net- other end of the weight. The derivative of the error with
work algorithms.32 respect to any weight in a multilayer perceptron network (of
First note that the total sum-of-squares error function (6) arbitrary topology) can always be written in a form analo-
can be written as a sum over all patterns of an error function gous to Eq. (18).
for each pattern separately In order to find a corresponding expression for the de-
rivatives with respect to weights in the first layer, we start by
writing the activations of the hidden units in the form
E= i E4, E4=; i {yk(x4; w,-t~y, (14)
q=l d
m 8E4
Yk-aak), iik= c wkjzj . (15) -=SjXi . (23)
dwji
j=O
Note that this has the same form as the derivative for a
The derivatives with respect to the final-layer weights can second-layer weight given by Eq. (18), so that the derivative
then be written in the form for a given weight connecting an input to a hidden unit is
given by the product of the S for the hidden unit and the
dE4 dE4 &i, value of the input variable.
-z-- (16)
dwkj diik dtS’kj - Finally, we need to find an expression for the 6’s. This is
easily obtained by using the chain rule for partial derivatives
We now introduce the definition
(24)
(17)
Then, by making use of Eq. (15), we can write the derivative By making use of Eqs. (15), (17), and (20) we obtain
in the form
k-l
(18)
The expression in Eq. (25) can be interpreted in terms of the
We can find an expression for iik by using Eqs. (14), (15), network diagram as a propagation of error signals, given by
and (17) to give &, backwards through the network along the second-layer
weights. This is illustrated in Fig. 11, and is the origin of the
09) term error backpropagation.
It is worth summarizing the various steps involved in
Because & is proportional to the difference between the net-
evaluation of the derivatives for a multilayer perceptron net-
work output and the desired value; it is sometimes referred to
work
as an error. Note that, for the sigmoidal activation functions
discussed earlier, the derivative g’(a) is easily re-expressed (1) For each pattern in the data set in turn, evaluate the
in terms of &a), as in Eqs. (11) and (13). This provides a activations of the hidden units using Eq. (20) and of the
small computational saving in a numerical implementation of output units using Eq. (15). This corresponds to the for-
the algorithm. Note also that the expression for the derivative ward propagation of signals through the network.
with respect to a particular weight, given by Eq. (18), takes (2) Evaluate the individual errors for the output units using
the simple form of the product of the error at the output end Eq. (19).
1812 Rev. Sci. Instrum., Vol. 85, No. 8, June 1994 Neural networks
Evaluate the errors for the hidden units using Eq. (25).
This is the error backpropagation step.
Evaluate the derivatives of the error function for this
particular pattern using Eqs. (18) and (23).
Repeat steps 1 to 4 for each pattern in the data set and
then sum the derivatives to obtain the derivative of the
complete error function.
An important feature of this approach to the calculation FIG. 12. Schematic illustration of the contours of a quadratic error surface
of derivatives is its computational efficiency. Since the num- in a Z-dimensional weight space in the neighborhood of a minimum, for
ber of weights is generally much larger than the number of which the curvature along the e, direction is much less than the curvature
units, the dominant contribution to the cost of a forward or a along the e, direction. Simple gradient descent, which takes successive steps
in the direction of the negative of the error surface gradient,
backward propagation comes from the evaluation of the Aw= - vVE, suffers from oscillations across the direction of the valley if
weighted sums (with the evaluation of the activation func- the value of the learning rate parameter 7 is too large.
tions being negligible by comparison). Suppose the network
has a total of ;/“ weights, and we wish to know how the cost
of evaluating the derivatives scales with ,9T: Since the error where G-denotes the step number in the iteration, and the
function EQ(w) for pattern 4 is a function of all of the parameter 77 is called the learning rate and in the simplest
weights, a single evaluation of Eq will take @.N’) steps scheme is set to a fixed value chosen by guesswork.
(i.e., the number of numerical steps needed to evaluate E will Provided the value of 7 is sufficiently small then Eq.
grow like jl;). Similarly, the direct evaluation of any one of (26) will lead to a decrease in the value of E (assuming the
the derivatives of Eq with respect to a weight would also gradient is not already zero by virtue of the weight vector
take &(J’) steps. Since there are .N such derivatives we being at a minimum of E). Increasing the value of 117 can lead
might expect that a total of @J$~) steps would be needed to to a more substantial reduction of E at each step and thus can
evaluate all of the derivatives. However, the technique of speed up the training process. However, too great a value for
backpropagation allows all of the derivatives to be evaluated 17can lead to instability. A further problem with this simple
using a single forward propagation, followed by a single approach is that the optimum value for 7 will typically
backward propagation, followed by the use of the formulas change with each step.
(18) and (23). Each of these requires @(.J”) operations and One of the main problems with simple gradient descent,
so all of the derivatives can be evaluated in @3~9) steps. however, arises when the error surface has a curvature along
For a data set of it patterns the derivatives of the complete one direction e1 in weight space which is substantially
error function ti = T;,EQ can therefore be found in smaller than the curvature along a second direction e2, as
&‘(3n,il’) steps, as compared with the @[K@) steps that illustrated schematically in Fig. 12. The learning rate param-
would be needed by a direct evaluation of the separate de- eter then has to be very small in order to prevent divergent
rivatives. In a typical application jl,‘. may f;nge from a few oscillations along the e2 direction, and this leads to very slow
hundred to many thousands, and the savingof computational progress along the ei direction for which the gradient is
effort is therefore significant. Since, even with the use of small. Ideally, the learning rate should be larger for compo-
backpropagation to evaluate error derivatives, the training of nents of the weight change vector along directions of low
a multilayer perceptron is computationally demanding, the curvature than for directions of high curvature. One simple
importance of this result is clear. In this respect, error back- way to try to achieve this involves the introduction of a
propagation is analogous to the fast Fourier transform (FFT) momentum term’* into the learning equations. The weight
technique which allows the evaluation of Fourier transforms update formula is modified to give
to be reduced from &~J?“) to @,&Jllt&), where ,& is the
number of Fourier components.
+p Awj;-*) , (27)
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1813
FIG. 13. As in Fig. 12. but showing the effect of introducing a momentum FIG. 14. As in Fig. 12, but showing the effect of using a conjugate gradients
term, so that the weight updates are given by Aw’~=- ?VE+Aw(‘-‘), or a quasi-Newton algorithm, which can find the minimum of a quadratic
where T denotes the iteration step number. This leads to an increase in the error function exactly in a fixed number of steps, even though the curvatures
effective value of 17in the direction of low curvatuve e, of the error surface may be very different along different directions.
1814 Rev. Sci. Instrum., Vol. 65, No. 8, June 1994 Neural networks
in the neighborhood of the minimum) in terms of its inverse.
For &?i”weights, this requires &(.,JI,‘) storage, which is gen-
erally not a problem except for very large networks having
thousands of weights. Again, the algorithm will find the
minimum of a quadratic error function exactly in a fixed
number of steps, as in Fig. 14. One advantage of such algo-
rithms is that the line searches do not need to be done as
precisely as with conjugate gradients. The results plotted in
Fig. 9 were obtained using the BFGS (Broyden-Fletcher-
Goldfarb-Shanno) quasi-Newton method?5 Algorithms have
also been developed which try to combine the best features FIG. 15. Intuitively we expect that an arbitrary continuous function y(x)
of conjugate gradient and quasi-Newton algorithms. One can be approximated by a linear combination of localized bump-like func-
such algorithm is the limited memory BFGS method.‘s’39 tions qSj(x). This concept leads to the radial basis function neural network
Although the error surface may be far from quadratic, model.
these methods are generally very robust, and typically give at
least an order of magnitude improvement in convergence n
speed compared with simple gradient descent with momen- Yk= 2 @kq+qtx), (30)
tum. A disadvantage is that they are somewhat more complex q=l
to implement in software than gradient descent, and also they
have no built-in mechanism to escape from local minima. where d4(x) is a radially symmetric function centered on the
Also they are intrinsically batch (rather than sequential) qth data point. There are many possible forms for the basis
methods, and so cannot deal effectively with redundancy in functions, of which a common choice is the Gaussian, given
the training set, as discussed earlier. They can, however, by
prove particularly useful for problems where high precision
is required as is the case in many instrumentation applica-
tions. f$Jx)=exp( -‘“s:;““) , 01)
Rev. Sci. Instrum., Vol. 85, No. 6, June 1994 Neural networks 1815
sis function. The second layer of the network is identical to
that of a multilayer perceptron in which the output units have
linear activation functions.
Again, it can be shown formally that such a structure is
capable of approximating essentially arbitrary continuous
functions to arbitrary accuracy provided a sufficiently large
number of hidden units (basis functions) is used and pro-
vided the network parameters (centers pi, widths cj, and
second-layer weights Wkj) are suitably chosen.44T45
As with the multilayer perceptron, we seek a least-
squares solution for the network parameters, obtained by
minimizing a sum-of-squares error of the form given in Eq.
FIG. 16. Plot of the activation z(x, ,xz) of a Gaussian hidden unit as used in (6). Since the network mapping is an analytic function of the
a radial basis function network, as a function of two input variables x1 and
network parameters, this could be done by simply optimizing
x2. This plot should be compared with the sigmoid shown in Fig. 8.
all of the weights in the network together using one of the
standard algorithms discussed earlier. Such an approach
presented with input vector x. Again, a bias for the output would, however, offer little advantage over the MLP net-
units has been included, and this has been represented as an work.
extra “basis function” &, whose activation is fixed to be A much faster approach to training is based on the fact
40= 1. For most applications the basis functions are chosen that the hidden units have a localized response, that is, each
to be Gaussian, so that we have unit only produces an output which is significantly different
from zero over a limited region of input space. This leads to
a two-stage training procedure in which the basis function
parameters (ruj and aj) are optimized first, and then, subse-
quently, the final-layer weights {Wkj} are determined.
where pj is a vedtor representing the center of the jth basis
function. Note that each basis function is given its own width 5. Choosing the basis function parameters
parameter aj. A plot of the response of a Gaussian unit as a
function of 2 input variables is shown in Fig. 16. Note that In the use of radial basis functions for exact interpola-
this is localized in’the input space, unlike the ridge-like re- tion, a basis function was placed over every data point. In the
sponse of a sigmoidal unit shown in Fig. 8. case of an RBF neural network we can adopt a similar strat-
The RBF network can be represented by a network dia- egy of placing basis functions in the regions of input space
gram as shown in Fig. 17. Each of the hidden units corre- where the training data’are located. Various heuristic proce-
sponds to one of the basis functions, and the lines connecting dures exist for achieving this, and we shall limit our discus-
the inputs to hidden unit j represent the elements of the sion to two of the simplest. We shall also discuss a more
vector pj. Instead of a bias parameter, each unit now has a systematic approach based on maximum likelihood.
parameter ~j which describes the width of the Gaussian ba- The fastest and most straightforward approach to choos-
ing the centers /+ of the basis functions is to set them equal
to some subset (usually chosen randomly) of the input vec-
outputs
tors from the training set. This only sets the basis function
YI YC centers, and the width parameters ~j must be set using some
other heuristic. For instance, we can choose all the Uj to be
equal and to be given by the average distance between the
basis function centers. This ensures that the basis functions
overlap to some degree and hence give a relatively smooth
representation of the distribution of training data. Such an
approach to the choice of pi and cri is very fast, and allows
an RBF network to be set up very quickly. The subset of
input vectors to be used as basis function centers can instead
be chosen from a more principled approach based on or-
Xl xd thogonal least squares,“6 which also determines the second-
inputs layer weights at the same time. In this case, the width pa-
rameters ui are fixed and are chosen at the outset.
FIG. 17. Architecture of a radial basis function neural network having d
A slightly more elaborate approach is based on the
inputs x1 t . . . .xd and c outputs y1 ,..., yC . Each of the m basis functions +j K-means algorithm.47 The goal of this technique is to asso-
computes a localized (often Gaussian) function of the input vector. The lines ciate each basis function with a group of input pattern vec-
connecting the inputs to the basis function qS1represent the elements of the tors, such that the center of the basis function is given by the
vector pj which describes the location of the center (in input space) of that
basis function. The second layer of the network, connecting the basis func-
mean of the vectors in the group, and such that the basis
tions with the output units, is identical to that of the multilayer perceptron function center in each group is closer to each pattern in the
shown in Fig. 7. (From Ref. 1.) group than is any other basis function center. In this way, the
1818 Rev. Sci. Instrum., Vol. 85, No. 6, June 1994 Neural networks
data points are grouped into “clusters” with one basis func- We note that, since J’k is a linear function of the final layer
tion center acting as the representative vector for each clus- weights; E is a quadratic function of these weights. Substi-
ter. This is achieved by an iterative procedure as follows. tuting Eq. (33) into Eq. [37), we can minimize E with respect
First, the basis function centers are initialized (for instance to these weights explicitly by differentiation, to give
by setting them to a subset of the pattern vectors). Then each
pattern vector is assigned to the basis function with the near-
est center flj, and the centers are recomputed as the means
of the vectors in each group. This process is then repeated,
o=!l#{ji, j,.,g.-tl]
> t3@
@ ‘=(@%)-W. (41)
(35)
(Note that this formula for the pseudo-inverse assumes that
the relevant inverse matrix exists. If it does not, then the
where the prefactor in front of ~j(x) is chosen to ensure that pseudo-inverse can still be uniquely defined by an appropri-
the probability density function integrates to unity: ate limiting process.“) In a practical implementation, the
Jp(x) dx= 1. If th e input vectors from the training set are weights are found by solving the linear equations (39) using
drawn independently from this distribution function, then the singular value decomposition35 to allow for possible numeri-
likelihood of this data set is given by the product cal ill-conditioning. Thus the final layer weights can be
found explicitly in closed form. Note, however, that the op-
timum value for these weights, given by Eq. (40), depends
z=fJ p(9). (336) on the values of the basis function parameters {pj, Cj}, via
q-1
the quantities 47. Once these parameters have been deter-
The basis function parameters can then be set by maximizing mined, the second-layer weights can then be set to their op-
this likelihood. Since the likelihood is an analytic non-linear timal vahres.
function of the parameters {,u] , ~ji), this maximization can Note that the matrix Q, has dimensions n X m where II is
be achieved by standard optimization methods (such as the the number of patterns, and m is the number of hidden units.
conjugate gradients and quasi-Newton methods described If there is one hidden unit per pattern, so that m =n, then the
earlier). It can also be done using re-substitution methods matrix Q, becomes square and the pseudo-inverse reduces to
based on the EM-aLgorithm.4g Such methods are relatively the usual matrix inverse. In this case the network outputs
fast and allow values for the parameters {ruj ,cj} to be ob- equal the target values exactly for each pattern, and the error
tained reasonably quickly. In contrast to the MLP, the hidden function is reduced to zero. This corresponds precisely to the
units in this case have a particularly simple interpretation as exact interpolation method discussed above. As we shall see
the components in a mixture model for the distribution of later, this is generally not a desirable situation, as it leads to
input data. The sum of their activations (suitably normalized) the network having poor performance on unseen data, and in
then provides a quantitative measure of p(x), which can play practice m is typically much less than n. The crucial issue of
an important role in validating the outputs of the network.3 how to optimize m will be discussed at greater length in the
next section.
C. Choosing the second-layer weights
V. LEARNING AND GENERALIZATION
We shall suppose that the basis function parameters
(centers and widths) have been chosen and fixed. As usual, So far we have discussed the representational capabili-
the sum-of-squares error can be written as ties of two important classes of neural network model, and
we have shown how network parameters can be determined
on the basis of a set of training data. As a consequence of the
(37) great flexibility of neural network mappings, it is often easy
to arrange for the network to represent the training data set
Rev. Sci. Instrum., Vol. 85, No. 6, June 1994 Neural networks 1817
with reasonable accuracy, particularly if the size of the data
set is relatively small. A much more important issue, how-
ever, is how well does the network perform when presented
with new data which did not form part of the training set.
This is called generalization and is often much more difficult
to achieve than simple memorization of the training data.
1818 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
other quantities (called loss criteria) other than misclassifica- 1.0 ) I
tion rate to be minimized. This is important if different mis-
classifications have different consequences and should there-
fore carry different penalties.J7 It also provides a principled
way to combine the outputs of different networks to build a
modular solution to a complex problem. These topics are
discussed further in Refs. 1 and 51.
B. Generalization
0.0
-.- 1 I I
The above analysis made two central assumptions: (ij 1.0
(a) O’O x
there is an infinite supply of training data, (ii) the network
has unlimited flexibility to represent arbitrary functional 1.0 1 I
forms. In practice we must inevitably deal with finite data
sets and, as we shall see, this forces us to restrict, the flex-
ibility of the network in order to achieve good performance.
By using a very large network, and a small data set, it is
generally easy to arrange for the network to learn the training
data reasonably accurately. It must be emphasized, however,
that the goal of network training is to produce a mapping
which captures the underlying trends in the training data in
such a way as to produce reliable outputs when the network 0.0
_.- 1 I I
is presented with data which do not form part of the training 1.0
(b) “’ X
set. If there is noise on the data, as will be the case for most
practical applications, then a network which achieves too 1.0
good a fit to the training data will have learned the details of
the noise on that particular data set. Such a network will Y
perform poorly when presented with. new data which do not
form part of the training set. Good performance on new data,
0.5
however, requires a network with the appropriate degree of
flexibility to learn the trends in the data, yet without fitting to
the noise.
These central issues in network generalization are most
0.0 I , I
easily understood by returning to our earlier analogy with
polynomial curve fitting. In particular, consider the problem 0.0 0.5 1.0
(c) x
of fitting a curve through a set of noise-corrupted data points,
as shown earlier for the case of a cubic polynomial in Fig. 5. FIG. 19. Examples of curve fitting using polynomials of successively higher
The results of fitting polynomials of various orders are order, using the same data as was used to plot Fig. 5. (a) was obtained using
a first order (linear) polynomial, and is seen to give a rather poor represen-
shown in Fig. 19. Jf the order in of the polynomial is too low, tation of the data. By using a cubic polynomial, as shown in (b), a much
as indicated for nz = 1 in Fig. 19(a), then the resulting curve better representation of the data is obtained. (Tois figure is identical to Fig.
gives only a poor representation of the trends in the data. 6, and is reproduced here for ease of comparison.) If a 10th order polyno-
mial is used, as shown in (c), a perfect fit to the data is obtained (since there
When the value of y is predicted using new values of x the are 11 data points and a 10th order polynomial has 11 degrees of freedom).
results will be poor. If the order of the polynomial is in- In this case, however, the large oscillations which are needed to fit the data
creased, as shown for m =3 in Fig. 19(b), then a much closer mean that the polynomial gives a poor representation of the underlying
representation of the data trend is obtained. However, if the generator of the data, and so will make poor predictions of y for new values
of X. (From Ref. 1.)
order of the polynomial is increased too far, as shown in Fig.
19(cj, the phenomenon of overfitting occurs which gives a
very small (in this case zero) error with respect to the train- and we see that there is an optimum number of degrees of
ing data, but which again gives a poor representation of the freedom (for a particular data set) in order to obtain the best
underlying trend in the data and which therefore gives poor performance with new data.
predictions for new data. Figure 20 shows a plot of the sum- A similar situation occurs with neural network map-
of-squares error versus the order of the polynomial for two pings. Here the weights in the network.are analogous to the
data sets. The first of these is the training data set which is coefficients in a polynomial, and the number of degrees of
used to determine the coefficients of the polynomial, and the freedom in the network is controlled by the number of
second is an independent test set which is generated in the weights, which in turn is determined by the number of hid-
same way as the training set, except for the noise contribu- den units. (Note that the effective number of degrees of free-
tion which is independent of that on the training data. The dom in a neural network is generally less than the number of
test set therefore simulates the effects of applying new data weights and biases. For a discussion see Ref. 58.) Again we
to the “trained” polynomial. The order of the polynomial can consider two independent data sets which we call train-
controls the number of degrees of freedom in the function, ing and test sets. We can then use the training data to train
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1819
performance on new data, corresponding to large values of
the test set error in Fig. 21.
- training A network function which has too little flexibility is said
to have a large bias, while one which fits the noise on the
data is said to have a large variance. One of the main goals
in applying a neural network is to achieve a good tradeoff
between bias and variance.59 Instead of restricting the num-
ber of weights in the network, an alternative approach to
controlling bias and variance is to add penalty terms to the
10 error function to encourage the network mapping to have
0 2 4 6 6
appropriate smoothness properties. This is called regulariza-
Orderof Polynomial tion, and a detailed discussion lies beyond the scope of this
review.60-65
FIG. 20. A plot of the residual value of the root-mean-square error for
polynomial curve fitting versus the order of the polynomial. Here the train- C. Determination of network topology
‘ing data are the same as used to plot Fig. 19, and are the data used to
determine the coefficients of the polynomial by minimizing the sum-of- In almost all applications, the goal of network training is
squares error. The residual rms error for the training data is seen to decrease to find a network mapping function which makes the best
monotonically as the order of the polynomial is increased, eventually falling possible predictions for new data. This corresponds to the
to zero for the 10th order polynomial which fits the training data exactly, as
shown in Fig. 19(c). The test data set is generated in the same way as the
network having the minimum test error, given by m = h in
training data, and also consists of 11 points with the same x values, but with Fig. 21. It is this requirement which drives the selection of
different values for the random additive noise. It is seen that the error on the the network topology, in other words the number of hidden
test data (which measures the ability of the polynomial to “generalize”) is units for the networks considered in this review.
smallest for a cubic polynomial. For polynomials with more degrees of
freedom than a cubic, the error for the test data is actually larger, even
In a practical application, the simplest approach to opti-
though the training data error is smaller. (From Ref. 1.) mizing the number of hidden units is to partition the avail-
able data at random into a training and a test set and then to
plot a graph of the form shown in Fig. 21. The best network
several networks, differing in the number m of hidden units, of those trained is then determined by the minimum in the
and plot a graph of the residual value of the error E after test set error. In a practical application the curves which are
training as a function of m. This would be expected to yield obtained from such an exercise do not always exhibit pre-
a monotonic decreasing function, as indicated schematically cisely this behavior. This is a consequence of the fact that
in Fig. 21, since the addition of extra degrees of freedom network training corresponds to a non-linear optimization
should not result in any increase in error, and will generally problem which can suffer from local minima, as already de-
allow the error to be smaller. We can also present the test set scribed in Sec. III. In addition, the effects of using a finite
data to the trained networks, and evaluate the corresponding size test set also lead to departures from the smooth behavior
values of E. These would be expected to show a decrease depicted in Fig. 21. Thus, the training error curve might not
with m at first as the network acquires greater flexibility, but decrease monotonically, or the test error curve might have
then start to increase as the problem of overfitting sets in. A several minima. An example of the curves obtained with real
common beginners mistake in applying neural networks is to data is shown in Fig. 22. Since the test set has itself been
use too large a network and thereby obtain apparently very used as part of the network optimization process, the final
good results (small training error at large values of m in Fig. performance of the network should, strictly speaking, be
21). We see, however, that this typically leads to very poor checked against a third independent set of data.
A more sophisticated approach, and one which is par-
ticularly useful when the quantity of available data is limited,
is that of cross-validation.66 Here the data set is partitioned
randomly into S equal sized sections. Each network is
trained on S - 1 of the sections and its performance tested on
the remaining section, which acts like a test set. ‘This is re-
peated for the S possible choices for the section omitted from
training, and the results are then averaged. This is repeated
for all of the networks under consideration. The network
having the smallest error on data not used for training is then
selected. In effect, all of the data is used for both training and
#z m testing. The disadvantage of this approach is its greater com-
putational demands. Again, a third independent data set
J?IG. 21. A schematic plot of the residual error with respect fo the training should be used to confirm the final performance of the se-
set, and the error with respect to a separate test set, as a function of the lected network. In practice, a value of S = 19 is a typical
number m of hidden units in a neural network. As with polynomial curve
fitting, there is an optimum number of hidden units (shown here by m=r;l)
choice, although if data are in very limited supply, a value of
which gives the smallest test set error, and hence the best generalization S= 1 can be used, giving rise to the procedure known as
performance. leave one out.
1820 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
0.03 ( output
data
t
I
B
t 0.02 \ 1
._^_ce---
test ,, “’ .
,-.---,/ post -
a, processing
training
B
~ 0.01
Ii
0.00 I----
0 5 IO 15 20
numberofhidden units
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1821
iern with the dimensionaiity of the space of input variables. variables. In a typical application the different input vari-
Consider a d-dimensional input space, and suppose that ables may represent very different physical quantities. For
the region of interest corresponds to the unit hypercube instance, one input might represent a magnetic field value,
XE [O,lld. We can specify the value of any one of the input and be e(l), while another might represent a frequency, and
variables Xi by dividing the corresponding axis into N seg- be @( 10”). This would typically cause difficulties in net-
ments and stating in which segment the value of the variable work training, since the optimal values for the weights would
lies. As N increases, so we can specify the variable with need to span a range of B[l 012), and it would prove diffi-
increasing precision. With each variable specified in this cult to discover such a solution using standard training aigo-
way, the unit hypercube has been subdivided into Nd small rithms. The problem can be resolved by performing a linear
hypercubes. In general, to specify a mapping from the input rescaiing of the data to ensure that each input variable has
space to a single output variable, we must provide Nd items zero mean and unit standard deviation over the training set.
of information, ‘representing the value of the output for the Thus the inputs to the network ii are obtained from the raw
corresponding input hypercube. Thus, the size of the training input data Xi by the following transformation:
set required to specify a mapping would in general grow
exponentially with the dimensionaiity of the input space.
This phenomenon is known as the curse of dimensionality.68 (49)
In practice, there are two reasons why the amount of
data needed may be much less than this argument would If we set
suggest. First, correlations between the input variables mean
that the data are effectively confined to a sub-space of the 1 n 2
input space which might have much lower effective dimen: Xi=- C Xi , s-I =-& $, {xi--xi}2 im
n q=l
sionaiity. Adding extra input variables which are strongly
correlated with the existing inputs would not lead to a sig-
then the rescaied inputs will ail have zero mean and unit
nificant increase in the effective dimensionality of the space
variance with respect to the training data set. Once the net-
occupied by the data. Second, there is generally significant
work is trained, the same rescaiing (49), using the same val-
structure in the data so that, for instance, the output variable
ues for the coefficients, must be applied to all future data
may vary smoothly with the input variables. Thus, knowi-
presented to the network. A similar rescaiing is often applied
edge of the outputs for several input vectors allows the out-
also to the target data for interpolation problems, and this
puts for new inputs to be predicted by an interpolation pro-
resealing must be inverted to post-process data obtained
cess. It is this second property of data sets which makes
from the trained network. The rescaiing in Eq. (49) treats the
generalization possible.
input variables as independent. A more sophisticated linear
Notwithstanding these two mitigating effects, the quan-
rescaiing, known as whitening,54 takes account also of corre-
tity of data needed to specify a mapping can still grow rap-
Iations between the input variables.
idly with dimensionality. As a result, when tackling a prac-
tical problem involving a finite size data set, the performance
of a neural network system can actually improve when the C. Feature extraction
input dimensionaiity is reduced by preprocessing even The simple rescaiing of input variables described above
though information may be lost in the dimensionaiity reduc- represents an invertable transformation in which no informa-
tion process. The fixed quantity of data is better able to tion is lost. As we have already indicated, however, in many
specify the mapping in the lower dimensional space, and this applications involving large numbers of input variables it can
can more than compensate for the loss of information. be very advantageous to reduce the dimensionaiity of the
Dimensionality reduction plays a particularly important input vector, even though this represents a non-invertable
role in problems for which the input dimensionality is large. process in which the amount of available information is po-
In applications such as the interpretation of images, for in- tentially diminished.
stance, the number of pixels may be many thousands. Direct One of the simplest approaches to dimensionality reduc-
presentation of the data to a network having large numbers of tion is to discard a subset of the input variables. Techniques
input units (one per pixel) and consequently large numbers for doing this generally involve finding some ranking of the
of degrees of freedom will typically give very poor results. relative importance of different inputs and then omitting the
Note that there is an additional benefit from dimension- least significant. In principle, the relative importance of the
aiity reduction in the form of reduced training times, arising input variables depends on what kind of mapping function
from the fact that there are now fewer adjustable parameters will be employed, and so strictly a neural network should be
in the network (since the number of input units has been trained for each possible subset of the input variables. Since
reduced, and so there are fewer weights in the first layer). in practice this is likely to be computationally prohibitive, a
simpler system which can be trained very quickly (such as a
8. Linear rescaiing
linear model) is often used to order the inputs. An appropri-
In addition to reducing dimensionaiity, there are other ate subset is then used for training the more flexible non-
motivations for preprocessing the data. One of the most com- linear network.
mon forms of preprocessing involves a simple linear rescai- Other forms of dimensionality reduction make no use of
ing of the input variables, and possibly also of the output the target data but simply look at the statistical properties of
1822 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
the input data alone. The most common such technique is VII. IMPLEMENTATION OF NEURAL NETWORKS
principal conlponents analysis6g in which a linear dimen-
So far we have discussed neural networks as abstract
sionality reducing transformation is sought which maximizes
mathematical functions. In a practical application, it is nec-
the variance of the transformed data. While easy to- apply,
essary to provide an implementation of the neural network.
such techniques run the risk of being significantly sub-
At present, the great majority of research projects in neural
optimal since they take no account of the target data.
networks, as we11 as most’practica1 applications, makes use
More generally, the goal of preprocessing is to find a
of simulations of the networks written in conventional soft-
number of (usually non-linear) combinations of the input
ware and running on standard computer platforms. While
variables, known as features, which are designed to make the
this is adequate for many applications, it is also possible to
task of the neural network as easy as possible. By selecting
implement networks in various forms of special-purpose
fewer features than input variables, a dimensionality reduc-
hardware. This takes advantage of the intrinsic parallelism of
tion is achieved. The optimum choice of features is very
neural network models and can lead to very high processing
problem dependent, and yet can have a strong influence on
speeds. We begin, however, with a discussion of software
the fmai performance of the network system. It is here that
implementation.
the skill and experience of the developer count a great deal.
A. Software implementation
Most applications of neural networks use software
implementations written in high level languages such as C,
PASCAL, and FORTRAN. The neural network algorithms
D. Prior knowledge
themselves are generally relatively straightforward to imple-
One of the most important, and most powerful, ways in ment, and much of the effort is often devoted to application-
which the performance of neural network systems can be specific tasks such as data preprocessing and user interface.
improved is through the incorporation of additional informa- Neural networks are well suited to implementation in object
tion, known as prior knowledge, into the network develop- oriented languages such as C+ f, which allow a network to
ment and training procedure, in addition to using the infor- be treated as an object, with methods to implement the basic
mation provided by the training data set. Prior knowledge operations of forward propagation, saving and retrieving
can take many forms, such as invariances which the network weight vectors, etc.
transformation must respect, or expected frequency of occur- There are now numerous neural network software pack-
rences of different classes in a classification problem. ages available, ranging from simple demonstration software
One way of exploiting prior knowledge is to build it into provided on disk with introductory books, through to large
the data preprocessing stage. If the desired outputs from the commercial packages supporting a range of network archi-
network are known to be invariant under some set of trans- tectures and training algorithms and having sophisticated
formations, then features can be extracted which exhibit this graphical interfaces. The latter kind of software can be very
property, thereby ensuring that the network outputs will au- useful for quick prototyping, and provides an easy way to
tomatically show the same behavior. The technique of regu- gain hands-on experience with neural networks without re-
larization, discussed in Sec. V, also implicitly involves incor- quiring a heavy investment in software development. It is
porating prior knowledge into the network training process. important to emphasize, however, that such software cannot
For example, a reguiarization term which penalizes high cur- be treated as a black box solution to problems since, as we
vature in the network mapping function reflects prior knowl- have seen, there are numerous subtle issues which must be
edge that the function should be smooth.65 Prior knowledge addressed if satisfactory performance is to be obtained. Some
can also be used to configure the topology of the network of these, such as the incorporation of prior knowledge, can
itself. For instance, the postal code recognition system de- sometimes be highly problem-specific, and do not readily
scribed in Ref. 30 uses a system of local “receptive fields” lend themselves to inclusion in commercial software. Also,
with “shared weights” to achieve approximate invariance to the fact that such software does not usually provide direct
translations of the characters within the input image. access to source code can significantly limit its applicability
The inclusion of explicit invariance to some set of trans- to complex real-world problems.
formations is an important use of preprocessing. It leads to 3
significant advantages compared with having the network
6. Hardware implementation
learn the invariance property by example: (i) the invariance
property is satisfied exactly, whereas it would only be One of the potential advantages of neural network tech-
learned approximately from examples; (ii) a smaller training niques compared with many conventional alternatives is that
set can be used since any set of patterns which differ only by of speed. There are in fact two quite distinct reasons why
the transformation can be represented by a single pattern in neural networks can prove to be significantly faster than con-
the training set; (iii) the network is able to extrapolate to new ventional methods. The first applies to software simulations
input vectors provided these differ from the training data as well as hardware implementations and stems from the fact
primarily by virtue of the invariant transformation. that, once trained, a neural network operating in feedforward
We shall describe some further examples of the use of mode can perform a multivariate non-linear transformation
prior knowledge when we review a number of case studies in in a fixed (and generally very small) number of operations.
Sec. VIII. This contrasts with many conventional techniques which
Rev. Sci. Instrum., Vol. 65, No. 8, June 1994 Neural networks 1823
achieve high speed at the expense of restriction to linear tal and analogue approaches. Digital systems make use of
transformations, or which solve non-linear problems by highly developed silicon fabrication technology, are robust to
means of iterative, and hence computationally intensive, ap- small variations in the fabrication process, and offer the flex-
proaches. Of course, it should be remembered that the pro- ibility to be reconfigured in software to give a wide variety
cess of training a neural network can be computationally of architectures. They also support network training algo-
intensive and slow, although for many applications training rithms and therefore speed up this computationally intensive
is performed only during the development phase, with the process.
network being used as a feedforward system when process- By contrast, anaiogue systems suffer from low precision
ing new data. weights, and are sensitive to process variations. Also, they do
The second reason why neural networks can give very not at present support learning, which must be done sepa-
high processing speeds is that they are intrinsically highly rately in software on a conventional computer. They do,
parallel systems and so can be implemented in special- however, offer a very high density of processing elements.
purpose parallel hardware. This gives an additional increase The Intel ETANN (Electrically Trainable Anaiogue Neural
in speed in addition to that resulting from the feedforward Network) chip, currently the only anaiogue neural network
nature of the network mapping. chip available commercially, contains over 10,000 weights,
Even with serial processor hardware, it is possible to giving an effective processing capability of 4GFlops
exploit the structure of the neural network mapping to im- (4 X 10’ floating point operations per second) per chip,
prove processing speed. For instance, many DSP (digital sig- which is comparable with a large supercomputer. The chips
nal processing) and workstation processors can perform can easily be cascaded to build larger networks with corre-
vector-matrix operations very efficiently. Since the number spondingly higher equivalent processing capacity.
of weights in a typical network is generally much larger than Research is also underway into wafer-scale integration
the number of nodes, the dominant contribution to the com- of neural networks, and also into optical and optoelectronic
putation comes from the product-and-sum stages rather than implementations. These latter two make use of modulated
from the evaluation of activation functions, and this is essen- laser beams to perform the basic vector-matrix operation,
tially a vector-matrix operation. The resulting improvements with sigmoidal non-linearities implemented either in non-
in. processing speed apply to the training phase of the net- linear optics or in conventional electronics. Holographic sys-
work as well as to its subsequent use in feedforward mode. tems are often used to implement the weights.
It is also relatively straightforward to implement neural
networks on arrays of processors such as transputers, or even
a network of workstations. During training, the network can Viii. EXAMPLE APPLICATIONS
be replicated on ail of the processors which then each deal
with a subset of the patterns. Alternatively, different parts of In this section we shall review a number of applications
the network (for instance successive layers) can be assigned of neural networks in the area of scientific instrumentation.
to different processors, which then operate in a pipeline fash- This is in no way intended to be a comprehensive survey of
ion. This relative ease of parallel implementation should be applications, which would be well beyond the scope of this
contrasted with the often severe problems of making efficient review, but rather a selection of examples to illustrate ideas
use of multiple processors with many conventional methods. developed in earlier sections. Information on where to find
If a conventional algorithm relies at some point on a single other applications can be found in the review of neural net-
serial calculation before the rest of the algorithm can pro- work literature in the Appendix.
ceed, then only one processor can perform this step while the A. Interpolation
other processors remain idle.
For very high processing speeds a network can be impie- One of the simplest ways to use a neural network is as a
mented in special-purpose hardware in which the various form of multivariate non-linear regression to find a smooth
components of the network (weights and nodes) are mapped interpolating function from a set of data points. As an ex-
directly into elements of the hardware system. A flexible ample we consider the problem of predicting a quantity
modular implementation of the multilayer perceptron, built known as the (dimensionless) energy confinement time ?E of
from conventional surface mount technology in a VME rack a tokamak plasma from a knowledge of four dimensionless
system,70 was recently used successfully for real-time feed- variables constructed from experimentally measured quanti-
back control of a tokamak piasma.71 This system used mui- ties. The dimensionless variables are denoted q (safety fac-
tipiying DAC’s (digital to anaiogue converters) acting as tor), v* (ratio of effective electron coiiisionality to trapped
digitally-set resistors to provide the weights and biases, and electron bounce frequency), pp (poloidal beta), and /;, (nor-
temperature-compensated transistor circuits to implement the malized electron Larmor radius). The precise definitions of
non-linear sigmoids. This gives a system which has analogue these quantities is not important here; more detailed informa-
signal paths but in which the synaptic weights can be set tion on this application can be found in Ref. 72. The goal is
digitally, allowing the weights to be specified to reasonably to predict i from knowledge of these dimensionless vari-
high precision. ables, and so we are seeking a functional relationship of the
Much of the research on hardware implementations of form
neural networks focuses on VLSI (Very Large Scale Integra-
tion) techniques, and these can be broadly divided into digi- h==F(q,v* ,P&J. (51)
1824 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
0.08 T
tnq In v. In f3, In ii
FIG. 24. Network structure used for predicting the normalized energy con-
finement time GE of a tokamak plasma in terms of a set of dimensionless
experimentally measured quantities Q, v,, & and j, . Note how the basic
0.00 4 I
input and output quantities have been processed by taking logarithms in
order to compress their dynamic range. The bias units have been omitted for -1 0 1 2 Y 3
clarity. (From Ref. 72.)
FIG. 25. The solid curve shows the behavior of the energy confinement time
SE versus the input variables for the energy confinement time problem cor-
In principle, this function could be predicted from plasma responding to the network shown in Fig. 24. Since there are four input
physics considerations, but in practice the physical processes variables, the horizontal axis has been taken along the direction of the first
are much too complex, and so empirical methods are used. principal component of the test data set, and the parameter y measures
distance along this direction. The dashed curve shows the corresponding
The conventional approach to this problem is to make results obtained using the linear regression. Note that the linear regression
the arbitrary assumption that the function F( ) in Eq. (51) function necessarily produces a power law behavior, while the neural net-
takes the form of a product of powers of the independent work function is able to represent a more general class of functions and
variables, so that hence can capture more of the structure in the data. (From Ref. 72.)
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1825
.h 2n v. ln 4 In ii
FIG. 26. A modified version of the network shown in Fig. 24 in which FE is beam
constrained to be a power law function of the input variable j, . This is
achieved by permitting only a single connection from the 6e input to the
FIG. 27. Schematic cross-section of an oil pipeline containing a mixture of
output (and using a linear output unit) to ensure that ln?E is a linear function oil, water, and gas in a stratified configuration. Also shown is the path of a
of In&. This provides an example of how prior knowledge can be built into gamma beam whose attenuation provides information on the quantities of
the structure of a network. (From Ref. 72.) the three substances present in the pipe (gamma densitometry). The quanti-
ties x0, x, , and xg represent the path lengths in the oil, water, and gas
phases respectively. If attentuation measurements are made at two different
nection from the In I;, input to the output, and since the out- wavelengths, and use is made of the constraint x,+x,+xp=X (where X is
put unit has a linear activation function, this achieves the the total path length within the pipe) then the values of the three path lengths
required effect. The single weight from the In j, input to the can be determined. By making such measurements along several chords
through the pipe, sufficient information can be obtained to allow the volume
output represents the parameter a in Eq. (54). fractions of the phases to be determined (with the aid of a neural network
Another important feature of neural networks when used mapping) even though the 3 phases may exhibit a variety of different geo-
to perform non-linear interpolation is their ability to learn metrical configurations. (From Ref. 67.)
how to combine data from several sensors to produce mean-
ingful outputs without the need to develop a detailed physi-
~=,[oe-MWoe-~wh+we Wtig , (56)
cal model to describe the required data transformations. This
is called sensor fusion or data fusion, and plays a role in
many neural network applications. An example of the fusion where x, , x, , and xg represent the path lengths through each
of magnetic field data with line-of-sight optical data to gen- of the three phases, as indicated in Fig. 27. The measurement
erate spatial profiles of electron density in a plasma can be from a single beam line does not provide sufficient informa-
found in Ref. 73. tion to determine all 3 path lengths, and so a second gamma
beam of a different energy is passed along the same path as
the first beam. Since the absorption coefficients are different
6. Classification at the two energies, the measured attenuation of the beam
provides a second independent piece of information. Finally,
We next discuss an application of neural networks in- the three path lengths are constrained to add up to the total
volving classification. The problem concerns the monitoring path length through the pipe
of oil flow along pipelines which carry a mixture of oil,
water, and gas, and the aim is to provide a non-invasive
technique for measurement of oil flow rates, a problem of Xo+X,+Xg=X (57)
considerable importance to the oil industry. The approach
described in Ref. 67 is based on the technique of dual-energy as shown in Fig. 27. We therefore have enough information
gamma densitometry. This involves measurement of the at- to extract the individual path lengths, which can be ex-
tenuation of a collimated beam of mono-energetic gammas pressed analytically in terms of the measured attenuations.67
passing through the pipe as indicated in Fig. 27. For a Measurements from a single dual-energy beamline are
gamma beam passing through a single homogeneous sub- insufficient to determine the volume fractions of the three
stance the fraction of the beam intensity Z attenuated per unit phases within the pipe, since the phases can occur in a vari-
length is constant, and so the intensity would decay exponen- ety of geometrical configurations, as illustrated in Fig. 28.
tially with distance according to This shows 4 model configurations of 3-phase flows used to
generate synthetic data for the neural network study. How-
(55) ever, by making measurements along several beamlines, in-
formation on the configuration of the phases can be obtained.
where IO is the beam intensity in the absence of matter, x is The system considered in Ref. 67 consists of 3 vertical and 3
the path length within the material, p is the mass density of horizontal beams arranged asymmetrically.
the material, and ,X is the mass absorption coefficient for the Multi-phase flows are notoriously difficult to model nu-
particular material at a particular gamma energy. For a merically and so it is not possible to use a first-principles
gamma beam passing through a combination of oil, water, approach to the interpretation of the data from the densitom-
and gas, the intensity of the beam decays like eter. It is, however, possible to collect large amounts of train-
1826 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
equal probability. Then the fractions of oil, water, and gas for
this configuration were selected randomly with uniform
probability distribution, subject to the constraint that they
must add to unity. The 12 independent path lengths are then
calculated geometrically, and these form the inputs to a neu-
stratified annular ral network. The dominant source of noise in this application
arises from photon statistics, and these are included in the
data using the correct Poisson distribution.
In order to predict the phase configuration, the network
is given 4 outputs, one for each of the configurations shown
in Fig. 28, and a l-of-N coding is used as described in Sec.
V. Networks were trained using a data set of 1,000 examples,
inverse homogeneous
annular and then tested using a further 1,000 independent examples.
The network structure consisted of a multilayer perceptron
with a single hidden layer of logistic sigmoidal units and an
H gas q oil # water
output layer also of logistic sigmoidal units. In order to com-
pare the network against a more conventional approach, the
FIG. 28. Four model configurations of 3-phase flow, used to generate train- same data were used to train a single-layer network having
ing and test data for a neural network which is trained to predict the frac- sigmoidal output units, which corresponds to a form of linear
tional volumes of oil, water, and gas in the pipe. Inputs to the network are discriminant function.1’53 The number of hidden units in the
taken from 6 dual-energy gamma densitometers of the kind illustrated in
Fig. 27. (From Ref. 67.) network was selected by training several networks and se-
lecting the one with the best performance on the test set, as
described in Sec. V, which gave a network having 5 hidden
ing data by attaching the densitometer system to a standard units. Results from the classification problem are summa-
multi-phase test rig. This problem is therefore well suited to rized in Table I.
analysis by neural network techniques. More detail on the performance of the network in deter-
From the system of 6 dual-energy densitometers we can mining the phase configuration can be obtained from “con-
extract the corresponding 6 values of path length in oil and 6 fusion matrices” which show, for each actual configuration,
values of path length in water [the remaining 6 path lengths how the examples were distributed according to the pre-
in gas represent redundant information by virtue of Eq. (57)]. dicted configurations. For perfect classification all entries
This gives 12 measurements from which we can attempt to would be zero except on the leading diagonal. Here the con-
determine the geometrical phase configuration. Synthetic figurations have been ordered as (homogeneous, stratified,
data were generated from the 4 model phase configurations annular, inverse annular). The confusion matrices, for both
shown in Fig. 28. To generate each data point in the training training and test sets, generated by the trained network hav-
set, one of the configurations was selected at random ‘with ing 5 hidden units are shown below.
Predicted Predicted
1259 0 0 0 \ 255 0 0 0
Actual
I
\
0
0
239
0
0
242
0
0
0
25:
Actual
247 0 1
Training Test
In this particular application, the real interest is in being (the volume fraction of gas being redundant information).
able to determine the volume fractions of the three phases, This is the application which generated the plot of training
and in particular of the oil. Once the phase configuration is and test errors versus the number of hidden units shown in
Fig. 22.
known, these volume fractions can be calculated geometri-
tally from the path length information. However, a more
direct approach is to train a network to map the path length C. Inverse problems
information from the densitometers directly onto the volume A large proportion of the data processing tasks encoun-
fractions. This leads to a network with 12 inputs, and 2 out- tered in instrumentation applications can be classified as in-
puts corresponding to the volume fractions of oil and water verse problems. The meaning of this term is best illustrated
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1827
TABLE I. Results for neural network prediction of phase configurations.
Values of zero indicate errors of less that 1.0X10-*.
with a particular example. Consider the general tomography FIG. 30. Schematic example of an inverse problem for which a direct ap-
plication of neural networks would give incorrect results. The mapping from
problem illustrated in Fig. 29. The goal is to determine the u to u, shown by the solid curve, is multivalued for values of u in the range
local spatial distribution of a quantity Q(r) from a number of U, to u?. A network trained by minimizing a sum-of-squares error based on
line-integral measurements training data generated from the solid curve would give a result of the form
indicated by the dashed curve (this result follows from the fact that the
function represented by the trained network is given by the conditional
average of the target data, as illustrated in Fig. 18). In the region where v(u)
(58) is multi-valued, the network outputs can be substantially different from the
desired values.
made along various lines of sight ri through a given region
of space. Hege, the quantity Q(r) might for instance be the
soft x-ray emissivity from a tokamak plasma. Other ex- the evaluation of the integrals in Eq. (58). However, in prac-
amples include x-ray absorbtion tomography in medical ap- tice we are required to solve the inverse problem of deter-
plications, and ultrasonic tomography for non-destructive mining the function Q(r) from a finite number of measure-
testing. ments Xi. This inverse problem is ill-posed60*74 because
The tomography problem is characterized by the exist- there are infinitely many functions Q(r) which give ‘rise ex-
ence of a well-defined forward problem in which we suppose actly to the same given set of line integral measurements. In _
that the spatial distribution Q(r) is known and we wish to addition, the measurements may be corrupted by noise, and
predict the values of the line integrals Xi. This problem is so we do not necessarily seek a solution which fits the data
well defined and has a unique solution obtained simply from exactly.
Many of the problems which arise in data analysis to
which neural networks may be applied are inverse problems.
(Note that cutie fitting, and network training, are themselves
inverse problems). Examples include the reconstruction of
electron density profiles in tokamaks from line integral
measurements,” and the simultaneous fitting of several over-
lapping Gaussians to complex spectra.“3 There is generally a
well-defined forward problem which may have a fast solu-
tion, but the inverse problem is often ill-posed, and with
conventional approaches may require computationally inten-
sive iterative techniques to find a solution. The neural net-
work approach offers the advantages of very high speed and
can avoid the need for a good initial guess which is often a
source of difficulty with conventional iterative methods. In
some instances the forward problem can be used to generate
synthetic data which can be used to train the network.
There is, however, a potential difficulty in applying neu-
ral networks to inverse problems which can lead to very poor
results unless appropriate care is taken. Consider the simple
FIG. 29. The tomography problem involves determining the local spatial
example of training a network on the inverse problem shown
distribution of a quantity Q(r) from line integral measurements in Fig. 30. The mapping from the variable u to the variable
xi=Jr,Q(r) dr made along a number of lines of sight Ti. If the function CL,shown by the solid curve, is single-valued (in other words
Q(r) were known, then the evaIuation of the various line integrals (this is it is a jimction). If, however, we consider the inverse map-
called the forward problem) would be straightforward and would give ping from u to u, then we see that this is not single-valued
unique results. In practice, we must solve the inverse problem of finding
Q(r) from the finite set of measured values given by the Ai, which is for a range of u values between u 1 and u2. As discussed in
ill-posed since there are infinitely many solutions. Neural networks are well- Sec. V the output of a trained network approximates the con-
suited to the solution of many inverse problems. ditional average of the target data given by Eq. (44). If data
1828 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
These are mapped by the neural network onto a set of geo-
metrical parameters which describe the position and shape of
the boundary of the plasma. The values for these parameters
as predicted by the neural network are compared with de-
sired values which have been preprogrammed as functions of
time prior to the plasma pulse. The resulting error signals are
then sent to standard PID (proportional-integral-differential)
linear control amplifiers which adjust the position and shape
of the plasma by changing the currents in a number of con-
trol coils.
The network is trained off-line in software from a large
data set of example plasma configurations obtained by nu-
merical solution of the plasma equilibrium equations. In or-
der to achieve real-time operation, the network was imple-
FIG. 31. Neural networks have recently been used for real-time feedback mented in special purpose hybrid digital-analogue
control of the position and shape of the plasma in the COMPASS tokamak hardware7’ described in Sec. VII. Values for the network
experiment, using the control system shown here. Inputs to the network weights, obtained from the software simulation, are loaded
consist of a set of signals m from magnetic pick-up coils which surround the
tokamak vacuum vessel. These are mapped by the network onto the values
into the network prior to the plasma pulse. This system re-
of a set of variables yk which describe the position and shape of the plasma cently achieved the first real-time control of a tokamak
boundary. Comparison of these variables with their desired values yi (which plasma by a neural network.7*
are preprogrammed to have specific time variations) gives error signals This application provides another example of the use of
which are sent, via control amplifiers, to sets of feedback control coils which
can modify the position and shape of the plasma boundary. Due to the very
prior knowledge in neural networks. It is a consequence of
high speed i-10 ,u..s)at which the feedback loop must operate, a fully the linearity of Maxwell’s equations that, if all of the currents
parallel hardware implementation of the neural network was used. (From in the tokamak system are scaled by a constant factor, the
Ref. 71.) magnetic field values will be scaled by the same constant
factor and the plasma position and shape will be unchanged.
from the solid curve in Fig. 30 are used to train a network the This implies that the mapping from measured signals to the
resulting network mapping will have the form shown by [p position and shape parameters, represented by the network,
dashed curve. In the range where the data is multivalued fhe should have the property that, if all the inputs are scaled by
output of the network can be completely spurious, since the the same factor, the outputs should remain unchanged. Since
average of several values of u may itself not be a valid value the order of magnitude of the inputs can vary by a factor of
for that variable for the given value of U. This problem is not up to 100 during a plasma pulse, there is considerable benefit
resolved by increasing the quantity of data or by improve- in building in this prior knowledge explicitly. This is
ments in the training procedure. achieved by dividing all inputs by the value of the total
When applying neural networks to inverse problems it is plasma current. A hardware implementation of this normal-
therefore essential to anticipate the possibility that the data ization process was developed for real-time operation. If this
may be multivalued. One approach to resolving this problem prior knowledge were not included in the network structure,
is to exclude all but one of the branches of the inverse map- the network would have to learn the invariance property
ping (or by training separate networks for each of the purely from the examples in the data set.
branches if all possible solutions are needed). For a detailed Another recent real-time application for neural networks
example of how this technique is applied in practice, in this was for the control of 6 mirror segments in an astronomical
case to the determination of the coefficients in a Gaussian optical telescope in order to perform real-time cancellation of
function fitted to a spectral line, see Ref. 22. distortions due to atmospheric turbulence.76 This technique,
called adaptive optics, involves changing the effective mirror
D. Control applications shape every 10 ms. Conventional approaches involve itera-
In this review we have concentrated almost entirely on tive algorithm: to calculate the required deformations of the
neural networks for data analysis, and indeed this represents mirror, and are computationally prohibitive. The neural net-
the area where these techniques are currently having the work provides a fast alternative, which achieves high accu-
greatest practical impact. However, neural networks also of- racy. When the control loop is closed the image quality
fer considerable promise for the solution of many complex shows a strong improvement, with a resolution close to that
problems in non-linear control. of the Hubble space telescope. In this case the network was
Feedforward networks, of the kind considered in this implemented on an array of transputers.
review, can be used to perform a non-linear mapping within Neural networks can also be used as non-linear adaptive
the context of a conventional linear feedback control loop. components within a control loop. In this case the network
This technique has been exploited successfully for the feed- continues to learn while acting as a controller, and in prin-
back control of tokamak plasmas71775as illustrated in Fig. 31. ciple can learn to control complex non-linear systems by trial
Here the inputs to the network consist of a number of mag- and error. This raises a number of interesting issues con-
netic signals (typically between 10 and 100) obtained from nected with the fact that the training data which the network
pick-up coils located around the tokamak vacuum vessel. sees is itself dependent on the control actions of the network.
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1629
Such issues take us well beyond the scope of this review, ries which reconstruct a complete pattern from a partial cue,
however, and so we must refer the interested reader to the or from a corrupted version of that pattern. They have also
literature for further details.77-7g been used to solve combinatorial optimization problems,
such as placing of components in an integrated circuit or the
IX. DISCUSSION scheduling of steps in a manufacturing process.
Another aspect of the techniques considered in this re-
In this review we have focused our attention on feedfor-
view is that all of the input data have been treated as static
ward neural networks viewed as general parameterized non-
vectors. There is also considerable interest in being able to
linear mappings between multi-dimensional spaces. Such
deal effectively with time varying signals. The simplest, and
networks provide a powerful set of new data analysis and
most common approach, is to sample the time series at regu-
data processing tools with numerous instrumentation appli-
lar intervals and then treat a succession of observed values as
cations. While feedforward networks currently account for
a static vector which can then be used as the input vector of
the majority of applications there are many other network
a standard feedforward network. This approach has been
models, performing a variety of different functions, which
used with considerable success both for classification of time
we do not have space to discuss in detail here. Instead we
series in problems such as speech recognitior? and for pre-
give a brief overview of some of the topics which have been
diction of future values of the time series” in applications
omitted, along with pointers to the literature. We then con-
such as financial forecasting or the prediction of sunspot ac-
clude with a few remarks on the future of neural computing.
tivity. A more comprehensive approach would, however,
A. Other network models make use of dynamical networks of the kind discussed
Most of the network models described so far are trained above.
by a supervised learning process in which the network is It should be emphasized that most of these neural net-
supplied with input vectors together with the corresponding work techniques have their counterparts in conventional
target vectors. There are other network models which are methods. In many cases the neural network provides a non-
trained by unsupervised learning in which only the input linear extension of some well known linear technique. Any-
vectors xq are supplied to the network. The goal in this case one wishing to make serious use of neural networks is there-
is to model structure within the data rather than learn a func- fore recommended to become familiar with these
tional mapping. conventional approaches.53-55
One example of unsupervised training is called density Throughout this review we have discussed learning in
estimation in which the network forms a model of the prob- neural networks in terms of the minimization of an error
ability distribution p(x) of the data as a function of x.53,1We function. However, learning and generalization in neural net-
have already encountered one example of this in Sec. IV, works can also be formulated in terms of a Bayesian infer-
using the Gaussian mixture model in Eq. (35). Another ex- ence framework,rs3-ah and this is currently an active area of
ample is clustering in which the goal is to discover any research.
clumping of the ( tta which may indicate structure having
some particular significance.53.80 Yet another application of B. Future developments
unsupervised methods is data visualization in which the data
is projected onto a Z-dimensional surface embedded in the Feedforward neural networks are now becoming well es-
original d-dimensional space, allowing the data to be visual- tablished as methods for data processing and interpretation,
ized on a computer screen.“’ In this case the training process and as such will find an ever greater range of practical ap-
corresponds to an iterative optimization of the location of the plications both in scientific instrumentation and many other
surface in order to capture as much of the structure in the fields. However, it is clear too that the connectionist para-
data as possible. Unsupervised networks are also used for digm for information processing is a very rich one which, 50
dimensionality reduction of the data prior to treatment with years after the pioneering work of McCulloch and Pitts, we
supervised learning techniques in order to mitigate the ef- are only just beginning to explore. It is likely to be a very
fects of the curse of dimensionality. long time before artificial neural networks approach the com-
One of the restrictions placed on the networks discussed plexity or performance of their biological counterparts. Nev-
in this review is that they should have a feedforward struc- ertheless, the fact that biological systems achieve such im-
ture so that the output values become explicit functions of pressive feats of information processing using this basic
the inputs. If we consider network diagrams with connec- connectionist approach will remain as a constant source of
tions which form loops then the network acquires a dynami- inspiration. While it would be unwise to speculate on future
cal behavior in which the activations of the units must be technical developments in this field, there can be little doubt
calculated by evolving differential equations through time. A that the future will be an exciting one.
class of such networks having some historical significance is
that developed by Hopfield?r6 who showed that, if the con-
APPENDIX: GUIDE TO THE NEURAL COMPUTING
nection from unit a to unit b has the same strength as the
LITERATURE
connection from unit b back to unit a, then the evolution of
the network corresponds to a relaxation described by an en- The last few years have witnessed a dramatic growth of
ergy function, thereby ensuring that the network evolves to a activity in neural computing accompanied by a huge range of
stationary state. Such networks can act as associative memo- books, journals, and conference proceedings. Here we aim to
1630 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks
provide an overview of the principal sources of information ‘Scientific American, special issue on Mind and Brain, September (1992).
on neural networks, although we cannot hope to be exhaus- 6D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Parallel Distribured
Processing: Explorations in the Microstructure of Cognition (MIT Press,
tive. Cambridge, 1986), Vol. 2.
The following journals specialize in neural networks. It 7W. S. McCulloch and W. Pit&, Bull. Math. Biophys. 5, 115 (1943).
should be emphasized, however, that the subject spans many “D. 0. Hebb, The Organization of Behaviour (Wiley, New York, 1949).
disciplines and that important contributions also appear in a ‘F. Rosenblatt, Psychol. Rev. 65, 386 (1958).
‘OF Rosenblatt, Principles of Neurodynamics (Spartan Books, Washington,
range of JO *d rnals specializing in other subjects. dC, 1962).
Neural Networks is published bimonthly by Pergamon “B Widrow, Self-Organizing Systems, edited by G. T. Yovitts (Spartan
Press and first appeared in 1988. It covers biological, math- Books, Washington, DC, 1962).
ematical, and technological aspects of neural networks, and a r2B. Widrow and M. E. Hoff, Adaptive Switching Ciruits (IRE WESCON
Convention Record, New York, 1960), p. 96.
subscription is included with membership of the Interna- 13B. Widrow and M. Lehr, Proc. IEEE 78, 1415 (1990).
tional Neural Network Society. 14M. Minsky and S. Papert, Perceptrons (MIT Press, Cambridge, 1959),
Neural Computation is a high quality multidisciplinary also available in an expanded edition (1990).
letters journal published quarterly by MIT Press. “J. J. Hopfield, Proc. Natl. Acad. Sci. 79, 2554 (1982).
“J. J. Hopfield, Proc. Natl. Acad. Sci. 81, 3088 (1984).
Network is another cross-disciplinary journal and is pub- r7D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Nature 323, 533
lished quarterly by the Institute of Physics in the U.K. (1986).
IIzternatiortal Journal of Neural Systems is published ‘*D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ParaBel Distributed
quarterly by World Scientific also covers a broad range of Processing: Explorations in the Microstructure of Cognition (MIT Press,
Cambridge, 1986), Vol. 1.
topics. 19D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Parallel Distributed
IEEE Transactions on Neural Networks is a journal with Processing: Explorations in the Microstructure of Cognition (MIT Press,
a strong emphasis on artificial networks and technology and Cambridge, 1986), Vol. 3.
appears bimonthly. 2”Neurocomputing: Foundations of Research, edited by J. A. Anderson and
E. Rosenfeld (MIT Press, Cambridge, 1988).
Neural Computing and Applications is a new journal ” Neurocomputing, edited by J. A. Anderson and E. Rosenfeld (MIT Press,
concerned primarily with applications and is published quar- Cambridge, 1990), Vol. 2.
terly by Springer-Verlag. =C. M. Bishop and C. M. Roach, Rev. Sci. Instrum. 63, 4450 (1992).
Neurocornputing is published bimonthly by Elsevier. 23C. M. Bishop, C. M. Roach, and M. G. von Hellerman, Plasma Phys.
Control. Fusion 35, 765 (1993).
There are currently well over 100 books available on a4K. Funahashi, Neural Networks 2, 183 (1989).
neural networks and it is impossible to survey them all. “G. Cybenko, Math. Control, Signals Syst. 2, 304 (1989).
Many of the introductory texts give a rather superficial treat- s6K. Hornick, M. Stinchcombe, and H. White, Neural Networks 2, 359
ment, generally with little insight into the key issues which (1989).
“K. Hornick, Neural Networks 4, 251 (1991).
often make the difference between successful applications %V. Y. Kreinovich, Neural Networks 4, 381 (1991).
and failures. Some of the better books are those given in “A. R. Gallant and H. White, Neural Networks 5, 129 (1992).
Refs. 87 and 88. A more comprehensive account of the ma- 3oLe Cun Y et al., Neural Computation 1, 541 (1989).
terial covered in this review can be found in Ref. 1. 3’J. F. Kolen and J. B. Pollack, in Advances in Neural Information Process-
ing Systems (Morgan Kaufmann, San Mateo, CA, 1991), Vol. 3, p. 860.
One of the largest conferences on neural networks is the 32C. M. Bishop, Neural Computation 4, 494 (1992).
International Joint Conference on Neural Networks (IJCNN) 33H. Robbins and S. Monro, Annu. Math. Stat. 22, 400 (1951).
held in the USA (and also in the Far East) with the proceed- 14J. Kiefer and J. Wolfowitz, Annu. Math. Stat. 23, 462 (1952).
ings published by IEEE. A scan through the substantial vol- 35W. H . Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Nu-
merical Recipes in C: The Art of Scientific Computing, 2nd ed. (Cam-
umes of the proceedings gives a good indication of the tre- bridge University Press, Cambridge, 1992).
mendous range of applications now being found for neural 36J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained
network techniques. A similar annua1 conference is the World Optimisation and Non-linear Equations (Prentice-Hall, Englewood Cliffs,
Congress on Neural Networks. A comparable, though some- NJ, 1983).
j7E. M. Johansson, F. U. Dowla, and D. M. Goodman, Int. J. Neural Syst. 2,
what smaller, conference is held each year in Europe as the 291 (1992).
International Conference on Artificial Neural Networks 3RD. F. Shanno, Math. Operations Res. 3, 244 (1978).
(ICANN). Au excellent meeting is the annual Neural Infor- “R. Battiti, Complex Syst. 3, 331 (1989).
mation Processing Systems conference (NIPS) whose pro- 40D. S. Broomhead and D. Lowe, Complex Syst. 2, 321 (1988).
41J. Moody and C. L. Darken, Neural Comput. 1, 281 (1989).
ceedings are published under the title Advances in Neural 42M. J. D. Powell, in Algorithms for Approximations, edited by J. C. Mason
I?{fomation Processing Systems by Morgan Kaufman. These and M. G. Cox (Clarendon, Oxford, 1987).
proceedings provide a snapshot of the latest research activity 43C. A Micchelli, Constructive Approx. 2, 11 (1986).
across almost all aspects of neural networks, and are highly 44E. Hartman, J. D. Keeler, and J. Kowalski, Neural Comput. 2,210 (1990).
45J. Park and I. W. Sandberg, Neural Comput. 3, 246 (1991).
recommended. Details of future conferences can generally be 46S. Chen, S. F. N. Cowan, and P. M. Grant, IEEE Trans. Neural Networks
found in the various neural network journals. 2, 302 (1991).
47J. MacUueen, in Proceedings of the Fifth Berkeley Symposium on Math-
ematical Statistics and Probability (University of California Press, Berke-
’C. M. Bishop, Neural Networks for Statistical Pattern Recognition (Ox- ley, CA, 1967), Vol. 1, p. 281.
ford University Press, Oxford, 1994). 48G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Ap-
‘It White, Neural Comput. 1, 425 (1989). plications to Clustering (Marcel Dekkcr, New York, 1988).
3 C. M. Bishop, Proc. IEE, Proceedings: Vision, Image and Speech; Special 49A. P. Dempster, M. N. Laird, and D. B. Rubin, J. R. Stat. Sot. B 39, 1
Issue on Neural Networks (1994). (1977).
‘E. R. Kandel and J. H. Schwartz, Principles of Neuroscience, 2nd ed. ‘“0. H. Golub and W. Kahan, SIAM Num. Analysis 2, 205 (1965).
(Elsevier, New York, 1985). ” M. D. Richard and R. P. Lippmann, Neural Comput. 3, 461 (1991).
Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks 1631
“D. W. Ruck et aZ., IEEE Trans. Neural Networks 1, 296 (1990). RL. Allen and C. M. Bishop, Plasma Phys. Control Fusion 34,1291(1992).
53R. 0. Duda and P. E. Hart, Pattern Classification and Scene Analysis 73C. M . Bishop, I. Strachan, J. O’Rourke, G. Maddison, and P. Thomas,
(Wiley, New York, 1973). Neural Comput. Appl. 1, 4 (1993).
54K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. 74V. A. Morozov, Methods for Solving Ill-posed Problems (Springer, Berlin,
(Academic, San Diego, CA, 1990). 1984).
“P. A. Devijer and J. Kittler, Pattern Recognition: A Statistical Approach 75J. B. Lister and H. Schnurrenberger, Nucl. Fusion 31, 1291 (1991).
(Prentice-Hall, Hemel Hempstead, UK, 1982).
76D G. Sandier, T. K. Barrett, D. A. Palmer, R. Q. Fugate, and W. J. Wild,
56D. J. Hand, Discrimination and CZassijication (Wiley, New York, 1981).
Niture 351, 300 (1991).
57D. Lowe and A. R. Webb, Network 1, 299 (1990).
58J. Moody, in Advances in Neural Information Processing Systems (Mor- “K J &tram and B. Wittenmark, Adaptive Control (Addison-Wesley,
gan Kaufmann, San Mateo, CA, 1993). Redwood City, CA, 1989).
“S. Geman, E. Bienenstock, and R. Doursat, Neural Comput. 4, 1 (1992). ‘a W. T. Miller, R. S. Sutton, and P. J. Werbos, Neural Networks for Controt
60A N. Tikhonov and V. Y. Arsenin, Solutions of Ill-posed Problems (Win- (MIT Press, Cambridge, 1990)..
s&, Washington, DC, 1977). “Handbook of Intelligent Control, edited by D. A. White and D. A. Sofge
“G. Wahba, Ann. Statist. 13, 1378 (1985). (Van Nostrand Reinhold, New York, 1992).
“C. M. Bishop, Neural Comput. 3, 579 (1991). *‘T. Kohonen, Self-organization and Associative Memory, 3rd ed. (Springer,
“T. Poggio and F. Girosi, Proc. IEEE 78, 1481 (1990). London, 1989).
“T. Poggio and F. Girosi, Science 247, 978 (1990). *‘R, P. Lippmann, Neural Comput. 1, 1 (1989).
“C. M. Bishop, IEEE Trans. Neural Networks 4, 882 (1993). *‘A. Lapedes and R. Farber, in Neural Information Processing Systems,
66M. Stone, Operationforsh. Statis. Ser. Statist. 9, 127 (1978). edited by D. 2. Anderson (American Institute of Physics, New York,
b7C M. Bishop and G. D. James, Nucl. Instrum. Methods Phys. Res. A 327, 1988), p. 442.
580 (1993).
83D. J. C. MacKiy, Neural Comput. 4,415 (1992).
‘aR. E. Bellman, Adaptive Control Processes (Princeton University Press,
@D. J. C. MacKay, Neural Comput. 4, 448 (1992).
Princeton, NJ, 1961).
691.T. Jollife, Principal Component Analysis (Springer, New York, 1986). s5D. J. C. MacKay, Neural Comput. 4, 720 (1992).
70C. M. Bishop, P. Cox, P. Haynes, C. M. Roach, M. E. U. Smith, T. N. 86W. L. Buntine and A. S. Weigend, Complex Syst. 5, 603 (1991).
Todd, and D. L. Trotman, in Neural NetworkApplications, edited by J. G. 87J Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory ofNeural
Taylor (Springer, London, 1992), p. 114. computation (Addison-Wesley, Redwood City, CA, 1991).
71C. M. Bishop, P. Cox, P. Haynes, C. M. Roach, M. E. U. Smith, T. N. ‘* R. Hecht-Nielsen, Neurocomputing (Addison-Wesley, Redwood City,
Todd, and D. L. Tmtman, Neural Computation (to be published). CA, 1990).
1832 Rev. Sci. Instrum., Vol. 65, No. 6, June 1994 Neural networks