0% found this document useful (0 votes)
78 views

Neural Networks and Applications Tutorial

Uploaded by

vane-16
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

Neural Networks and Applications Tutorial

Uploaded by

vane-16
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

PHYSICS REPORTS (Review Section of Physics Letters) 207, Nos. 3—5 (1991) 215—259.

North-Holland

NEURAL NETWORKS AND APPLICATIONS TUTORIAL

I. GUYON
AT&T Bell Laboratories, Holrndel, NJ 07733, USA

Abstract:
The importance of neural networks has grown dramatically during this decade. While only a few years ago they were primarily of academic
interest, now dozens of companies and many universities are investigating the potential use of these systems and products are beginning to appear.
The idea of building a machine whose architecture is inspired by that of the brain has roots which go far back in history. Nowadays,
technological advances of computers and the availability of custom integrated circuits, permit simulations of hundreds or even thousands of neurons.
In conjunction, the growing interest in learning machines, non-linear dynamics and parallel computation spurred renewed attention in artificial
neural networks.
Many tentative applications have been proposed, including decision systems (associative memories, classifiers, data compressors and optimizers),
or parametric models for signal processing purposes (system identification, automatic control, noise canceling, etc.). While they do not always
outperform standard methods, neural network approaches are already used in some real world applications for pattern recognition and signal
processing tasks.
The tutorial is divided into six lectures, that where presented at the Third Graduate Summer Course on Computational Physics (September 3—7,
1990) on Parallel Architectures and Applications, organized by the European Physical Society: (1) Introduction: machine learning and biological
computation. (2) Adaptive artificial neurons (perceptron, ADALINE, sigmoid units, etc.): learning rules and implementations. (3) Neural network
systems: architectures, learning algorithms. (4) Applications: pattern recognition, signal processing, etc. (5) Elements of learning theory: how to
build networks which generalize. (6) A case study: a neural network for on-line recognition of handwritten alphanumeric characters.

1. Introduction

1.1. Brains and machines

Computers offer more and more computational power but still perform poorly on very important
tasks that animals and / or humans can achieve easily:
• perceiving objects in natural scenes and noting their relations,
• controlling movements,
• retrieving information by associations,
• understanding natural language,
• elaborating strategies.
Similar lists can be found in many publications on machine learning, artificial intelligence or neural
networks. Are these problems inherently impossible to solve with computers, and, if yes, why?
Computers are becoming extremely good at managing data bases, performing arithmetic and even
algebraic computations. These tasks being very mechanistic, some philosophers and scientists argue that
the limitations of machines lie in their “mechanic” way of processing information [1]: one of the
frequently used arguments is that computers which are built using the same principles as the Turing
machine fall under Gödel’s theorem of incompleteness. Other arguments raise the question of the
determinism of computers as opposed to the stochastic behavior of biological systems.
On the other hand, it is more and more widely admitted that the brain itself is a machine [33]. The
debate on whether thinking is produced by the same physical elements from which our body is built, or
0 370-1573191 /$15.75 © 1991 Elsevier Science Publishers B.V. All rights reserved
216 Parallel architectures and applications

rather by some different maybe immaterial source (the soul?), has roots which go far back in History.
Although as many as 2400 years ago Democritus was already conjecturing that the soul that is tile “. . .

residence of the principle and the rule of our actions is nothing but a part of our body like the hands, the
feet, the eyes.. whether there is something different in essence between brains and machines is still a
.“,

central question.
By assuming that the brain is just a machine, we raise the hope that this machine could be imitated
by a computer. Our current knowledge about the central nervous system is far less than enough to
achieve a faithful copy of the brain. While biologists are interested in modeling all the different levels of
the organization of the brain, from the molecular level to the behavioral level, physicists, mathemati-
cians, psychologists and engineers are primarily interested in schematic models that retain only essential
features enabling them to understand and reproduce some of the brain functions. A historical sketch of
the early days of the field can be found in ref. [36].

1.2. Biological modeling and engineering

What common features do real brains and artificial neural networks share?
In fig. 1 is reproduced a drawing of the visual cortex of the rat executed by Ramón y Cajal in 1888
[116].The brain anatomy was already well known a century ago and Cajal’s neuro-anatomy book is still
a reference today. Though neuro-physiology made a lot of progress this century [75], the way

1-!
~“ \I
.4 J’r~l(’\
~ ‘II, t .
~ -~-- (UI
1/) . /J.l
I I I . - i-~1-.~
)Tr~-1’
11 ~
)
— 5

/
I ~

‘I . ~-

/1
/

Fig. 1. Drawing of the visual cortex of the rat (Ramón y Calal. 1888 [116]).
I. Guyon, Neural networks and applications tutorial 217

information is processed in the brain is still largely under debate. Artificial neural networks retain only
very basic features of the brain that are commonly admitted: the parallel distributed processing and a
simple neuron model (the formal neuron). Yet, reasonably small systems of artificial neurons (1000
neurons with 100 connections per neuron, compared to lO’~neurons and about 10000 connections per
neuron for the human brain) have interesting emergent properties also observed for the brain: learning
from examples and generalization, associative memory, tolerance to failures of neurons and/or con-
nection deletions.
The neuron models used by engineers are very simple and incorporate only a small amount of the
known characteristics of real neurons (fig. 2a [1271). Neurons receive information through a lot of
ramifications (dendrites) to which synapses from other neurons transmit the nervous influx: chemical
transmitters liberated at the synapses modify the membrane potential of the neuron; if the potential
exceeds a threshold, the neuron sends a nervous influx along its output fiber, the axon. Learning
consists of modifications of the synaptic efficacies according to correlations between the activity of the
pre-synaptic neuron and the post-synaptic neuron (Hebb’s principle, 1949 [571).The formal neuron,
first introduced by McCulloch and Pitts in 1943 [98] (fig. 2b), simply performs a sum of its inputs,
weighted by coefficients (called synaptic coefficients) to obtain its total input (also called “potential”).
The output of the formal neuron is obtained by comparing the value of its potential to a threshold
value: +1 if it is over the threshold, —1 otherwise. Formal neurons learn by modifying their synaptic
coefficients with algorithms inspired by Hebb’s principle (see section 2). Many other characteristics of

I t ‘~‘ ~ ~—~-—‘ ,.“ // /

- ~

(/ ~‘ ~~ 2~
- ~
7 ‘~i~

~ \
~~h/’\ \

Cal / / \ “,
\ POTENTiAL AXON
SYNAPSE

______ _________ y=sgn(v)

(b)

Fig. 2. Real neuron versus formal neuron. (a) Real neuron (from “The neuron” by C. Stevens. Copyright © September 1979 by Scientific
American, Inc. All rights reserved. [127]). (b) McCulloch—Pitts formal neuron [98].
218 Parallel architectures and applications

real neurons are sometimes introduced in formal neurons, including replacing the linear computation of
the potential by a more realistic function and the thresholding with a probabilistic decision. However,
most applications make use only of the simplest formal neuron.

1.3. Neuro-computers?

The word “neuro-computer” was recently introduced by the media to refer to special purpose
hardware and fast parallel processors that implement artificial neural networks. Figure 3 gives a rough
idea of neural network requirements in terms of speed and storage capacity [31.The area given for the
applications should be understood as an indication of trends only. The biological networks (round dots)
and the VLSI networks (squares) are positioned with a precision of about one order of magnitude.
Because of the similarity between the formal neuron and a linear filter (they both perform a weighted
sum), the hardware technology of neural networks has a lot in common with that of linear filters.
Indeed, some of the custom integrated circuits of fig. 3 where designed to implement linear filters and
are mentioned for comparison.
Clearly, networks of formal neurons can be simulated quite easily on conventional sequential
computers. The training phase and the utilization phase do not have the same requirements though.
The design and the training of a neural network require a lot of flexibility, due to frequent modifications
of the architecture and of the training algorithm. On the contrary, the network in utilization phase only
needs to run fast. Hence, the training phase is almost always performed with software simulators
(Rochester, SN, Mac Brain, Neuralware, Nestor, etc.) running on PCs or workstations. Accelerator
boards for PCs are available from several neural network startup companies [HNC (Hecht—Nielsen
Neurocomputers), SAIC (Science Applications International Corporation), etc.1 who sell jointly a

1~

•Arr
ILSI

E~ ~‘R~O~S

—+-—1~- I I I I I ~

100 106 io9


WEIGHT STORAGE (bits)
Fig. 3. Neural network requirements. (From DARPA [3].)
I. Guyon, Neural networks and applications tutorial 219

Table I
VLSI versus biology. (From DARPA [3])
VLSI Biology Ratio
Clock rate (Hz) 10’ l0~ I0~
3) l0~ l0I2 10~
Memory density (cm

corresponding software simulator. Information regarding these products can be found in various neural
network journals, including: Neural Network Review, Neurocomputers, Intelligence, Neural Network
News. Fast simulation machines have been obtained by programming supercomputers (Warp systolic
array, Connection Machine, Hypercube), or by programming arrays of processors like transputers or
DSPs (Digital Signal Processors) [42, 133, 22].
The utilization phase is only in rare cases performed with PCs or workstations, with a few exceptions
[52].DSPs provide a gain in speed that meets the requirements of some applications in control, speech
and character recognition [88]. But other applications, in particular in vision, require utilization of the
parallelism of neural architectures. For a review on existing special purpose hardware that can
implement neural networks, one can refer to the DARPA study [3], which also compares technologies
(optics, analog and digital VLSI). For analog VLSI designs, one can refer to refs. [100, 63]; for optical
computing to ref. [7]. The best chips available evaluate hundreds of giga-connections per second and
can implement more than 10000 binary connections [49] (or, after reconfiguration, less connections but
with analog depth). Special purpose chips can be up to iOt times faster than workstations, though the
peak speed is rarely achieved in practice because of communication bottlenecks with host computers.
Silicon technology and biological technology do not have the same specifications (table 1). VLSI
memory densities are very low compared to those of biological memories (10~cm3 for VLSI versus
1012 cm3 for neural systems). Conversely, the speed at which real neurons operate is quite slow
compared to speeds achieved by artificial neurons (biological clock rate i03 Hz versus VLSI clock rate
108 Hz). Therefore, processor multiplexing is necessary to make best use of VLSI technology. Some
artificial neural networks are in essence convolutional and can be implemented very naturally with
processor multiplexing (see for instance the TDNN in sections 3, 4 and 6). In general, technological
constraints make artificial neural network hardware look quite different from biological neural systems.

1.4. Neural network challenges

At the beginning of this introduction, we asked ourselves whether it would be possible to build an
“artificial brain”. The current state of the art does not allow an answer to be given yet. We are
currently limited more by our understanding of the brain and our algorithms that try to imitate brain
functions than by the available hardware than can implement these algorithms. The following pages,
while giving some insight into the neural network algorithms, can also help the reader measure the
distance between existing systems and “intelligent” artificial neural networks. The positive aspect is that
there is still a lot of work to be done!

2. Adaptive artificial neurons

In this section, we describe some elementary learning mechanisms for formal neurons. A lot of the
material of this section is covered by Kohonen’s book [79], by the book on Parallel Distributed
220 Parallel architectures and applications

Processing [12] and the literature on linear discriminant functions [40, 130] and linear adaptive filters
[142]. The mathematical foundations of perceptrons can be found in Minsky and Papert’s book [102].

2.1. The formal neuron

The basic element of artificial neural networks is the formal neuron (also simply called neuron). Its
first version, shown in fig. 2b, was introduced in 1943 by McCulloch and Pitts [98] and is still very
widely used.
The formal neuron has binary inputs and outputs. It computes its “potential” v by performing a sum
of its inputs x weighted by the synaptic coefficients w.. The potential is then compared to a threshold. If
it exceeds the threshold, the neuron output y is +1, otherwise, it is —1. For convenience, we set the
threshold to be 0 and compensate by setting an additional input x11 to be 1, thus using the weight w11 as a
bias value.
Networks of McCulloch—Pitts neurons can perform any boolean function, provided the weights are
set to adequate values. The proof is very simple: neurons can simulate 2-input NAND gates; 2-input
NANDs form a universal basis; hence, neurons form a universal basis.
Given that neural networks can implement any boolean function, the next question is: given a
desired function F, how do we pick a network (i.e. find the architecture and weights) that can
implement it? Strategies for designing architectures for given applications will be presented in sections
3, 4, 5 and 6. In this section, we present several training schemes for adapting the weights of one unit
during a supervised learning session, that is, when examples of (input, output) pairs are given.

2.2. How to train one formal neuron?

“When an action of cell A is near enough to excite cell B and repeatedly or persistently takes part in
firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency,
as one of the cells firing B, is increased.” (D. 0. Hebb, 1949) [57].
This learning principle formulated by Hebb in 1949 is the essence of all training algorithms for neural
networks. One of its simplest versions as a supervised learning algorithm for training formal neurons
was proposed by Cooper et al. [341 and is known under the name of Hebb’s rule. It consists in
incrementing (respectively decrementing) the synaptic weight w1 by a fixed amount a > 0, every time
there is a coincidence (respectively anti-coincidence) between the pre-synaptic activation x,~and the
desired output d~of the neuron,

~,,w1=ax~d~. (1)

Other algorithms, such as the perceptron algorithm [119],

(2)

and the Widrow—Hoff algorithm [141],

= —ax~’(v° d”)
— , (3)

are similar in spirit. For the perceptron algorithm, a modification of the weights occurs only if d” is
I. Guyon, Neural networks and applications tutorial 221

different from yP. For the Widrow—Hoff algorithm, the weight modification is modulated by the
difference between the desired output d’1 and the potential v’~’. Detailed explanations about these
algorithms and their differences can be found in the book of Duda and Hart [40].

2.3. The perceptron

The perceptron, as a “machine” was introduced in 1957 by Rosenblatt [1191.It consists of a first
layer of units (not necessarily formal neurons) performing an initial computation that can be thought of
as a preprocessing, or a change in representation from {x,} to {~~} (fig. 4). The second layer consists of
a single formal neuron. The first layer is not adaptive, only the weights of the formal neuron in the
second layer are trained.
The purpose of the preliminary treatment of the first layer is to change the representation so that the
neuron in the second layer can learn the desired mapping. This neuron provides a linear decision
boundary,

W
0+ ~ + w2~2+ ~ ~

Therefore, the desired mapping can be learned if the problem is linearly separable in its intermediate
representation { ~} (fig. 5). A detailed discussion about what perceptrons can do is provided in Minsky
and Papert’s book [102].
The perceptron learning algorithm [eq. (2)] that was initially invented to train the “machine”
perceptron described above, can be used to train McCulloch—Pitts neurons independently of any
preprocessing layer. This algorithm can be viewed as an on-line gradient descent algorithm (stochastic
gradient) minimizing the perceptron cost function,

E= ~ V~. (4)
p ‘‘missed”

This cost function is a continuous differentiable function which penalizes the bad answers (p “missed”)
over the training examples (p = 1,. m). The perceptron learning algorithm converges only if the
. . ,

McC-P neuron

Perceptron
Fig. 4. The perceptron (Rosenblatt, 1957 [119]).
222 Parallel architectures and applications

Fig. 5. What can one perceptron do? Learn problems that are linearly separable in their intermediate representation. (a) A linearly separable
problem. (b) A non linearly separable problem.

problem is linearly separable [102]. Other algorithms that converge even for non linearly separable
problems have been proposed to minimize the perceptron cost function [40, 821.
Perceptrons have been used extensively in pattern recognition [40, 130]. Systems derived from the
perceptrons and using sigma-pi units (polynomial discriminant functions, high-order neural networks)
[112, 109, 47], or radial basis functions [28, 113, 1181 are still commonly used.

2.4. The ADALINE

The ADALINE (ADAptive LINear Element) was introduced in 1959 by B. Widrow (fig. 6). It is a
simple linear unit that is associated with a particular training algorithm proposed by Widrow and Hoff
[141] [see eq. (3)]. This algorithm has been widely used in signal processing to train linear filters, linear
predictors, etc. [142], and can be used to train McCulloch—Pitts neurons as well.

Mc Culloch-Pltts neuron

xa=i

____ _______ v ~i:.~~:r;:::~iy~sgn(v)

ADALINE
Fig. 6. The ADALINE (Widrow. 1959 [141]).
I. Guyon, Neural networks and applications tutorial 223

Similarly to the perceptron learning algorithm, the Widrow—Hoff algorithm is an on-line (or
stochastic) gradient descent algorithm. It minimizes the mean-squared-error (MSE) cost function:

E=~(VP_dP)2. (5)

The Widrow—Hoff algorithm is often also called LMS (Least Mean Squares algorithm).
Other algorithms for training linear units by minimizing the MSE can be found in refs. [40,
79].

2.5. The sigmoid unit

A large variety of units inspired by the early model of McCulloch and Pitts have been proposed in
the literature [12, 97, 134, 621. Some of them introduce stochastic decisions. The Boltzmann unit [591,
for instance, is activated with probability

P(y=1)=p(v), P(y—1)=1—p(V),

where p(V) = 1 /[1 + exp(—/3V)] follows the Boltzmann distribution. We will restrict ourselves to
deterministic units in the following however.
One of the most widely used units in the last few years is a slightly modified version of the
McCulloch—Pitts neuron: the sigmoid unit [1211 (fig. 7). The threshold function is replaced by a
smoother activation function, typically a scaled hyperbolic tangent,

f(v)=atanh(bv), (6)

where a and b are positive parameters. Notice that, with a simple variable change, the sigmoid unit
gives the expected value of the Boltzmann unit having the same weights [60].
The sigmoid unit can be trained with the so-called “delta-rule”, which is nothing other than the
Widrow—Hoff or LMS algorithm applied to the sigmoid unit rather than to a linear unit,

= —ax~(y”— d’~)f’(v’~) , (7)

xo I _______ _______________________________

!_____________ ____

Fig. 7. The sigmoid unit.


224 Parallel architectures and applications

where y” = f(V”). This rule minimizes the mean-squared-error cost function,


P1

E= ~ (y~—d~)2 (8)
1’ =
Networks of sigmoid units can be trained to approximate a large class of continuous valued functions
[37, 65, 32, 128, 44]. One of their advantages over McCulloch—Pitts neurons is that they provide a
differentiable function of their inputs which makes it possible to use gradient descent techniques to train
multilayer neural networks [121]. This will be described in section 3.

3. Neural network systems

In section 2, we presented the formal neurons and mentioned that networks of such neurons can do
almost everything. But, we explained only how to train one such unit. In this section, we explain how a
network with arbitrary architecture can be trained.
The material of this section is partially covered by the books previously referenced [79, 121, 40].
Additional information can be also found in the textbooks on neural networks [137, 124, 10]. It is often
useful to refer directly to the original papers referenced in the text.

3.1. What are hidden units?

Some of the basic architectures (fig. 8) do not present particular difficulties. For instance, the
one-layer network can be considered as a set of independent neurons, having the same inputs. Notice
that we represent input units as squares to distinguish them from neurons, since they do not perform
any computation. Each neuron can therefore be trained independently, with one of the learning
mechanisms described in section 2 [eqs. (1), (2) and (3)]. Fully connected networks [96, 61], whose
neurons serve both as input and output units, can be trained similarly [108, 82, 39]. Conversely, the
multilayer feed-forward network is an example of a network that cannot be trained directly with the
learning rules of section 2.
Neurons that are neither input nor output units of the network play a particular role (fig. 9). They
are called “hidden” units. During supervised learning, desired values are provided to the output units
only. Output units can be trained with one of the algorithms mentioned in section 2, but what about
hidden units? When there is an error at the output of the network, hidden units are also “responsible”
for this error. How do they share this responsibility, that is, how to penalize (or reward) them? This
problem, known as the credit assignment problem, has found several solutions that will be described in
this section.

3.2. Supervised learning and back-propagation

In fig. 10, we present a more general way of stating the problem of supervised learning than in the
previous section. Supervised learning consists of minimizing a cost function E that cumulates the errors
e~between the actual outputs y’1 of the system and the desired outputs d~,for given inputs {xf’ },
= 0, . . n. The mean-squared-error is a very commonly used cost function, but several other ones
. ,

have also been proposed and used [58, 128, 55].


1. Guyon, Neural networks and applications tutorial 225

ONE LAYER FULLY CONNECTED

FEED-FORWARD

Fig. 8. Basic architectures.

yl

INPUT
UNITS
J’
‘~
~
‘Y~ OUTPUT

x
2 •~.. y4

Th6
HIDDEN
Y7 UNITS

Fig. 9. General architecture.


226 Parallel architectures and applications

Supervised learning Pattern p squared error:


e=~(y-d) p=’i,...m
4 ~ I

Cost function:
X ANN
— - E=~e~

Learning:
e d MhiE

(a) (b)

Fig. 10. Supervised learning.

In fig. 11, we present two very classical ways of performing the minimization of F: Monte Carlo
methods and gradient descent. Though gradient descent methods are known to lead to suboptimal
solutions (local minima of F), they are usually preferred over Monte Carlo calculations (such as
simulated annealing [78]) for computational time reasons. All recent work on neural networks shows
that gradient descent methods provide very acceptable solutions.
The error back-propagation algorithm, derived by several authors [138, 105, 85, 120], provides an
easy and elegant way of performing on-line (or stochastic) gradient descent to train neural networks
(fig. 12). The outputs of all neurons are first computed during the forward propagation pass. The errors

(a) Monte•Carlo calculation


Make a random elementary step in weight space

Compute L~E
L. Accept new set of weights with probability:
P = 1I[1+ e ~

(b) Gradient descent

ii _~wij

N. B. On-line gradient descent


(stochastic gradient)
D&>0
P “~w

Fig. 11. Classical optimization methods.


I. Guyon, Neural networks and applications tutorial 227

Error back-propagation
Algorithm to perform on-line gradient descent.
Example: Feed forward network &
Mean Squared Error.
E=Ie~ ; 2
e~=,~(y~-d~)p=1,...n.
p output

FORWARD PROPAGATION BACKWARD PROPAGATION

Weight update: ~ = ~ °> 0

where:
y
1P= f(v1~)

- (y1 - d1) f’(v1) if joutput layer,


~ f’(v~)~ otherwise.
Fig. 12. The error back-propagation algorithm for a feed-forward layered network.

of the output units are computed by comparing the outputs to the desired outputs. The errors of the
hidden units are computed during the back-propagation pass. A given hidden unit j does not have a
desired output value, its error is proportional to a weighted sum of the errors of its successors (units k
receiving the output of j). The coefficients of the weighted sum are the WkI measuring the synaptic
strength of the connection jto k. The derivation of this algorithm [120]was made possible by the use of
a non-linear function which is differentiable. The most widely used such function is the hyperbolic
tangent [eq. (6)].
Back-propagation is a simple algorithm, but its speed of convergence can vary by several orders of
magnitude if it is not used properly. A few practical suggestions will be given in section 6. Other, more
elaborate, optimization algorithms from the gradient descent family have been applied to neural
networks [106, 20], ensuring better or faster convergence.

3.3. Unsupervised learning

So far, we described only supervised learning schemes, that is learning given examples of input—
output pairs. There are classes of problems where desired outputs (targets, class label, etc.) are
unknown. In fig. 13a, the problem of designing a classifier with unsupervised learning is described [130].
The cost function involves discrete variables (class labels) and cannot be optimized with standard
gradient descent. Other techniques are used, known as “clustering” algorithms. The well known
K-means clustering method is given as an example [40, 130] (fig. 13b).
228 Parallel architectures and applications

(a) Unsupervised learning Kohonen maps


Classification problem, 1~
classes undefined: opo ogy preserving maping.

X = { x~} set of samples, p = 1, ... m; . Similarity matching (find co):


2~{ ~ } corresponding class labels,
c,= 1, K. ... ~ = mm
j~Ct~fl [IxP-j~II for all k=1,...K

Cost function E (X, c’). Updating:


. Competition

discrete variables ~. ( xi’- ~) for kE N~(neighborhood


(class labels). PJ — J— ~.>o of ~Cp)
Clustering algorithm.
Cooperation

(b) Example: K-means clustering


K Examples of topological neighborhoods:
E —L ~ x’- 2 p=i,... m (s represents different phases during training)

Begin with:set cent~~t~otciost~ ~k

o~e~thm~~

Stop when the Classification does not change “~Y~3> “

Fig. 13. Unsupervised learning. (a) Unsupervised classification. (b) Fig. 14. Topology preserving mapping (Kohonen, 1980 [79]).
K-means clustering.

Several unsupervised learning algorithms for neural networks have been proposed [79, 15, 93, 31].
We choose to explain one of these algorithms, known by the name “Kohonen maps” (fig. 14). It is
inspired by models for the generation of topological cortical maps. In these models, neurons are
arranged in a two-dimensional array, with long range inhibitory interactions (neurons in competition)
and short range positive interactions (neurons in cooperation). Therefore, neurons sensitive to similar
stimuli tend to become neighbors in this two-dimensional topology.
Kohonen’s procedure [79] reproduces this competition/cooperation scheme. After training with this
algorithm, similar input vectors activate neighboring units (topology preserving mapping). The method
is similar to a nearest neighbor technique of classification, using peculiar prototypes: the weights of a
given neuron are the coefficients of a corresponding prototype; neurons perform the computation of an
Euclidean distance between their input and the prototype (which differs from the computation usually
performed by formal neurons); the decision is taken according to the maximum output.

3.4. Semi-supervised learning

As can be inferred from the previous examples of training algorithms, learning is nothing but the
minimization of a cost function calculated with the training examples. For supervised learning, the cost
function is an increasing function of the disparity between outputs of the system and desired outputs.
For unsupervised learning, desired outputs are not specified, the cost function involves only similarities
between patterns, according to some arbitrary similarity criterion.
In some intermediate cases, outputs are not specified, but the effect of these outputs on a system of
interest can actually be measured and turned into a cost to be minimized (fig. 15). Assume that I want
1. Guyon, Neural networks and applications tutorial 229

Learning Vector QuantIzation


~ A /

Each class has several reference patterns.

Semi-supervised learning . Similarity matching (findJ~*):

in~ut = U~ for all k=1,...K

ou put JA?~=
Pr (xv. F-
—~j*) ~fx°belongsto
tliiclassof *
Targets for i are unknown.
Goal = mm Expected value e(y). - oi. (!~~
~~*) if.~doesnotbelc~ng

Fig. 15. Semi-supervised learning with unknown targets. Fig. 16. Learning Vector Quantization (Kohonen, 1986 [79]).

to train a neural network to play tennis. But, I am a very poor professor since I do not play tennis
myself, I know only the rules! The network is given no information about desired outputs (the right
trajectories of the racket), but is told when it lost
Under certain conditions, neural networks can learn how to achieve such tasks with the back-
propagation algorithm. Errors can be propagated through the system (or through a model of the
system) to the neural network. The only requirement is that the system (or the model of the system) is a
differentiable function [139].
Another example of a semi-supervised learning algorithm applied to a classification problem is given
in fig. 16: LVQ (Learning Vector Quantization) [79]. Several units are designated to become the
representatives of a given class, but desired outputs are not given. Imagine that you have non
homogeneous classes: for handwritten digits for instance, different styles of 7, with or without cross bar.
The system is given the freedom to define by itself subclasses for the different variants. This is achieved
by reinforcing units that give maximum response for a given input, and have the right corresponding
class label, while penalizing the units that give maximum response but have the wrong class label.
Other semi-supervised learning algorithms include reinforcement learning [143]. Back-propagation
networks can be trained to perform operations similar to principal component analysis, which can be
thought of as a semi-supervised learning: the targets are identical to the inputs, the outputs of interest
are given by hidden units [26].

3.5. Time-varying signals

The importance of time-varying signals justifies giving them a separate subsection, although the
principles described next can be directly derived from the previous discussion.
A time-delay neuron (fig. 17) is simply a neuron receiving inputs that have been delayed in time. A
230 Parallel architectures and applications

Simple neuron:
___________ Output
(phoneme)

Time delay neuron:


~ hidden units
(tIme-delay
neurons)

input
or. (codes of 7 letters)

this is the text to be read

leftcontext letter to pronounce


right context
time lime

Fig. 17. Introducing delays: the time-delay neuron. Fig. 18. Introducing delays: NetTalk (Sejnowski and Rosenberg
[122]).

simple example of the use of time-delay neurons is shown in fig. 18 [122, 123]. The task is to associate a
letter (appropriately coded) with a corresponding phenome (also appropriately coded). The text is
scanned by the network (called NetTalk). Each time-delay unit has access to the right and left context,
in addition to the code of the letter to be pronounced. The second layer is fully connected and does not
involve time-delay neurons. After training with the back-propagation algorithm, the network is able to
“read” a text on which it was not trained, with reasonable accuracy.
A more complicated case is shown in fig. 19. The outputs of the first layer of time-delay neurons are
themselves delayed. The second layer is also composed of time-delay neurons, etc. Notice that the
delays can be different at each layer. Such a network is called TDNN (Time Delay Neural Network)
[84, 136]. The network in fig. 19a can be also represented in an alternate way (fig. 19b). The input
representation is unfolded in time. Each time-delay neuron (inset in fig. 19b) is physically replicated
along the time axis (instead of the same neuron being reused in time, as data come in). The result is a
feed-forward network that can be trained with standard back-propagation, with the constraint that
repeated neurons have identical weights (they are indeed the same neuron). This procedure is also
called “weight sharing”. It is realized in practice by averaging the weight updates of the neurons sharing
the same weights.
Examples of applications of TDNNs are given in section 4 and the details of the design of a TDNN
for a particular application are given in section 6.

3.6. Introducing feedback

Numerous signal processing applications require models that incorporate feedback. One way of
representing feedback nets is the canonical representation illustrated in fig. 20 [110]. This representa-
tion is convenient for explaining how to use back-propagation to train feedback nets, but it is not
unique and other representations can be used as well as other training algorithms [8, 111, 144, 107].
1. Guyon, Neural networks and applications tutorial 231

Introduction of delays in the hidden layers:

DELAY ~

data flow

full connection

_______________ restricted
input field
neuron

one feature vector:


outputs of different
neurons

tttt
same neuron repeated
/ along the time axis
features b time

Fig. 19. Introducing delays: Time Delay Neural Network (Lang and Hinton [84],Waibel 1135]).

Once the network is in the canonical representation, the feedback can be replaced by a “physical”
replication of the network (fig. 21). The number of copies determines the horizon on which the network
will be trained. Training is performed with back-propagation “through time”, that is, back-propagation
on the extended network containing all the copies and with the constraint that repeated identical
neurons share the same weights.
232 Parallel architectures and applications

Canonical representation:
outputs state variables

~dl~

inputs

Example:
(c)
(b) I

Fig. 20. Introducing feed-back. (a) Canonical representation, (b) Example of feed-back network. (c) Same network as (h). in canonical
representation. (From Personnaz et al. [1101.)

z (0) z (1) z (2) z (3)

Copies

~. c ~2: Copie n°3 Copie

tt(3) Id (3)’~(3)1 ((v ~


4 4
:
Back—prop.
.~

a4 Id
4 (4)-~(4lI1 (~(4l) :
Fig. 21. Introducing feedback. Back-propagation through time. (From Personnaz et al. [110].) :
1. Guyon, Neural networks and applications tutorial 233

4. Applications

This section is devoted to the description of a few examples that illustrate applications of the
algorithms described in section 3. As mentioned in section 1, special custom chips for neural networks
and linear filters are almost inter-exchangeable, because of the similarity between the formal neuron
and linear filters. Applications of neural networks also reflect this similarity, and in particular image
processing with neural networks is very reminiscent of linear signal processing (convolutions).
Table 2 lists the main area of applications of neural networks. A large variety of examples can be
found easily in the proceedings of the three large annual neural network conferences: IJCNN
(International Joint Conference on Neural Networks) [69, 70], INNC (International Neural Network
Conference) [71] and NIPS (Neural Information Processing Systems) [132]. Other examples can be
found in the proceedings of various conferences: ICASSP (International Conference on Acoustic
Speech and Signal Processing) [68], ICPR (International Conference on Pattern Recognition) [67], ISIC
(International Symposium on Intelligent Control) [66].

4.1. Fixed weights

In sections 2 and 3, we focussed on the determination of the weights by learning. However, some
techniques involving formal neurons do not require training by example. The weights are set a priori by
the designer. Such is the case of well known techniques in image, speech and signal processing that
involve matched filtering, feature extraction, classification, and other filtering operations. Their
similarities with neural network computations render possible their implementation on a neural network
chip [72, 49]. Other applications with user predefined weights are more specific to neural networks:
combinatorial optimization networks [73] and other ad hoc networks [64].
Some examples of convolutional kernels commonly used in image processing and pattern recognition
are shown in fig. 22. Operations of skeletonization (fig. 23) and feature extraction (fig. 24) were

Table 2
Applications of neural networks
No learning
fixed weights Supervised learning Semi-supervised learning Unsupervised learning

image/speech processing image/speech processing image/speech processing image/speech processing


and pattern recognition and pattern recognition and pattern recognition and pattern recognition
(matched filtering, feature (adaptive filtering, classification, (learning vector quantization, (vector quantization, coding,
extraction, classification) PDF estimation) coding adaptation to data compression, clustering)
individual styles)

signal processing signal processing control combinatorial optimization


(filtering) (adaptive filtering, system (robotics, congestion (traveling salesman)
identification, control, control)
prediction)

combinatorial optimization statistical data analysis statistical data analysis


(traveling salesman) (PDF estimation, curve fitting) (principal components)

ad hoc networks diagnosis function optimization


(A/D converter) (trained expert system) (reinforcement learning)
234 Parallel architectures and applications

Convolution of kernels ~ ~

+1

Laplacian
operator
-1 -1

Skeletonization
kernel End of line detector
2
Fig. 22. Convolution of kernels in image processing. Fig. 23. Skeletonization.

2
_U_ Fig. 24. Feature extraction.
_

performed using various such kernels on a neural network chip [72] that can convolve in parallel up to
50 (7 x 7) ternary kernels at a rate of 0.5 million pixels per second. A more recent reconfigurable chip
[49] can perform similar operations with different kernel sizes and weight precision [for instance
convolutions of up to 64 (16 x 16) ternary kernels can be performed in parallel, at a rate of 2 million
pixels per second].
Machines designed to perform very fast weighted sums such as neural network chips can also
implement classical classification algorithms based on pattern matching (nearest neighbors, Parzen
windows, potential functions, etc. [40]) that are close parents of “neural” techniques (Kohonen maps
and LVQ [79], RCE [117]). These pattern matching algorithms can sometimes be reduced to a network
of formal neurons (fig. 25 [53]), or of radial basis functions [28].

4.2. Supervised learning

A lot of applications of neural networks use simply a multilayer feed-forward network with total
connection between layers. This simple architecture already yields results that are better than other
I. Guyon, Neural networks and applications tutorial 235

(a) (b)
Fig. 25. Classification. (a) Nearest neighbors. (b) Parzen windows.

10 output units J~J


“I fully connected
— 300 links

layer H3 000000000
30 hidden units fully connected
- 6000 links

layer H2
12 x 16=192 H21 H2.12
hidden units ______ — 40 000 links
-. from 12 kernels
5x5x8

layer Hi • ‘U
U, IN _____
12x64=768 .~
hidden units ~
H1.1 “ H1.1 /
—20,000 links
from 12 kernels
5x5

256 input units ________

Fig. 26. Handwritten digit recognition with a convolutional network (Le Cun et al. [88]).
236 Parallel architectures and applications

methods in a number of cases (prediction of the secondary structure of proteins [115], prediction of
water consumption [30], classification of sonar signals [48], autonomous vehicle navigation [114],
playing backgammon [129]), or results that are comparable in performance and accuracy but better in
robustness (image compression [35], control of an active magnetic bearing system [23]).
We present here two examples of more elaborate architectures involving weight sharing: the
handwritten digit recognizer of Le Cun et al. [88] (fig. 26) and the TDNN of Waibel et al. [135] (fig.
27). These networks are both convolutional networks.
The digit recognizer (fig. 26) performs consecutively:
• 2D convolutions of 12 (5 x 5) kernels in layer Hi, resulting in 12 feature maps,
• 2D convolutions of 12 (5 x 5 x 8) kernels in layer H2, on some of the previously obtained maps,
• classification in layer 113 and output layer, that are fully connected.
The entire network is trained with back-propagation, with the constraint that neurons associated with
the same kernel share the same weights. Various other networks have been proposed for character
recognition [43, 29, 53].

D Output Layer

integration

• - - • • . Hidden Layer2
I-I.--—.. m

-~ - ..“S •....• Hidden Layer 1


1 . .. ~._ e —.
-. ‘~
(Hi)
5437 ~

81 8! 9! ~

“s~eeie~~
•UUU•U~U
.~
~iitii~
8~ • •
~‘

U U
.~,,

• — •
4547
3797
3187
2672
~
~

—-.——— 2250 InputLayer


81 ~L.t ~ 8 -• • U U — U 1922 C
••UUUUU !641
~ ~ 1406 ~

mm . •.•••••~ 1219
7. •U~U..•. 1031
ii ‘~,li !9~ U ••~•••~ 844 ‘i~
~ 656
•u•UUUU• 462
• U..UU 281
.•U•U..~ 141

15 frames
10 msec frame rate
Fig. 27. Phoneme classification with a TDNN (Waihel ci al. [136] © 1989 IEEE).
I. Guyon, Neural networks and applications tutorial 237

The phenome classifier (fig. 27) works very similarly, except that 2D convolutions are replaced by
1D convolutions. The architecture is that of a TDNN as described in section 3. Many other
architectures specific to speech recognition problems have been proposed [54, 21, 27, 83, 92]. For a
review of applications of neural networks in speech recognition and speech processing one can refer to
ref. [95]. A collection of papers on neural networks applied to speech can be found in a special issue of
Speech Communication [5].

4.3. Semi-supervised learning

Applications of semi-supervised learning have been demonstrated in robotics and control (with
back-propagation) [17, 74, 103], in function optimization (with reinforcement learning) [143] and in
classification (with LVQ) [79].
We present an example of a classical control problem: the broom balancer (fig. 28). The neural
network must provide a sequence of actions which maintain the inversed pendulum in equilibrium. In

M~___

C)
Forward Model

1810 0
5911(8) 0 0
l~l 0 0 Temporal
sgn(~) 0 Difference
Unit
181 0 ~ 0 —.o
Sgn(e) 0 0
iaiO 0
8911(8) 0 /~‘ 0
/0
Action / 0
01 Unit /
K 0
•0
‘0
to plant
Controller
Fig. 28. Control of a broom balancer (Jordan and Jacobs [74]).
238 Parallel architectures and applications

the solution of Jordan and Jacobs [74], a first auxiliary network (the forward model) is first trained to
emulate the dynamics of the system and provides a differentiable model. After the forward model is
trained, its weights are frozen. Then the neural network controller itself is trained. No desired actions
are given to the controller (as would happen during a supervised learning session). Instead, the system
is given the information on whether it succeeded or failed in maintaining the pendulum in equilibrium
and errors are back-propagated through the forward model to train the controller.
Another example of semi-supervised learning is given in fig. 29. It is a variant of the Learning Vector
Quantization algorithm (LVQ2 [81]), applied by McDermott and Katagiri [99] to the same task of
phoneme recognition that the TDNN above described. The network shown in fig. 29 consists of:
• one layer which performs a convolution with the 112 (16 x 7) kernels defined by LVQ2,
• one layer which picks the maximum activation (corresponding to minimum Euclidean distance) of
the kernels, in each category, at each time step,
• one layer which averages over time the outputs of each category.

4.4. Unsupervised learning

Unsupervised learning techniques such as Kohonen maps have found applications for instance in
pattern recognition [79] and combinatorial optimization [13].
Kohonen has used his topological map technique to perform classification of Finnish phonemes [79]
(fig. 30). This map has been incorporated in a larger system functioning as a phonetic typewriter.

/b//d//gj
Final activations I . u.
~au~o:s r~ _________

/d/vectors vectors

112 input values

l6melscale
filterbank ~-1:~°~~ ~:
coefficients

15 frames, 10 msec frame rate


Fig. 29. Phoneme recognition using LVQ2 (McDermott and Katagiri [99]© 1989 IEEE).
1. Guyon, Neural networks and applications tutorial 239

~ ooaa~~~eee
~:::~~~t1oet1i uoaa~oøei
~ uuhhreoei i
v~*~ry*~~S vvvrroyj ii
~**~*~S’1.et1* vvdrryy.tj i
~~:tfl1~15*~115 nvndn~mnss
d*n~~~m*~S nnnnnmmmms
(a) (b)
Fig. 30. Topological map of Finnish phonemes (Kohonen [79]). (a) Units giving maximum answer for each phoneme. (b) Phonemes for which given
units give maximum answer.

4.5. Comparisons

In the following, we present some comparisons between networks and comparisons with reference
methods.
An evaluation of a large variety of methods has been made by Lee and Lippmann [91] (fig. 31) on
several classification problems. As far as accuracy is concerned, the results show that most classifiers
perform similarly, except when there is a clear mismatch between the task and the classifier. Notice that
in this study however neural networks were not particularly carefully designed for the tasks addressed.

BULLSEYE DISJOINT

B
Dimensionality: 2 Dirnensionality: 2
Testing set size: 500 Testing set size: 500
Training set size; 500 Training set size: 500
CLasses: 2 CLasses: 2

DIGIT VOWEL
Dimensionolity: 22 cepstra Dimension: 2 formants
Training set size: 70 Training set size: 338
Testing set size: 112 Testing set size: 330
16 Training sets Classes: 10 vowels
16 Testing sets Talker independent
Classes: 7 digits
Talker dependent

(a)
Fig. 31. Comparison of various classifiers (Lee and Lippmann [91]). (a) The problems. (b) The performance on text examples.
240 Parallel architectures and applications

10
BULLSEYE DISJOINT
8 8

6’ 6

u~ I~

L2I~ ~“U’~”J 2

O ~ I
C Classifiers Classifiers
.230 30
DIGIT VOWEL
- 25~

20 ~

~ a~ a a Z c3 ri. a a ~ S 42 Z ~ 0 4)
~ 0 .~ L ~ 0 0 0 N L ~ a a

~ 8. L
a 0
m 0 L
a
3 c;~ — 4)

(b)
Fig. 3!. (contd.).

Another evaluation was done by Guyon et al. on a handwritten digits task [53] (fig. 32) showing that
well designed neural networks outperform standard classical methods in accuracy while also requiring
less storage capacity. On the other hand, poorly designed networks can also perform badly and waste
storage resources.
The importance of the design of the network is also illustrated in the study of Le Cun [86] (fig. 33).
The task evaluated here is again that of handwritten digit recognition. Feed-forward networks fully
connected between layers (Net-i and Net-2) are compared to a network with local connections between
layers (Net-3) and to convolutional networks (Net-4 and Net-5). Convolutional networks both perform
better and require less storage capacity, since a lot of neurons share the same kernel (the number of
weights is much smaller than the number of links).
Finally, several speech recognition techniques are compared in tables 3 and 4. The TDNN is
compared to 11MM (Hidden Markov Models) in table 3 [135], showing that on this particular task the
TDNN performs better than HMM. The LVQ is compared to K-means clustering and TDNN in table 4
[99]. LVQ outperforms both methods. Recent work on hybrid methods combining HMM and LVQ [77],
11MM and hidden-control network [92] or LVQ and TDNN [45], show that it is possible to obtain even
better performance by taking the best of different methods.
I. Guyon, Neural networks and applications tutorial 241

10~

6
! io
C)

a.~ io~
io~ ~IJiiIt1IiJiJ
pm fm pm fm pm fm pe pm fm pm fm pe pm

pm pixel maps

I pixel maps
+

a.
~a0
C) ~
70
~ __..
1 __________
2
~
__
.~‘‘
__________ 311 ____ .~.
______________
4
I i
5
~
6 oO
~
feature
pe I polynomial
fm
well
maps
expansion
classified
rejected
misclassified
Nearest Parzen One-layer net Fully con. One-layer net Two-layer
neighbors windows grandmother dyn. net sep. classes 2 by 2 net
b
Fig. 32. Handwritten digit recognition: Benchmark studies (Guyon et al. [53]). (a) Storage requirement. (b) Classification performance on test set.

10

(a) ______
~l6x16 12
10 ~ 10
8x8
4x4 10
8x8x2
4x4 ~
~~~~4x4x4 10
8x8x2

_________ 16x 16 _________ ________~I6x16 16x16 16x16

Net-i Net-2 Net-3 Net-4 Net-5

(b) network architecture links weights performance


single layer network Net-i 2570 2570 80 %
two layer network Net-2 324.0 3240 87 %
locally connected Net-3 1226 1226 88.5 %
constrained network Net-4 2266 1132 94 %
constrained network 2 Net-5 5194 1060 98.4 %
Fig. 33. Handwritten digit recognition: comparison of five network architectures (Le Cun [86, 87]). (a) Net-i: single layer; Net-2: twelve hidden
units fully connected; Net-3: two hidden layers locally connected; Net-4: two hidden layers locally connected, convolutional constraints in the first
hidden layer; Net-5: three hidden layers, two of them with local connections and convolutional constraints. (b) Performance on the test set (all nets
obtain 100% correct on the training set).
242 Parallel architectures and applications

Table 3
Phoneme recognition comparison of methods: TDNN versus HMM (Waibel [135])
No. of No. of Recognition No. of Recognition
Speaker tokens errors rate TDNN errors rate HMM
MAU b(227) 4 98.21 18 92.11
d(179) 3 98.3 98.8 6 96.7 92.9
g(252) 1 99.6 23 90.9

MHT b(208) 2 99.01 8 96.21


d(170) 0 100 99.1 3 98.2 97.2
g(254) 4 98.4 7 97.2

MNM b(216) 11 94.91 27 87.51


d(178) 1 99.4 97.5 13 92.7 90.9
g(256) 4 98.4 19 92.6

Table 4
Phoneme recognition. Comparison of methods: LVQ2 versus K-means and TDNN (McDermott
and S. Katagiri [99])
LVO2

No. of errors! K-means TDNN


Task No. of tokens % correct total % total % total %
b 2/227 99.11
d 0/179 100 99.2 78.7 99.0
g 3/252 98.8

p 6/15 60.0]
0/440 100 98.9 95.7 98.7
k 5/500 99.0

m 4/481 99.21
n 7/265 97.4 98.8 83.7 96.6
N 4/488 99.2J

4/538 99.3
sh 0/316 100 99 4 98 8 993
h 0/207 100 .

z 3/115 97.4

~ 100 100 100

0/722 100 1
w 1/78 98.7 ~ 99.6 99.2 99.9
y 3/174 98.3 J
a 0/600 100
2/600 99.7
u 14/600 97.7 99.1 96.7 98.6
e 6/600 99.0
0 4/600 99.3
I. Guyon, Neural networks and applications tutorial 243

5. Elements of learning theory

In this section, we make a synthesis of recent work in learning theory. Several questions have been
addressed in the past few years by theorists, including:
(1) What class of problems can neural networks learn?
(2) How can neural networks be trained?
(3) Under what conditions do neural networks generalize?
We have already partially answered the two first questions:
(1) Neural networks can learn almost everything [37, 65, 32, 128, 44] (see section 2).
(2) Algorithms have been proposed for training neural networks by example (see section 3).
In principle, one can always find a neural network that can solve a given problem, provided that
there is no restriction on the size of the network and that an infinite amount of data is available. In
practice one has to deal with a limited amount of resources, and has to rely on the generalization
abilities of the network. In this section, we examine this last question of generalization and extract
guidelines from learning theory to choose an architecture and to improve the training efficiency.
The approach we will describe uses the tools of statistical physics. Details and proofs about these
results are given in refs. [38, 131, 126]. Other approaches leading to similar conclusions include the
worst case analysis [6, 18], and other studies of mathematical statistics [16, 140]. See also the
proceedings of the Second Annual Workshop on Computational Learning Theory [41].

5.1. How to design a network?

The main differences between the classical approach of Artificial Intelligence (Al) and Neural
Networks (NN) is that Al requires considerably detailed programming whereas NN rely heavily on
learning. To some extent, neural networks can be considered as a set of new statistical estimators that
can perform probability density estimation and function approximation (curve fitting). Their advantage
over other statistical estimators is that a priori knowledge about the problem can be incorpo
architecture of the network. This allows a considerable reduction in the amount of data needed to
train.
To get a feeling for the type of problems we face when designing and training a neural network, let
us make an analogy with curve fitting (fig. 34).
Assume that you want to fit experimental points with a polynomial. With a small number of points,
fitting with a high-order polynomial can lead to very stupid interpolations: the curve goes through all
the experimental points, but values between experimental points do not make sense (poor generaliza-
tion). The polynomial has too many free parameters, we do not have enough data to constrain the
solution sufficiently (underdetermination).
Let us try again to fit these same points with a line. We can find a regression solution that does not
go through all the experimental points but whose values between experimental points make a lot more
sense. In this second case we have more experimental points than free parameters (overdetermination)
which ensures reasonable generalization.
The number of free parameters is just a simple measure of the complexity of the model. In the
following, we will show how to relate complexity and number of training examples.
244 Parallel architectures and applications

• Fit F(x~)with a high degree


polynomial G~(x).
V
UNDERDETERMINATION

• Fit F(x~)with a line G~(x).

(x~,F(x~)) G—~(x)

x
Fig. 34. Analogy with curve fitting.

5.2. What is learning?

We will only address the problem of a fixed architecture, determined prior to training. Algorithms to
grow networks during training [101], to prune the connections [90] and genetic algorithms [24, 56] will
not be considered here.
In the following, unless otherwise stated, we work with a fixed architecture. The choice of an
architecture and of weight domains (weights can be quantized, binary, bounded, etc.) defines a space of
hypotheses, that is the space of all networks accessible by learning {G~}(fig. 35). A given function F
can be learned if and only if there is a non empty subset of { G~} that implements F.
Before learning, we can try to make a “lucky guess”. The probability of picking a network at random
that actually implements F is

= ~F~’

where 12,. and ~(~) are respectively the volume of the subset of networks that can implement F and the
volume of {G~}itself. From now on we assume that this probability is non zero and that F can be
learned by the network.
We define the entropy of the network architecture (intrinsic entropy) as:

= —~ p~ln p~P,
1. Guyon, Neural networks and applications tutorial 245

NETWORKS
THAT IMPLEMENT
THE DESIRED FUNCTION F.
{G. . .}
OUR CHOICE OF
ARCHITECTURE:

w
SPACE.

VOLUME ~F VOLUME ç~(O)


BEFORE TRAINING.

Fig. 35. Space of hypotheses prior to training (from Solla [126]).

where the summation runs on all functions H implementable by the network. If we have m training
examples, our guess is made easier since some networks can be eliminated, those which disagree with F
for the training examples. Our space of hypotheses therefore reduces (fig. 36). Our probability of
making a lucky guess increases,

p~)= QFI12(m)> p~°)


>0.

Some networks have now zero probability of being selected (the ones that we have rejected, given
the examples), therefore the entropy of the network decreases,

0< S~= —~ P~ln P~ ~

Learning is an entropy reduction process.

5.3. Information versus complexity

The goal is to reach zero entropy. The question is: how many examples are needed to reach this
goal?
Let us define the learning efficiency as

= (S~°~
— S(m))Im.

~F ~

Fig. 36. Space of hypotheses after learning m examples (from Solla [126]).
246 Parallel architectures and applications

Zero entropy is reached for the critical number of examples,


(0)
m~=S I’q,
determined by the choice of the architecture (intrinsic entropy S~°1)
and the choice of the examples
(training efficiency ~).
The entropy can be used as a measure of complexity of the network. The larger the intrinsic entropy
of the network ~ the more additional information is needed to train it (me).

5.4. Bias versus variance

It is always desirable to reduce the critical number of examples m~needed to train the network. One
strategy is to try to reduce as much as possible the intrinsic entropy ~ Given a network capable of
implementing N functions, the maximum entropy corresponds to an equipartition of {G~},

p~ = i/N,

for all functions H. Then,

= —~, p~In p~°)


= In N.

If the network architecture is biased, all the a priori probabilities P~are not equal, and the intrinsic
entropy is lower. Hence, if the architecture is biased, m~is smaller.
How can the bias affect generalization?
Let us pick a network G~at random, after introduction of m training examples. The squared bias is
then given by:

~(Gw~m - F~2,

where ~G~ ) m is the expected value of G~(according to the underlying distribution of probabilities of
networks, given the m examples); and where is, for instance, the Euclidean norm. The expected
generalization ability of network G~can be measured by computing the expected value of the squared
error,

(~G~ F~2~m
-= K Gw)m - F~2+ (~(Gw~m - GW~2~fl?.

It is composed of two terms. The first one we recognize as being the bias, and the second one is the
variance (fig. 37). Ultimately, when we reach m~,both terms vanish. But in practice, the number of
examples available m is often much smaller than m~,and therefore there is a tradeoff between the bias
and the variance.
Figure 38 shows an example of this tradeoff between bias and variance [46].Three different methods
have been tried on a problem of recognition of handwritten digits: a two-layer feed-forward network
trained with back-propagation, Parzen windows and k nearest neighbors [40]. The number of training
examples was kept constant. The percentage of errors on the test set gives an estimate of the
I. Guyon, Neural networks and applications tutorial 247

p:;c~t~

Parzen ~vindows
/ k—NN /Backpr~agation
Fig. 37. The bias versus variance tradeoff for the choice of the initial Fig. 38. An example of the bias vs variance tradeoff. Handwritten
space of hypotheses. digit recognition: The influence of the variation of the characteristic
parameter of three different classification methods on the percentage
of errors on the jet set. (From Geman, Bienenstock and Doursat
[46).)

generalization ability of the network. It is plotted versus a parameter characteristic of the architecture:
window width o- for Parzen windows, number of neighbors k for nearest neighbors and number of
hidden units N5 for the neural network. The curve goes through a minimum number of errors
corresponding to the best tradeoff between bias and variance.

5.5. In practice:

• Very general networks can implement a large variety of functions but are the hardest to train. The
critical number of training
1, examples
characteristic of required to obtain
the generality of thegood generalization is proportional to the
architecture.
intrinsic entropy S~°
• Good hints should be introduced to bias the neural network architecture towards the desired
function. Hints reduce the generality of the network architecture and therefore its intrinsic entropy.
• Bad hints bias the network towards undesired functions. For a given number of training examples,
the parameters of the architecture can be optimally chosen to obtain the best tradeoff between bias and
variance.
• The training efficiency ~j can be improved by choosing an appropriate training set: the least
predictable example is the one yielding highest generalization improvement.

6. Case study: design of a TDNN for handwritten character recognition

The purpose of this section is to illustrate notions, introduced in the previous sections, about the
design of neural networks to solve particular problems. Emphasis will be made of the importance of the
preprocessing. This preprocessing can be fairly simple but is essential in order to format the data in a
way which is compatible with the neural network. Special care has also been given to the design of the
248 Parallel architectures and applications

network to incorporate a priori knowledge of the designer to reduce the intrinsic entropy (see section
5). Strategies to improve the learning efficiency are also described.

6.1. The application

The task we addressed is the design of a character recognizer for a touch terminal [52]. This work
was carried out at Bell Labs, in collaboration with P. Albrecht, Y. Le Cun, J. Denker and W. Hubbard.
The touch terminal consists of a transparent touch-sensitive screen overlaid on a standard liquid
crystal display (LCD). One can write on the touch screen using any sort of stylus, even one’s finger. We
will refer to any such writing instrument as “the pen” for simplicity. When the pen is touching the
screen, points are returned regularly and displayed immediately under the pen (electronic ink). The
system we used is represented in fig. 39. The touch terminal is connected to a standard PC through two
interface boards. Portable laptop PCs with touch interface (electronic notebooks) are appearing in the
market. These devices are particularly convenient for traveling salesmen, insurance agents, nurses, etc.
who have frequently to access data bases through keywords and fill forms while standing or in an
awkward position: the pen replaces both the keyboard and the mouse. In addition, the electronic
notebook can serve as pocket calculator, blackboard, mailbox, etc. After a day of work, the electronic
notebook can be connected directly to the main computer to update the central database.
For this application, it is important not to be handicapped by poor recognition performance of
handwritten characters. In this work, we restricted ourselves to a problem of limited difficulty:
• Characters are drawn in disjoint boxes, so that only trivial segmentation is needed.

Display

rou9~~’~~

rrr

CoiledCable (up to

~st~~ngIe Desktop Stand

Fig. 39. The AT&T touch terminal.


I. Guyon, Neural networks and applications tutorial 249

• Either digits or uppercase letters are considered.


• The system is trained to be writer independent.
• No use is made of syntax, semantics or other contextual information.
We used a set of 12 000 characters from approximately 250 writers. The performance of the system
was evaluated on a set of 2500 characters from a disjoint set of writers. The data were collected among
AT&T staff at Columbus, Ohio.

6.2. The method

Many different approaches have been proposed in the literature, for Optical Character Recognition
or for so-called on-line recognition. The methods used in these two cases are usually very different, the
former deals with two-dimensional images, while the latter deals with time-sequential signals. We chose
to preserve the sequential nature of the information provided by the touch screen.
This choice was justified by the following preliminary simulations. It is possible to remove the time
information (projecting the x, y, t trajectory onto the x, y plane) and smooth the lines so that a classifier
designed for optical character recognition can be used. We tried this approach, using the neural
network classifier of Le Cun et al. [89], achieving an error rate of less than 5% on zipcode digits, and
8% on uppercase letters. However, we find that a neural network which incorporates the time
information performs better, while using simpler preprocessing and a smaller number of parameters.
It turns out that the ordering of the sequence of points contains more information than the exact
timing of the points. Our preprocessing resamples the character and then encodes local geometric
information such as the position of the pen, the direction of the movement and the curvature of the
trajectory.
The classification is then performed by a neural network classifier. We use a multi-layer feed-forward
network. The layers in between the input-layer and the output-layer (hidden-layers) provide additional
degrees of freedom. The network is trained by a gradient descent method (the back-propagation
algorithm). To ensure that training leads to good generalization performance (performance on not
previously seen patterns), the network architecture is constrained, to make it sensitive to local
topological features [89, 84, 136, 25].

6.3. The preprocessing

The purpose of the preprocessing is to create an intermediate representation which simplifies the
problem for the neural network while making only a small computational overhead. This includes two
complementary processes:
• Reducing variability which does not help discrimination between classes, therefore building up
some invariance.
• Enhancing variability which does help discrimination between classes.
Examples of raw data are shown in fig. 40. The first steps of our preprocessing (namely centering,
rescaling and resampling), greatly reduce the meaningless variability. In particular, time and scale
distortions are removed. The last step of our preprocessing enhances the useful variability by capturing
information about the local slope and the local curvature.
To make the intermediate representation invariant to translations and scale distortions, the charac-
ters are centered and resealed.
The raw data contain enormous variations in writing speed. Training a neural network with such data
250 Parallel architectures and applications

L~FEE

/‘,~

(a) / (b)
Fig. 40. Examples of handwritten characters entered Ofl the touch Fig. 41. Preprocessing. (a) Resampling. (h) Frame representation —
vector of seven components f(n): J~= pen-up/pen-down;

terminal,
= (x — x1,)/~~ and f, = v1)/~. coordinates; f, = cos 0 and J~=
sin 0, direction: f~= cos tb and f~= sin ~, curvature.

leads to poor performance (60% errors). Resampling to obtain a constant number of regularly spaced
points on the trajectory yields much better performance. The resampling algorithm uses simple linear
interpolation. An example of character before and after resampling is shown in fig. 41a. Resampled
characters are represented as a sequence of points [x(n), y(n)], regularly spaced in arc length (as
opposed to the input sequence, which was regularly spaced in time). The two states of the pen (up and
down) are encoded as the value (+ 1 or —1) of an additional variable penup(n). The initial number of
pen-down points recorded by the touch terminal varies between roughly 5 and 200, with an average
around 50 points. The resampled characters have 81 points, including pen-up points.
We encode the direction of a stroke by estimating the direction cosines of the tangent to the curve at
point n (see fig. 41b). These parameters can also be thought of as (discrete approximations
2 + dy2)’ to) the first
‘2• It is also desirable to
derivatives with respect to arc length, dx!ds and dylds, where ds = (dx
incorporate information related to the curvature. Unfortunately, the second derivatives d2x/ds and
dy/ds are not bounded. Therefore, we choose to measure the local curvature by the angle between
two elementary segments: 4(n) = O(n + 1) O(n 1) as shown in fig. 41b. This angle is encoded by its
— —

cosine and sine, which can be computed from the direction cosines of O(n 1) and O(n + 1) through —

trigonometric formulas.
These four parameters (cos 0, sin 0, cos 4’, sin 4’) were chosen for several reasons: they do not
require the computation of a transcendental function, they are bounded, and for a smooth curve the
parameters change smoothly, without branch cuts. (A one-parameter encoding of the angle would not
have these nice properties.)
I. Guyon, Neural networks and applications tutorial 251

6.4. The frame representation

We define a frame to be a feature vector that is part of a sequence parameterized by the frame
number n. To summarize the results of the previous section, the intermediate representation created by
the preprocessor consists of a sequence of frames with seven components:

f0(n) = penup(n), f,(n) = [x(n) — x0J/8~, f2(n) = [y(n) — y01I~,


(9)
f5(n) = cos 0(n), f4(n) = sin 0(n), f~(n)= cos ~/~(n), f6(n) = sin q5(n),

where x0 and y0 are the coordinates of the center of the character and is the height of the character.
All these components are bounded and vary between —1 and + 1, with exception off1 (n) which may
occasionally go slightly out of these bounds. In general, normalization of the inputs is very important:

. ~ . U • •

“U • U E~i•; .•_

~AB

Fig. 42. Intermediate reresentation produced by the preprocessor.


252 Parallel architectures and applications

The speed at which a weight changes is proportional to the covariance of its incoming signal. Therefore,
the input variable should be normalized to have similar covariances. This makes the learning speeds
more uniform, and allows the use of larger values for the gradient step a (see fig. 12).
An example of this representation is shown in fig. 42. Time (as parameterized by the frame number,
n) increases from left to right. Each frame is represented as a column of boxes; the size of the box
indicates the magnitude of a component and its color indicates the sign (black = negative; white =
positive). The trajectory of the pen can be followed on this intermediate representation. Point A is in
the middle of a line: the intermediate representation around point A shows constant direction and
curvature (components f3(n), f4(n), f~(n)and f6(n) are constant). Point B is at the extremity of an angle:
this corresponds to sharp changes in the direction and the curvature.
Local features of this kind (lines, curves, edges, . . .) are well known to be useful for character
recognition; the preprocessor makes this information readily available to the adaptive neural network
described below.

6.5. The network architecture

The intermediate representation produced by the preprocessor (fig. 42) captures local topological
features along the curve of the character. It is then used as the input of a neural network which is
trained to make use of these simple local features and extract more complex, more global features.
The network is a TDNN [841, already described in section 3. Its architecture is represented in fig.
19b. It is a feed-forward layered network; in this case we have four layers of weights connecting five
layers of units (since we count the input as a degenerate “layer #0”).
The connections are restricted to be local. A particular unit has a receptive field that is limited in the
time direction; that is, all its inputs come from a group of consecutive frames in the previous layer. The
receptive field of unit i will have considerable overlap with the receptive field of unit — 1. This induces
a topology on the input space, giving the network a hint that it should be looking for sequences.
The network is constrained to be convolutional. That is, the weights connecting one frame (in layer
/ + 1) to its receptive field (in layer 1) are the same as the weights connecting the next frame to the next
receptive field. The motivation for this is that we expect that a particular meaningful feature (e.g., a line
or a curve) can occur at different times within the sequence. It also means that there are far fewer free
parameters, which facilitates training and improves generalization.
The final specialization of our network is that each layer has a coarser time representation than the
preceding layer. This is implemented by subsampling [87, 25]: only one every s values in the
convolution is kept (and actually computed).
The loss of time resolution in the feature vectors (due to subsampling) is partially compensated by an
increase in the number of features. This is called “bi-pyramidal” scaling. The number of units, and
therefore the information-carrying capacity, in each layer is reduced less than the subsampling alone
might suggest. The output layer consists of only one frame of 36 coefficients (10 for digits and 26 for
letters) that can be considered as the ultimate most abstract features.
In the terms used in section 5 all these constraints on the architecture contribute to substantially
reducing the intrinsic entropy of the network. Specializing the network contributes to facilitating
learning only if the resulting structure reflects the designer’s a priori knowledge of the problem.
Otherwise the network is a priori biased towards wrong solutions.
The specifications of our network are summarized in fig. 43.
I. Guyon, Neural networks and applications tutorial 253

data flow

OUTPUT LAYER

_ SECOND

27

,,/‘features time

Fig. 43. Specifications of our TDNN. Bi-pyramidal scaling of the parameters.

6.6. The training

The neuron used is a sigmoid unit (fig. 7) whose activation function is a scaled hyperbolic tangent,

f(v) = a tanh by
where a = 1.716 and b = 2/3 [89]. By convention, the input x0 is always set to 1, so that the weight w,0
serves as a bias value.
— It should be noticed that we use a symmetric activation function (odd sigmoid function).
Non-symmetric activation functions (ranging between 0 and 1 for instance) yield much slower
convergence [76].
— The choice of the parameters a and b of the sigmoid has to be made in conjunction with the choice
of the target values of the output layer and the initialization of the weights:
— It is important to choose target values within the range of the sigmoid. With this choice of a and b,
target values can be chosen conveniently to be —1 and + 1. Many authors use target values for desired
outputs equal to the asymptotic values of the activation function. This has many undesirable effects.
Target values on the asymptotes tend to drive the weights to infinity, and slow down learning by orders
of magnitude. It also has bad effects on generalization performance.
254 Parallel architectures and applications

Initial values of the weights should not be too big (as that might saturate the units and produce tiny

gradients) and not too small (as this can cause the gradients to be very small, and the learning to be
initially very slow; catastrophes with all weights going to 0 might also occur). The value of the weighted
sum v (or potential) should stay in the linear part of the sigmoid. This imposes the requirement that
random initial values of the weights should be smaller for units with many inputs and larger for units
with few outputs. In practice, with the values of a and b chosen above, weights are initialized with
random values, uniformly distributed between —2.4 / F1 and 2.4 / F, where F, is the total number of
inputs to unit i.
The weights are adjusted during a supervised training session which performs gradient descent in
weight space with a mean-squared-error (MSE) cost function:

~
P k=1 1=1

where p is the number of training patterns, c the number of output units (one per class, i.e. 36), y~the
state of the output unit 1, when pattern number k is presented at the input of the network, and d~the
corresponding target value. The target values are binary: d~= 1 if / = class(k) and —1 otherwise.
The training algorithm is a stochastic gradient procedure (on-line gradient descent). It is a modified
version of the back-propagation algorithm [120], which makes use of the diagonal approximation of the
Hessian matrix to optimally set the learning rate [19]. The training algorithm, when it modifies the
weights, must preserve the convolutional constraint on the connection pattern. This is implemented by
“weight sharing” (see refs. [89, 87]) (see section 3).
— For classification problems, always use on-line (stochastic) weight update, as opposed to batch
update. Stochastic update is when the weights are updated once after each pattern presentation. Batch
update is when the gradients are accumulated over the whole training set before the weights are
updated. Stochastic update is orders of magnitude faster on problems with a large and redundant
database. Stochastic update is harder to vectorize/parallelize.
— Slow convergence might be caused by large differences in the learning speed in various parts of the
network. The gradient steps (learning rates a) should be chosen so that all units converge at the same
speed. In particular, a should be smaller in the last layers than in the first layers: the last layers tend to
have larger gradients. Units with many inputs should have smaller learning rate than units with few
inputs.
— The least predictable example is the one carrying most new information. Examples of different
classes and from different writers should be alternated. Shuffling the examples is critical for the
improvement of the speed of convergence. Query and selective sampling also improve the training
efficiency [141,
The training session is stopped according to a cross-validation criterion. For this purpose a validation
set of 500 examples, distinct from both the training set and the test set, was defined. The training is
stopped when the mean-squared-error on the examples of the validation set stops decreasing signifi-
cantly or starts increasing. This corresponds usually to less than 30 learning cycles through the entire
training set. The network which gives the smallest mean-squared error on the validation set is kept and
re-tested with an independent test set of 2500 examples from different writers. The result of this last test
is what is called the performance of the network.
I. Guyon. Neural networks and applications tutorial 255

6.7. The results

Training was performed with a set of 12 000 examples produced by a large variety of writers. Using
the cross-validation procedure, training was stopped after 21 passes through the training set. At that
point, the percentage of mistakes was 0.3% on the training set, and 4.0% on the validation set.
The network was then tested on the test set (2500 examples from a disjoint set of writers). With the
simplest classification scheme consisting of picking the maximum output, the performance was 3.4%
mistakes on the whole test set, 2.3% if tested on digits only and 3.8% if tested on uppercase letters
only.
Using the validation set, two thresholds were set so as to reject ambiguous or meaningless patterns.
A pattern is rejected if the maximum output is below a first threshold °max and/or the difference
between the two biggest outputs is below a second threshold °d~ff’The threshold 0djff was adjusted to
give at most 1% substitution errors (on that set): °diff = 0.3. The other threshold, Omax, was fixed at 0.
Applying these thresholds to the test set, the network rejected 7.2% of the patterns as unclassifiable,
and made 0.7% substitution errors.
— This protocol of selection of the thresholds using the validation set was necessary so that we could
certify that all the parameters of the system were determined without reference to the test set.
There was no a priori guarantee that the central feature of our system, namely the emphasis on the
sequential structure of the data, will be an advantage rather than a disadvantage. After all, for any
particular static pixel map, there are many stroke sequences that will produce it. This can (in principle)
only increase the intra-class variability and complicate the recognition process.
The remarkable thing is that, in practice, the sequential information is highly advantageous, and is
indeed required for the high recognition performance that we obtained.

7. Where to find information?

Books and reviews: Pioneering work from the early 40’s has been re-published in ref. [11].There are
several general review papers [133, 94, 80, 50, 51] and reviews on more specific topics [9, 95, 140, 139,
103]. The classics of Neural Networks are refs. [102, 79, 121, 100]. Recently, several textbooks have
been published, including refs. [137, 124, 10]. Related topics can be found in refs. [40, 130, 142, 104].
Special issues of IEEE Computer [4] and Speech Communication [5]are devoted to Neural Networks.
A special issue of Scientific American is devoted to the Brain [2]. An extensive survey of the field has
been published by DARPA [3].
Journals: Interesting articles are scattered in a variety of journals from maths, physics, biology,
pattern recognition, speech, machine learning, etc. Journals specially devoted to Neural Networks
include: Neural Computation, IEEE Transactions on Neural Networks, Neural Networks, Neurocom-
puting, International Journal of Neural Systems, The International Journal of Neural Networks,
Network Computation in Neural Systems. Journals publishing reviews of books, papers, conferences
and products include: Neural Network Review, Neurocomputers, Intelligence, Neural Network News.
Conferences: The three main annual conferences on Neural Networks are: IJCNN (International
Joint Conference on Neural Networks), INNC (International Neural Network Conference) and NIPS
(Neural Information Processing Systems).
256 Parallel architectures and applications

Acknowledgement

I would like to thank all the Neural Network research group at Bell Labs. This tutorial is, to a large
extent, the result of my interaction with them over the last two years. Special thanks to Jane Bromley
and Ofer Matan for their help with the manuscript.

References

[1] A.R. Anderson, ed., Minds and machines. in: Contemporary Perspectives in Philosophy (Prentice-Hall, Englewood Cliffs, 1964).
]2] The Brain, Special Issue, Sci. Am. (1979).
[3] DARPA Neural Network Study, AFCEA, USA (1988).
[41Artificial Neural Systems, Special Issue, IEEE Computer 21(3) (1988).
[5] Neurospeech. Special Issue, Speech Comm. 9(1) (1990).
[6] Y.S. Abu-Mostafa, The Vapnik—Chervonenkis dimension: information versus complexity in learning, Neural Comput. 1 (1989) 312—317.
[7] Y.5. Abu-Mostafa and D. Psaltis. Optical neuro computers. Sci. Am. 225(3) (1987) 88—95.
[8] L.B. Almeida, A learning rule for asynchronous perceptrons with feedback in a combinatorial environment, in: Proc. ICNN 87. San Diego
(IEEE, 1987).
[9] S. Amari and K. Maginu, Statistical neurodynamics of associative memory. Neural Networks 1 (1988) 63—73.
[10] D.J. Amit, Modeling Brain Function (Cambridge Univ. Press, Cambridge, 1989).
[11]J. Anderson and E. Rosenfeld. eds, Neurocomputing-. Foundations of Research (MIT Press, Cambridge, 1988).
[12] J.A. Anderson, J.W. Silverstein, S.A. Ritz and R.S. Jones, Distinctive features, categorical perception, and probability learning: some
applications of a neural model, Psychol. Rev. 84 (1977) 413—451.
[13] B. Angeniol, G. dc Ia Croix Vaubois and JX. Le Texier, Self-organizing feature maps. and the traveling salesman problem, Neural Networks.
1(1988) 289—293.
[14] L. Atlas, D. Cohn and R. Ladner, Training connectionist networks with queries and selective sampling, NIPS 89. Advan. Neural Inform.
Proc. Systems 2, ed. D.S. Touretzky (IEEE, Kaufmann, San Mateo, 1990) pp. 566—573.
[15] H.B. Barlow, Unsupervised learning, Neural Comput. 1 (1989) 295—311.
[16] A. Barron and R. Barron. Statistical learning networks: a unifying view, in: Symposium on the Interface: Statistics and Computing Science.
Reston VA (April 1988).
[17] A.G. Barto, R.S. Sutton and C.W. Anderson, Neuronlike elements that solve difficult learning control problems, IEEE Trans. systems, Man,
Cyhernet. (1983).
[18] E.B. Baum and D. Haussler, What size net gives valid generalization? Neural Comput. 1(1989)151—160.
19] 5. Becker and Y. Le Cun. Improving the convergence of back-propagation learning with second-order methods, Tech. Rep. CRG-TR-88-5.
University of Toronto Connectionist Research Group (1988).
[20] 5. Becker and Y. Le Cun, Improving the convergence of back-propagation learning with second-order methods, Proc. 1988 Connectionist
Models Summer School, eds D. Touretzky, G. Hinton and T. Sejnowski (Kaufmann. San Mateo. 1989) pp. 29—37.
[21] Y. Bengio, R. Dc Mon and R. Cardin, Speaker independent speech recognition with neural networks and speech knowledge. NIPS 89,
Advan. Neural Inform. Proc. Systems 2, ed. D.S. Touretzky (IEEE. Kaufmann, San Mateo, 1990) pp. 218—225.
]22] F. Blayo. Les implementations VLSI de réseaux de neurones, in: Tutorial de l’Université d’éte CIRILLE, Lyon (1989).
[23] 1-1. Bleuler, D. Diez, G. Lauber, U. Meyer and D. Zlatnik, Non-linear neural network control with application example, IEEE Proc. mt.
Conf. Neural Networks. Vol. 1, Paris (1990) pp. 201—204.
[24] L.B. Booker, D.E. Goldberg and J.H. Holland, Classifier systems and genetic algorithms, Cognitive Science and Machine Intelligence
Laboratory TR-8. The University of Michigan, Michigan (1987).
[25] L.-Y. Bottou, Master’s Thesis, EHEI, Universite de Paris 5 (1988).
[26] H. Bourlard andY. Kamp, Auto-association by multilayer perceptrons and singular value decomposition, Biol. Cybernet. 59(1988) 291—294.
[27] iS. Bridle, Alpha-nets: a recurrent neural’ network architecture with a hidden Markov model interpretation, Speech Comm. 9 (1990) 83—92.
[281D.S. Broomhead and D. Lowe, Multi-variate functional interpolation and adaptive networks, Complex Systems 2 (1988) 321—355.
[29] D.J. Burr, Experiments with a connectionist text reader, IEEE 1st Ann. mt. Conf. Neural Networks, Vol. 4, San Diego (1987) pp. 717—724.
[30] 5. Canu, R. Sobral and R. Lengelle, Formal neural network as an adaptive model for water demand. IEEE Proc. Int. Conf. Neural
Networks, Vol. 1, Paris (1990) pp. 131—135.
[31] G.A. Carpenter and S. Grossberg, The ART of adaptive pattern recognition by self-organizing neural network. IEEE Computer 21(3) (1988)
77—88.
[32] S.M. Carroll and B.W. Dickinson, Construction of neural nets using the radon transform, IEEE Proc. Int. Joint Conf. Neural Networks, Vol.
1. Washington (1989),
I. Guyon, Neural networks and applications tutorial 257

[33] J.P. Changeux, l’Homme Neuronal (Fayard, Paris, 1983).


[34] L.N. Cooper, F. Liberman and E. Oja, A theory for the acquisition and loss of neuron specificity in visual cortex, Biol. Cybernet. 33 (1979)
9—28.
[351G.W. Cottrell. P. Munro and D. Zipser, Learning internal representations from grey-scale images: an example of extensional programming,
Proc. 9th Ann. Conf. of the Cognitive Science Society (Erlbaum, Hillsdale, 1987) pp. 461—473.
[36] J.D. Cowan, Neural networks: the early days, NIPS 89, ed. D.S. Touretzky, Advan. Neural Inform. Proc. Systems 2 (IEEE, Kaufmann, San
Mateo, 1990) pp. 828—842.
[37]G. Cybenko, Approximation by superpositions of sigmoidal functions, Math. Control Signals Systems 2 (1989) 303—314.
[38]J. Denker, D. Schwartz, B. Wittner, S.A. Solla, R. Howard, L. Jackel and J. Hopfield, Large automatic learning, rule extraction and
generalization. Complex Systems, 1 (1987) 877—922.
[39]S. Diederich and M. Opper, Phys. Rev. Lett. 58 (1987) 9.
[40] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis (Wiley, New York, 1973).
[41] Rivest et al., eds, Proc. 2nd Ann. Workshop on Computational Learning Theory, Santa Cruz (Kaufmann, 1989).
[42[ B.M. Forrest et al., Implementing neural network models on parallel computers, Computer 1. 30 (1987) 413—419.
[43] K. Fukushima, Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,
Biol. Cybernet. 36 (1980) 193—202.
[44] K.I. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural Networks 2 (1989) 183—192.
[45] P. Gallinari, A neural net classifier combining unsupervised and supervised learning. IEEE Proc. mt. Conf. Neural Networks, Vol. 1, Paris
(1990) pp. 375—378.
[46]5. Geman, E. Bienenstock and R. Doursat, Neural networks and the bias/variance problem, Neural Comput., to be published.
[47] C.L. Giles, G.Z. Sun, Chen H.H., Y.C. Lee and D. Chen, Higher order recurrent networks and grammatical inference, NIPS 89, Advan.
Neural Inform. Proc. Systems 2, ed. D.S. Touretzky (IEEE, Kaufmann, San Mateo, 1990) pp. 380—387.
[48] R.P. Gorman and T.J. Sejnowski, Analysis of hidden units in a layered network trained to classify sonar targets, Neural Networks 1 (1988)
75—79.
[49] H.P. Graf and D. Henderson, A reconfigurable CMOS neural network, ISSCC Dig. Tech. Papers, IEEE Int. Solid-State Circuits Conf.
(1990).
[50] S. Grossberg, Nonlinear neural networks: principles, mechanisms and architectures, Neural Networks 1 (1988) 17—61.
[51] I. Guyon, Neural Network systems, Proc. 5th mt. Symp. on Numerical Methods in Engineering, Vol. 1, eds R. Gruber et al. (Springer,
Berlin, 1989) pp. 203—210.
[52]I. Guyon, P. Albrecht, Y. Le Cun. J. Denker and W. Hubbard, Design of a neural network character recognizer for a touch terminal, Pattern
Recognition 24 (1991) 105—119.
[53] 1. Guyon, I. Poujaud, L. Personnaz, G. Dreyfus, J. Denker and Y. Le Cun, Comparing different neural network architectures for classifying
handwritten digits, IEEE Proc. Int. Joint Conf. on Neural Networks, Vol. 2, Washington (1989) pp. 127—132.
[54] H. Bourland and C.J. Wellekens, Multilayer perceptrons and automatic speech recognition, IEEE Proc. ICNN, San Diego (1987).
[55]J.B. Hampshire II and A.H. Waibel, A novel objective function for improved phoneme recognition using time delay neural networks, IEEE
Proc. lot. Joint Conf. on Neural Networks, Vol. i, washington (1989) pp. 235—241.
[56]S. Hanson. Meiosis network, NIPS 89, Advan. Neural Inform. Proc. Systems 2, ed. D.S. Touretzky (IEEE, Kaufmann, San Mateo, 1990) pp.
533—541.
[57] DO. Hebb, The Organization of Behavior (Wiley. New York, 1949).
[58] G.E. Hinton, Connectionist learning procedures, Tech. Rep. Carnegie-Mellon University, Pittsburgh (1987).
[59] G.E. Hinton and T.J. Sejnowski, Learning and relearning in Boltzmann machines, in: Parallel Distributed Processing: Explorations in the
Microstructure of Cognition, Vol. 1 (Bradford, Cambridge, MA, 1986).
[60] J.E. Hinton, Deterministic Boltzmann learning performs steepest descent in weight-space, Neural Comput. 1 (1989) 143—150.
[61] J.J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proc. Nat. Acad. Sci. USA 79 (1982)
2554—2558.
[621J.J. Hopfield, Neurons with graded response have collective computational properties like those of two-state neurons, Proc. Nat. Acad. Sci.
USA 81(1984) 3088—3092.
[63]J.J. Hopfield, The effectiveness of analogue “neural network” hardware, Network 1(1990) 27—40.
[64]J.i. Hopfield and D.W. Tank, Simple ‘neural’ optimization networks: an AID converter, signal decision circuit, and linear programming
circuit, IEEE Trans. Circuits Syst. CS-33 (1986) 533.
[65]K. Hornik, M. Stinchcombe and H. White, Universal approximation of an unknown mapping and its derivatives using multilayer feedforward
networks, Neural Network 2 (1989) 359—368.
[66] IEEE Proc. mt. Symp. on Intelligent Control, Albany NY (September 1989).
[67]IEEE Proc. 10th mt. Conf. on Pattern Recognition, Atlantic City NJ (June 1990).
[68]IEEE Proc. mt. Conf. on Acoustics, Speech and Signal Processing, Albuquerque NM (April 1990).
[69] IEEE Proc. Int. Joint Conf. on Neural Networks, Washington DC (January 1990).
[70] IEEE Proc. Int. Joint Conf. on Neural Networks, San Diego CA (June 1990).
[71] IEEE Proc. Int. Conf. Neural Networks, Paris (July 1990).
258 Parallel architectures and applications

[72] L.D. iackel, H.P. Graf, W. Hubbard, J.S. Denker, D. Henderson and I. Guyon, An application of neural net chips: handwritten digit
recognition, IEEE Proc. Int. Conf. on Neural Networks. Vol. 2, San Diego (1988) pp. 107—115.
[73] J.i. Hopfield and D.W. Tank. Neural computation of decisions in optimization problems. Biol. Cybernet. 52 (1985) 141—152.
[74] MI. Jordan and R.A. Jacobs, Learning to control an unstable system with forward modeling. NIPS 89, Advan. Neural Inform. Proc. Systems
2, ed. D.S. Touretzky (IEEE, Kaufmann, San Mateo, 1990) pp. 324—331.
[75] E.R. Kandel and i.H. Schwartz, Principles of Neural Science (North-Holland. Amsterdam, 1981).
[76] 1. Kanter, Y. Le Cun and S. Solla, Second order properties of error surfaces: learning time and generalization, NIPS 90, Advan. Neural
Inform. Proc. Systems 3, eds R. Lippmann et al. (IEEE, Kaufmann, San Mateo, 1991).
[77] S. Katagiri, private communication (1990).
[78] 5. Kirkpatrick. C.D. Gelatt and M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671—680.
[79] T. Kohonen. Self-Organization and Associative Memory. 2nd Ed. (Springer, Berlin, 1987).
[80] T. Kohonen, An introduction to neural computing, Neural Networks 1(1988)3—16.
[81] T. Kohonen. G. Barna and R. Chrisley. Statistical pattern recognition with neural networks: benchmarking studies, IEEE 2nd mt. Conf. on
Neural Networks, San Diego CA (July 1988) pp. 61—68.
[82] W. Krauth and M. Mezard, Learning algorithms with optimal stability in neural networks. J. Phys. A. 20 (1987) L745.
[83] G. Kuhn, Connected recognition with a recurrent network. Speech Comm. 9 (1990) 41—48.
[84] K.i. Lang and G.E. Hinton, A time delay neural network architecture for speech recognition. Tech. Rep. CMU-cs-I52, Carnegie-Mellon
University, Pittsburgh PA (1988).
[85]Y. Le Cun, A learning scheme for asymmetric threshold networks, Proc. Cognitiva 85, Paris. France (1985) pp. 599—604.
[86[ Y. Le Cun, Generalization and network design strategies, Tech. Rep. CRG-TR-89-4. University of Toronto Connectionist Research Group
(1989).
[87] Y. Le Cun, Generalization and network design strategies, in: Connectionism in Perspective. eds R. Pfeifer, Z. Schreter, F. Fogelman and L.
Steels (Elsevier, Amsterdam, 1989).
[88] Y. Le Cun, B. Boser, iS. Denker, D. Henderson, R.E. Howard, W. Hubbard and L.D. Jackel, Back-propagation applied to handwritten
zipcode recognition, Neural Comput. 1 (1989) 541—551.
[89] Y. Le Cun, J.S. Denker and S. Solla, Optimal brain damage, NIPS 89, Advan. Neural Inform. Proc. Systems 2. ed. D.S. Touretzky (IEEE.
Kaufmann, San Mateo, 1990) pp. 598—605.
[90] Y. Le Cun, L.D. Jackel, B. Boser. J.S. Denker. H.P. Graf, I. Guyon. D. Henderson, RE. Howard and W. Hubbard, Handwritten digit
recognition: application of neural network chips and automatic learning, IEEE Communications Mag. (November 1989) 41—46.
[91] Y. Lee and R.P. Lippmann, Practical characteristics of neural networks and conventional pattern classifiers on artificial and speech problems,
NIPS 89, Advan. Neural Inform. Proc. Systems 2, ed. D.S. Touretzky (IEEE. Kaufmann. San Mateo, 1990) pp. 168—177.
[92] E. Levin, Word recognition using hidden control neural architecture, IEEE Int. Conf. on Acoustics Speech and Signal Processing.
Albuquerque NM (1990).
[93] R. Linsker, How to generate ordered maps by maximizing the mutual information between input and output. Neural Comput. 1 (1989)
402—411.
[94] R.P. Lippmann, An introduction to computing with neural nets. IEEE ASSP Mag. 3(4) (1987) 4—22.
[95] R.P. Lippmann, Review of neural networks for speech recognition, Neural Comput. 1 (1989) 1—38.
[96] W.A. Little, The existence of persistent states in the brain, Math. Biosci. 19 (1974) 101—120.
[97] WA. Little and G. Shaw, A statistical theory of short and long term memory, Behav. Biol. 14 (1975) 115—133.
[98] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5 (1943) 115—133.
[99] E. McDermott and S. Katagiri, Shift-invariant, multi-category phoneme recognition using Kohonen’s LVQ2, IEEE Int. Conf. on Acoustics
Speech and Signal Processing, Vol. 1, Glasgow (1989) pp. 81—84.
[100] C. Mead, Analog VLSI and Neural Systems (Addison—Wesley, Reading, 1989).
[101]M. Mezard and J.-P. Nadal. Learning in feed-forward neural networks: the tiling algorithm. I. Phys. A. 22 (1990) 2191—2203.
[102] M. Minsky and S. Papert, Perceptrons (MIT Press, Cambridge. 1969).
[103]K. S. Narenda and K. Parthasarathy, Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Networks I
(1990) 4—27.
[104]K.S. Narendra and M.A.L. Thathachar, Learning Automata (Prentice-Hall, Englewood Cliffs, 1989).
[105]D.B. Parker, Learning-logic, Tech. Rep. TR-47. Sloan School of Management, MIT, Cambridge MA (1985).
[106]D.B. Parker, Optimal algorithms for adaptive networks: second order back propagation, second order direct propagation. and second order
hebbian learning, IEEE 1st Ann. Int. Conf. on Neural Networks. San Diego CA (June 1987).
[107] B.A. Pearmutter, Learning state space trajectories in recurrent neutral networks, Neural Comput. 1 (1989) 263—269.
[108] L. Personnaz, I. Guyon and G. Dreyfus, Collective computational properties of neural networks: new learning mechanisms. Phys. Rev. A 34
(1986) 4217—4228.
[109] L. Personnaz, I. Guyon and G. Dreyfus, High-order neural networks: information storage without errors, Europhys. LetS. 4 (1987) 863—867.
[110] L. Personnaz. 0. Nerrand and G. Dreyfus, Apprentissage et misc en oeuvre des réseaux de neurones bouclés, in: Journées Internationales de
Sciences Informatiques (Tunis, 1990).
1. Guyon, Neural networks and applications tutorial 259

[111] F.J. Pineda, Generalization of back propagation to recurrent and higher order neural networks, Proc. IEEE Conf. on Neural Information
Processing Systems, Denver CO (November 1987).
[112] T. Poggio, Biol. Cybernet. 19 (1975) 201.
[113] T. Poggio and F. Girosi, Regularization algorithms for learning that are equivalent to multilayer networks, Science 247 (1990) 978—982.
[114] D.A. Pomerleau, ALVINN: an autonomous land vehicle in a neural network, NIPS 88. Advan. Neural Inform. Proc. Systems 1, ed. D.S.
Touretzky (IEEE, Kaufmann, San Mateo, 1989) pp. 305—313.
[115] N. Qian and T.J. Sejnowski, Predicting the secondary structure of globular proteins using neural network models, J. Mol. Biol. 147 (1988)
195—197.
[116] 5. Ramón y Cajal, Histologic du Système Nerveux de l’Homme et des Vertébrés (Maloine, Paris, 1911).
[117] D.L. Reilly, L.N. Cooper and C. Elbaum, A neural model for category learning, Biol. Cybernet. 45 (1982) 35—41.
[118] S. Renals and R. Rohwer, Phoneme classification experiments using radial basis functions, IEEE Proc. Int. Joint Conf. on Neural Networks,
Vol. 1, Washington DC (1989) pp. 461—467.
[119] F. Rosenblatt, Principles of Neurodynamics (Spartan Books, New York, 1962).
[120] D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning internal representations by error propagation, in: Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, Vol. 1 (Bradford Books, Cambridge MA, 1986) pp. 318—362.
[121] D.E. Rumelhart, J.L. McClelland and the PDP research group. Parallel Distributed Processing: Explorations in the Microstructure of
Cognition, Vol. 1 (Bradford Books, Cambridge MA, 1986).
[122]T.J. Sejnowski and C.R. Rosenberg, Nettalk: a parallel network that learns to read aloud, Tech. Rep. 86-01, Department of Electrical
Engineering and Computer Science, Johns Hopkins University, Baltimore MD (1986).
[123]T.J. Sejnowski and C.R. Rosenberg, Parallel networks that learn to pronounce English text, Complex Systems 1 (1987) 145—168.
[124] P.K. Simpson. Artificial Neural Systems (Pergamon, Oxford, 1989).
[125] S.A. Solla, E. Levin and M. Fleicher, Accelerated learning in layered neural networks, Complex Systems 2(1988) 625—640.
[126] E. Levin, N. Tisby and S.A. Solla, A statistical approach to learning and generalization in layered neural networks, Proc. IEEE 78(1990)
1568—1574.
[127]C. Stevens, The neuron, in: The Brain, Special Issue, Sci. Am. (1979).
[128] M. Stinchcombe and H. White, Universal approximation using feed-forward networks with nonsigmoid hidden layer activation function,
IEEE Proc. Int. Joint Conf. on Neural Networks, Vol. 1, Washington DC (1989) pp. 613—617.
[129] G. Tesauro and T.J. Sejnowski, A parallel network that learns to play backgammon, Artificial Intelligence 39 (1989) 357—390.
[130]C.W. Therrien, Decision, Estimation and Classification: an Introduction to Pattern Recognition and Related Topics (Wiley, New York, 1989).
[131] N. Tishby, E. Levin and S.A. Solla, Consistent inference of probabilities in layered networks: predictions and generalization, IEEE Proc. mt.
Joint Conf. on Neural Networks, Washington DC (1989).
[132] D.S. Touretzky, ed., NIPS 89, Advan. Neural Inform. Proc. Systems 2 (IEEE, Kaufmann, San Mateo, 1990).
[133]P. Treleaven and M. Vellasco, Neural computing overview, The Second European Seminar on Neural Networks. Neural Computing:
Commercial Prospects (1989).
[134]C. von der Malsburg, Self-organization of orientation sensitive cells in striate cortex, Kybemetik 14 (1973) 85—100.
[135]A. Waibel, Consonant recognition by modular construction of large phonemic time-delay neural networks, NIPS 88, Advan. Neural Inform.
Proc. Systems 1. ed. D.S. Touretzky (Kaufmann, San Mateo, 1989) pp. 215—223.
[136]A. Waibel, T. Hanazawa, G. Hrnton, K. Shikano and K. Lang, Phoneme recognition using time-delay neural networks, IEEE Trans.
Acoustics Speech Signal Proc. 37 (1989) 328—339.
[137] P. Wasserman, Neural Computing: Theory and Practice (Van Nostrand Reinhold, Princeton, 1989).
[138] P. Wcrbos, Beyond Regression, PhD thesis, Harvard University (1974).
[139] P.J. Werbos, Backpropagation and neurocontrol: a review and prospectus, IEEE Proc. Int. Joint Conf. on Neural Networks, Vol. 1,
Washington DC (1989) pp. 209—216.
[140] H. White, Learning in artificial neural networks: a statistical perspective, Neural Comput. 1 (1989) 425—464.
[141] B. Widrow and M.E. Hoff, Adaptive switching circuits, in: IRE WESCON Cony. Record, Part 4 (1960) pp. 96—104.
[142]B. Widrow and S.D. Stearns, Adaptive Signal Processing (Prentice-Hall, Englewood Cliffs, 1985).
[143]R.J. Williams and J. Peng, Reinforcement learning algorithms as function optimizers. IEEE Proc. Int. Joint Conf. on Neural Networks, Vol.
2, Washington DC (1989) pp. 89—95.
[144] R.J. Williams and D. Zipser, A learning algorithm for continually running fully recurrent neural networks, Neural Comput. 1(1989) 270—280.

You might also like