0% found this document useful (0 votes)
9 views

UNIT3

Uploaded by

kowshikch2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

UNIT3

Uploaded by

kowshikch2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Optimization techniques for Gradient Descent

Gradient Descent is a widely used optimization algorithm for machine learning models.
However, there are several optimization techniques that can be used to improve the
performance of Gradient Descent. Here are some of the most popular optimization techniques
for Gradient Descent:

Learning Rate Scheduling: The learning rate determines the step size of the Gradient Descent
algorithm. Learning Rate Scheduling involves changing the learning rate during the training
process, such as decreasing the learning rate as the number of iterations increases. This
technique helps the algorithm to converge faster and avoid overshooting the minimum.
Momentum-based Updates: The Momentum-based Gradient Descent technique involves
adding a fraction of the previous update to the current update. This technique helps the
algorithm to overcome local minima and accelerates convergence.
Batch Normalization: Batch Normalization is a technique used to normalize the inputs to
each layer of the neural network. This helps the Gradient Descent algorithm to converge faster
and avoid vanishing or exploding gradients.
Weight Decay: Weight Decay is a regularization technique that involves adding a penalty term
to the cost function proportional to the magnitude of the weights. This helps to prevent
overfitting and improve the generalization of the model.
Adaptive Learning Rates: Adaptive Learning Rate techniques involve adjusting the learning
rate adaptively during the training process. Examples include Adagrad, RMSprop, and Adam.
These techniques adjust the learning rate based on the historical gradient information, which
can improve the convergence speed and accuracy of the algorithm.
Second-Order Methods: Second-Order Methods use the second-order derivatives of the cost
function to update the parameters. Examples include Newton’s Method and Quasi-Newton
Methods. These methods can converge faster than Gradient Descent, but require more
computation and may be less stable.
Gradient Descent is an iterative optimization algorithm, used to find the minimum value for a
function. The general idea is to initialize the parameters to random values, and then take small
steps in the direction of the “slope” at each iteration. Gradient descent is highly used in
supervised learning to minimize the error function and find the optimal values for the
parameters. Various extensions have been designed for the gradient descent algorithms. Some
of them are discussed below:
Momentum method: This method is used to accelerate the gradient descent algorithm by
taking into consideration the exponentially weighted average of the gradients.

RMSprop: RMSprop was proposed by the University of Toronto’s Geoffrey Hinton. The
intuition is to apply an exponentially weighted average method to the second moment of the
gradients (dW2).

The goal of the gradient descent is to minimise a given function which, in our case, is the loss
function of the neural network. To achieve this goal, it performs two steps iteratively.

1. Compute the slope (gradient) that is the first-order derivative of the function at the current
point

2. Move-in the opposite direction of the slope increase from the current point by the computed
amount
So, the idea is to pass the training set through the hidden layers of the neural network and then
update the parameters of the layers by computing the gradients using the training samples from
the training dataset.

Think of it like this. Suppose a man is at top of the valley and he wants to get to the bottom of the
valley. So he goes down the slope. He decides his next position based on his current position and
stops when he gets to the bottom of the valley which was his goal.

Batch Gradient Descent

In Batch Gradient Descent, all the training data is taken into consideration to take a single
step. We take the average of the gradients of all the training examples and then use that mean
gradient to update our parameters. So that’s just one step of gradient descent in one epoch.

Batch Gradient Descent is great for convex or relatively smooth error manifolds. In this case, we
move somewhat directly towards an optimum solution.
Stochastic Gradient Descent

In Batch Gradient Descent we were considering all the examples for every step of Gradient
Descent. But what if our dataset is very huge. Deep learning models crave for data. The more the
data the more chances of a model to be good. Suppose our dataset has 5 million examples, then
just to take one step the model will have to calculate the gradients of all the 5 million examples.
This does not seem an efficient way. To tackle this problem we have Stochastic Gradient Descent.
In Stochastic Gradient Descent (SGD), we consider just one example at a time to take a single
step. We do the following steps in one epoch for SGD:

1. Take an example

2. Feed it to Neural Network


3. Calculate it’s gradient

4. Use the gradient we calculated in step 3 to update the weights

5. Repeat steps 1–4 for all the examples in training dataset

Since we are considering just one example at a time the cost will fluctuate over the training
examples and it will not necessarily decrease. But in the long run, you will see the cost decreasing
with fluctuations.

Mini Batch Gradient Descent

We have seen the Batch Gradient Descent. We have also seen the Stochastic Gradient Descent.
Batch Gradient Descent can be used for smoother curves. SGD can be used when the dataset is
large. Batch Gradient Descent converges directly to minima. SGD converges faster for larger
datasets. But, since in SGD we use only one example at a time, we cannot implement the
vectorized implementation on it. This can slow down the computations. To tackle this problem, a
mixture of Batch Gradient Descent and SGD is used.
Neither we use all the dataset all at once nor we use the single example at a time. We use a batch
of a fixed number of training examples which is less than the actual dataset and call it a mini-
batch. Doing this helps us achieve the advantages of both the former variants we saw. So, after
creating the mini-batches of fixed size, we do the following steps in one epoch:

1. Pick a mini-batch

2. Feed it to Neural Network

3. Calculate the mean gradient of the mini-batch

4. Use the mean gradient we calculated in step 3 to update the weights

5. Repeat steps 1–4 for the mini-batches we created

HYPOTHESIS SEARCH SPACE

The concept of a hypothesis is fundamental in Machine Learning and data science endeavours.
In the realm of machine learning, a hypothesis serves as an initial assumption made by data
scientists and ML professionals when attempting to address a problem. Machine learning
involves conducting experiments based on past experiences, and these hypotheses are crucial
in formulating potential solutions. It’s important to note that in machine learning discussions,
the terms “hypothesis” and “model” are sometimes used interchangeably. However, a
hypothesis represents an assumption, while a model is a mathematical representation employed
to test that hypothesis. This section on “Hypothesis in Machine Learning” explores key aspects
related to hypotheses in machine learning and their significance.

Hypothesis in Machine Learning


A hypothesis in machine learning is the model’s presumption regarding the connection
between the input features and the result. It is an illustration of the mapping function that the
algorithm is attempting to discover using the training set. To minimize the discrepancy
between the expected and actual outputs, the learning process involves modifying the weights
that parameterize the hypothesis. The objective is to optimize the model’s parameters to
achieve the best predictive performance on new, unseen data, and a cost function is used to
assess the hypothesis’ accuracy.
Hypothesis Space (H)
Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the
machine learning algorithm would determine the best possible (only one) which would best
describe the target function or the outputs.
Hypothesis (h)
A hypothesis is a function that best describes the target in supervised machine learning. The
hypothesis that an algorithm would come up depends upon the data and also depends upon the
restrictions and bias that we have imposed on the data.
The Hypothesis can be calculated as:
y=mx+by=mx+b
Where,
 y = range
 m = slope of the lines
 x = domain
 b = intercept
To better understand the Hypothesis Space and Hypothesis consider the following coordinate
that shows the distribution of some data:

INDUCTIVE BIAS

Inductive bias can be defined as the set of assumptions or biases that a learning algorithm
employs to make predictions on unseen data based on its training data. These assumptions are
inherent in the algorithm’s design and serve as a foundation for learning and generalization.
The inductive bias of an algorithm influences how it selects a hypothesis (a possible
explanation or model) from the hypothesis space (the set of all possible hypotheses) that best
fits the training data. It helps the algorithm navigate the trade-off between fitting the training
data perfectly (overfitting) and generalizing well to unseen data (underfitting).
Importance of Inductive Bias
Inductive bias is crucial in machine learning as it helps algorithms generalize from limited
training data to unseen data. Without a well-defined inductive bias, algorithms may struggle to
make accurate predictions or may overfit the training data, leading to poor performance on new
data.
Understanding the inductive bias of an algorithm is essential for model selection, as different
biases may be more suitable for different types of data or tasks. It also provides insights into
how the algorithm is learning and what assumptions it is making about the data, which can aid
in interpreting its predictions and results.
Challenges and Considerations
While inductive bias is essential for learning, it can also introduce limitations and challenges.
Biases that are too strong or inappropriate for the data can lead to poor generalization or biased
predictions. Balancing bias with variance (the variability of predictions) is a key challenge in
machine learning, requiring careful tuning and model selection.
Additionally, the choice of inductive bias can impact the interpretability of the model. Simpler
biases may lead to more interpretable models, while more complex biases may sacrifice
interpretability for improved performance.

ERROR FUNCTIONS

The loss function, also referred to as the error function, is a crucial component in machine
learning that quantifies the difference between the predicted outputs of a machine learning
algorithm and the actual target values. For example, within a regression problem to predict car
prices based on historical data, a loss function evaluates a neural network prediction based on a
training sample from the training dataset. The loss function quantifies the gap or the error margin
of the car price predicted by the network to the actual price.

The resulting value, the loss, reflects the accuracy of the model's predictions. During training, a
learning algorithm such as the backpropagation algorithm uses the gradient of the loss function
with respect to the model's parameters to adjust these parameters and minimize the loss,
effectively improving the model's performance on the dataset. Often, the terms loss function and
cost function are used interchangeably; despite this, both terms have distinct definitions:

As mentioned earlier, the loss function, also known as the error function, quantifies how well a
single prediction of the machine learning algorithm is compared to the actual target value. The
key takeaway is that a loss function applies to a single training example and is part of the overall
model's learning process that provides the signal by which the model's learning algorithm
updates the weights and parameters. The cost function, sometimes called the objective function,
is an average of the loss function of an entire training set containing several training examples.
The cost function quantifies the model's performance on the whole training dataset.

Mean Square Error (MSE) / L2 Loss


The Mean Square Error(MSE) or L2 loss is a loss function that quantifies the magnitude of the
error between a machine learning algorithm prediction and an actual output by taking the average
of the squared difference between the predictions and the target values. Squaring the difference
between the predictions and actual target values results in a higher penalty assigned to more
significant deviations from the target value. A mean of the errors normalizes the total errors
against the number of samples in a dataset or observation.

The mathematical equation for Mean Square Error (MSE) or L2 Loss is:

MSE = (1/n) * Σ(yᵢ - ȳ)²

Mean Absolute Error (MAE) / L1 Loss


Mean Absolute Error (MAE), also known as L1 Loss, is a loss function used in regression tasks
that calculates the average absolute differences between predicted values from a machine
learning model and the actual target values. Unlike Mean Squared Error (MSE), MAE does not
square the differences, treating all errors with equal weight regardless of their magnitude.

The mathematical equation for Mean Absolute Error (MAE) or L1 Loss is:

MAE = (1/n) * Σ|yᵢ - ȳ|


Objectives

7 Supervised Hebbian Learning

Objectives 7-1
Theory and Examples 7-2
Linear Associator 7-3
The Hebb Rule 7-4
Performance Analysis 7-5
Pseudoinverse Rule 7-7
Application 7-10
Variations of Hebbian Learning 7-12
Summary of Results 7-14
Solved Problems 7-16
Epilogue 7-29
Further Reading 7-30
Exercises 7-31

Objectives
7
The Hebb rule was one of the first neural network learning laws. It was
proposed by Donald Hebb in 1949 as a possible mechanism for synaptic
modification in the brain and since then has been used to train artificial
neural networks.
In this chapter we will use the linear algebra concepts of the previous two
chapters to explain why Hebbian learning works. We will also show how
the Hebb rule can be used to train neural networks for pattern recognition.

7-1
7 Supervised Hebbian Learning

Theory and Examples


Donald O. Hebb was born in Chester, Nova Scotia, just after the turn of the
century. He originally planned to become a novelist, and obtained a degree
in English from Dalhousie University in Halifax in 1925. Since every first-
rate novelist needs to have a good understanding of human nature, he be-
gan to study Freud after graduation and became interested in psychology.
He then pursued a master’s degree in psychology at McGill University,
where he wrote a thesis on Pavlovian conditioning. He received his Ph.D.
from Harvard in 1936, where his dissertation investigated the effects of
early experience on the vision of rats. Later he joined the Montreal Neuro-
logical Institute, where he studied the extent of intellectual changes in
brain surgery patients. In 1942 he moved to the Yerkes Laboratories of Pri-
mate Biology in Florida, where he studied chimpanzee behavior.
In 1949 Hebb summarized his two decades of research in The Organization
of Behavior [Hebb49]. The main premise of this book was that behavior
could be explained by the action of neurons. This was in marked contrast
to the behaviorist school of psychology (with proponents such as B. F. Skin-
ner), which emphasized the correlation between stimulus and response and
discouraged the use of any physiological hypotheses. It was a confrontation
between a top-down philosophy and a bottom-up philosophy. Hebb stated
his approach: “The method then calls for learning as much as one can about
what the parts of the brain do (primarily the physiologist’s field), and relat-
ing the behavior as far as possible to this knowledge (primarily for the psy-
chologist); then seeing what further information is to be had about how the
total brain works, from the discrepancy between (1) actual behavior and (2)
the behavior that would be predicted from adding up what is known about
the action of the various parts.”
The most famous idea contained in The Organization of Behavior was the
postulate that came to be known as Hebbian learning:
Hebb’s Postulate “When an axon of cell A is near enough to excite a cell B and repeatedly or
persistently takes part in firing it, some growth process or metabolic change
takes place in one or both cells such that A’s efficiency, as one of the cells fir-
ing B, is increased.”
This postulate suggested a physical mechanism for learning at the cellular
level. Although Hebb never claimed to have firm physiological evidence for
his theory, subsequent research has shown that some cells do exhibit Heb-
bian learning. Hebb’s theories continue to influence current research in
neuroscience.
As with most historic ideas, Hebb’s postulate was not completely new, as
he himself emphasized. It had been foreshadowed by several others, includ-
ing Freud. Consider, for example, the following principle of association
stated by psychologist and philosopher William James in 1890: “When two

7-2
Linear Associator

brain processes are active together or in immediate succession, one of


them, on reoccurring tends to propagate its excitement into the other.”

Linear Associator
Hebb’s learning law can be used in combination with a variety of neural
network architectures. We will use a very simple architecture for our initial
presentation of Hebbian learning. In this way we can concentrate on the
learning law rather than the architecture. The network we will use is the
Linear Associator linear associator, which is shown in Figure 7.1. (This network was intro-
duced independently by James Anderson [Ande72] and Teuvo Kohonen
[Koho72].)

Inputs Linear Layer

p n a
Rx1
W Sx1 Sx1
SxR

R S
a = purelin (Wp)

Figure 7.1 Linear Associator


The output vector a is determined from the input vector p according to: 7
a = Wp , (7.1)

or
R

ai =  w ij p j . (7.2)
j=1

The linear associator is an example of a type of neural network called an


Associative Memory associative memory. The task of an associative memory is to learn Q pairs
of prototype input/output vectors:

{p 1, t 1}  {p 2 , t 2}    {p Q, t Q} . (7.3)

In other words, if the network receives an input p = p q then it should pro-


duce an output a = t q , for q = 1 2   Q . In addition, if the input is
changed slightly (i.e., p = p q +  ) then the output should only be changed
slightly (i.e., a = t q +  ).

7-3
7 Supervised Hebbian Learning

The Hebb Rule


How can we interpret Hebb’s postulate mathematically, so that we can use
it to train the weight matrix of the linear associator? First, let’s rephrase
the postulate: If two neurons on either side of a synapse are activated si-
multaneously, the strength of the synapse will increase. Notice from Eq.
(7.2) that the connection (synapse) between input p j and output a i is the
weight w ij . Therefore Hebb’s postulate would imply that if a positive p j
produces a positive a i then w ij should increase. This suggests that one
mathematical interpretation of the postulate could be

The Hebb Rule w ijnew = w ijold +  f i  a iq g j  p jq  , (7.4)

where p jq is the jth element of the qth input vector p q ; a iq is the ith ele-
ment of the network output when the qth input vector is presented to the
network; and  is a positive constant, called the learning rate. This equa-
tion says that the change in the weight w ij is proportional to a product of
functions of the activities on either side of the synapse. For this chapter we
will simplify Eq. (7.4) to the following form

w ijnew = w ijold + a iq p jq . (7.5)

Note that this expression actually extends Hebb’s postulate beyond its
strict interpretation. The change in the weight is proportional to a product
of the activity on either side of the synapse. Therefore, not only do we in-
crease the weight when both p j and a i are positive, but we also increase
the weight when they are both negative. In addition, this implementation
of the Hebb rule will decrease the weight whenever p j and a i have opposite
sign.
The Hebb rule defined in Eq. (7.5) is an unsupervised learning rule. It does
not require any information concerning the target output. In this chapter
we are interested in using the Hebb rule for supervised learning, in which
the target output is known for each input vector. (We will revisit the unsu-
pervised Hebb rule in Chapter 13.) For the supervised Hebb rule we substi-
tute the target output for the actual output. In this way, we are telling the
algorithm what the network should do, rather than what it is currently do-
ing. The resulting equation is

w ijnew = w ijold + t iq p jq , (7.6)

where t iq is the ith element of the qth target vector t q . (We have set the
learning rate  to one, for simplicity.)
Notice that Eq. (7.6) can be written in vector notation:
new old T
W = W + tq pq . (7.7)

7-4
The Hebb Rule

If we assume that the weight matrix is initialized to zero and then each of
the Q input/output pairs are applied once to Eq. (7.7), we can write
Q

W = t1 p1 + t2 p2 +  + tQ pQ =  tq pq .
T T T T
(7.8)
q=1

This can be represented in matrix form:

T
p1
T
W = t 1 t 2  t Q p 2 = TP ,
T
(7.9)


T
pQ

where

T = t1 t2  tQ  P = p1 p2  pQ . (7.10)

Performance Analysis
Let’s analyze the performance of Hebbian learning for the linear associa-
tor. First consider the case where the p q vectors are orthonormal (orthog-
onal and unit length). If p k is input to the network, then the network
output can be computed


Q
T
Q 7
 
T
a = Wp k =  t q p q p k = tq  pq pk  . (7.11)
 
q=1 q=1

Since the p q are orthonormal,


T
 pq pk  = 1 q = k

= 0 q  k. (7.12)
Therefore Eq. (7.11) can be rewritten

a = Wp k = t k . (7.13)

The output of the network is equal to the target output. This shows that, if
the input prototype vectors are orthonormal, the Hebb rule will produce the
correct output for each input.

7-5
7 Supervised Hebbian Learning

But what about non-orthogonal prototype vectors? Let’s assume that each
p q vector is unit length, but that they are not orthogonal. Then Eq. (7.11)
becomes

Error

 tq  pq pk  .
T
a = Wp k = t k + (7.14)
qk

Because the vectors are not orthogonal, the network will not produce the
correct output. The magnitude of the error will depend on the amount of
correlation between the prototype input patterns.
2 As an example, suppose that the prototype input/output vectors are
+2

   
 0.5   0.5 
 – 0.5  t = 1   0.5 1 
 p1 = 1   p2 =  t2 = . (7.15)
 0.5 –1   – 0.5 1 
 – 0.5   – 0.5 
   

(Check that the two input vectors are orthonormal.)


The weight matrix would be

W = TP =
T 1 1 0.5 – 0.5 0.5 – 0.5 = 1 0 0 – 1 . (7.16)
– 1 1 0.5 0.5 – 0.5 – 0.5 0 1 –1 0

If we test this weight matrix on the two prototype inputs we find

0.5
Wp 1 = 1 0 0 – 1 – 0.5 = 1 , (7.17)
0 1 – 1 0 0.5 –1
– 0.5

and

0.5
Wp 2 = 1 0 0 – 1 0.5 = 1 . (7.18)
0 1 – 1 0 – 0.5 1
– 0.5

Success!! The outputs of the network are equal to the targets.

7-6
Pseudoinverse Rule

Now let’s revisit the apple and orange recognition problem described in
2
+2 Chapter 3. Recall that the prototype inputs were

1 1
p 1 = – 1  orange  p2 = 1  apple  . (7.19)
–1 –1

(Note that they are not orthogonal.) If we normalize these inputs and
choose as desired outputs -1 and 1, we obtain

 0.5774   0.5774 
   
 1
p = – 0.5774  t 1 = – 1   p 2 = 0.5774  t 2 = 1  . (7.20)
   
 – 0.5774   – 0.5774 

Our weight matrix becomes

W = TP = – 1 1 0.5774 – 0.5774 – 0.5774 = 0 1.1548 0 .


T
(7.21)
0.5774 0.5774 – 0.5774

So, if we use our two prototype patterns,

0.5774
Wp 1 = 0 1.1548 0 – 0.5774 = – 0.6668 , (7.22)
– 0.5774
7
0.5774
Wp 2 = 0 1.1548 0 0.5774 = 0.6668 . (7.23)
– 0.5774

The outputs are close, but do not quite match the target outputs.

Pseudoinverse Rule
When the prototype input patterns are not orthogonal, the Hebb rule pro-
duces some errors. There are several procedures that can be used to reduce
these errors. In this section we will discuss one of those procedures, the
pseudoinverse rule.
Recall that the task of the linear associator was to produce an output of t q
for an input of p q . In other words,

Wp q = t q q = 1 2   Q . (7.24)

7-7
7 Supervised Hebbian Learning

If it is not possible to choose a weight matrix so that these equations are


exactly satisfied, then we want them to be approximately satisfied. One ap-
proach would be to choose the weight matrix to minimize the following per-
formance index:
Q


2
F W = t q – Wp q . (7.25)
q=1

If the prototype input vectors p q are orthonormal and we use the Hebb rule
to find W, then F(W) will be zero. When the input vectors are not orthogo-
nal and we use the Hebb rule, then F(W) will be not be zero, and it is not
clear that F(W) will be minimized. It turns out that the weight matrix that
will minimize F(W) is obtained by using the pseudoinverse matrix, which
we will define next.
First, let’s rewrite Eq. (7.24) in matrix form:

WP = T , (7.26)

where

T = t1 t2  tQ  P = p1 p2  pQ . (7.27)

Then Eq. (7.25) can be written


2 2
F  W  = T – WP = E , (7.28)

where

E = T – WP , (7.29)

and

  eij .
2 2
E = (7.30)
i j

Note that F(W) can be made zero if we can solve Eq. (7.26). If the P matrix
has an inverse, the solution is
–1
W = TP . (7.31)

However, this is rarely possible. Normally the p q vectors (the columns of


P) will be independent, but R (the dimension of p q ) will be larger than Q
(the number of p q vectors). Therefore, P will not be a square matrix, and
no exact inverse will exist.

7-8
Pseudoinverse Rule

It has been shown [Albe72] that the weight matrix that minimizes Eq.
Pseudoinverse Rule (7.25) is given by the pseudoinverse rule:
+
W = TP , (7.32)
+
where P is the Moore-Penrose pseudoinverse. The pseudoinverse of a real
matrix P is the unique matrix that satisfies
+
PP P = P,
+ + +
P PP = P ,
T
(7.33)
+ +
P P = P P ,
+ + T
PP =  PP  .

When the number, R, of rows of P is greater than the number of columns,


Q, of P , and the columns of P are independent, then the pseudoinverse can
be computed by

+ T –1 T
P = P P P . (7.34)
2 To test the pseudoinverse rule (Eq. (7.32)), consider again the apple and or-
+2
ange recognition problem. Recall that the input/output prototype vectors
are

 1   1 
   
 p1 = –1  t1 = –1   p2 = 1  t2 = 1  .

 –1



 –1


(7.35)
7
(Note that we do not need to normalize the input vectors when using the
pseudoinverse rule.)
The weight matrix is calculated from Eq. (7.32):
+
 1 1 
+  
W = TP = – 1 1  – 1 1  , (7.36)
 
 –1 –1 

where the pseudoinverse is computed from Eq. (7.34):


–1
–1 T
P = P P P = 3 1 1 – 1 – 1 = 0.25 – 0.5 – 0.25 .
+ T
(7.37)
1 3 1 1 –1 0.25 0.5 – 0.25

7-9
7 Supervised Hebbian Learning

This produces the following weight matrix:

W = TP = – 1 1 0.25 – 0.5 – 0.25 = 0 1 0 .


+
(7.38)
0.25 0.5 – 0.25

Let’s try this matrix on our two prototype patterns.

1
Wp 1 = 0 1 0 – 1 = – 1 (7.39)
–1

1
Wp 2 = 0 1 0 1 = 1 (7.40)
–1

The network outputs exactly match the desired outputs. Compare this re-
sult with the performance of the Hebb rule. As you can see from Eq. (7.22)
and Eq. (7.23), the Hebbian outputs are only close, while the pseudoinverse
rule produces exact results.

Application
Now let’s see how we might use the Hebb rule on a practical, although
greatly oversimplified, pattern recognition problem. For this problem we
will use a special type of associative memory — the autoassociative memo-
Autoassociative Memory ry. In an autoassociative memory the desired output vector is equal to the
input vector (i.e., t q = p q ). We will use an autoassociative memory to store
a set of patterns and then to recall these patterns, even when corrupted
patterns are provided as input.
The patterns we want to store are shown to the left. (Since we are designing
an autoassociative memory, these patterns represent the input vectors and
the targets.) They represent the digits {0, 1, 2} displayed in a 6X5 grid. We
need to convert these digits to vectors, which will become the prototype pat-
p1,t1 p2,t2 p3,t3 terns for our network. Each white square will be represented by a “-1”, and
each dark square will be represented by a “1”. Then, to create the input vec-
tors, we will scan each 6X5 grid one column at a time. For example, the first
prototype pattern will be
T
p1 = –1 1 1 1 1 –1 1 –1 –1 –1 –1 1 1 – 1  1 –1 . (7.41)

The vector p 1 corresponds to the digit “0”, p 2 to the digit “1”, and p 3 to the
digit “2”. Using the Hebb rule, the weight matrix is computed

7-10
Application

T T T
W = p1 p1 + p2 p2 + p3 p3 . (7.42)

(Note that p q replaces t q in Eq. (7.8), since this is autoassociative memory.)


Because there are only two allowable values for the elements of the proto-
type vectors, we will modify the linear associator so that its output ele-
ments can only take on values of “-1” or “1”. We can do this by replacing
the linear transfer function with a symmetrical hard limit transfer func-
tion. The resulting network is displayed in Figure 7.2.

Inputs Sym. Hard Limit Layer

p n a
30x1
W 30x1 30x1
30x30

30 30
a = hardlims (Wp)

Figure 7.2 Autoassociative Network for Digit Recognition


Now let’s investigate the operation of this network. We will provide the net-
work with corrupted versions of the prototype patterns and then check the
network output. In the first test, which is shown in Figure 7.3, the network
is presented with a prototype pattern in which the lower half of the pattern
is occluded. In each case the correct pattern is produced by the network.
7

Figure 7.3 Recovery of 50% Occluded Patterns


In the next test we remove even more of the prototype patterns. Figure 7.4
illustrates the result of removing the lower two-thirds of each pattern. In
this case only the digit “1” is recovered correctly. The other two patterns
produce results that do not correspond to any of the prototype patterns.
This is a common problem in associative memories. We would like to design
networks so that the number of such spurious patterns would be mini-
mized. We will come back to this topic again in Chapter 18, when we dis-
cuss recurrent associative memories.

7-11
7 Supervised Hebbian Learning

Figure 7.4 Recovery of 67% Occluded Patterns


In our final test we will present the autoassociative network with noisy ver-
sions of the prototype pattern. To create the noisy patterns we will random-
ly change seven elements of each pattern. The results are shown in Figure
7.5. For these examples all of the patterns were correctly recovered.

Figure 7.5 Recovery of Noisy Patterns


To experiment with this type of pattern recognition problem, use the Neural
Network Design Demonstration Supervised Hebb (nnd7sh).

Variations of Hebbian Learning


There have been a number of variations on the basic Hebb rule. In fact,
many of the learning laws that will be discussed in the remainder of this
text have some relationship to the Hebb rule.
One of the problems of the Hebb rule is that it can lead to weight matrices
having very large elements if there are many prototype patterns in the
training set. Consider again the basic rule:
new old T
W = W + tq pq . (7.43)

A positive parameter  , called the learning rate, can be used to limit the
amount of increase in the weight matrix elements, if the learning rate is
less than one, as in:
new old T
W = W + t q p q . (7.44)

We can also add a decay term, so that the learning rule behaves like a
smoothing filter, remembering the most recent inputs more clearly:
new old T old old T
W = W + t q p q – W =  1 –  W + t q p q , (7.45)

where  is a positive constant less than one. As  approaches zero, the


learning law becomes the standard rule. As  approaches one, the learning

7-12
Variations of Hebbian Learning

law quickly forgets old inputs and remembers only the most recent pat-
terns. This keeps the weight matrix from growing without bound.
The idea of filtering the weight changes and of having an adjustable learn-
ing rate are important ones, and we will discuss them again in Chapters
10, 12, 15, 16, 18 and 19.
If we modify Eq. (7.44) by replacing the desired output with the difference
between the desired output and the actual output, we get another impor-
tant learning rule:
new old T
W = W +   t q – a q p q . (7.46)

This is sometimes known as the delta rule, since it uses the difference be-
tween desired and actual output. It is also known as the Widrow-Hoff algo-
rithm, after the researchers who introduced it. The delta rule adjusts the
weights so as to minimize the mean square error (see Chapter 10). For this
reason it will produce the same results as the pseudoinverse rule, which
minimizes the sum of squares of errors (see Eq. (7.25)). The advantage of
the delta rule is that it can update the weights after each new input pattern
is presented, whereas the pseudoinverse rule computes the weights in one
step, after all of the input/target pairs are known. This sequential updating
allows the delta rule to adapt to a changing environment. The delta rule
will be discussed in detail in Chapter 10.
The basic Hebb rule will be discussed again, in a different context, in Chap-
ter 13. In the present chapter we have used a supervised form of the Hebb
rule. We have assumed that the desired outputs of the network, t q , are
known, and can be used in the learning rule. In the unsupervised Hebb
rule, which is discussed in Chapter 13, the actual network output is used
instead of the desired network output, as in:
7
new old T
W = W + a q p q , (7.47)

where a q is the output of the network when p q is given as the input (see
also Eq. (7.5)). This unsupervised form of the Hebb rule, which does not re-
quire knowledge of the desired output, is actually a more direct interpreta-
tion of Hebb’s postulate than is the supervised form discussed in this
chapter.

7-13
11 Backpropagation

1
this chapter, the response of this network is a superposition of S sigmoid
functions.
Figure 11.11 illustrates the network response as the number of neurons in
the first layer (hidden layer) is increased. Unless there are at least five neu-
rons in the hidden layer the network cannot accurately represent g  p  .

3 3
1-2-1 1-3-1
2 2

1 1

0 0

-1 -1
-2 -1 0 1 2 -2 -1 0 1 2

3 3
1-4-1 1-5-1
2 2

1 1

0 0

-1 -1
-2 -1 0 1 2 -2 -1 0 1 2

Figure 11.11 Effect of Increasing the Number of Hidden Neurons


1
To summarize these results, a 1- S -1 network, with sigmoid neurons in the
hidden layer and linear neurons in the output layer, can produce a re-
1
sponse that is a superposition of S sigmoid functions. If we want to ap-
proximate a function that has a large number of inflection points, we will
need to have a large number of neurons in the hidden layer.
Use the MATLAB® Neural Network Design Demonstration Function Ap-
proximation (nnd11fa) to develop more insight into the capability of a two-
layer network.

Convergence
In the previous section we presented some examples in which the network
response did not give an accurate approximation to the desired function,
even though the backpropagation algorithm produced network parameters
that minimized mean square error. This occurred because the capabilities
of the network were inherently limited by the number of hidden neurons it
contained. In this section we will provide an example in which the network
is capable of approximating the function, but the learning algorithm does
not produce network parameters that produce an accurate approximation.
In the next chapter we will discuss this problem in more detail and explain
why it occurs. For now we simply want to illustrate the problem.

11-20
Using Backpropagation

The function that we want the network to approximate is


2
+2
g  p  = 1 + sin  p  for – 2  p  2 . (11.55)

To approximate this function we will use a 1-3-1 network, where the trans-
fer function for the first layer is log-sigmoid and the transfer function for
the second layer is linear.
Figure 11.12 illustrates a case where the learning algorithm converges to
a solution that minimizes mean square error. The thin blue lines represent
intermediate iterations, and the thick blue line represents the final solu-
tion, when the algorithm has converged. (The numbers next to each curve
indicate the sequence of iterations, where 0 represents the initial condition
and 5 represents the final solution. The numbers do not correspond to the
iteration number. There were many iterations for which no curve is repre-
sented. The numbers simply indicate an ordering.)

2 5

1 3
2
4
0
0

-1
-2 -1 0 1 2

Figure 11.12 Convergence to a Global Minimum


Figure 11.13 illustrates a case where the learning algorithm converges to
a solution that does not minimize mean square error. The thick blue line
(marked with a 5) represents the network response at the final iteration.
The gradient of the mean square error is zero at the final iteration, there-
fore we have a local minima, but we know that a better solution exists, as
evidenced by Figure 11.12. The only difference between this result and the
result shown in Figure 11.12 is the initial condition. From one initial con-
dition the algorithm converged to a global minimum point, while from an-
other initial condition the algorithm converged to a local minimum point.

11
11-21
11 Backpropagation

5
1 3
4

0
0

1
-1
-2 -1 0 1 2

Figure 11.13 Convergence to a Local Minimum


Note that this result could not have occurred with the LMS algorithm. The
mean square error performance index for the ADALINE network is a qua-
dratic function with a single minimum point (under most conditions).
Therefore the LMS algorithm is guaranteed to converge to the global min-
imum as long as the learning rate is small enough. The mean square error
for the multilayer network is generally much more complex and has many
local minima (as we will see in the next chapter). When the backpropaga-
tion algorithm converges we cannot be sure that we have an optimum so-
lution. It is best to try several different initial conditions in order to ensure
that an optimum solution has been obtained.

Generalization
In most cases the multilayer network is trained with a finite number of ex-
amples of proper network behavior:

{p 1, t 1}  { p 2, t 2}    {p Q, tQ} . (11.56)

This training set is normally representative of a much larger class of pos-


sible input/output pairs. It is important that the network successfully gen-
eralize what it has learned to the total population.
2 For example, suppose that the training set is obtained by sampling the fol-
+2
lowing function:


g  p  = 1 + sin  --- p , (11.57)
4 

at the points p = – 2 – 1.6 – 1.2   1.6 2 . (There are a total of 11 input/tar-


get pairs.) In Figure 11.14 we see the response of a 1-2-1 network that has

11-22
Using Backpropagation

been trained on this data. The black line represents g  p  , the blue line rep-
resents the network response, and the ‘+’ symbols indicate the training set.
3

-1
-2 -1 0 1 2

Figure 11.14 1-2-1 Network Approximation of g  p 

We can see that the network response is an accurate representation of


g  p  . If we were to find the response of the network at a value of p that was
not contained in the training set (e.g., p = – 0.2 ), the network would still
produce an output close to g  p  . This network generalizes well.
Now consider Figure 11.15, which shows the response of a 1-9-1 network
that has been trained on the same data set. Note that the network response
accurately models g  p  at all of the training points. However, if we com-
pute the network response at a value of p not contained in the training set
(e.g., p = – 0.2 ) the network might produce an output far from the true re-
sponse g  p  . This network does not generalize well.
3

-1
-2 -1 0 1 2

Figure 11.15 1-9-1 Network Approximation of g  p 

The 1-9-1 network has too much flexibility for this problem; it has a total
of 28 adjustable parameters (18 weights and 10 biases), and yet there are
only 11 data points in the training set. The 1-2-1 network has only 7 param-
11
11-23
11 Backpropagation

eters and is therefore much more restricted in the types of functions that it
can implement.
For a network to be able to generalize, it should have fewer parameters than
there are data points in the training set. In neural networks, as in all mod-
eling problems, we want to use the simplest network that can adequately
represent the training set. Don’t use a bigger network when a smaller net-
work will work (a concept often referred to as Ockham’s Razor).
An alternative to using the simplest network is to stop the training before
the network overfits. A reference to this procedure and other techniques to
improve generalization are given in Chapter 13.
To experiment with generalization in neural networks, use the MATLAB®
Neural Network Design Demonstration Generalization (nnd11gn).

11-24
Objectives

13 Generalization
13
Objectives 13-1
Theory and Examples 13-2
Problem Statement 13-2
Methods for Improving Generalization 13-5
Estimating Generalization Error - The Test Set 13-6
Early Stopping 13-6
Regularization 13-8
Bayesian Analysis 13-10
Bayesian Regularization 13-12
Relationship Between Early Stopping and Regularization 13-19
Summary of Results 13-29
Solved Problems 13-32
Epilogue 13-44
Further Reading 13-45
Exercises 13-47

Objectives
One of the key issues in designing a multilayer network is determining the
number of neurons to use. In effect, that is the objective of this chapter.
In Chapter 11 we showed that if the number of neurons is too large, the net-
work will overfit the training data. This means that the error on the train-
ing data will be very small, but the network will fail to perform as well
when presented with new data. A network that generalizes well will per-
form as well on new data as it does on the training data.
The complexity of a neural network is determined by the number of free pa-
rameters that it has (weights and biases), which in turn is determined by
the number of neurons. If a network is too complex for a given data set,
then it is likely to overfit and to have poor generalization.
In this chapter we will see that we can adjust the complexity of a network
to fit the complexity of the data. In addition, this can be done without
changing the number of neurons. We can adjust the effective number of
free parameters without changing the actual number of free parameters.

13-1
13 Generalization

Theory and Examples


Mark Twain once said “We should be careful to get out of an experience
only the wisdom that is in it-and stop there; lest we be like the cat that sits
down on a hot stove-lid. She will never sit down on a hot stove-lid again-
and that is well; but also she will never sit down on a cold one any more.”
(From Following the Equator, 1897.)
That is the objective of this chapter. We want to train neural networks to
Generalization get out of the data only the wisdom that is in it. This concept is called gen-
eralization. A network trained to generalize will perform as well in new sit-
uations as it does on the data on which it was trained.
The key strategy we will use for obtaining good generalization is to find the
simplest model that explains the data. This is a variation of a principle
Ockham’s Razor called Ockham’s razor, which is named after the English logician William
of Ockham, who worked in the 14th Century. The idea is that the more
complexity you have in your model, the greater the possibility for errors.
In terms of neural networks, the simplest model is the one that contains
the smallest number of free parameters (weights and biases), or, equiva-
lently, the smallest number of neurons. To find a network that generalizes
well, we need to find the simplest network that fits the data.
There are at least five different approaches that people have used to pro-
duce simple networks: growing, pruning, global searches, regularization,
and early stopping. Growing methods start with no neurons in the network
and then add neurons until the performance is adequate. Pruning methods
start with large networks, which likely overfit, and then remove neurons
(or weights) one at a time until the performance degrades significantly.
Global searches, such as genetic algorithms, search the space of all possible
network architectures to locate the simplest model that explains the data.
The final two approaches, regularization and early stopping, keep the net-
work small by constraining the magnitude of the network weights, rather
than by constraining the number of network weights. In this chapter we
will concentrate on these two approaches. We will begin by defining the
problem of generalization and by showing examples of both good and poor
generalization. We will then describe the regularization and early stopping
methods for training neural networks. Finally, we will demonstrate how
these two methods are, in effect, performing the same operation.

Problem Statement
Let’s begin our discussion of generalization by defining the problem. We
start with a training set of example network inputs and corresponding tar-
get outputs:

13-2
Problem Statement

{p 1, t 1}  { p 2, t 2}    {p Q, tQ} . (13.1)

For our development of the concept of generalization, we will assume that


the target outputs are generated by

tq = g  pq  + q , (13.2)
13
where g  .  is some unknown function, and  q is a random, independent
and zero mean noise source. Our training objective will be to produce a neu-
ral network that approximates g  .  , while ignoring the noise.
The standard performance index for neural network training is the sum
squared error on the training set:
Q

  tq – aq 
T
F  x  = ED =  tq – aq  , (13.3)
q=1

where a q is the network output for input p q . We are using the variable E D
to represent the sum squared error on the training data, because later we
will modify the performance index to include an additional term.
Overfitting The problem of overfitting is illustrated in Figure 13.1. The blue curve rep-
resents the function g  .  . The large open circles represent the noisy target
points. The black curve represents the trained network response, and the
smaller circles filled with crosses represent the network response at the
training points. In this figure we can see that the network response exactly
matches the training points. However, it does a very poor job of matching
the underlying function. It overfits.
There are actually two kinds of errors that occur in Figure 13.1. The first
type of error, which is caused by overfitting, occurs for input values be-
tween -3 and 0. This is the region where all of the training data points oc-
ccur. The network response in this region overfits the training data and
will fail to perform well for input values that are not in the training set. The
Interpolation network does a poor job of interpolation; it fails to accurately approximate
the function near the training points.
The second type of error occurs for inputs in the region between 0 and 3.
The network fails to perform well in this region, not because it is overfit-
Extrapolation ting, but because there is no training data there. The network is extrapo-
lating beyond the range of the input data.
In this chapter we will discuss methods for preventing errors of interpola-
tion (overfitting). There is no way to prevent errors of extrapolation, unless
the data that is used to train the network covers all regions of the input
space where the network will be used. The network has no way of knowing
what the true function looks like in regions where there is no data.

13-3
13 Generalization

25

20

15

10

−5

−10

−15

−20

−25

−30
−3 −2 −1 0 1 2 3

Figure 13.1 Example of Overfitting and Poor Extrapolation


In Figure 13.2 we have an example of a network that has been trained to
generalize well. The network has the same number of weights as the net-
work of Figure 13.1, and it was trained using the same data set, but it has
been trained in such a way that it does not fully use all of the weights that
are available. It only uses as many weights as necessary to fit the data. The
network response does not fit the function perfectly, but it does the best job
it can, based on limited and noisy data.

25

20

15

10

−5

−10

−15

−20

−25

−30
−3 −2 −1 0 1 2 3

Figure 13.2 Example of Good Interpolation and Poor Extrapolation


In both Figure 13.1 and Figure 13.2 we can see that the network fails to ex-
trapolate accurately. This is understandable, since the network has been
provided with no information about the characteristics of the function out-

13-4
Methods for Improving Generalization

side of the range – 3  p  0 . The network response outside this range will
be unpredictable. This is why it is important to have training data for all
regions of the input space where the network will be used. It is usually not
difficult to determine the required input range when the network has a sin-
gle input, as in this example. However, when the network has many inputs,
it becomes more difficult to determine when the network is interpolating
13
and when it is extrapolating.
This problem is illustrated in a simple way in Figure 13.3. On the left side
of this figure we see the function that is to be approximated. The range for
the input variables is – 3  p 1  3 and – 3  p 2  3 . The neural network was
trained over these ranges of the two variables, but only for p 1  p 2 . There-
fore, both p 1 and p 2 cover their individual ranges, but only half of the total
input space is covered. When p 1  p 2 , the network is extrapolating, and we
can see on the right side of Figure 13.3 that the network performs poorly in
this region. (See Problem P13.4 for another example of extrapolation.) If
there are many input variables, it will be quite difficult to determine when
the network is interpolating and when it is extrapolating. We will discuss
some practical ways of dealing with this problem in Chapter 22.

a) b)

8 8

6 6

4 4

2 2

t 0 a 0

−2 −2

−4 −4

−6 −6
3 3
2 2
3 3
1 2 1 2
0 1 0 1
p2 −1 0 p2 −1 0
−2
−3 −3
−2
−1
p1 −2
−3 −3
−2
−1
p1

Figure 13.3 Function (a) and Neural Network Approximation (b)

Methods for Improving Generalization


The remainder of this chapter will discuss methods for improving the gen-
eralization capability of neural networks. As we discussed earlier, there are
a number of approaches to this problem - all of which try to find the sim-
plest network that will fit the data. These approaches fit into two general
categories: restricting the number of weights (or, equivalently, the number
of neurons) in the network, or restricting the magnitude of the weights. We
will concentrate on two methods that we have found to be particularly use-
ful: early stopping and regularization. Both of these approaches attempt to
restrict the magnitude of the weights, although they do so in very different

13-5
13 Generalization

ways. At the end of this chapter, we will demonstrate the approximate


equivalence of the two methods.
We should note that in this chapter we are assuming that there is a limited
amount of data with which to train the network. If the amount of data is
unlimited, which in practical terms means that the number of data points
is significantly larger than the number of network parameters, then there
will not be a problem of overfitting.

Estimating Generalization Error - The Test Set


Before we discuss methods for improving the generalization capability of
neural networks, we should first discuss how we can estimate this error for
a specific neural network. Given a limited amount of available data, it is
important to hold aside a certain subset during the training process. After
the network has been trained, we will compute the errors that the trained
Test Set network makes on this test set. The test set errors will then give us an in-
dication of how the network will perform in the future; they are a measure
of the generalization capability of the network.
In order for the test set to be a valid indicator of generalization capability,
there are two important things to keep in mind. First, the test set must
never be used in any way to train the neural network, or even to select one
network from a group of candidate networks. The test set should only be
used after all training and selection is complete. Second, the test set must
be representative of all situations for which the network will be used. This
can sometimes be difficult to guarantee, especially when the input space is
high-dimensional or has a complex shape. We will discuss this problem in
more detail in Chapter 22, Practical Training Issues.
In the remaining sections of this chapter, we will assume that a test set has
been removed from the data set before training begins, and that this set
will be used at the completion of training to measure generalization capa-
bility.

Early Stopping
The first method we will discuss for improving generalization is also the
simplest method. It is called early stopping [WaVe94]. The idea behind this
method is that as training progresses the network uses more and more of
its weights, until all weights are fully used when training reaches a mini-
mum of the error surface. By increasing the number of iterations of train-
ing, we are increasing the complexity of the resulting network. If training
is stopped before the minimum is reached, then the network will effectively
be using fewer parameters and will be less likely to overfit. In a later sec-
tion of this chapter we will demonstrate how the number of parameters
changes as the number of iterations increases.
In order to use early stopping effectively, we need to know when to stop the
Cross-Validation training. We will describe a method, called cross-validation, that uses a

13-6
Methods for Improving Generalization

Validation Set validation set to decide when to stop [Sarl95]. The available data (after re-
moving the test set, as described above) is divided into two parts: a training
set and a validation set. The training set is used to compute gradients or
Jacobians and to determine the weight update at each iteration. The vali-
dation set is an indicator of what is happening to the network function “in
between” the training points, and its error is monitored during the training
13
process. When the error on the validation set goes up for several iterations,
the training is stopped, and the weights that produced the minimum error
on the validation set are used as the final trained network weights.
This process is illustrated in Figure 13.4. The graph at the bottom of this
figure shows the progress of the training and validation performance indi-
ces, F (the sum squared errors), during training. Although the training er-
ror continues to go down throughout the training process, a minimum of
the validation error occurs at the point labeled “a,” which corresponds to
training iteration 14. The graph at the upper left shows the network re-
sponse at this early stopping point. The resulting network provides a good
fit to the true function. The graph at the upper right demonstrates the net-
work response if we continue to train to point “b,” where the validation er-
ror has increased and the network is overfitting.

1.5 1.5

1 a 1 b
0.5 0.5

2 2
a 0 a 0

−0.5 −0.5

−1 −1

−1.5 −1.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

p p
3
10

2
Validation
10

a b
F 10
1

Training
0
10

−1
10
0 1 2
10 10 10

Iteration

Figure 13.4 Illustration of Early Stopping

13-7
13 Generalization

The basic concept for early stopping is simple, but there are several practi-
cal issues to be addressed. First, the validation set must be chosen so that
it is representative of all situations for which the network will be used. This
is also true for the test and training sets, as we mentioned earlier. Each set
must be roughly equivalent in its coverage of the input space, although the
size of each set may be different.
When we divide the data, approximately 70% is typically used for training,
with 15% for validation and 15% for testing. These are only approximate
numbers. A complete discussion of how to select the amount of data for the
validation set is given in [AmMu97].
Another practical point to be made about early stopping is that we should
use a relatively slow training method. During training, the network will
use more and more of the available network parameters (as we will explain
in the last section of this chapter). If the training method is too fast, it will
likely jump past the point at which the validation error is minimized.
To experiment with the effect of early stopping, use the MATLAB® Neural
Network Design Demonstration Early Stopping (nnd13es).

Regularization
The second method we will discuss for improving generalization is called
regularization. For this method, we modify the sum squared error perfor-
mance index of Eq. (13.3) to include a term that penalizes network com-
plexity. This concept was introduced by Tikhonov [Tikh63]. He added a
penalty, or regularization, term that involved the derivatives of the approx-
imating function (neural network in our case), which forced the resulting
function to be smooth. Under certain conditions, this regularization term
can be written as the sum of squares of the network weights, as in
Q n

  tq – aq   tq – aq  +   xi ,
T 2
F  x  = E D + E W =  (13.4)
q=1 i=1

where the ratio    controls the effective complexity of the network solu-
tion. The larger this ratio is, the smoother the network response. (Note that
we could have used a single parameter here, but developments in later sec-
tions will require two parameters.)
Why do we want to penalize the sum squared weights, and how is this sim-
ilar to reducing the number of neurons? Consider again the example mul-
tilayer network shown in Figure 11.4. Recall how increasing a weight
increased the slope of the network function. You can see this effect again in
2
Figure 13.5, where we have changed the weight w 1 1 from 0 to 2. When the
weights are large, the function created by the network can have large
slopes, and is therefore more likely to overfit the training data. If we re-
strict the weights to be small, then the network function will create a

13-8
Methods for Improving Generalization

smooth interpolation through the training data - just as if the network had
a small number of neurons.

2
w 1 1 = 2
13
2

2
a 1

2
w 1 1 = 0

−1
−2 −1 0 1 2

Figure 13.5 Effect of Weight on Network Response


To experiment with the effect of weight changes on the network function, use
the MATLAB® Neural Network Design Demonstration Network Function
(nnd11nf).
The key to the success of the regularization method in producing a network
that generalizes well is the correct choice of the regularization ratio    .
Figure 13.6 illustrates the effect of changing this ratio. Here we have
trained a 1-20-1 network on 21 noisy samples of a sine wave.
In the figure, the blue line represents the true function, and the large open
circles represent the noisy data. The black curve represents the trained
network response, and the smaller circles filled with crosses represent the
network response at the training points. From the figure, we can see that
the ratio    = 0.01 produces the best fit to the true function. For ratios
larger than this, the network response is too smooth, and for ratios smaller
than this, the network overfits.
There are several techniques for setting the regularization parameter. One
approach is to use a validation set, such as we described in the section on
early stopping; the regularization parameter is set to minimize the squared
error on the validation set [GoLa98]. In the next two sections we will de-
scribe a different technique for automatically setting the regularization pa-
rameter. It is called Bayesian regularization.

13-9
13 Generalization

1.5 1.5

1  = 0 1
   = 0.01
0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

1.5 1.5

1    = 0.25 1
 = 1
0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 13.6 Effect of Regularization Ratio


To experiment with the effect of regularization, use the MATLAB® Neural
Network Design Demonstration Regularization (nnd13reg).

Bayesian Analysis
Thomas Bayes was a Presbyterian minister who lived in England during
the 1700’s. He was also an amateur mathematician. His most important
work was published after his death. In it, he presented what is now known
as Bayes’ Theorem. The theorem states that if you have two random
events, A and B , then the conditional probability of the occurrence of A ,
given the occurrence of B can be computed as

P  B A P  A 
P  A B  = ------------------------------- . (13.5)
P B

Eq. (13.5) is called Bayes’ rule. Each of the terms in this expression has a
name by which it is commonly referred. P  A  is called the prior probability.
It tells us what we know about A before we know the outcome of B . P  A B 
is called the posterior probability. This tells us what we know about A after
we learn about B . P  B A  is the conditional probability of B given A . Nor-
mally this term is given by our knowledge of the system that describes the
relationship between B and A . P  B  is the marginal probability of the
event B , and it acts as a normalization factor in Bayes’ rule.

13-10

You might also like