UNIT3
UNIT3
Gradient Descent is a widely used optimization algorithm for machine learning models.
However, there are several optimization techniques that can be used to improve the
performance of Gradient Descent. Here are some of the most popular optimization techniques
for Gradient Descent:
Learning Rate Scheduling: The learning rate determines the step size of the Gradient Descent
algorithm. Learning Rate Scheduling involves changing the learning rate during the training
process, such as decreasing the learning rate as the number of iterations increases. This
technique helps the algorithm to converge faster and avoid overshooting the minimum.
Momentum-based Updates: The Momentum-based Gradient Descent technique involves
adding a fraction of the previous update to the current update. This technique helps the
algorithm to overcome local minima and accelerates convergence.
Batch Normalization: Batch Normalization is a technique used to normalize the inputs to
each layer of the neural network. This helps the Gradient Descent algorithm to converge faster
and avoid vanishing or exploding gradients.
Weight Decay: Weight Decay is a regularization technique that involves adding a penalty term
to the cost function proportional to the magnitude of the weights. This helps to prevent
overfitting and improve the generalization of the model.
Adaptive Learning Rates: Adaptive Learning Rate techniques involve adjusting the learning
rate adaptively during the training process. Examples include Adagrad, RMSprop, and Adam.
These techniques adjust the learning rate based on the historical gradient information, which
can improve the convergence speed and accuracy of the algorithm.
Second-Order Methods: Second-Order Methods use the second-order derivatives of the cost
function to update the parameters. Examples include Newton’s Method and Quasi-Newton
Methods. These methods can converge faster than Gradient Descent, but require more
computation and may be less stable.
Gradient Descent is an iterative optimization algorithm, used to find the minimum value for a
function. The general idea is to initialize the parameters to random values, and then take small
steps in the direction of the “slope” at each iteration. Gradient descent is highly used in
supervised learning to minimize the error function and find the optimal values for the
parameters. Various extensions have been designed for the gradient descent algorithms. Some
of them are discussed below:
Momentum method: This method is used to accelerate the gradient descent algorithm by
taking into consideration the exponentially weighted average of the gradients.
RMSprop: RMSprop was proposed by the University of Toronto’s Geoffrey Hinton. The
intuition is to apply an exponentially weighted average method to the second moment of the
gradients (dW2).
The goal of the gradient descent is to minimise a given function which, in our case, is the loss
function of the neural network. To achieve this goal, it performs two steps iteratively.
1. Compute the slope (gradient) that is the first-order derivative of the function at the current
point
2. Move-in the opposite direction of the slope increase from the current point by the computed
amount
So, the idea is to pass the training set through the hidden layers of the neural network and then
update the parameters of the layers by computing the gradients using the training samples from
the training dataset.
Think of it like this. Suppose a man is at top of the valley and he wants to get to the bottom of the
valley. So he goes down the slope. He decides his next position based on his current position and
stops when he gets to the bottom of the valley which was his goal.
In Batch Gradient Descent, all the training data is taken into consideration to take a single
step. We take the average of the gradients of all the training examples and then use that mean
gradient to update our parameters. So that’s just one step of gradient descent in one epoch.
Batch Gradient Descent is great for convex or relatively smooth error manifolds. In this case, we
move somewhat directly towards an optimum solution.
Stochastic Gradient Descent
In Batch Gradient Descent we were considering all the examples for every step of Gradient
Descent. But what if our dataset is very huge. Deep learning models crave for data. The more the
data the more chances of a model to be good. Suppose our dataset has 5 million examples, then
just to take one step the model will have to calculate the gradients of all the 5 million examples.
This does not seem an efficient way. To tackle this problem we have Stochastic Gradient Descent.
In Stochastic Gradient Descent (SGD), we consider just one example at a time to take a single
step. We do the following steps in one epoch for SGD:
1. Take an example
Since we are considering just one example at a time the cost will fluctuate over the training
examples and it will not necessarily decrease. But in the long run, you will see the cost decreasing
with fluctuations.
We have seen the Batch Gradient Descent. We have also seen the Stochastic Gradient Descent.
Batch Gradient Descent can be used for smoother curves. SGD can be used when the dataset is
large. Batch Gradient Descent converges directly to minima. SGD converges faster for larger
datasets. But, since in SGD we use only one example at a time, we cannot implement the
vectorized implementation on it. This can slow down the computations. To tackle this problem, a
mixture of Batch Gradient Descent and SGD is used.
Neither we use all the dataset all at once nor we use the single example at a time. We use a batch
of a fixed number of training examples which is less than the actual dataset and call it a mini-
batch. Doing this helps us achieve the advantages of both the former variants we saw. So, after
creating the mini-batches of fixed size, we do the following steps in one epoch:
1. Pick a mini-batch
The concept of a hypothesis is fundamental in Machine Learning and data science endeavours.
In the realm of machine learning, a hypothesis serves as an initial assumption made by data
scientists and ML professionals when attempting to address a problem. Machine learning
involves conducting experiments based on past experiences, and these hypotheses are crucial
in formulating potential solutions. It’s important to note that in machine learning discussions,
the terms “hypothesis” and “model” are sometimes used interchangeably. However, a
hypothesis represents an assumption, while a model is a mathematical representation employed
to test that hypothesis. This section on “Hypothesis in Machine Learning” explores key aspects
related to hypotheses in machine learning and their significance.
INDUCTIVE BIAS
Inductive bias can be defined as the set of assumptions or biases that a learning algorithm
employs to make predictions on unseen data based on its training data. These assumptions are
inherent in the algorithm’s design and serve as a foundation for learning and generalization.
The inductive bias of an algorithm influences how it selects a hypothesis (a possible
explanation or model) from the hypothesis space (the set of all possible hypotheses) that best
fits the training data. It helps the algorithm navigate the trade-off between fitting the training
data perfectly (overfitting) and generalizing well to unseen data (underfitting).
Importance of Inductive Bias
Inductive bias is crucial in machine learning as it helps algorithms generalize from limited
training data to unseen data. Without a well-defined inductive bias, algorithms may struggle to
make accurate predictions or may overfit the training data, leading to poor performance on new
data.
Understanding the inductive bias of an algorithm is essential for model selection, as different
biases may be more suitable for different types of data or tasks. It also provides insights into
how the algorithm is learning and what assumptions it is making about the data, which can aid
in interpreting its predictions and results.
Challenges and Considerations
While inductive bias is essential for learning, it can also introduce limitations and challenges.
Biases that are too strong or inappropriate for the data can lead to poor generalization or biased
predictions. Balancing bias with variance (the variability of predictions) is a key challenge in
machine learning, requiring careful tuning and model selection.
Additionally, the choice of inductive bias can impact the interpretability of the model. Simpler
biases may lead to more interpretable models, while more complex biases may sacrifice
interpretability for improved performance.
ERROR FUNCTIONS
The loss function, also referred to as the error function, is a crucial component in machine
learning that quantifies the difference between the predicted outputs of a machine learning
algorithm and the actual target values. For example, within a regression problem to predict car
prices based on historical data, a loss function evaluates a neural network prediction based on a
training sample from the training dataset. The loss function quantifies the gap or the error margin
of the car price predicted by the network to the actual price.
The resulting value, the loss, reflects the accuracy of the model's predictions. During training, a
learning algorithm such as the backpropagation algorithm uses the gradient of the loss function
with respect to the model's parameters to adjust these parameters and minimize the loss,
effectively improving the model's performance on the dataset. Often, the terms loss function and
cost function are used interchangeably; despite this, both terms have distinct definitions:
As mentioned earlier, the loss function, also known as the error function, quantifies how well a
single prediction of the machine learning algorithm is compared to the actual target value. The
key takeaway is that a loss function applies to a single training example and is part of the overall
model's learning process that provides the signal by which the model's learning algorithm
updates the weights and parameters. The cost function, sometimes called the objective function,
is an average of the loss function of an entire training set containing several training examples.
The cost function quantifies the model's performance on the whole training dataset.
The mathematical equation for Mean Square Error (MSE) or L2 Loss is:
The mathematical equation for Mean Absolute Error (MAE) or L1 Loss is:
Objectives 7-1
Theory and Examples 7-2
Linear Associator 7-3
The Hebb Rule 7-4
Performance Analysis 7-5
Pseudoinverse Rule 7-7
Application 7-10
Variations of Hebbian Learning 7-12
Summary of Results 7-14
Solved Problems 7-16
Epilogue 7-29
Further Reading 7-30
Exercises 7-31
Objectives
7
The Hebb rule was one of the first neural network learning laws. It was
proposed by Donald Hebb in 1949 as a possible mechanism for synaptic
modification in the brain and since then has been used to train artificial
neural networks.
In this chapter we will use the linear algebra concepts of the previous two
chapters to explain why Hebbian learning works. We will also show how
the Hebb rule can be used to train neural networks for pattern recognition.
7-1
7 Supervised Hebbian Learning
7-2
Linear Associator
Linear Associator
Hebb’s learning law can be used in combination with a variety of neural
network architectures. We will use a very simple architecture for our initial
presentation of Hebbian learning. In this way we can concentrate on the
learning law rather than the architecture. The network we will use is the
Linear Associator linear associator, which is shown in Figure 7.1. (This network was intro-
duced independently by James Anderson [Ande72] and Teuvo Kohonen
[Koho72].)
p n a
Rx1
W Sx1 Sx1
SxR
R S
a = purelin (Wp)
or
R
ai = w ij p j . (7.2)
j=1
{p 1, t 1} {p 2 , t 2} {p Q, t Q} . (7.3)
7-3
7 Supervised Hebbian Learning
where p jq is the jth element of the qth input vector p q ; a iq is the ith ele-
ment of the network output when the qth input vector is presented to the
network; and is a positive constant, called the learning rate. This equa-
tion says that the change in the weight w ij is proportional to a product of
functions of the activities on either side of the synapse. For this chapter we
will simplify Eq. (7.4) to the following form
Note that this expression actually extends Hebb’s postulate beyond its
strict interpretation. The change in the weight is proportional to a product
of the activity on either side of the synapse. Therefore, not only do we in-
crease the weight when both p j and a i are positive, but we also increase
the weight when they are both negative. In addition, this implementation
of the Hebb rule will decrease the weight whenever p j and a i have opposite
sign.
The Hebb rule defined in Eq. (7.5) is an unsupervised learning rule. It does
not require any information concerning the target output. In this chapter
we are interested in using the Hebb rule for supervised learning, in which
the target output is known for each input vector. (We will revisit the unsu-
pervised Hebb rule in Chapter 13.) For the supervised Hebb rule we substi-
tute the target output for the actual output. In this way, we are telling the
algorithm what the network should do, rather than what it is currently do-
ing. The resulting equation is
where t iq is the ith element of the qth target vector t q . (We have set the
learning rate to one, for simplicity.)
Notice that Eq. (7.6) can be written in vector notation:
new old T
W = W + tq pq . (7.7)
7-4
The Hebb Rule
If we assume that the weight matrix is initialized to zero and then each of
the Q input/output pairs are applied once to Eq. (7.7), we can write
Q
W = t1 p1 + t2 p2 + + tQ pQ = tq pq .
T T T T
(7.8)
q=1
T
p1
T
W = t 1 t 2 t Q p 2 = TP ,
T
(7.9)
T
pQ
where
T = t1 t2 tQ P = p1 p2 pQ . (7.10)
Performance Analysis
Let’s analyze the performance of Hebbian learning for the linear associa-
tor. First consider the case where the p q vectors are orthonormal (orthog-
onal and unit length). If p k is input to the network, then the network
output can be computed
Q
T
Q 7
T
a = Wp k = t q p q p k = tq pq pk . (7.11)
q=1 q=1
= 0 q k. (7.12)
Therefore Eq. (7.11) can be rewritten
a = Wp k = t k . (7.13)
The output of the network is equal to the target output. This shows that, if
the input prototype vectors are orthonormal, the Hebb rule will produce the
correct output for each input.
7-5
7 Supervised Hebbian Learning
But what about non-orthogonal prototype vectors? Let’s assume that each
p q vector is unit length, but that they are not orthogonal. Then Eq. (7.11)
becomes
Error
tq pq pk .
T
a = Wp k = t k + (7.14)
qk
Because the vectors are not orthogonal, the network will not produce the
correct output. The magnitude of the error will depend on the amount of
correlation between the prototype input patterns.
2 As an example, suppose that the prototype input/output vectors are
+2
0.5 0.5
– 0.5 t = 1 0.5 1
p1 = 1 p2 = t2 = . (7.15)
0.5 –1 – 0.5 1
– 0.5 – 0.5
W = TP =
T 1 1 0.5 – 0.5 0.5 – 0.5 = 1 0 0 – 1 . (7.16)
– 1 1 0.5 0.5 – 0.5 – 0.5 0 1 –1 0
0.5
Wp 1 = 1 0 0 – 1 – 0.5 = 1 , (7.17)
0 1 – 1 0 0.5 –1
– 0.5
and
0.5
Wp 2 = 1 0 0 – 1 0.5 = 1 . (7.18)
0 1 – 1 0 – 0.5 1
– 0.5
7-6
Pseudoinverse Rule
Now let’s revisit the apple and orange recognition problem described in
2
+2 Chapter 3. Recall that the prototype inputs were
1 1
p 1 = – 1 orange p2 = 1 apple . (7.19)
–1 –1
(Note that they are not orthogonal.) If we normalize these inputs and
choose as desired outputs -1 and 1, we obtain
0.5774 0.5774
1
p = – 0.5774 t 1 = – 1 p 2 = 0.5774 t 2 = 1 . (7.20)
– 0.5774 – 0.5774
0.5774
Wp 1 = 0 1.1548 0 – 0.5774 = – 0.6668 , (7.22)
– 0.5774
7
0.5774
Wp 2 = 0 1.1548 0 0.5774 = 0.6668 . (7.23)
– 0.5774
The outputs are close, but do not quite match the target outputs.
Pseudoinverse Rule
When the prototype input patterns are not orthogonal, the Hebb rule pro-
duces some errors. There are several procedures that can be used to reduce
these errors. In this section we will discuss one of those procedures, the
pseudoinverse rule.
Recall that the task of the linear associator was to produce an output of t q
for an input of p q . In other words,
Wp q = t q q = 1 2 Q . (7.24)
7-7
7 Supervised Hebbian Learning
2
F W = t q – Wp q . (7.25)
q=1
If the prototype input vectors p q are orthonormal and we use the Hebb rule
to find W, then F(W) will be zero. When the input vectors are not orthogo-
nal and we use the Hebb rule, then F(W) will be not be zero, and it is not
clear that F(W) will be minimized. It turns out that the weight matrix that
will minimize F(W) is obtained by using the pseudoinverse matrix, which
we will define next.
First, let’s rewrite Eq. (7.24) in matrix form:
WP = T , (7.26)
where
T = t1 t2 tQ P = p1 p2 pQ . (7.27)
where
E = T – WP , (7.29)
and
eij .
2 2
E = (7.30)
i j
Note that F(W) can be made zero if we can solve Eq. (7.26). If the P matrix
has an inverse, the solution is
–1
W = TP . (7.31)
7-8
Pseudoinverse Rule
It has been shown [Albe72] that the weight matrix that minimizes Eq.
Pseudoinverse Rule (7.25) is given by the pseudoinverse rule:
+
W = TP , (7.32)
+
where P is the Moore-Penrose pseudoinverse. The pseudoinverse of a real
matrix P is the unique matrix that satisfies
+
PP P = P,
+ + +
P PP = P ,
T
(7.33)
+ +
P P = P P ,
+ + T
PP = PP .
+ T –1 T
P = P P P . (7.34)
2 To test the pseudoinverse rule (Eq. (7.32)), consider again the apple and or-
+2
ange recognition problem. Recall that the input/output prototype vectors
are
1 1
p1 = –1 t1 = –1 p2 = 1 t2 = 1 .
–1
–1
(7.35)
7
(Note that we do not need to normalize the input vectors when using the
pseudoinverse rule.)
The weight matrix is calculated from Eq. (7.32):
+
1 1
+
W = TP = – 1 1 – 1 1 , (7.36)
–1 –1
7-9
7 Supervised Hebbian Learning
1
Wp 1 = 0 1 0 – 1 = – 1 (7.39)
–1
1
Wp 2 = 0 1 0 1 = 1 (7.40)
–1
The network outputs exactly match the desired outputs. Compare this re-
sult with the performance of the Hebb rule. As you can see from Eq. (7.22)
and Eq. (7.23), the Hebbian outputs are only close, while the pseudoinverse
rule produces exact results.
Application
Now let’s see how we might use the Hebb rule on a practical, although
greatly oversimplified, pattern recognition problem. For this problem we
will use a special type of associative memory — the autoassociative memo-
Autoassociative Memory ry. In an autoassociative memory the desired output vector is equal to the
input vector (i.e., t q = p q ). We will use an autoassociative memory to store
a set of patterns and then to recall these patterns, even when corrupted
patterns are provided as input.
The patterns we want to store are shown to the left. (Since we are designing
an autoassociative memory, these patterns represent the input vectors and
the targets.) They represent the digits {0, 1, 2} displayed in a 6X5 grid. We
need to convert these digits to vectors, which will become the prototype pat-
p1,t1 p2,t2 p3,t3 terns for our network. Each white square will be represented by a “-1”, and
each dark square will be represented by a “1”. Then, to create the input vec-
tors, we will scan each 6X5 grid one column at a time. For example, the first
prototype pattern will be
T
p1 = –1 1 1 1 1 –1 1 –1 –1 –1 –1 1 1 – 1 1 –1 . (7.41)
The vector p 1 corresponds to the digit “0”, p 2 to the digit “1”, and p 3 to the
digit “2”. Using the Hebb rule, the weight matrix is computed
7-10
Application
T T T
W = p1 p1 + p2 p2 + p3 p3 . (7.42)
p n a
30x1
W 30x1 30x1
30x30
30 30
a = hardlims (Wp)
7-11
7 Supervised Hebbian Learning
A positive parameter , called the learning rate, can be used to limit the
amount of increase in the weight matrix elements, if the learning rate is
less than one, as in:
new old T
W = W + t q p q . (7.44)
We can also add a decay term, so that the learning rule behaves like a
smoothing filter, remembering the most recent inputs more clearly:
new old T old old T
W = W + t q p q – W = 1 – W + t q p q , (7.45)
7-12
Variations of Hebbian Learning
law quickly forgets old inputs and remembers only the most recent pat-
terns. This keeps the weight matrix from growing without bound.
The idea of filtering the weight changes and of having an adjustable learn-
ing rate are important ones, and we will discuss them again in Chapters
10, 12, 15, 16, 18 and 19.
If we modify Eq. (7.44) by replacing the desired output with the difference
between the desired output and the actual output, we get another impor-
tant learning rule:
new old T
W = W + t q – a q p q . (7.46)
This is sometimes known as the delta rule, since it uses the difference be-
tween desired and actual output. It is also known as the Widrow-Hoff algo-
rithm, after the researchers who introduced it. The delta rule adjusts the
weights so as to minimize the mean square error (see Chapter 10). For this
reason it will produce the same results as the pseudoinverse rule, which
minimizes the sum of squares of errors (see Eq. (7.25)). The advantage of
the delta rule is that it can update the weights after each new input pattern
is presented, whereas the pseudoinverse rule computes the weights in one
step, after all of the input/target pairs are known. This sequential updating
allows the delta rule to adapt to a changing environment. The delta rule
will be discussed in detail in Chapter 10.
The basic Hebb rule will be discussed again, in a different context, in Chap-
ter 13. In the present chapter we have used a supervised form of the Hebb
rule. We have assumed that the desired outputs of the network, t q , are
known, and can be used in the learning rule. In the unsupervised Hebb
rule, which is discussed in Chapter 13, the actual network output is used
instead of the desired network output, as in:
7
new old T
W = W + a q p q , (7.47)
where a q is the output of the network when p q is given as the input (see
also Eq. (7.5)). This unsupervised form of the Hebb rule, which does not re-
quire knowledge of the desired output, is actually a more direct interpreta-
tion of Hebb’s postulate than is the supervised form discussed in this
chapter.
7-13
11 Backpropagation
1
this chapter, the response of this network is a superposition of S sigmoid
functions.
Figure 11.11 illustrates the network response as the number of neurons in
the first layer (hidden layer) is increased. Unless there are at least five neu-
rons in the hidden layer the network cannot accurately represent g p .
3 3
1-2-1 1-3-1
2 2
1 1
0 0
-1 -1
-2 -1 0 1 2 -2 -1 0 1 2
3 3
1-4-1 1-5-1
2 2
1 1
0 0
-1 -1
-2 -1 0 1 2 -2 -1 0 1 2
Convergence
In the previous section we presented some examples in which the network
response did not give an accurate approximation to the desired function,
even though the backpropagation algorithm produced network parameters
that minimized mean square error. This occurred because the capabilities
of the network were inherently limited by the number of hidden neurons it
contained. In this section we will provide an example in which the network
is capable of approximating the function, but the learning algorithm does
not produce network parameters that produce an accurate approximation.
In the next chapter we will discuss this problem in more detail and explain
why it occurs. For now we simply want to illustrate the problem.
11-20
Using Backpropagation
To approximate this function we will use a 1-3-1 network, where the trans-
fer function for the first layer is log-sigmoid and the transfer function for
the second layer is linear.
Figure 11.12 illustrates a case where the learning algorithm converges to
a solution that minimizes mean square error. The thin blue lines represent
intermediate iterations, and the thick blue line represents the final solu-
tion, when the algorithm has converged. (The numbers next to each curve
indicate the sequence of iterations, where 0 represents the initial condition
and 5 represents the final solution. The numbers do not correspond to the
iteration number. There were many iterations for which no curve is repre-
sented. The numbers simply indicate an ordering.)
2 5
1 3
2
4
0
0
-1
-2 -1 0 1 2
11
11-21
11 Backpropagation
5
1 3
4
0
0
1
-1
-2 -1 0 1 2
Generalization
In most cases the multilayer network is trained with a finite number of ex-
amples of proper network behavior:
{p 1, t 1} { p 2, t 2} {p Q, tQ} . (11.56)
g p = 1 + sin --- p , (11.57)
4
11-22
Using Backpropagation
been trained on this data. The black line represents g p , the blue line rep-
resents the network response, and the ‘+’ symbols indicate the training set.
3
-1
-2 -1 0 1 2
-1
-2 -1 0 1 2
The 1-9-1 network has too much flexibility for this problem; it has a total
of 28 adjustable parameters (18 weights and 10 biases), and yet there are
only 11 data points in the training set. The 1-2-1 network has only 7 param-
11
11-23
11 Backpropagation
eters and is therefore much more restricted in the types of functions that it
can implement.
For a network to be able to generalize, it should have fewer parameters than
there are data points in the training set. In neural networks, as in all mod-
eling problems, we want to use the simplest network that can adequately
represent the training set. Don’t use a bigger network when a smaller net-
work will work (a concept often referred to as Ockham’s Razor).
An alternative to using the simplest network is to stop the training before
the network overfits. A reference to this procedure and other techniques to
improve generalization are given in Chapter 13.
To experiment with generalization in neural networks, use the MATLAB®
Neural Network Design Demonstration Generalization (nnd11gn).
11-24
Objectives
13 Generalization
13
Objectives 13-1
Theory and Examples 13-2
Problem Statement 13-2
Methods for Improving Generalization 13-5
Estimating Generalization Error - The Test Set 13-6
Early Stopping 13-6
Regularization 13-8
Bayesian Analysis 13-10
Bayesian Regularization 13-12
Relationship Between Early Stopping and Regularization 13-19
Summary of Results 13-29
Solved Problems 13-32
Epilogue 13-44
Further Reading 13-45
Exercises 13-47
Objectives
One of the key issues in designing a multilayer network is determining the
number of neurons to use. In effect, that is the objective of this chapter.
In Chapter 11 we showed that if the number of neurons is too large, the net-
work will overfit the training data. This means that the error on the train-
ing data will be very small, but the network will fail to perform as well
when presented with new data. A network that generalizes well will per-
form as well on new data as it does on the training data.
The complexity of a neural network is determined by the number of free pa-
rameters that it has (weights and biases), which in turn is determined by
the number of neurons. If a network is too complex for a given data set,
then it is likely to overfit and to have poor generalization.
In this chapter we will see that we can adjust the complexity of a network
to fit the complexity of the data. In addition, this can be done without
changing the number of neurons. We can adjust the effective number of
free parameters without changing the actual number of free parameters.
13-1
13 Generalization
Problem Statement
Let’s begin our discussion of generalization by defining the problem. We
start with a training set of example network inputs and corresponding tar-
get outputs:
13-2
Problem Statement
{p 1, t 1} { p 2, t 2} {p Q, tQ} . (13.1)
tq = g pq + q , (13.2)
13
where g . is some unknown function, and q is a random, independent
and zero mean noise source. Our training objective will be to produce a neu-
ral network that approximates g . , while ignoring the noise.
The standard performance index for neural network training is the sum
squared error on the training set:
Q
tq – aq
T
F x = ED = tq – aq , (13.3)
q=1
where a q is the network output for input p q . We are using the variable E D
to represent the sum squared error on the training data, because later we
will modify the performance index to include an additional term.
Overfitting The problem of overfitting is illustrated in Figure 13.1. The blue curve rep-
resents the function g . . The large open circles represent the noisy target
points. The black curve represents the trained network response, and the
smaller circles filled with crosses represent the network response at the
training points. In this figure we can see that the network response exactly
matches the training points. However, it does a very poor job of matching
the underlying function. It overfits.
There are actually two kinds of errors that occur in Figure 13.1. The first
type of error, which is caused by overfitting, occurs for input values be-
tween -3 and 0. This is the region where all of the training data points oc-
ccur. The network response in this region overfits the training data and
will fail to perform well for input values that are not in the training set. The
Interpolation network does a poor job of interpolation; it fails to accurately approximate
the function near the training points.
The second type of error occurs for inputs in the region between 0 and 3.
The network fails to perform well in this region, not because it is overfit-
Extrapolation ting, but because there is no training data there. The network is extrapo-
lating beyond the range of the input data.
In this chapter we will discuss methods for preventing errors of interpola-
tion (overfitting). There is no way to prevent errors of extrapolation, unless
the data that is used to train the network covers all regions of the input
space where the network will be used. The network has no way of knowing
what the true function looks like in regions where there is no data.
13-3
13 Generalization
25
20
15
10
−5
−10
−15
−20
−25
−30
−3 −2 −1 0 1 2 3
25
20
15
10
−5
−10
−15
−20
−25
−30
−3 −2 −1 0 1 2 3
13-4
Methods for Improving Generalization
side of the range – 3 p 0 . The network response outside this range will
be unpredictable. This is why it is important to have training data for all
regions of the input space where the network will be used. It is usually not
difficult to determine the required input range when the network has a sin-
gle input, as in this example. However, when the network has many inputs,
it becomes more difficult to determine when the network is interpolating
13
and when it is extrapolating.
This problem is illustrated in a simple way in Figure 13.3. On the left side
of this figure we see the function that is to be approximated. The range for
the input variables is – 3 p 1 3 and – 3 p 2 3 . The neural network was
trained over these ranges of the two variables, but only for p 1 p 2 . There-
fore, both p 1 and p 2 cover their individual ranges, but only half of the total
input space is covered. When p 1 p 2 , the network is extrapolating, and we
can see on the right side of Figure 13.3 that the network performs poorly in
this region. (See Problem P13.4 for another example of extrapolation.) If
there are many input variables, it will be quite difficult to determine when
the network is interpolating and when it is extrapolating. We will discuss
some practical ways of dealing with this problem in Chapter 22.
a) b)
8 8
6 6
4 4
2 2
t 0 a 0
−2 −2
−4 −4
−6 −6
3 3
2 2
3 3
1 2 1 2
0 1 0 1
p2 −1 0 p2 −1 0
−2
−3 −3
−2
−1
p1 −2
−3 −3
−2
−1
p1
13-5
13 Generalization
Early Stopping
The first method we will discuss for improving generalization is also the
simplest method. It is called early stopping [WaVe94]. The idea behind this
method is that as training progresses the network uses more and more of
its weights, until all weights are fully used when training reaches a mini-
mum of the error surface. By increasing the number of iterations of train-
ing, we are increasing the complexity of the resulting network. If training
is stopped before the minimum is reached, then the network will effectively
be using fewer parameters and will be less likely to overfit. In a later sec-
tion of this chapter we will demonstrate how the number of parameters
changes as the number of iterations increases.
In order to use early stopping effectively, we need to know when to stop the
Cross-Validation training. We will describe a method, called cross-validation, that uses a
13-6
Methods for Improving Generalization
Validation Set validation set to decide when to stop [Sarl95]. The available data (after re-
moving the test set, as described above) is divided into two parts: a training
set and a validation set. The training set is used to compute gradients or
Jacobians and to determine the weight update at each iteration. The vali-
dation set is an indicator of what is happening to the network function “in
between” the training points, and its error is monitored during the training
13
process. When the error on the validation set goes up for several iterations,
the training is stopped, and the weights that produced the minimum error
on the validation set are used as the final trained network weights.
This process is illustrated in Figure 13.4. The graph at the bottom of this
figure shows the progress of the training and validation performance indi-
ces, F (the sum squared errors), during training. Although the training er-
ror continues to go down throughout the training process, a minimum of
the validation error occurs at the point labeled “a,” which corresponds to
training iteration 14. The graph at the upper left shows the network re-
sponse at this early stopping point. The resulting network provides a good
fit to the true function. The graph at the upper right demonstrates the net-
work response if we continue to train to point “b,” where the validation er-
ror has increased and the network is overfitting.
1.5 1.5
1 a 1 b
0.5 0.5
2 2
a 0 a 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
p p
3
10
2
Validation
10
a b
F 10
1
Training
0
10
−1
10
0 1 2
10 10 10
Iteration
13-7
13 Generalization
The basic concept for early stopping is simple, but there are several practi-
cal issues to be addressed. First, the validation set must be chosen so that
it is representative of all situations for which the network will be used. This
is also true for the test and training sets, as we mentioned earlier. Each set
must be roughly equivalent in its coverage of the input space, although the
size of each set may be different.
When we divide the data, approximately 70% is typically used for training,
with 15% for validation and 15% for testing. These are only approximate
numbers. A complete discussion of how to select the amount of data for the
validation set is given in [AmMu97].
Another practical point to be made about early stopping is that we should
use a relatively slow training method. During training, the network will
use more and more of the available network parameters (as we will explain
in the last section of this chapter). If the training method is too fast, it will
likely jump past the point at which the validation error is minimized.
To experiment with the effect of early stopping, use the MATLAB® Neural
Network Design Demonstration Early Stopping (nnd13es).
Regularization
The second method we will discuss for improving generalization is called
regularization. For this method, we modify the sum squared error perfor-
mance index of Eq. (13.3) to include a term that penalizes network com-
plexity. This concept was introduced by Tikhonov [Tikh63]. He added a
penalty, or regularization, term that involved the derivatives of the approx-
imating function (neural network in our case), which forced the resulting
function to be smooth. Under certain conditions, this regularization term
can be written as the sum of squares of the network weights, as in
Q n
tq – aq tq – aq + xi ,
T 2
F x = E D + E W = (13.4)
q=1 i=1
where the ratio controls the effective complexity of the network solu-
tion. The larger this ratio is, the smoother the network response. (Note that
we could have used a single parameter here, but developments in later sec-
tions will require two parameters.)
Why do we want to penalize the sum squared weights, and how is this sim-
ilar to reducing the number of neurons? Consider again the example mul-
tilayer network shown in Figure 11.4. Recall how increasing a weight
increased the slope of the network function. You can see this effect again in
2
Figure 13.5, where we have changed the weight w 1 1 from 0 to 2. When the
weights are large, the function created by the network can have large
slopes, and is therefore more likely to overfit the training data. If we re-
strict the weights to be small, then the network function will create a
13-8
Methods for Improving Generalization
smooth interpolation through the training data - just as if the network had
a small number of neurons.
2
w 1 1 = 2
13
2
2
a 1
2
w 1 1 = 0
−1
−2 −1 0 1 2
13-9
13 Generalization
1.5 1.5
1 = 0 1
= 0.01
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1.5 1.5
1 = 0.25 1
= 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Bayesian Analysis
Thomas Bayes was a Presbyterian minister who lived in England during
the 1700’s. He was also an amateur mathematician. His most important
work was published after his death. In it, he presented what is now known
as Bayes’ Theorem. The theorem states that if you have two random
events, A and B , then the conditional probability of the occurrence of A ,
given the occurrence of B can be computed as
P B A P A
P A B = ------------------------------- . (13.5)
P B
Eq. (13.5) is called Bayes’ rule. Each of the terms in this expression has a
name by which it is commonly referred. P A is called the prior probability.
It tells us what we know about A before we know the outcome of B . P A B
is called the posterior probability. This tells us what we know about A after
we learn about B . P B A is the conditional probability of B given A . Nor-
mally this term is given by our knowledge of the system that describes the
relationship between B and A . P B is the marginal probability of the
event B , and it acts as a normalization factor in Bayes’ rule.
13-10