Chapter_7
Chapter_7
Neural Networks
1
Outline
◼ 1. Neural Networks Representation
◼ 2. Appropriate problems for Neural Network Learning
◼ 3. Perceptrons
◼ 4. Multilayer Networks and the Backpropagation
algorithm
◼ 5. Remarks on the Backpropagation algorithm
◼ 6. Neural network application development
◼ 7. Benefits and Limitations of Neural networks
◼ 8. Neural network applications
◼ 9. Time series prediction using Neural Networks
2
1. NEURAL NETWORK REPRESENTATION
◼ An ANN is composed of processing elements called or
perceptrons, organized in different ways to form the network’s
structure.
Processing Elements
◼ An ANN consists of perceptrons. Each of the perceptrons
receives inputs, processes inputs and delivers a single output.
3
The network
◼ Each ANN is composed of a collection of perceptrons
grouped in layers. A typical structure is shown in Fig.2.
Figure 2
4
2. Appropriate Problems for Neural Network
◼ ANN learning is well-suited to problems in which the training data
corresponds to noisy data. It is also applicable to problems for
which more symbolic representations are used.
◼ The back-propagation (BP) algorithm is the most commonly
used ANN learning technique. It is appropriate for problems with
the characteristics:
❑ Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
❑ Output is discrete or real valued
❑ Output is a vector of values
❑ Possibly noisy data
❑ Long training times accepted
❑ Fast evaluation of the learned function required.
❑ Not important for humans to understand the weights
◼ Examples:
❑ Speech phoneme recognition
❑ Image classification
❑ Financial prediction
5
3. PERCEPTRONS
◼ A perceptron takes a vector of real-valued inputs, calculates a
linear combination of these inputs, then outputs
❑ a 1 if the result is greater than some threshold
❑ –1 otherwise.
6
Figure 3. A perceptron
7
◼ To simplify notation, we imagine an additional constant input x0 =
1, allowing us to write the above inequality as
n
w x
i =0
i i 0
or in vector form as
w.x 0
For brevity, we will sometimes write the perceptron function as
o( x) = sgn( w.x)
◼ Learning a perceptron involves choosing values for the weights
w0, w1,…, wn.
8
Representation Power of Perceptrons
◼ We can view the perceptron as representing a hyperplane decision
surface in the n-dimensional space of instances (i.e. points). The
perceptron outputs a 1 for instances lying on one side of the
hyperplane and outputs a –1 for instances lying on the other side, as
in Figure 4.
◼ The equation for this decision hyperplane is
9
How to train a perceptron
◼ Although we are interested in learning networks of many
interconnected units, let us begin by understanding how to
learn the weights for a single perceptron.
◼ Here learning is to determine a weight vector that causes
the perceptron to produce the correct +1 or –1 for each of
the given training examples.
◼ Several algorithms are known to solve this learning
problem. Here we consider two:
❑ the perceptron training rule and
❑ the delta rule.
10
Perceptron training rule
◼ One way to learn an acceptable weight vector is to begin with
random weights, then iteratively apply the perceptron to
each training example, modifying the perceptron weights
whenever it misclassifies an example. This process is
repeated, iterating through the training examples as many as
times needed until the perceptron classifies all training
examples correctly.
◼ Weights are modified at each step according to the perceptron
training rule, which revises the weight wi associated with input
xi according to the rule.
wi wi + wi
where wi = (t – o)xi
◼ Here:
t is target output value for the current training example
o is perceptron output
is small constant (e.g., 0.1) called learning rate
11
Perceptron training rule (cont.)
◼ The role of the learning rate is to moderate the degree
to which weights are changed at each step. It is usually
set to some small value (e.g. 0.1) and is sometimes
made to decrease as the number of weight-tuning
iterations increases.
◼ We can prove that the algorithm will converge
❑ If training data is linearly separable
❑ and sufficiently small.
◼ If the data is not linearly separable, convergence is not
assured.
12
The Delta Rule (Gradient Descent)
◼ Although the perceptron training rule finds a successful weight
vector when the training examples are linearly separable, it can fail
to converge if the examples are not linearly separatable. A second
training rule, called the delta rule, is designed to overcome this
difficulty.
◼ The key idea of delta rule: to use gradient descent to search the
space of possible weight vectors to find the weights that best fit the
training examples.
◼ The delta rule is important because it provides the basis for the
back-propagation algorithm, which can learn networks with many
interconnected units.
◼ The delta training rule: considering the task of training an un-
thresholded perceptron, that is a linear unit, for which the output o is
given by:
o = w0 + w1x1 + ··· + wnxn (1)
◼ Thus, a linear unit corresponds to the first stage of a perceptron,
without the threshold.
13
Delta rule
◼ In order to derive a weight learning rule for linear units,
let specify a measure for the training error of a weight
vector, relative to the training examples.
◼ The Training Error can be computed as the following
squared error
(2)
15
Figure 5. The error surface
How can we calculate the direction of steepest descent along the error surface?
This direction can be found by computing the derivative of E w.r.t. each
component of the vector w.
16
Derivation of the Gradient Descent Rule
◼ This vector derivative is called the gradient of E with respect to the
vector <w0,…,wn>, written E .
(3)
w w + w
(4)
17
◼ Here is a positive constant called the learning rate, which
determines the step size in the gradient descent search.
◼ The negative sign is present because we want to move the
weight vector in the direction that decreases E. This training
rule can also be written in its component form
wi wi + wi
where
(5)
18
(6)
where xid denotes the single input component xi for the training
example d.
We now have an equation that gives E/wi in terms of the linear unit
inputs xid, output od and the target value td associated with the training
example. Substituting Equation (6) into Equation (5) yields the weight
update rule for gradient descent.
19
(7)
20
Figure 6. Gradient
Descent algorithm
for training a linear
unit.
(8)
(9)
21
Stochastic Approximation to Gradient Descent
◼ The key difficulties in applying gradient descent are:
❑ Converging to a local minimum can sometimes be quite slow.
❑ If there are multiple local minima in the error surface, then there is no
guarantee that the procedure will find the global minimum.
◼ One common variation on gradient descent to alleviate these
difficulties is called incremental gradient descent (or stochastic
gradient descent).
◼ The key differences between standard gradient descent and
stochastic gradient descent are:
❑ In standard gradient descent, the error is summed over all examples
before upgrading weights, whereas in stochastic gradient descent
weights are updated upon examining each training example.
❑ The modified training rule is like the training rule given by Equation (7)
except that as we iterate through each example we update the weight
according to
wi = (t – o) xi (10)
where t, o and xi are the target value, unit output, and the ith input.
22
◼ To modify the gradient descent algorithm in Figure 6 to
implement this stochastic approximation, Equation wi wi + wi
is simply deleted and Equation wi wi + (t - o)xi is replaced
by
wi wi + (t - o)xi.
◼ One way to view this stochastic gradient descent is to consider a
distinct error function defined for each individual training example
d as follows
1
Ed ( w) = (t d − od ) 2
2
where td and od are the target value and the unit output value for
training example d.
We come to the stochastic gradient descent algorithm (Figure. 7)
23
◼ Summing over multiple examples in standard gradient descent
requires more computation per weight update step. On the other hand,
because it uses the true gradient, standard gradient descent is often
used with a larger step size per weight update than stochastic
gradient descent.
Figure 7. Stochastic
gradient descent
algorithm
(11)
24
◼ Stochastic gradient descent (i.e. incremental mode) can
sometimes avoid falling into local minima because it
uses the various gradient of E rather than overall
gradient of E to guide its search.
◼ Both stochastic and standard gradient descent methods
are commonly used in practice.
Summary
◼ Perceptron training rule
❑ Perfectly classifies training data
❑ Converge, provided the training examples are linearly
separable
25
4. MULTILAYER NETWORKS AND THE
BACKPROPOGATION ALGORITHM
◼ Single perceptrons can only express linear decision surfaces.
In contrast, the kind of multilayer networks learned by the
backpropagation algorithm are capaple of expressing a rich
variety of nonlinear decision surfaces.
◼ This section discusses how to learn such multilayer networks
using a gradient descent algorithm similar to that discussed in
the previous section.
A Differentiable Threshold Unit
◼ What type of unit as the basis for multilayer networks ?
• Perceptron : not differentiable -> can’t use gradient descent
• Linear Unit : multi-layers of linear units -> still produce only linear
function
• Sigmoid Unit : smoothed, differentiable threshold function
26
Figure 7. The sigmoid threshold unit.
Sigmoid threshold unit
(12)
27
The sigmoid unit
◼ Like the perceptron, the sigmoid unit first computes a linear
combination of its inputs, then applies a threshold to the result. In
the case of sigmoid unit, however, the threshold output is a
continuous function of its input.
◼ The sigmoid function (x) is also called the logistic function.
◼ Interesting property:
28
The Backpropagation (BP)Algorithm
◼ The BP algorithm learns the weights for a multilayer network, given
a network with a fixed set of units and interconnections. It employs a
gradient descent to attempt to minimize the squared error between
the network output values and the target values for these outputs.
◼ Because we are considering networks with multiple output units
rather than single unit as before, we begin by redefining E to sum the
errors over all of the network output units
1
Ed ( w) = kd kd
( t
2 koutputs
− o ) 2
(13)
where outputs is the set of output units in the network, and tkd and okd
are the target and output values associated with the k-th output unit
and training example d.
29
The Backpropagation Algorithm (cont.)
30
The Backpropagation algorithm (Fig. 8)
Initialize all weights to small random numbers
Until satisfied, do
for each training example, do
1. Input the training example to the network and compute the
network outputs
2. For each output k
k ok(1 - ok)(tk – ok) (14)
3. For each hidden unit h
h oh (1 − oh ) w
koutputs
kh k (15)
31
◼ In the BP algorithm, step1 propagates the input forward through the
network. And the steps 2, 3 and 4 propagates the errors backward
through the network.
◼ The main loop of BP repeatedly iterates over the training examples.
For each training example, it applies the ANN to the example,
calculates the error of the network output for this example,
computes the gradient w. r. t. the error on the example, then
updates all weights in the network. This gradient descent step is
iterated until ANN performs acceptably well.
◼ A variety of termination conditions can be used to halt the
procedure.
❑ One may choose to halt after a fixed number of iterations through
the loop, or
❑ once the error on the training examples falls below some
threshold, or
❑ once the error on a separate validation set of examples meets
some criteria.
32
An Example of
Back-propagation
algorithm
Note: The sigmoid function is applied to hidden layer and output layer.
33
Table 1: Initial input and weight values
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 w04 w05 w06
-----------------------------------------------------------------------------------
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1
34
Table 4: Calculation for weight updating
Weight New value
------------------------------------------------------------------------------
w46 -0.3+(0.9)(0.1311)(0.332)= -0.261
w56 -0.2+(0.9)(0.1311)(0.525)= -0.138
w14 0.2 +(0.9)(-0.0087)(1) = 0.192
w15 -0.3 +(0.9)(-0.0065)(1) = -0.306
w24 0.4+ (0.9)(-0.0087)(0) = 0.4
w25 0.1+ (0.9)(-0.0065)(0) = 0.1
w34 -0.5+ (0.9)(-0.0087)(1) = -0.508
w35 0.2 + (0.9)(-0.0065)(1) = 0.194
w06 0.1 + (0.9)(0.1311) = 0.218
w05 0.2 + (0.9)(-0.0065)=0.194
w04 -0.4 +(0.9)(-0.0087) = -0.408
35
Adding Momentum
◼ Because BP is a widely used algorithm, many variations have been
developed. The most common is to alter the weight-update rule in
Step 4 in the algorithm by making the weight update on the nth
iteration depend partially on the update that occurred during the (n -
1)-th iteration, as follows:
(18)
Here wi,j(n) is the weight update performed during the n-th iteration
through the main loop of the algorithm.
- n-th iteration update depend on (n-1)th iteration
- : constant between 0 and 1 is called the momentum.
Role of momentum term:
- keep the ball rolling through small local minima in the error
surface.
- Gradually increase the step size of the search in regions where
the gradient is unchanging, thereby speeding convergence.
36
Derivation of the Backpropagation Rule
Recall from the equation:
1
Ed ( w) = (t d − od ) 2
2
Stochastic gradient descent involves iterating through the training
examples one at a time.
In other words, for each training example d, every wji is updated by
adding to it wji:
(21)
37
Notation
◼ xji = the ith input to unit j
◼ wji = the weight associated with the ith input to unit j
◼ netj = i wjixji (the weighted sum of input for unit j)
◼ oj = the output computed by unit j
◼ tj = the target output for unit j
◼ = the sigmod function
◼ outputs = the set of units in the final layer of the network
◼ Downstream(j) = the set of units whose immediate inputs
include the output of unit j.
38
Derivation of the Backpropagation Rule (cont.)
◼ To begin, notice that weight wji can influence the rest of
the network through netj. So, we can use the chain rule
to write:
(22)
39
Case 1: Training rule for output unit weights.
◼ Just as wji can influence the rest of the network only through netj,
netj can influence the network only through oj. So, we can use the
chain rule again to write:
(23)
The derivatives in the right hand side will be zero for all output units k
except when k = j.
40
We have:
(24)
(25)
41
◼ Substituting expressions (24) and (25) into (23), we obtain:
(26)
And combining this with Equation (21) and (22), we have the
stochastic gradient descent rule for output units
(27)
Note this training rule is exactly the weight update rule, implemented
by Equation (14) and (15) in the Backpropagation algorithm.
Furthermore, we can see that k in Equation(14) is equal to the
quantity - Ed/ netk.
42
Case 2: Training rule for Hidden Unit Weights
43
Rearranging terms and using j to denote - Ed/ netj, we have
Fig. 9.
45
5. REMARKS ON THE
BACKPROPAGATION ALGORITHM
46
6. NEURAL NETWORK APPLICATION
DEVELOPMENT
The development process for an ANN application has eight steps.
◼ Step 1: (Data collection) The data to be used for the training and
testing of ANN are collected. We have to consider that the
particular problem is amenable to ANN solution and that
adequate data exist and can be obtained.
◼ Step 2: (Training and testing data separation) The available data
are divided into training and testing data sets. For a moderately
sized data set, 80% of the data are randomly selected for training,
10% for testing, and 10% secondary testing.
◼ Step 3: (Network architecture) A network architecture and a
learning method (training algorithm) are selected. Important
considerations are the exact number of nodes and the number of
layers.
47
◼ Step 4: (Parameter tuning and weight initialization) There are
parameters for tuning ANN to the desired learning
performance. Part of this step is initialization of the network
weights and parameters, followed by modification of the
parameters as training performance feedback is received.
❑ Often, the initial values are important in determining the
effectiveness and length of training.
◼ Step 5: (Data transformation) Transforms the application data
into the type and format required by the ANN.
◼ Step 6: (Training) Training is conducted iteratively by
presenting input and known output data to the ANN. The ANN
computes the outputs and adjusts the weights until the
computed outputs are within an acceptable tolerance of the
known outputs for the input cases.
48
◼ Step 7: (Testing) Once the training has been completed,
it is necessary to test the network.
❑ The testing examines the performance of ANN using the
derived weights by measuring the ability of the network to
classify the testing data correctly.
◼ Step 8: (Implementation) Now a stable set of weights are
obtained.
❑ Now ANN can reproduce the desired output given inputs like
those in the training set.
❑ The ANN is ready to use as a stand-alone system or as part of
another software system where new input data will be
presented to it and its output will be a recommended decision.
49
7. BENEFITS AND LIMITATIONS OF
NEURAL NETWORKS
6.1 Benefits of ANNs
◼ Usefulness for pattern recognition, classification, generalization,
abstraction and interpretation of imcomplete and noisy inputs.
(e.g. handwriting recognition, image recognition, voice and
speech recognition, weather forecasing).
◼ Providing some characteristics to problem solving that are
difficult to simulate using the logical, analytical techniques of
expert systems and standard software technologies.
◼ Ability to solve new kinds of problems. ANNs are particularly
effective at solving problems whose solutions are difficult to
define. This opened up a new range of decision support
applications formerly either difficult or impossible to computerize.
50
◼ Robustness. ANNs tend to be more robust than their
conventional counterparts. They have the ability to cope with
imcomplete or fuzzy data. ANNs can be very tolerant of faults if
properly implemented.
◼ Fast processing speed. Because they consist of a large number
of massively interconnected processing units, all operating in
parallel on the same problem, ANNs can potentially operate at
considerable speed (when implemented on parallel processors).
◼ Flexibility and ease of maintenaince. ANNs are very flexible in
adapting their behavior to new and changing environments. They
are also easier to maintain, with some having the ability to learn
from experience to improve their own performance.
51
Network parameters
The following parameters of the ANN are chosen for a closer
inspection:
◼ The number of input units:
❑ The number of input units determines the number of periods the
ANN “looks into the past” when predicting the future. The number
of input units is equivalent to the size of the input window.
❑ The number of input units is equivalent to the number of
attributes of the input sample.
◼ The number of output units: depends on the number of classes
52
Network parameters (cont.)
◼ The learning rate: (0< < 1) is a scaling factor that tells the
learning algorithm how strong the weights of the connections
should be adjusted for a given error. A higher can be used to
speed up the learning process, but if is too high, the algorithm
will skip the optimum weights. (The learning rate is constant
across presentations).
◼ The momentum parameter (0 < < 1) is another number that
affects the gradient descent of the weights: the momentum term
is added that keeps the direction of the previous step thus
avoiding the descent into local minima. (The momentum term is
constant across presentations).
53
8. SOME ANN APPLICATIONS
ANN application areas:
◼ Face detection
◼ Face recognition
◼ Object recognition
◼ Speech recognition
◼ Image retrieval
◼ Tax form processing to identify tax fraud
◼ Bankruptcy prediction
54
ANN applications (cont.)
55
ANN software tools
56
9. Time Series Prediction using Neural network
◼ Time series prediction: given an existing time series, we
model the time series in order to make accurate forecasts
57
◼ Why is it difficult?
❑ Limited quantity of data (Observed data series sometimes too
short to partition)
❑ Noise (Erroneous data points, obscuring component)
❑ Moving Average
❑ Nonstationarity (Fundamentals change over time,
nonstationary)
❑ Forecasting method selection (Statistics, Artificial intelligence)
◼ Neural networks have been widely used as time series
forecasters: most often these are feed-forward networks
which employ a sliding window over the input sequence.
◼ The neural network sees the time series X1,…,Xn in the form
of many mappings of an input vector to an output value.
58
Prediction with Neural Networks
◼ A number of adjoining data points of the time series (the input
window Xt-s+1, Xt-s+2,…, Xt) are used as activation levels for the
input units of the input layer.
59
Fig.10 Learning a time series
61
Evaluating Methods of forecasts
62
Evaluation Methods of Forecasts
◼ There are three measures of accuracy of the prediction models:
MAPE, MAE and MSE.
◼ For all three measures, the smaller the value, the better the fit of
the model.
◼ Use these statistics to compare the predictive accuracy of the
different methods.
◼ MAPE (Mean Absolute Percentage Error) measure the accuracy
of predicted time series values. It expresses accuracy as a
percentage.
n
| ( y t − y 't ) / y t |
MAPE = t =1
100
n
where yt is the actual value, y’t is the predicted value and n is the
number of observations.
63
MAE
| y t − y 't |
MAE = t =1
n
where yt is the actual value, y’t is the predicted value and n is the
number of observations.
64
MSE
t t
( y − y ' ) 2
MSE = t =1
65
R-squared
66
References
◼ Tom M. Mitchell, Machine Learning, The McGraw Hill, 1997.
◼ M. Verleysen, K. Hlavackova, Learning in RBF Networks, Proc.
of Int. Conf. on Neural Networks (ICNN), Washington, D.C., June
3-6, 1996.
◼ F. Schwenker, H. A. Kestler, G. Palm, Three learning phases for
radial-basis-function networks, Neural Networks 14(2001) 439-
458.
◼ N.Benoudjit, C. Archambeau, A. Lendasse, J. Lee, M. Verleysen,
Width optimization of the Guassian kernels in RBF Networks,
Proc. of European Symposium on Artificial Neural Networks,
Bruges, Belgium, 24-26 April 2002, pp. 425-432.
◼ J. Han and M. Kamber, Data Mining: Concepts and Techniques,
2nd Edition, Morgan Kaufmann Publishers, 2006
67
Terminology
◼ Neural network: mạng nơ ron, feed-forward neural network:
mạng nơ ron truyền thẳng, weight: trọng số, hidden layer: tầng
ẩn, bias: độ lệch, linear unit: đơn vị tuyến tính, linear
combination: tổ hợp tuyến tính, threshold: ngưỡng, weight
vector: véc tơ trọng số, perceptron training rule: luật huấn luyện
perceptron, linearly separable data: dữ liệu khả tách một cách
tuyến tính, gradient descent: suy giảm độ dốc, incremental
gradient descent: suy giảm độ dốc gia tăng, error function: hàm
lỗi, learning rate: hệ số học, back-propagation: lan truyền
ngược, error term: toán hạng sai số, sigmoid function: hàm
sigmoid, transfer function: hàm truyền, momentum constant:
hằng số quán tính
68