0% found this document useful (0 votes)
2 views

Chapter_7

Chapter 7 discusses neural networks, focusing on their representation, appropriate problems for learning, and the perceptron model. It explains the training rules for perceptrons, including the perceptron training rule and the delta rule, which utilizes gradient descent for weight adjustment. The chapter also introduces multilayer networks and the backpropagation algorithm, emphasizing their ability to model complex, nonlinear decision surfaces.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter_7

Chapter 7 discusses neural networks, focusing on their representation, appropriate problems for learning, and the perceptron model. It explains the training rules for perceptrons, including the perceptron training rule and the delta rule, which utilizes gradient descent for weight adjustment. The chapter also introduces multilayer networks and the backpropagation algorithm, emphasizing their ability to model complex, nonlinear decision surfaces.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Chapter 7:

Neural Networks

Assoc. Prof. Dr. Duong Tuan Anh


HCMC University of Technology
July 2015

1
Outline
◼ 1. Neural Networks Representation
◼ 2. Appropriate problems for Neural Network Learning
◼ 3. Perceptrons
◼ 4. Multilayer Networks and the Backpropagation
algorithm
◼ 5. Remarks on the Backpropagation algorithm
◼ 6. Neural network application development
◼ 7. Benefits and Limitations of Neural networks
◼ 8. Neural network applications
◼ 9. Time series prediction using Neural Networks

2
1. NEURAL NETWORK REPRESENTATION
◼ An ANN is composed of processing elements called or
perceptrons, organized in different ways to form the network’s
structure.
Processing Elements
◼ An ANN consists of perceptrons. Each of the perceptrons
receives inputs, processes inputs and delivers a single output.

The input can be raw input


data or the output of other
perceptrons. The output
can be the final result (e.g.
1 means yes, 0 means no)
or it can be inputs to other
perceptrons.

3
The network
◼ Each ANN is composed of a collection of perceptrons
grouped in layers. A typical structure is shown in Fig.2.

Note the three layers:


input, intermediate
(called the hidden
layer) and output.
Several hidden layers
can be placed between
the input and output
layers.

Figure 2

4
2. Appropriate Problems for Neural Network
◼ ANN learning is well-suited to problems in which the training data
corresponds to noisy data. It is also applicable to problems for
which more symbolic representations are used.
◼ The back-propagation (BP) algorithm is the most commonly
used ANN learning technique. It is appropriate for problems with
the characteristics:
❑ Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
❑ Output is discrete or real valued
❑ Output is a vector of values
❑ Possibly noisy data
❑ Long training times accepted
❑ Fast evaluation of the learned function required.
❑ Not important for humans to understand the weights
◼ Examples:
❑ Speech phoneme recognition
❑ Image classification
❑ Financial prediction

5
3. PERCEPTRONS
◼ A perceptron takes a vector of real-valued inputs, calculates a
linear combination of these inputs, then outputs
❑ a 1 if the result is greater than some threshold

❑ –1 otherwise.

◼ Given real-valued inputs x1 through xn, the output o(x1, …, xn)


computed by the perceptron is

o(x1, …, xn) = 1 if w0 + w1x1 + … + wnxn > 0


-1 otherwise
where wi is a real-valued constant, or weight.
◼ Notice the quantify (-w0) is a threshold that the weighted
combination of inputs w1x1 + … + wnxn must surpass in order for
perceptron to output a 1.

6
Figure 3. A perceptron

Linear unit Threshold unit

7
◼ To simplify notation, we imagine an additional constant input x0 =
1, allowing us to write the above inequality as
n

w x
i =0
i i 0

or in vector form as

w.x  0
For brevity, we will sometimes write the perceptron function as

o( x) = sgn( w.x)
◼ Learning a perceptron involves choosing values for the weights
w0, w1,…, wn.

8
Representation Power of Perceptrons
◼ We can view the perceptron as representing a hyperplane decision
surface in the n-dimensional space of instances (i.e. points). The
perceptron outputs a 1 for instances lying on one side of the
hyperplane and outputs a –1 for instances lying on the other side, as
in Figure 4.
◼ The equation for this decision hyperplane is

Some sets of positive and


negative examples cannot
be separated by any
hyperplane. Those that can
be separated are called
linearly separated set of
examples. Figure 4. Decision surface

9
How to train a perceptron
◼ Although we are interested in learning networks of many
interconnected units, let us begin by understanding how to
learn the weights for a single perceptron.
◼ Here learning is to determine a weight vector that causes
the perceptron to produce the correct +1 or –1 for each of
the given training examples.
◼ Several algorithms are known to solve this learning
problem. Here we consider two:
❑ the perceptron training rule and
❑ the delta rule.

10
Perceptron training rule
◼ One way to learn an acceptable weight vector is to begin with
random weights, then iteratively apply the perceptron to
each training example, modifying the perceptron weights
whenever it misclassifies an example. This process is
repeated, iterating through the training examples as many as
times needed until the perceptron classifies all training
examples correctly.
◼ Weights are modified at each step according to the perceptron
training rule, which revises the weight wi associated with input
xi according to the rule.
wi  wi + wi
where wi = (t – o)xi
◼ Here:
t is target output value for the current training example
o is perceptron output
 is small constant (e.g., 0.1) called learning rate

11
Perceptron training rule (cont.)
◼ The role of the learning rate is to moderate the degree
to which weights are changed at each step. It is usually
set to some small value (e.g. 0.1) and is sometimes
made to decrease as the number of weight-tuning
iterations increases.
◼ We can prove that the algorithm will converge
❑ If training data is linearly separable
❑ and  sufficiently small.
◼ If the data is not linearly separable, convergence is not
assured.

12
The Delta Rule (Gradient Descent)
◼ Although the perceptron training rule finds a successful weight
vector when the training examples are linearly separable, it can fail
to converge if the examples are not linearly separatable. A second
training rule, called the delta rule, is designed to overcome this
difficulty.
◼ The key idea of delta rule: to use gradient descent to search the
space of possible weight vectors to find the weights that best fit the
training examples.
◼ The delta rule is important because it provides the basis for the
back-propagation algorithm, which can learn networks with many
interconnected units.
◼ The delta training rule: considering the task of training an un-
thresholded perceptron, that is a linear unit, for which the output o is
given by:
o = w0 + w1x1 + ··· + wnxn (1)
◼ Thus, a linear unit corresponds to the first stage of a perceptron,
without the threshold.

13
Delta rule
◼ In order to derive a weight learning rule for linear units,
let specify a measure for the training error of a weight
vector, relative to the training examples.
◼ The Training Error can be computed as the following
squared error
(2)

where D is set of training examples, td is the target


output for the training example d and od is the output of
the linear unit for the training example d.
Here we characterize E as a function of weight
vector because the linear unit output O depends on
this weight vector.
14
Hypothesis Space
◼ To understand the gradient descent algorithm, it is
helpful to visualize the entire space of possible weight
vectors and their associated E values, as shown in
Figure 5.
❑ Here the axes wo,w1 represents possible values for the two
weights of a simple linear unit. The wo,w1 plane represents
the entire hypothesis space.
❑ The vertical axis indicates the error E relative to some fixed
set of training examples. The error surface shown in the
figure summarizes the desirability of every weight vector in
the hypothesis space.
◼ For linear units, this error surface must be parabolic
with a single global minimum. And we desire a weight
vector with this minimum.

15
Figure 5. The error surface

How can we calculate the direction of steepest descent along the error surface?
This direction can be found by computing the derivative of E w.r.t. each
component of the vector w.

16
Derivation of the Gradient Descent Rule
◼ This vector derivative is called the gradient of E with respect to the
vector <w0,…,wn>, written E .

(3)

Notice E is itself a vector, whose components are the partial


derivatives of E with respect to each of the wi.
When interpreted as a vector in weight space, the gradient
specifies the direction that produces the steepest increase in E.
The negative of this vector therefore gives the direction of steepest
decrease.
Since the gradient specifies the direction of steepest increase of
E, the training rule for gradient descent is

w  w + w
(4)

17
◼ Here  is a positive constant called the learning rate, which
determines the step size in the gradient descent search.
◼ The negative sign is present because we want to move the
weight vector in the direction that decreases E. This training
rule can also be written in its component form
wi wi + wi
where

(5)

which makes it clear that steepest descent is achieved by


altering each component wi of weight vector in proportion to
E/wi.
The vector of E/wi derivatives that form the gradient can be
obtained by differentiating E from Equation (2), as

18
(6)

where xid denotes the single input component xi for the training
example d.
We now have an equation that gives E/wi in terms of the linear unit
inputs xid, output od and the target value td associated with the training
example. Substituting Equation (6) into Equation (5) yields the weight
update rule for gradient descent.

19
(7)

◼ The gradient descent algorithm for training linear units is as


follows:
❑ Pick an initial random weight vector. Apply the linear unit to all
training examples, then compute wi for each weight according to
Equation (7). Update each weight wi by adding wi , them repeat
the process. The algorithm is given in Figure 6.
◼ Because the error surface contains only a single global
minimum, this algorithm will converge to a weight vector with
minimum error, regardless of whether the training examples
are linearly separable, given a sufficiently small  is used.
◼ If  is too large, the gradient descent search runs the risk of
overstepping the minimum in the error surface rather than
settling into it. For this reason, one common modification to
the algorithm is to gradually reduce the value of  as the
number of gradient descent steps grows.

20
Figure 6. Gradient
Descent algorithm
for training a linear
unit.

(8)

(9)
21
Stochastic Approximation to Gradient Descent
◼ The key difficulties in applying gradient descent are:
❑ Converging to a local minimum can sometimes be quite slow.
❑ If there are multiple local minima in the error surface, then there is no
guarantee that the procedure will find the global minimum.
◼ One common variation on gradient descent to alleviate these
difficulties is called incremental gradient descent (or stochastic
gradient descent).
◼ The key differences between standard gradient descent and
stochastic gradient descent are:
❑ In standard gradient descent, the error is summed over all examples
before upgrading weights, whereas in stochastic gradient descent
weights are updated upon examining each training example.
❑ The modified training rule is like the training rule given by Equation (7)
except that as we iterate through each example we update the weight
according to
wi = (t – o) xi (10)
where t, o and xi are the target value, unit output, and the ith input.

22
◼ To modify the gradient descent algorithm in Figure 6 to
implement this stochastic approximation, Equation wi wi + wi
is simply deleted and Equation wi  wi + (t - o)xi is replaced
by
wi wi + (t - o)xi.
◼ One way to view this stochastic gradient descent is to consider a
distinct error function defined for each individual training example
d as follows

1
Ed ( w) = (t d − od ) 2

2
where td and od are the target value and the unit output value for
training example d.
We come to the stochastic gradient descent algorithm (Figure. 7)

23
◼ Summing over multiple examples in standard gradient descent
requires more computation per weight update step. On the other hand,
because it uses the true gradient, standard gradient descent is often
used with a larger step size per weight update than stochastic
gradient descent.

Figure 7. Stochastic
gradient descent
algorithm

(11)

24
◼ Stochastic gradient descent (i.e. incremental mode) can
sometimes avoid falling into local minima because it
uses the various gradient of E rather than overall
gradient of E to guide its search.
◼ Both stochastic and standard gradient descent methods
are commonly used in practice.
Summary
◼ Perceptron training rule
❑ Perfectly classifies training data
❑ Converge, provided the training examples are linearly
separable

◼ Delta Rule using gradient descent


❑ Converge asymptotically to minimum error hypothesis
❑ Converge regardless of whether training data are linearly
separable

25
4. MULTILAYER NETWORKS AND THE
BACKPROPOGATION ALGORITHM
◼ Single perceptrons can only express linear decision surfaces.
In contrast, the kind of multilayer networks learned by the
backpropagation algorithm are capaple of expressing a rich
variety of nonlinear decision surfaces.
◼ This section discusses how to learn such multilayer networks
using a gradient descent algorithm similar to that discussed in
the previous section.
A Differentiable Threshold Unit
◼ What type of unit as the basis for multilayer networks ?
• Perceptron : not differentiable -> can’t use gradient descent
• Linear Unit : multi-layers of linear units -> still produce only linear
function
• Sigmoid Unit : smoothed, differentiable threshold function

26
Figure 7. The sigmoid threshold unit.
Sigmoid threshold unit

(12)

27
The sigmoid unit
◼ Like the perceptron, the sigmoid unit first computes a linear
combination of its inputs, then applies a threshold to the result. In
the case of sigmoid unit, however, the threshold output is a
continuous function of its input.
◼ The sigmoid function (x) is also called the logistic function.
◼ Interesting property:

• Output ranges between 0 and 1, increasing monotonically with its


input.
We can derive gradient descent rules to train
• One sigmoid unit
• Multilayer networks of sigmoid units  Backpropagation

28
The Backpropagation (BP)Algorithm
◼ The BP algorithm learns the weights for a multilayer network, given
a network with a fixed set of units and interconnections. It employs a
gradient descent to attempt to minimize the squared error between
the network output values and the target values for these outputs.
◼ Because we are considering networks with multiple output units
rather than single unit as before, we begin by redefining E to sum the
errors over all of the network output units

1
Ed ( w) =  kd kd
( t
2 koutputs
− o ) 2
(13)

where outputs is the set of output units in the network, and tkd and okd
are the target and output values associated with the k-th output unit
and training example d.

29
The Backpropagation Algorithm (cont.)

◼ The BP algorithm is presented in Figure 8. The algorithm


applies to layered feed-forward networks containing 2
layers of sigmoid units, with units at each layer
connected to all units from the preceding layer.
◼ This is an incremental gradient descent version of
Backpropagation.
◼ The notation is as follows:
❑ xji denotes the input from node i to unit j, and wji denotes
the corresponding weight.
❑ n denotes the error term associated with unit n. It plays a
role analogous to the quantity (t – o) in our earlier
discussion of the delta training rule.

30
The Backpropagation algorithm (Fig. 8)
Initialize all weights to small random numbers
Until satisfied, do
for each training example, do
1. Input the training example to the network and compute the
network outputs
2. For each output k
k  ok(1 - ok)(tk – ok) (14)
3. For each hidden unit h

 h  oh (1 − oh ) w
koutputs

kh k (15)

4. Update each network weight wji


wji  wji + wji
where
wji =  jxji (16)

31
◼ In the BP algorithm, step1 propagates the input forward through the
network. And the steps 2, 3 and 4 propagates the errors backward
through the network.
◼ The main loop of BP repeatedly iterates over the training examples.
For each training example, it applies the ANN to the example,
calculates the error of the network output for this example,
computes the gradient w. r. t. the error on the example, then
updates all weights in the network. This gradient descent step is
iterated until ANN performs acceptably well.
◼ A variety of termination conditions can be used to halt the
procedure.
❑ One may choose to halt after a fixed number of iterations through
the loop, or
❑ once the error on the training examples falls below some
threshold, or
❑ once the error on a separate validation set of examples meets
some criteria.

32
An Example of
Back-propagation
algorithm

Figure 8: An example of a multilayer feed-forward neural network. Assume


that the learning rate  is 0.9 and the first training example, X = (1,0,1) whose
class label is 1.

Note: The sigmoid function is applied to hidden layer and output layer.

33
Table 1: Initial input and weight values
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 w04 w05 w06
-----------------------------------------------------------------------------------
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1

Table 2: The net input and output calculation


Unit j Net input Ij Output Oj
-----------------------------------------------------------------------------------
4 0.2 + 0 -0.5 -0.4 = -0.7 1/(1+e0.7)=0.332
5 -0.3 +0+0.2 +0.2 =0.1 1/(1+e-0.1)=0.525
6 (-0.3)(0.332)-(0.2)(0.525)+0.1 = -0.105 1/(1+e0.105)=0.474

Table 3: Calculation of the error at each node


Unit j j
-----------------------------------------------------------------------------
6 (0.474)(1-0.474)(1-0.474)=0.1311
5 (0.525)(1-0.525)(0.1311)(-0.2)=-0.0065
4 (0.332)(1-0.332)(0.1311)(-0.3)=-0.0087

34
Table 4: Calculation for weight updating
Weight New value
------------------------------------------------------------------------------
w46 -0.3+(0.9)(0.1311)(0.332)= -0.261
w56 -0.2+(0.9)(0.1311)(0.525)= -0.138
w14 0.2 +(0.9)(-0.0087)(1) = 0.192
w15 -0.3 +(0.9)(-0.0065)(1) = -0.306
w24 0.4+ (0.9)(-0.0087)(0) = 0.4
w25 0.1+ (0.9)(-0.0065)(0) = 0.1
w34 -0.5+ (0.9)(-0.0087)(1) = -0.508
w35 0.2 + (0.9)(-0.0065)(1) = 0.194
w06 0.1 + (0.9)(0.1311) = 0.218
w05 0.2 + (0.9)(-0.0065)=0.194
w04 -0.4 +(0.9)(-0.0087) = -0.408

35
Adding Momentum
◼ Because BP is a widely used algorithm, many variations have been
developed. The most common is to alter the weight-update rule in
Step 4 in the algorithm by making the weight update on the nth
iteration depend partially on the update that occurred during the (n -
1)-th iteration, as follows:
(18)

Here wi,j(n) is the weight update performed during the n-th iteration
through the main loop of the algorithm.
- n-th iteration update depend on (n-1)th iteration
- : constant between 0 and 1 is called the momentum.
Role of momentum term:
- keep the ball rolling through small local minima in the error
surface.
- Gradually increase the step size of the search in regions where
the gradient is unchanging, thereby speeding convergence.

36
Derivation of the Backpropagation Rule
Recall from the equation:
1
Ed ( w) = (t d − od ) 2
2
Stochastic gradient descent involves iterating through the training
examples one at a time.
In other words, for each training example d, every wji is updated by
adding to it wji:
(21)

where Ed is the error on training example d, summed over all ouput


units.

37
Notation
◼ xji = the ith input to unit j
◼ wji = the weight associated with the ith input to unit j
◼ netj = i wjixji (the weighted sum of input for unit j)
◼ oj = the output computed by unit j
◼ tj = the target output for unit j
◼  = the sigmod function
◼ outputs = the set of units in the final layer of the network
◼ Downstream(j) = the set of units whose immediate inputs
include the output of unit j.

Now we derive an expression for Ed/ wji in order to implement


the stochastic gradient descent rule in Equation (21).

38
Derivation of the Backpropagation Rule (cont.)
◼ To begin, notice that weight wji can influence the rest of
the network through netj. So, we can use the chain rule
to write:

(22)

Now our remaining task is to derive a convenient


expression for Ed/ netj.
We consider two cases: (1) the case where unit j is an
output unit and (2) the case where j is an internal unit.

39
Case 1: Training rule for output unit weights.
◼ Just as wji can influence the rest of the network only through netj,
netj can influence the network only through oj. So, we can use the
chain rule again to write:

(23)

To begin, consider the first term in Equation (23)

The derivatives in the right hand side will be zero for all output units k
except when k = j.

40
We have:

(24)

Next consider the second term in Equation (23). Since oj = (netj),


the derivative oj/ netj is just the derivative of the sigmod function,
which we have already noted is equal to (netj)(1- (netj)).
Therefore,

(25)

41
◼ Substituting expressions (24) and (25) into (23), we obtain:

(26)

And combining this with Equation (21) and (22), we have the
stochastic gradient descent rule for output units

(27)

Note this training rule is exactly the weight update rule, implemented
by Equation (14) and (15) in the Backpropagation algorithm.
Furthermore, we can see that k in Equation(14) is equal to the
quantity -  Ed/  netk.

42
Case 2: Training rule for Hidden Unit Weights

◼ In the case where j is an hidden unit in the network, the


derivation of the training rule for wji must take into
account the indirect ways in which wji can influence the
network outputs and hence Ed.
◼ For this reason, we will find it useful to refer to the set of
all units immediately downstream of unit j in the network.
We denote this set of units by Downstream(j).
◼ Notice that netj can influence the network outputs (and
therefore Ed) only through the units in Downstream(j).
Therefore, we can write

43
Rearranging terms and using j to denote - Ed/ netj, we have

and wji =  j xji


44
Hyperbolic Tangent Activation function
◼ Besides sigmoid, we can use tanh as activation function:
tanh(x) = (ex – e-x)/(ex + e-x)
◼ Output ranges between -1 and 1, increasing monotonically with
its input.
◼ Property: tanh’(x) = 1 – [tanh(x)]2.

Fig. 9.

45
5. REMARKS ON THE
BACKPROPAGATION ALGORITHM

◼ Convergence and Local Minima


❑ Gradient descent to some local minimum
◼ Perhaps not global minimum...
❑ Heuristics to alleviate the problem of local
minima
◼ Add momentum
◼ Use stochastic gradient descent rather than true
gradient descent.
◼ Train multiple nets with different initial weights
using the same data.

46
6. NEURAL NETWORK APPLICATION
DEVELOPMENT
The development process for an ANN application has eight steps.
◼ Step 1: (Data collection) The data to be used for the training and
testing of ANN are collected. We have to consider that the
particular problem is amenable to ANN solution and that
adequate data exist and can be obtained.
◼ Step 2: (Training and testing data separation) The available data
are divided into training and testing data sets. For a moderately
sized data set, 80% of the data are randomly selected for training,
10% for testing, and 10% secondary testing.
◼ Step 3: (Network architecture) A network architecture and a
learning method (training algorithm) are selected. Important
considerations are the exact number of nodes and the number of
layers.

47
◼ Step 4: (Parameter tuning and weight initialization) There are
parameters for tuning ANN to the desired learning
performance. Part of this step is initialization of the network
weights and parameters, followed by modification of the
parameters as training performance feedback is received.
❑ Often, the initial values are important in determining the
effectiveness and length of training.
◼ Step 5: (Data transformation) Transforms the application data
into the type and format required by the ANN.
◼ Step 6: (Training) Training is conducted iteratively by
presenting input and known output data to the ANN. The ANN
computes the outputs and adjusts the weights until the
computed outputs are within an acceptable tolerance of the
known outputs for the input cases.

48
◼ Step 7: (Testing) Once the training has been completed,
it is necessary to test the network.
❑ The testing examines the performance of ANN using the
derived weights by measuring the ability of the network to
classify the testing data correctly.
◼ Step 8: (Implementation) Now a stable set of weights are
obtained.
❑ Now ANN can reproduce the desired output given inputs like
those in the training set.
❑ The ANN is ready to use as a stand-alone system or as part of
another software system where new input data will be
presented to it and its output will be a recommended decision.

49
7. BENEFITS AND LIMITATIONS OF
NEURAL NETWORKS
6.1 Benefits of ANNs
◼ Usefulness for pattern recognition, classification, generalization,
abstraction and interpretation of imcomplete and noisy inputs.
(e.g. handwriting recognition, image recognition, voice and
speech recognition, weather forecasing).
◼ Providing some characteristics to problem solving that are
difficult to simulate using the logical, analytical techniques of
expert systems and standard software technologies.
◼ Ability to solve new kinds of problems. ANNs are particularly
effective at solving problems whose solutions are difficult to
define. This opened up a new range of decision support
applications formerly either difficult or impossible to computerize.

50
◼ Robustness. ANNs tend to be more robust than their
conventional counterparts. They have the ability to cope with
imcomplete or fuzzy data. ANNs can be very tolerant of faults if
properly implemented.
◼ Fast processing speed. Because they consist of a large number
of massively interconnected processing units, all operating in
parallel on the same problem, ANNs can potentially operate at
considerable speed (when implemented on parallel processors).
◼ Flexibility and ease of maintenaince. ANNs are very flexible in
adapting their behavior to new and changing environments. They
are also easier to maintain, with some having the ability to learn
from experience to improve their own performance.

6.2 Limitations of ANNs


◼ ANNs do not produce an explicit model even though new cases can
be fed into it and new results obtained.
◼ ANNs lack explanation capabilities. Justifications for results is
difficults to obtain because the connection weights usually do not
have obvious interpretations.

51
Network parameters
The following parameters of the ANN are chosen for a closer
inspection:
◼ The number of input units:
❑ The number of input units determines the number of periods the
ANN “looks into the past” when predicting the future. The number
of input units is equivalent to the size of the input window.
❑ The number of input units is equivalent to the number of
attributes of the input sample.
◼ The number of output units: depends on the number of classes

◼ The number of hidden units: Whereas it has been shown that


one hidden layer is sufficient to approximate continuous function,
the number of hidden units necessary is “not known in general”.
Some examples of ANN architectures that have been used for
time series prediction can be 8-8-1, 6-6-1, and 5-5-1.
◼ Note: Ash’s method for estimating the number of units in hidden
layer.

52
Network parameters (cont.)
◼ The learning rate:  (0<  < 1) is a scaling factor that tells the
learning algorithm how strong the weights of the connections
should be adjusted for a given error. A higher  can be used to
speed up the learning process, but if  is too high, the algorithm
will skip the optimum weights. (The learning rate  is constant
across presentations).
◼ The momentum parameter  (0 <  < 1) is another number that
affects the gradient descent of the weights: the momentum term
is added that keeps the direction of the previous step thus
avoiding the descent into local minima. (The momentum term is
constant across presentations).

53
8. SOME ANN APPLICATIONS
ANN application areas:
◼ Face detection

◼ Face recognition

◼ Object recognition

◼ Handwritten character/digit recognition

◼ Speech recognition

◼ Image retrieval
◼ Tax form processing to identify tax fraud

◼ Enhancing auditing by finding irregularites

◼ Bankruptcy prediction

54
ANN applications (cont.)

◼ Customer credit scoring


◼ Loan approvals
◼ Credit card approval and fraud detection
◼ Financial prediction
◼ Energy forecasting
◼ Computer access security (intrusion detection and
classification of attacks)
◼ Fraud detection in mobile telecommunication networks

55
ANN software tools

◼ In WEKA, the data mining software, there is


ANN tool for classification.
◼ Spice-Neuro software
◼ Neural Network Toolbox – MATLAB
◼ Neural Network in R
◼ Neural Network in Scikit-Learn

56
9. Time Series Prediction using Neural network
◼ Time series prediction: given an existing time series, we
model the time series in order to make accurate forecasts

◼ Example time series


❑ Financial (e.g., stock price, exchange rates)
❑ Physically observed (e.g., weather, sunspots, river flow)
◼ Why is it important?
❑ Preventing undesirable events by forecasting the event,
identifying the circumstances preceding the event, and taking
corrective action so the event can be avoided.
❑ Forecasting undesirable, yet unavoidable, events to preemptively
lessen their impact e.g., flood prediction
❑ Profiting from forecasting (e.g., financial markets)

57
◼ Why is it difficult?
❑ Limited quantity of data (Observed data series sometimes too
short to partition)
❑ Noise (Erroneous data points, obscuring component)
❑ Moving Average
❑ Nonstationarity (Fundamentals change over time,
nonstationary)
❑ Forecasting method selection (Statistics, Artificial intelligence)
◼ Neural networks have been widely used as time series
forecasters: most often these are feed-forward networks
which employ a sliding window over the input sequence.
◼ The neural network sees the time series X1,…,Xn in the form
of many mappings of an input vector to an output value.

58
Prediction with Neural Networks
◼ A number of adjoining data points of the time series (the input
window Xt-s+1, Xt-s+2,…, Xt) are used as activation levels for the
input units of the input layer.

◼ The size s of the input window correspondends to the number


of input units of the neural network.

How to train a neural network for time series prediction


◼ In the forward path, these activation levels are propagated over
one hidden layer to one output unit. The error used for the
backpropagation learning algorithm is now computed by
comparing the value of the output unit with the transformed value
of the time series at time t+1. This error is propagated back to
the connections between output and hidden layer and to those
between hidden and input layer. After all weights have been
updated accordingly, one presentation has been completed.

59
Fig.10 Learning a time series

EX: A time series : 2 3 5 4 6 8 5 7 11 13 9 7 60


Prediction with Neural Networks (cont.)

◼ Training a neural network with backpropagation algorithm


usually requires that all representations of the samples
(called one epoch) are presented many times. For
examples, the ANN may use 60 to 138 epoches.
◼ Neural networks can perform time series prediction with
high precision since they can capture the non-linear
characteristics of time series data.

61
Evaluating Methods of forecasts

◼ Divide the data into two sections - an initialization part


and a test part
❑ Use the initialization data set to train the neural network
❑ Use the trained neural network to forecast the test data
set and determine the forecast errors to evaluate the
predictive effectiveness.
❑ Evaluate errors (MAE, MPE, MSE, MAPE)

62
Evaluation Methods of Forecasts
◼ There are three measures of accuracy of the prediction models:
MAPE, MAE and MSE.
◼ For all three measures, the smaller the value, the better the fit of
the model.
◼ Use these statistics to compare the predictive accuracy of the
different methods.
◼ MAPE (Mean Absolute Percentage Error) measure the accuracy
of predicted time series values. It expresses accuracy as a
percentage.
n

| ( y t − y 't ) / y t |
MAPE = t =1
100
n

where yt is the actual value, y’t is the predicted value and n is the
number of observations.

63
MAE

◼ MAE (Mean Absolute Error) expresses accuracy in the


same units as the data, which help conceptualize the
amount of error.

| y t − y 't |
MAE = t =1

n
where yt is the actual value, y’t is the predicted value and n is the
number of observations.

64
MSE

◼ MSE(Mean Squared Error) is a more sensitive


measure of an unusually large forecast error than
MAE.
n

 t t
( y − y ' ) 2

MSE = t =1

where yt is the actual value, y’t is the predicted value and


n is the number of observations.

65
R-squared

R-squared (R2) measure the accuracy of predicted time series


values. Its value is in the range [0, 1].

66
References
◼ Tom M. Mitchell, Machine Learning, The McGraw Hill, 1997.
◼ M. Verleysen, K. Hlavackova, Learning in RBF Networks, Proc.
of Int. Conf. on Neural Networks (ICNN), Washington, D.C., June
3-6, 1996.
◼ F. Schwenker, H. A. Kestler, G. Palm, Three learning phases for
radial-basis-function networks, Neural Networks 14(2001) 439-
458.
◼ N.Benoudjit, C. Archambeau, A. Lendasse, J. Lee, M. Verleysen,
Width optimization of the Guassian kernels in RBF Networks,
Proc. of European Symposium on Artificial Neural Networks,
Bruges, Belgium, 24-26 April 2002, pp. 425-432.
◼ J. Han and M. Kamber, Data Mining: Concepts and Techniques,
2nd Edition, Morgan Kaufmann Publishers, 2006

67
Terminology
◼ Neural network: mạng nơ ron, feed-forward neural network:
mạng nơ ron truyền thẳng, weight: trọng số, hidden layer: tầng
ẩn, bias: độ lệch, linear unit: đơn vị tuyến tính, linear
combination: tổ hợp tuyến tính, threshold: ngưỡng, weight
vector: véc tơ trọng số, perceptron training rule: luật huấn luyện
perceptron, linearly separable data: dữ liệu khả tách một cách
tuyến tính, gradient descent: suy giảm độ dốc, incremental
gradient descent: suy giảm độ dốc gia tăng, error function: hàm
lỗi, learning rate: hệ số học, back-propagation: lan truyền
ngược, error term: toán hạng sai số, sigmoid function: hàm
sigmoid, transfer function: hàm truyền, momentum constant:
hằng số quán tính

68

You might also like