0% found this document useful (0 votes)
26 views22 pages

Hoff Learning Rule

The document discusses the Widrow & Hoff learning rule, also known as the Delta rule or LMS rule. It describes how the rule works to minimize error by adjusting weights according to the gradient of the error function. It also provides an example of applying the rule to an ADALINE network to learn an AND logic function.

Uploaded by

Jo Jo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views22 pages

Hoff Learning Rule

The document discusses the Widrow & Hoff learning rule, also known as the Delta rule or LMS rule. It describes how the rule works to minimize error by adjusting weights according to the gradient of the error function. It also provides an example of applying the rule to an ADALINE network to learn an AND logic function.

Uploaded by

Jo Jo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Widrow & Hoff

Learning Rule
Dr. Nitish Katal

1
Outline

• Delta Learning Rule


• Single & Multiple Output Layers
• Choice of Learning Rate

2
Learning by Error Minimization
• What is the Best Learning Parameters?
• Learn from your mistakes

• For an input (xi ,t ) in the training set,

• The cost of a mistake is 𝑡 − 𝑦𝐼𝑁 or 𝑡 − 𝑥𝑖 𝑤𝑖

• Define the cost (or loss) for a particular weight vector w to be:
• Sum of squared costs over the training set
1
𝐸 = 𝑡 − 𝑦𝐼𝑁 2
2
• One strategy for learning:
• Find the w with least cost on this data
3
Delta Rule /
Widrow & Hoff Learning rule
• Learning: minimizing mean squared error
1
𝐸 = 𝑡 − 𝑦𝐼𝑁 2
2
• Different strategies exist for learning by optimization
• Gradient descent is a popular algorithm

• The Delta rule for adjusting the weight for each pattern is given
as
∆𝑤𝐼 = 𝛼 𝑡 − 𝑦𝐼𝑁 𝑥𝐼

4
Delta Rule
Now:
x : Vector of activations of the inputs
yIN : Net input to the output unit Y
t : Target
𝑛

𝑦𝐼𝑁 = ෍ 𝑥𝑖 𝑤𝑖
𝑖=1
Then, the squared error for a particular training pattern is:
1
𝐸 = 𝑡 − 𝑦𝐼𝑁 2
2

where E is the function of all the weights 𝑤𝑖 , I = 1, …, n


5
Delta Rule … (contd.)
𝜕𝐸 𝜕𝐸 𝜕𝐸
Gradient of E: 𝐸= , ,…,
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑛

• The gradient keeps the direction of the most rapid increase in E;


• The opposite direction gives the most rapid decrease in error
𝜕𝐸

𝜕𝑤𝐼
Since,
𝜕𝐸 𝜕 1
= 𝑡 − 𝑦𝐼𝑁 2
𝜕𝑤𝐼 𝜕𝑤𝐼 2
𝜕 𝑦𝐼𝑁
= − 𝑡 − 𝑦𝐼𝑁
𝜕𝑤𝐼
= − 𝑡 − 𝑦𝐼𝑁 𝑥𝐼
6
Delta Rule … (contd.)
𝑛
𝜕𝐸 𝜕𝐸 𝜕𝐸 𝜕𝐸 𝜕 1
Derivation 𝐸= , ,…, =෍ 𝑡 − 𝑥𝑖 𝑤𝑖 2
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑛 𝜕𝑤𝐼 𝜕𝑤𝐼 2
𝑖=1

𝜕𝐸 𝜕 1 𝑛
So: = 𝑡 − 𝑦𝐼𝑁 2
= ෍ 𝑡 − 𝑥𝑖 𝑤𝑖
𝜕 𝑡 − 𝑥𝑖 𝑤𝑖
𝜕𝑤𝐼 𝜕𝑤𝐼 2 𝜕𝑤𝐼
𝑖=1
𝑛 𝑛

We know that: 𝑦𝐼𝑁 = ෍ 𝑥𝑖 𝑤𝑖 = − ෍ 𝑡 − 𝑥𝑖 𝑤𝑖 𝑥𝐼


𝑖=1
𝑖=1
𝑛 • The gradient keeps the direction of the most rapid
Thus: 𝜕𝐸 𝜕 1 2 increase in E;
= ෍ 𝑡 − 𝑥𝑖 𝑤𝑖
𝜕𝑤𝐼 𝜕𝑤𝐼 2 • The opposite direction gives the most rapid decrease in
𝑖=1
error

7
Summary : Delta/WH/LMS Rule

∆𝑤𝐼 = 𝛼 𝑡 − 𝑦𝐼𝑁 𝑥𝐼

wi ( new ) = wi ( old ) +  ( t − yIN ) xi


b ( new ) = b ( old ) +  ( t − yIN )

8
Choice of Learning Rate

9
Summary
• The real life problems are complex and non-
convex.
• Purpose of NN Learning or Training:
• Minimise the output errors for particular set of
training data
• By adjusting the network weights wij.
• Error Function E(wij)
• “Measures” how far the current network is from the
desired one.
• Gradient of the Error Function
• Partial derivatives of error function ∂E(wij)/∂wij
• Guides the direction to move in weight space to
reduce the error.
• The learning rate 𝛼 specifies:
• The step sizes we take in weight space for each
iteration of the weight update equation.
• Keep iterating
• The weight space until the errors are ‘small enough’.
• Choice of Activation functions with
11 derivatives
ADALINE
Learning Rule & Architecture

12
Outline

• ADALINE
• Architecture
• Learning Algorithm
• Example
ADALINE : Introduction
• Adaline : Adaptive Linear Neuron
• NN having a single linear unit.

• Developed by Widrow and Hoff in


1960.

• Features of Adaline
• Uses bipolar activation function.
• Uses delta rule for training to
minimize the MSE between the actual +1 if 𝑦𝐼𝑁 ≥ 0
output and the desired/target output. 𝑓 𝑦𝐼𝑁 =ቊ
−1 if 𝑦𝐼𝑁 < 0
• The weights and the bias are
adjustable.
Algorithm
Step 0 Initialize all weights
𝒘𝒊 (Small random values ) (i = 1 … n) n is the no. of input neurons
Set Learning Rate
Step 1 While Stopping Criteria is FALSE, Do Steps 2 - 6
Step 2 For each : Bipolar Training Vector: Inputs (s) & Target (t)
Do Steps : 3 - 5
Step 3 : Set activations for input units
𝒙𝒊 = 𝒔𝒊 (i = 1 … n)
Step 4 : Find Net Input to Output units

𝒚𝑰𝑵 = 𝒃 + ෍ 𝒙𝒊 𝒘𝒊
𝒊
Step 5 : Update the Weights & Bias, (i = 1 … n)
𝒘𝒊 𝒏𝒆𝒘 = 𝒘𝒊 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵 𝒙𝒊
𝒃 𝒏𝒆𝒘 = 𝒃 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵
Step 6 : Check Stopping Criteria
If Largest Weight Change that in Step 2 is SMALLER than a specified tolerance, then STOP;
else continue.
Choice of Learning Rate
• Hecht-Nelson (1990), proposed:
• An upper bound for its value can be found from the largest Eigen
values of the correlational matrix R of input vectors x(p).
𝑃
1
𝑅 = ෍ 𝑥 𝑝 𝑇𝑥 𝑝
𝑃
𝑝=1
• 𝛼 < Half of the largest eigen value of R.
• Commonly a small value is chosen (𝛼 = 0.1)

• For Single Layer Neurons for n inputs Widrow et al. (1988),


proposed:
0.1 ≤ 𝑛𝛼 ≤ 1.0
Application
• Bipolar Inputs & Outputs for AND Logic

Inputs Target
𝒙𝟏 𝒙𝟐 𝑩 𝒕
1 1 1 1
-1 1 1 -1
1 -1 1 -1
-1 -1 1 -1
Application
Initialization: 𝒘𝒊 𝒏𝒆𝒘 = 𝒘𝒊 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵 𝒙𝒊
• Set w1, w2, b, 𝛼 = Some small value 𝒃 𝒏𝒆𝒘 = 𝒃 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵

Inputs Target Net Error Weight Changes Weights MSE


𝒙𝟏 𝒙𝟐 𝑩 𝒕 𝒚𝑰𝑵 𝒕 − 𝒚𝑰𝑵 𝚫𝒘𝟏 𝚫𝒘𝟐 𝚫𝐛 𝒘𝟏 𝒘𝟐 𝒃
1 1 1 1
-1 1 1 -1
1 -1 1 -1
-1 -1 1 -1
Sum of Error
1 1 1 1
-1 1 1 -1
1 -1 1 -1
-1 -1 1 -1
Sum of Error
Application
Initialization: 𝒘𝒊 𝒏𝒆𝒘 = 𝒘𝒊 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵 𝒙𝒊
• Set w1, w2, b, 𝛼 = Some small value 𝒃 𝒏𝒆𝒘 = 𝒃 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵

Inputs Target Net Error Weight Changes Weights MSE


𝒙𝟏 𝒙𝟐 𝑩 𝒕 𝒚𝑰𝑵 𝒕 − 𝒚𝑰𝑵 𝚫𝒘𝟏 𝚫𝒘𝟐 𝚫𝒘𝟑 𝒘𝟏 𝒘𝟐 𝒘𝟑
1 1 1 1
-1 1 1 -1
1 -1 1 -1
-1 -1 1 -1
Sum of Error
1 1 1 1
-1 1 1 -1
1 -1 1 -1
-1 -1 1 -1
Sum of Error
𝒘𝒊 𝒏𝒆𝒘 = 𝒘𝒊 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵 𝒙𝒊
𝒃 𝒏𝒆𝒘 = 𝒃 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵
Application
𝒘𝒊 𝒏𝒆𝒘 = 𝒘𝒊 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵 𝒙𝒊
𝒃 𝒏𝒆𝒘 = 𝒃 𝒐𝒍𝒅 + 𝜶 𝒕 − 𝒚𝑰𝑵
References
• L. Fausett, “Neural Network : Architectures, Algorithms & Applications”
• Chapter: 2
• Topics: 2.4, 2.4.1, 2.4.2

• Machine Learning week 1: Cost Function, Gradient Descent and Univariate Linear
Regression
• https://bit.ly/3hSveVF

22
Thank You!

23

You might also like