0% found this document useful (0 votes)
61 views

NN 2

This document provides an overview of supervised learning and perceptrons. It discusses key concepts such as: - Training and test data sets, with the training set containing input and target values. - The architecture of neural networks, including single layer feedforward networks and perceptrons with one neuron. - How perceptrons can be used for binary classification tasks by training the network to correctly classify examples into two classes. - The perceptron learning algorithm, which tries to find weights that separate the classes, minimizing misclassified examples over multiple iterations or epochs.

Uploaded by

Ku Lot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

NN 2

This document provides an overview of supervised learning and perceptrons. It discusses key concepts such as: - Training and test data sets, with the training set containing input and target values. - The architecture of neural networks, including single layer feedforward networks and perceptrons with one neuron. - How perceptrons can be used for binary classification tasks by training the network to correctly classify examples into two classes. - The perceptron learning algorithm, which tries to find weights that separate the classes, minimizing misclassified examples over multiple iterations or epochs.

Uploaded by

Ku Lot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Supervised Learning

• Training and test data sets


• Training set: input & target

Neural Networks NN 1 1
Perceptron: architecture
• We consider the architecture: feed-forward
NN with one layer
• It is sufficient to study single layer
perceptrons with just one neuron:



Neural Networks NN 1 2
Single layer perceptrons
• Generalization to single layer perceptrons with
more neurons is easy because:

 

•The output units are independent among each other


•Each weight only affects one of the outputs
Neural Networks NN 1 3
Perceptron: Neuron Model
• The (McCulloch-Pitts) perceptron is a single
layer NN with a non-linear , the sign function
+ 1 if v  0
 (v ) = 
− 1 if v  0
b (bias)
x1
w1
v y
x2 w2 (v)
wn
xn
Neural Networks NN 1 4
Perceptron for Classification
• The perceptron is used for binary
classification.
• Given training examples of classes C1, C2
train the perceptron in such a way that it
classifies correctly the training examples:
– If the output of the perceptron is +1 then the input
is assigned to class C1
– If the output is -1 then the input is assigned to C2

Neural Networks NN 1 5
Perceptron Training
• How can we train a perceptron for a
classification task?
• We try to find suitable values for the
weights in such a way that the training
examples are correctly classified.
• Geometrically, we try to find a hyper-
plane that separates the examples of the
two classes.

Neural Networks NN 1 6
Perceptron Geometric View
The equation below describes a (hyper-)plane in the
input space consisting of real valued 2D vectors. The
plane splits the input space into two regions, each of
them describing one class.
decision
region for C1
2 x2 w x + w x + w >= 0

w x
1 1 2 2 0

i i + w 0 = 0 decision
i =1 boundary C1
x1
C2
w1x1 + w2x2 + w0 = 0

Neural Networks NN 1 7
Example: AND
• Here is a representation of the AND
function
• White means false, black means true for
the output
• -1 means false, +1 means true for the
input

-1 AND -1 = false
-1 AND +1 = false
+1 AND -1 = false
+1 AND +1 = true
Neural Networks NN 1 8
Example: AND continued
• A linear decision surface (i.e. a plane in
3D space) intersecting the feature
space (i.e. the 2D plane where z=0)
separates false from true instances

Neural Networks NN 1 9
Example: AND continued
• Watch a perceptron learn the AND function:

Neural Networks NN 1 10
Example: XOR
• Here’s the XOR function:
-1 XOR -1 = false
-1 XOR +1 = true
+1 XOR -1 = true
+1 XOR +1 = false

Perceptrons cannot learn such linearly inseparable functions


Neural Networks NN 1 11
Example: XOR continued
• Watch a perceptron try to learn XOR

Neural Networks NN 1 12
Example

-1 -1 -1 -1 -1 -1 -1 -1
-1 -1 +1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 -1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 +1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1

Neural Networks NN 1 13
Example
• How to train a perceptron to recognize this 3?
• Assign –1 to weights of input values that are
equal to -1, +1 to weights of input values that
are equal to +1, and –63 to the bias.
• Then the output of the perceptron will be 1
when presented with a “prefect” 3, and at most
–1 for all other patterns.

Neural Networks NN 1 14
Example

-1 -1 -1 -1 -1 -1 -1 -1
-1 -1 +1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 -1 +1 +1 +1 -1 -1
-1 +1 -1 -1 -1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 +1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1

Neural Networks NN 1 15
Example
• What if a slightly different 3 is to be recognized,
like the one in the previous slide?
• The original 3 with one bit corrupted would
produce a sum equal to –1.
• If the bias is set to –61 then also this corrupted
3 will be recognized, as well as all patterns with
one corrupted bit.
• The system has been able to generalize by
considering only one example of corrupted
pattern!

Neural Networks NN 1 16
Perceptron: Learning Algorithm

• Variables and parameters at iteration n


of the learning algorithm:
x (n) = input vector
= [+1, x1(n), x2(n), …, xm(n)]T
w(n) = weight vector
= [b(n), w1(n), w2(n), …, wm(n)]T
b(n) = bias
y(n) = actual response
d(n) = desired response
 = learning rate parameter
Neural Networks NN 1 17
The fixed-increment learning algorithm

n=1;
initialize w(n) randomly;
while (there are misclassified training examples)
Select a misclassified augmented example (x(n),d(n))
w(n+1) = w(n) + d(n)x(n);
n = n+1;
end-while;
 = learning rate parameter (real number)

Neural Networks NN 1 18
Example

Consider the 2-dimensional training set C1  C2,


C1 = {(1,1), (1, -1), (0, -1)} with class label 1
C2 = {(-1,-1), (-1,1), (0,1)} with class label -1

Train a perceptron on C1  C2

Neural Networks NN 1 19
A possible implementation
Consider the augmented training set C’1  C’2, with first
entry fixed to 1 (to deal with the bias as extra weight):
(1, 1, 1), (1, 1, -1), (1, 0, -1) ,(1,-1, -1), (1,-1, 1), (1,0,1)
Replace x with -x for all x  C2’ and use the following update
rule:
w(n ) + x (n ) if wT (n ) x(n )  0
w(n + 1) = 
w(n ) otherwise

Epoch = application of the update rule to each example of


the training set. Execution of the learning algorithm
terminates when the weights do not change after one
epoch.
Neural Networks NN 1 20
Execution
• the execution of the perceptron learning algorithm for each epoch is
illustrated below, with w(1)=(1,0,0),  =1, and transformed inputs
(1, 1, 1), (1, 1, -1), (1,0, -1), (-1,1, 1), (-1,1, -1), (-1,0, -1)

Adjusted Weight w(n) x(n) Update? New


pattern applied weight
(1, 1, 1) (1, 0, 0) 1 No (1, 0, 0)
(1, 1, -1) (1, 0, 0) 1 No (1, 0, 0)
(1,0, -1) (1, 0, 0) 1 No (1, 0, 0)
(-1,1, 1) (1, 0, 0) -1 Yes (0, 1, 1)
(-1,1, -1) (0, 1, 1) 0 Yes (-1, 2, 0)
(-1,0, -1) (-1, 2, 0) 1 No (-1, 2, 0)
End epoch 1
Neural Networks NN 1 21
Execution
Adjusted Weight w(n) 
(n) Update? New
pattern applied weight
(1, 1, 1) (-1, 2, 0) 1 No (-1, 2, 0)
(1, 1, -1) (-1, 2, 0) 1 No (-1, 2, 0)
(1,0, -1) (-1, 2, 0) -1 Yes (0, 2, -1)
(-1, 1, 1) (0, 2, -1) 1 No (0, 2, -1)
(-1, 1, -1) (0, 2, -1) 3 No (0, 2, -1)
(-1,0, -1) (0,2, -1) 1 No (0, 2, -1)
End epoch 2

At epoch 3 no weight changes. (check!)


 stop execution of algorithm.
Final weight vector: (0, 2, -1).
 decision hyperplane is 2x1 - x2 = 0.
Neural Networks NN 1 22
Result

x2 1
- - +
Decision boundary:
C2 2x1 - x2 = 0

-1 1/2 1 2 x1
w
- + -1 + C1

Neural Networks NN 1 23
Termination of the learning algorithm
Suppose the classes C1, C2 are linearly separable (that is, there
exists a hyper-plane that separates them). Then the perceptron
algorithm applied to C1  C2 terminates successfully after a
finite number of iterations.
Proof:
Consider the set C containing the inputs of C1  C2 transformed by
replacing x with -x for each x with class label -1.
For simplicity assume w(1) = 0,  = 1.
Let x(1) … x(k)  C be the sequence of inputs that have been used
after k iterations. Then
w(2) = w(1) + x(1)
w(3) = w(2) + x(2)  w(k+1) = x(1) + … + x(k)

w(k+1) = w(k) + x(k)

Neural Networks NN 1 24
Convergence theorem (proof)

Since C1 and C2 are linearly separable then there exists w* such


that w*T x > 0  x  C.
Let  = min w*T x
Then w*T w(k+1) = w*T x(1) + … + w*T x(k)  k
By the Cauchy-Schwarz inequality we get:
||w*||2 ||w(k+1)||2  [w*T w(k+1)]2

||w(k+1)||2  k2  2 (A)
||w*|| 2

Neural Networks NN 1 25
Convergence theorem (proof)
• Now we consider another route:
w(k+1) = w(k) + x(k)
|| w(k+1)||2 = || w(k)||2 + ||x(k)||2 + 2 w T(k)x(k)
euclidean norm 
 0 because x(k) is misclassified
 ||w(k+1)||2  ||w(k)||2 + ||x(k)||2
=0
||w(2)||2  ||w(1)||2 + ||x(1)||2
||w(3)||2  ||w(2)||2 + ||x(2)||2
k

  
2
||w(k+1)||2 || x(i) ||
i =1

Neural Networks NN 1 26
convergence theorem (proof)

• Let  = max ||x(n)||2 x(n)  C


• ||w(k+1)||2  k  (B)
• For sufficiently large values of k:
(B) becomes in conflict with (A).
Then k cannot be greater than kmax such that (A) and (B) are both
satisfied with the equality sign.
k 2max 2 || w* ||2
= kmax  kmax = β
|| w* ||2
 2

• Then the algorithm terminates successfully in at most  ||w*||2


iterations. 2

Neural Networks NN 1 27
Perceptron: Limitations
• The perceptron can only model linearly
separable classes, like (those described by)
the following Boolean functions:
• AND
• OR
• COMPLEMENT
• It cannot model the XOR.

• You can experiment with these functions in the


Matlab practical lessons.
Neural Networks NN 1 28
Gradient Descent Learning Rule
• Perceptron learning rule fails to converge
if examples are not linearly separable

• Gradient Descent: Consider linear unit


without threshold and continuous output o
(not just –1,1)
• o(x)=w0 + w1 x1 + … + wn xn
• Update the wi’s such that they minimize
the squared error
• E[w1,…,wn] = ½ (x,d)D (d-o(x))2
where D is the set of training examples
Neural Networks NN 1 29
• Replace the step function in the perceptron with a
continuous (differentiable) function f, e.g the simplest is
linear function
• With or without the threshold, the Adaline is trained based
on the output of the function f rather than the final output.

+/
f (x)

(Adaline)
Neural Networks NN 1 30
Adaline: Adaptive Linear Element
• When the two classes are not linearly separable, it may be
desirable to obtain a linear separator that minimizes the mean
squared error.
• Adaline (Adaptive Linear Element):
– uses a linear neuron model and
– the Least-Mean-Square (LMS) learning algorithm
– useful for robust linear classification and regression
For an example (x,d) the error e(w) of the network is
m
e(w) = d −  x jw j
j=0

and the squared error is E ( w) = 12 e2


Neural Networks NN 1 31
Adaline
• The total error E_tot is the mean of the squared errors of
all the examples.
• E_tot is a quadratic function of the weights, whose
derivative exists everywhere.
• Incremental gradient descent may be used to minimize
E_tot. (see Sec. 3.1,3.2
http://www.shef.ac.uk/psychology/gurney/notes/l3/subsection3_3_2.html
of the online Gurney book for an description of gradient
descent).
• At each iteration LMS algorithm selects an example and
decreases the network error E of that example, even
when the example is correctly classified by the network.

Neural Networks NN 1 32
Incremental Gradient Descent

• start from an arbitrary point in the weight space


• the direction in which the error E of an example (as a
function of the weights) is decreasing most rapidly is
the opposite of the gradient of E:

− (gradient of E ( w(n) )) = − E
w1

, ,
E
wm

• take a small step (of size ) in that direction

w(n + 1) = w(n ) − (gradient of E(w(n) ))


Neural Networks NN 1 33
Gradient Descent

D={<(1,1),1>,<(-1,-1),1>,
<(1,-1),-1>,<(-1,1),-1>}
(w1,w2)

(w1+w1,w2 +w2)

Neural Networks NN 1 34
Gradient Descent
• Train the wi’s such that they minimize the
squared error
– E[w1,…,wm] = ½ dD (td-od)2

Gradient:
E[w]=[E/w0,… E/wm]
w=- E[w]
wi=- E/wi
= - /wi 1/2d(td-od)2
= - /wi 1/2d(td-i wi xi)2
= - d(td- od)(-xi)
Neural Networks NN 1 35
Gradient Descent
Gradient-Descent(training_examples, )
Each training example is a pair of the form <(x1,…xm),t> where (x1,…,xm)
is the vector of input values, and t is the target output value,  is the
learning rate (e.g. 0.1)
• Initialize each wi to some small random value
• Until the termination condition is met , Do
– Initialize each wi to zero
– For each <(x1,…xm),t> in training_examples Do
• Input the instance (x1,…,xm) to the linear unit and compute the
output o
• For each linear unit weight wi Do
– wi= wi +  (t-o) xi
– For each linear unit weight wi Do
• wi=wi+wi
• Termination condition – error falls under a given threshold
Neural Networks NN 1 36
Incremental Stochastic
Gradient Descent
• Batch mode : gradient descent
w=w -  ED[w] over the entire data D
ED[w]=1/2d(td-od)2
• Incremental mode: gradient descent
w=w -  Ed[w] over individual training
examples d
Ed[w]=1/2 (td-od)2

Incremental Gradient Descent can approximate Batch Gradient


Descent arbitrarily closely if  is small enough
Neural Networks NN 1 37
Weights Update Rule:
incremental mode

• Computation of Gradient(E):
E ( w ) e
=e
w w
= e[− x ]
T

• Delta rule for weight update:

w(n + 1) = w(n) + e(n)x(n)


Neural Networks NN 1 38
LMS learning algorithm

n=1;
initialize w(n) randomly;
while (E_tot unsatisfactory and n<max_iterations)
Select an example (x(n),d(n))
e(n) = d (n) − w(n) x(n)
T

w(n + 1) = w(n) +  e(n) x(n)


n = n+1;
end-while;
 = learning rate parameter (real number)
x(n)
A modification uses w(n + 1) = w(n) +  e(n)
|| x(n) ||
Neural Networks NN 1 39
Perceptron Learning Rule VS.
Gradient Descent Rule
Perceptron learning rule guaranteed to
succeed if
• Training examples are linearly separable
• Sufficiently small learning rate 
Linear unit training rules uses gradient descent
• Guaranteed to converge to hypothesis with
minimum squared error
• Given sufficiently small learning rate 
• Even when training data contains noise
•Neural
Even when training data
Networks NN 1
not separable by H 40
Comparison of Perceptron and Adaline
Perceptron Adaline

Architecture Single-layer Single-layer

Neuron Non-linear linear


model
Learning Minimze Minimize total
algorithm number of error
misclassified
examples

Application Linear Linear classification and


classification regression
Neural Networks NN 1 41
Renaissance of Perceptron
Multi-Layer
Perceptron
Back-Propagation, 80’

Perceptron

Learning Theory, 90’

Support Vector
Neural Networks NN 1 Machine 42

You might also like