Introduction & Basics Perceptrons Perceptron Learning and PLR Beyond Perceptrons Two-Layered Feed-Forward Neural Networks
Introduction & Basics Perceptrons Perceptron Learning and PLR Beyond Perceptrons Two-Layered Feed-Forward Neural Networks
NEURAL NETWORKS
Chapter 20.5
Introduction & Basics Perceptrons Perceptron Learning and PLR Beyond Perceptrons Two-Layered Feed-Forward Neural Networks
1
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
Inspired by biological nervous system. composed of a large number of highly interconnected processing elements. It resembles the brain in two respects: Knowledge is acquired by the network through a learning process. Interneuron connection strengths known as synaptic weights are used to store the knowledge.
2
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
3
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
INTRODUCTION
4
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
INTRODUCTION
Known as:
Neural Networks (NNs) Artificial Neural Networks (ANNs) Connectionist Models Parallel Distributed Processing (PDP) Models
5
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
INTRODUCTION
NN similar to brain:
knowledge is acquire experientially (learning) knowledge stored in connections (weights)
5/13/2012
INTRODUCTION
performance: about 102 msec, about 100 sequential neuron firings for "many" tasks
7
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
INTRODUCTION
Attractions of NN approach:
can be massively parallel
MIMD, optical computing, analog systems from a large collection of simple processing elements emerges interesting complex global behavior
is a robust computation
can handle noisy and incomplete data due to fine-grained distributed and continuous knowledge representation
8
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
INTRODUCTION
Attractions of NN approach:
fault tolerant
ok to have faulty elements and bad connections
degrades gracefully
continues to function, at a lower level of performance, when portions of the network are faulty
5/13/2012
5/13/2012
represent as a graph
edges: links
Layer 3 Layer 2
Layer 1
5/13/2012
Unit composition:
set of input links
from other units or sensors of the environment
Inputs
Output
wn
12
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
Given n inputs, the unit's activation is defined by: a = g( (w1 * x1) + (w2 * x2) + ... + (wn * xn) )
where: wi are the weights xi are the input values g() is a simple non-linear function, commonly:
let ini be sum of wi xi for all i
step:
sign:
1
-1
5/13/2012
PERCEPTRONS: LINEAR THRESHOLD UNITS (LTU) LTUs where studied in the 1950s!
mainly as single-layered nets
Perceptrons:
simple 1-layer network, units act independently composed of linear threshold units (LTU)
a unit's inputs, xi, are weighted, wi, and combined
step function computes activation level, a
x1 xi xn
w1
S
wn
14
5/13/2012
-1 t x1 xn w1 a wn
15
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
AND Perceptron:
inputs are 0 or 1 output is 1 when both x1 and x2 are 1
-1
.75
x1 x2 .5 a .5
.5*1+.5*1+.75*-1 = .25 output = 1
4 possible .5*0+.5*0+.75*-1 data points = -.75 output = 0 threshold is like a separating line
16
x1
1
0 0 1 x2
5/13/2012
OR Perceptron:
inputs are 0 or 1
PERCEPTRONS: OR EXAMPLE
-1
.25
x1 x2 .5 a .5
.5*1+.5*1+.25*-1 = .75 output = 1
4 possible .5*0+.5*0+.25*-1 data points = -.25 output = 0 threshold is like a separating line
17
x1
1
0 0 1 x2
5/13/2012
PERCEPTRON LEARNING
So the only unknown is the weights Perceptrons learn by changing their weights
supervised learning is used
the correct output is given for each training example
an example is a list of values for the input units correct output is a list desired values for the output units
18
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
1. 2.
Repeat until all examples correctly classified or some other stopping criterion is met
b.
5/13/2012
5/13/2012
5/13/2012
22
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
PLR is also called the Delta Rule or the Widrow-Hoff Rule PLR is a variant of rule proposed by Rosenblatt in 1960 PLR is based on an idea of Hebb:
the strength of a connection between two units should be adjusted in proportion to the product of their simultaneous activations the product is used as a means of measuring the correlation between the values that are output by the two units
23
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
PLR is a "local" learning rule only local information in the network is needed to update a weight PLR performs gradient descent in "weight space" this rule iteratively adjusts all of the weights so that at for each training example the error is monotonically non-increasing, i.e. ~decreases
24
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
Perceptron Convergence Theorem says if a set of examples are learnable, then PLR will find the necessary weights
in a finite number of steps independent of the initial weights
This theorem says that if a solution exists, PLR's gradient descent is guaranteed to find an optimal solution (i.e., 100% correct classification) for any 1-layer neural network
25
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
26
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
XOR Perceptron:
inputs are 0 or 1
-1
???
x1 x2 .5 a .5
2-D input space with 4 possible data points positives from negatives using a straight line?
27
x1
1
0 0 1 x2
5/13/2012
In general, the goal of learning in a perceptron is to adjust the separating hyperplane which divides an n-dimensional input space where n is the number of input units by modifying the weights (and biases) until all of the examples with target value 1 are on one side of the hyperplane, and all of the examples with target value 0 are on the other side of the hyperplane.
28
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
BEYOND PERCEPTRONS
Perceptrons as a computing model are too weak because they can only learn linearly-separable functions. To enhance the computational ability, general neural networks have multiple layers of units. The challenge is to find a learning rule that works for multi-layered networks.
29
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
BEYOND PERCEPTRONS
A feed-forward multi-layered network computes a function of the inputs and the weights. Input units (on left or bottom):
activation is determined by the environment
Perceptrons have input units followed 30 by one layer of output units, i.e. no hidden units
5/13/2012
BEYOND PERCEPTRONS
NN's with one hidden layer of a sufficient number of units, can compute functions associated with convex classification regions in input space. NN's with two hidden layers are universal computing devices, although the complexity of the function is limited by the number of units.
If too few, the network will be unable to represent the function. If too many, the network will memorize examples and is subject to overfitting.
31
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
Inputs Hidden Units Output Units Weights on links from input to hidden Weights on links from hidden to output Network Activations
ak=Ik Wk,j I1 I2
Aj
Wj,i
Ai
I3
I4 I5 I6
a1 = O1
a2 = O2
32
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
ak=Ik Wk,j
Aj
Wj,i
Ai
Two Layered: count layers with units computing an activation Feed-forward: each unit in a layer connects forward to all of the units in the next layer no cycles
- links within the same layer - links to prior layers
I1 I2
I3
I4 I5 I6 Layer 1
33
a1 = O1
a2 = O2
no skipping layers
Layer 2
5/13/2012
NEURAL NETWORKS
Chapter 20.5
Two-Layered Feed-Forward Neural Networks Solving XOR Learning in Multi-Layered FeedForward NN Back-Propagation Computing the Change for Weights Other Issues & Applications
34
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
XOR Perceptron?:
inputs are 0 or 1 output is 1 when I1 is 1 and I2 is 0 or I1 is 0 and I2 is 1
CONQUERING XOR
I1
.25 .5 .5 .5
OR
.5 O -.5
I2
Each unit in hidden layer acts like a perceptron learning a separating line
.5 .75
AND I1 1 0 0 1
top hidden unit acts like an OR perceptron bottom hidden unit acts like an AND perceptron
35
I2
5/13/2012
XOR Perceptron?:
inputs are 0 or 1 output is 1 when I1 is 1 and I2 is 0 or I1 is 0 and I2 is 1
CONQUERING XOR
I1
.25 .5 .5 .5
OR
.5 O -.5
I2
The output unit combines I1 these separating lines by intersecting the "half-planes" 1 defined by the separating lines
when OR is 1 and AND is 0 0 0
36
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
.5 .75
AND
I2 1
then output O, is 1
5/13/2012
PLR doesn't work in multi-layered feed-forward nets since desired values for hidden units aren't known. Must again solve the Credit Assignment Problem
determine which weights to credit/blame for the output error in the network determine which weights in the network should be updated and how to update them
37
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
Back-Propogation:
method for learning weights in these networks generalizes PLR
Back-Propagation approach:
gradient-descent algorithm to minimize the error on the training data errors are propagated through the network starting at the output units and working backwards towards the input units
38
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
BACK-PROPAGATION ALGORITHM
1. 2.
Initialize the weights in the network (usually random values like PLA) Repeat until all examples correctly classified or other stopping criterion is met for each example e in training set do
forward pass: Oi = neural_net_output(network, e) Ti = desired output, i.e Target or Teacher's output
a. b. c.
i.
ii. compute Dwk,j for all weights from inputs to hidden layer 2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER,
5/13/2012
Back-propagation performs a gradient descent search in weight space to learn network weights. Given a network with n weights:
each configuration of weights is a vector, W, of length n that defines an instance of the network W can be considered a point in an n-dimensional weight space, where each dimension is associated with one of the connections in the network
40
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
Given n output units in the network: Ei = ((T1 - O1)2 + (T2 - O2) 2 + ... + (T n - On) 2) / 2
Ti is the target value for the ith output unit Oi is the network output value for the ith output unit
41
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
Visualized as a 2D error surface in weight space Each point in w1 w2 plane E is a weight configuration Each point has a total error E
2D surface represents errors for all weight configurations Goal is to find a lower point on the error surface (local minima) Gradient descent follows the direction of the steepest descent i.e. where E decreases the most
w2
.3
.8
w1
42
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
The gradient is defined as: Gradient_E = [E/w1, E/w2, ..., E/wn] Then change the ith weight by: Dwi = - a * E/wi To compute the derivatives for calculating the gradient direction requires an activation function that is continuous, differentiable, non-decreasing and easily computed.
can't use the step function as in LTU's instead use the sigmoid function 1/(1+e-x) where x is ini the weighted sum of inputs
43
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
COMPUTING THE CHANGE FOR WEIGHTS: TWO-LAYER For weights between hidden and output units generalize PLR for sigmoid NEURAL NETWORK activation is:
5/13/2012
Dwj,i = -a * E/wj,i = -a * -aj * (Ti - Oi) * g'(ini) = a * aj * (Ti - Oi) * Oi * (1 - Oi) a learning rate parameter
5/13/2012
a1
w1,2
Dw1,2 = a
product of
I1 I2
learning rate
I3
I4 I5 I6
O1
O2
45
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
COMPUTING THE CHANGE FOR WEIGHTS: TWO-LAYER For weights between inputs and hidden units: NEURAL NETWORK
5/13/2012
error at an output units is "distributed" back to each of the hidden units in proportion to the weight of the connection between them
total error is distributed to all of the hidden units that contributed to that error
Each hidden unit accumulates some error from each of the output units to which it is connected
46
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
COMPUTING THE CHANGE FOR WEIGHTS: TWO-LAYER For weights between inputs and hidden units: NEURAL NETWORK Dwk,j = -a * E/wk,j
5/13/2012
= -a * -Ik * g'(inj) * S( wj,i * (Ti - Oi) * g'(ini) ) = a * Ik * aj * (1 - aj) * S( wj,i*(Ti-Oi)*Oi*(1-Oi) ) wk,j weight on link from input k to hidden unit j wj,i weight on link from hidden unit j to output unit i
input value k
5/13/2012
Dw1,2 =
a*
I1 *
w1,2
product of
learning rate activation along link
a2
W2,i
I1 I2
a2 * (1 a2)
I3
I4 I5 I6
O1
g(in2)
O2
48
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
1. 2.
Initialize the weights in the network (usually random values) Repeat until all examples correctly classified or other stopping criterion is met for each example e in training set do
a. b. c.
Forward Pass: Oi = neural_net_output(network, e) compute weighted sum, then sigmoid activation Ti = desired output, i.e Target or Teacher's output calculate error (Ti - Oi) at the output units d.
i.
Backward Pass:
49
2 0 0ii. - 2 0 0 4 J A M E SDwk,j S Ka E NkT* aj * (1 O Mj) N O T(wS *(Ti-Oi.)*Oi*(1-Oi)) 1 compute D . = R * I N Y F R - a * S E j,i B Y C D Y E R , ET. AL.
5/13/2012
OTHER ISSUES
50
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
OTHER ISSUES
Use a tuning set or cross-validation to determine experimentally the number 51 of units that minimizes error.
5/13/2012
OTHER ISSUES
train to classify 1 - e/2 of the training set correctly e.g. if n=80 and e=0.1 (i.e. 10% error on test set) 52
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, training E T . of L . is 800 set A size
5/13/2012
OTHER ISSUES
Train the network until the error rate on a tuning set begins increasing rather than training until the error (i.e. SSE) is minimized.
53
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
APPLICATIONS
5/13/2012
APPLICATIONS: ALVINN
ALVINN (Pomerleau, 1988): learns to control vehicle steering to stay in the middle of its lane topology: two-layered feed-forward network using back-propagation learning
topology: input
input is 480*512 pixel image 15 times per second
color image is preprocessed to obtain a 30*32 pixel image each pixel is one byte, an integer from 0 to 255 55 corresponding to the brightness of the image
5/13/2012
APPLICATIONS: ALVINN
topology: output
output is one of 30 discrete steering positions
output unit 1 means sharp left output unit 30 means sharp right
topology: hidden
56
5/13/2012
APPLICATIONS: ALVINN
Learning:
continuously learns on the fly by observing
human driver (takes ~5 minutes from random initial weights)
solutions
generate negative examples by synthesizing views of the road that are incorrect for current steering
57 maintain a buffer of 200 real and synthesized images that keeps some images in many different 2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, steering. directions ET AL.
5/13/2012
APPLICATIONS: ALVINN
Results:
has driven at speeds up to 70 mph has driven continuously for distances up to 90 miles
has driven across the continent during different times of the day and with different traffic conditions
can drive on:
single lane roads and highways multi-lane highways paved bike paths dirt roads
5/13/2012
SUMMARY
Advantages
parallel processing architecture robust with respect to node failure
59
2001-2004 JAMES D. SKRENTNY FROM NOTES BY C. DYER, ET. AL.
5/13/2012
SUMMARY
Disadvantages
slow training (i.e. takes many epochs) poor interpretability (i.e. difficult to get rules)