0% found this document useful (0 votes)
4 views

Lec 7

The document discusses different types of non-Bayes classifiers including linear discriminants and neural networks. It describes discriminant functions and how linear discriminant functions use hyperplanes as decision surfaces. It also discusses perceptron cost functions and algorithms, and how linear discriminants can be modeled as perceptrons. The document then covers topics like artificial neurons, multilayer perceptrons, their discriminating ability, training methods using gradient descent and backpropagation algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lec 7

The document discusses different types of non-Bayes classifiers including linear discriminants and neural networks. It describes discriminant functions and how linear discriminant functions use hyperplanes as decision surfaces. It also discusses perceptron cost functions and algorithms, and how linear discriminants can be modeled as perceptrons. The document then covers topics like artificial neurons, multilayer perceptrons, their discriminating ability, training methods using gradient descent and backpropagation algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Non-Bayes classifiers.

Linear discriminants,
neural networks.
Discriminant functions(1)
Bayes classification rule:
P ( w1 | x )  P ( w2 | x )  0 ? w1 : w2

Instead might try to find a function:

f w1 , w2 ( x)  0 ? w1 : w2

f w1 , w2 ( x) is called discriminant function.

{x | f w1 , w2 ( x)  0} - decision surface
Discriminant functions (2)
Class 1 Class 1

Class 2 Class 2

Linear discriminant function:


f w1 , w2 ( x)  w x  w0
T

Decision surface is a hyperplane w x  w0  0


T
Linear discriminant – perceptron cost function
w  x
Replace w    and x   
 w0  1 
f
Thus now decision function is w1 , w2 ( x )  wT
x
and decision surface is wT
x0

Perceptron cost function:


J ( x)    x w x T

 1, if x  w1 and x is wT x  0

where  x  1, if x  w2 and x is wT x  0
0, x is correctly classified

Linear discriminant – perceptron cost function
Class 1
Perceptron cost function:
Class 2
J ( x)    x wT x
x
Value of J (x ) is proportional to
the sum of distances of all
misclassified samples to the
decision surface.

If discriminant function separates classes perfectly, then J ( x )  0


Otherwise, J ( x )  0 and we want to minimize it.

J (x) is continuous and piecewise linear. So we might try to


use gradient descent algorithm.
Linear discriminant – Perceptron algorithm
Gradient descent:
J ( w)
w(t  1)  w(t )   t
w w w( t )

J ( w)
At points where J (x ) is differentiable   δx x
w misclas
sified x

Thus w(t  1)  w(t )   t δ


misclas
x x
sified x

Perceptron algorithm converges when classes are linearly


separable with some conditions on  t
Sum of error squares estimation
Let denote y ( x)  1 as desired output function, 1 for
one class and –1 for the other.
Want to find discriminant function f w1 , w2 ( x)  w x
T

whose output is similar to y (x)

Use sum of error squares as similarity criterion:


N
J ( w )   yi  w x i T 2

i 1

ˆ  arg min J ( w )
w
w
Sum of error squares estimation
Minimize mean square error:
J ( w ) N
 2  x i ( yi  w x i )  0
T

w i 1

 N  N
  x i x i w
T
ˆ   x i yi
 i 1  i 1

Thus 1
 N
  N

ˆ    x i x i    x i yi 
w T

 i 1   i 1 
Neurons
Artificial neuron.
x1
w1
x2
w2  f

xl wl
w0

Above figure represent artificial neuron calculating:


 l 
y  f   wi xi 
 i 1 
Artificial neuron.
Threshold functions f:

Step function Logistic function

1 1

0 0

1 x  0 1
f ( x)   f ( x) 
0 x  0 1  e  ax
Combining artificial neurons

x1
x2

xl

Multilayer perceptron with 3 layers.


Discriminating ability of multilayer perceptron
Since 3-layer perceptron can approximate any smooth
function, it can approximate F ( x )  P ( w1 | x )  P ( w2 | x )
- optimal discriminant function of two classes.
Training of multilayer perceptron
f f

ykr 1 v r y rj
wrjk j
f f

f
f

Layer r-1 Layer r


Training and cost function
Desired network output: x(i )  y (i )
Trained network output: x(i )  yˆ (i )

Cost function for one training sample:


1 kL
E (i )   ( ym (i )  ym (i ))
ˆ 2

2 m 1
N
Total cost function: J   E (i )
i 1

r
Goal of the training: find values of w jk which minimize
cost function J .
Gradient descent
Denote: w rj  [ wrj 0 , wrj1 ,..., wrjk r 1 ]T

J
Gradient descent: w ( new)  w (old )  
r r
j j
w rj

N
Since J   E (i ) , we might want to update weights
after processing
i 1 each training sample separately:

E (i )
w (new)  w (old )  
r
j
r
j
w jr
Gradient descent
Chain rule for differentiating composite functions:

 r
E (i ) E (i ) j (i ) E (i ) r 1
v
 r  r y (i )
w jr
v j (i ) w jr
v j (i )

E (i )
Denote:  (i )  r
r
j
v j (i )
Backpropagation
If r=L, then
E ( i )   1 kL
2
 j (i )  L  L   ( f (vm (i ))  yˆ m (i )) 
L L

v j (i ) v j (i )  2 m 1 
 ( f (v Lj (i ))  yˆ j (i )) f (v Lj (i ))  e j (i ) f (v Lj (i ))

If r<L, then
E (i ) E (i ) v
kr r
j (i )
 j (i )  r 1   r
r 1

v j (i ) k 1 v j (i ) v rj 1 (i )
kr
v rj (i ) kr
   jr (i ) r 1
   jr (i ) wkjr f (v rj 1 (i ))
k 1 v j (i ) k 1
Backpropagation algorithm

• Initialization: initialize all weights with random values.


• Forward computations: for each training vector x(i)
r r
compute all j v (i ), y j (i )
• Backward computations: for each i, j and r=L, L-1,…,2
compute  jr 1 (i )
• Update weights: E (i )
w j (new)  w j (old )  
r r

w rj
 w rj (old )   jr (i ) y r 1 (i )
MLP issues
• What is the best network configuration?
• How to choose proper learning parameter  ?
• When training should be stopped?
• Choose another threshold function f or cost function J?

You might also like