0% found this document useful (0 votes)
39 views17 pages

What Is Neural Network Technology?

The document discusses neural networks. It begins by defining neural networks and their key components, including processing elements analogous to neurons and interconnections like dendrites and axons. The document then contrasts the architectures of neural networks and von Neumann computers. Finally, it discusses concepts in inductive learning with neural networks including training algorithms like backpropagation and error correction procedures. The goal of neural networks is to learn patterns from examples through training rather than being explicitly programmed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views17 pages

What Is Neural Network Technology?

The document discusses neural networks. It begins by defining neural networks and their key components, including processing elements analogous to neurons and interconnections like dendrites and axons. The document then contrasts the architectures of neural networks and von Neumann computers. Finally, it discusses concepts in inductive learning with neural networks including training algorithms like backpropagation and error correction procedures. The goal of neural networks is to learn patterns from examples through training rather than being explicitly programmed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Neural Network

• Introduction
- what is neural network?
- what is learning?
- symbolic learning vs neural net learning

• Training Perceptron
- Gradient Descent Method
- Widrow-Hoff Procedure
- Generalized Delta Procedure
- Error Correction Procedure

• Training Multi Layer Perceptron


- Backpropagation Method

• Hopfield Net and Hamming Net

What is Neural Network Technology?

• A new method of computing


• Based on research of the human brain
• Systems are trained rather than programmed to
accomplish tasks
• Have proven successful at solving problems proven
difficult or impossible by using conventional
computing techniques

* Formal Definition : A Neural Network is a non-programmed


adaptive information processing system based upon
research in how the brain encodes and processes
information

1
What Makes Up A Neural Network?

Neural Net Neurobiology

Processing Element Neuron

Interconnections Scheme Dendrites and Axons

Learning Law Neuro Transmitters

2
SOMA
wi

∑ T

Dendrites
(input) Summer Threshold Axons
(output)
synapses
N

∑w x
i =1
i i <Neuron Model>

-> Processing Element

Neural Net vs. Von Neumann Computer

Neural Net Von Neumann


Non-algorithmic Algorithmic

Trained Programmed with instructions

Memory and processing elements the same Memory and processing separate

Pursue multiple hypotheses simultaneously Pursue one hypothesis at a time

Fault tolerant Non fault tolerant

Non-logical operation Highly logical operation

Adaptation or learning Algorithmic parameterization modification only

Seeks answer by finding minima in solution space Seeks answer by following logical tree structure

3
Inductive learning : learning from example

Concepts of inductive learning


• Example : (x,f(x))
• Inductive inference
– collection of examples of f(X) → h(X)
• Hypothesis : h
– approximation of f, the agent’s belief about f
– hypothesis space : set of all hypothesis writable.
• Bias
– Any preference for one hypothesis over another.
– there’re many consistent hypotheses.

Inductive learning : concepts of inductive learning

– Examples of biased hypothesis

(a) (b) (c) (d)


examples hypothesis1 hypothesis2 hypothesis3

• Representation of functions
– expressiveness : perceptron can’t learn XOR
– efficiency : # of examples for good generalization
– ‘a good set of sentences’

4
Learning procedure
1. collect a large set of examples
2. Divide it into two disjoint sets :
training set & test set
3. use the learning algorithm with the training
set as examples to generate a hypothesis H
4. Measure the percentage of examples in the
test set that are correctly classified by H.
5. Repeat steps 1-4 for different sizes of
training sets & different randomly selected
training sets of each size

Introduction :

S-R Learning with TLU/NN


• Experiences E = {( X , a) | X ∈ Ξ, a ∈ {0 / 1}}
– sensory data set Ξ = { X | X = ( x1 ,..., xi ,..., xn )}
paired with proper action

• Knowledge
f :X →a
– function
from sensory data
to proper action

• Representation
– 1. single TLU with adjustable weights

– 2. Multi layer perceptron

5
Training Single TLU
• TLU geometry
– TLU definition
• internal knowledge representation
Input : X
• an abstract computation tool that calculates
output : f (X )
f ( X ) = f s (W ⋅ X − θ ) transfer : f s (s)
weight : W
threshold : θ

f s (s )
1

0
s

Training Single TLU:

– Geometric Interpretation of TLU computation

• 0 if X is in one side of a hyperplane,


1 if X is in the other side of the hyperplane.

A hyper plane in Rn space:

Y = W ⋅ X −θ = 0

*A hyper plane in Rn+1 space

Y = W ⋅ X −θ

6
Training Single TLU:

f s (s) ≡ s
– 3.2.4 The Widrow-Hoff Procedure
s
• Using transfer function : f s ( s) ≡ s

∂ε
∂W
= −2 ( d − f ) X ε

∂ε
W ' ← W − 12 c
∂W
W
W ' ← W + c(d − f ) X W' W

– 3.2.5 The Generalized Delta Procedure


1
f s ( s) ≡
• Using transfunction : 1 + e−s

1
∂ε f s (s) ≡
= −2( d − f ) f (1 − f ) X 1 + e−s
∂W

∂ε
W ' ← W − 12 c
∂W
Sigmoid function f ′ = f (1 − f )

W ' ← W + c(d − f ) f (1 − f ) X

Training Single TLU:

The Error-Correction Procedure


f s (s )
1
• Using transfer function : 0/1 threshold function
0
• Adjust weights, only when (d-f) = (1 or -1) s

• Use the same weight update gradient

W ' ← W + c (d − f ) X
Known Theorem :

If there exists some weight vector W, that produces a


correct output for all input vectors, the error-correction
procedure will find such a weight vector and terminate.

– If there exists no such vector W, error-correction procedure


will never terminate.
- Widrow-hoff and generalized Delta procedures will find
minimum squared-error solutions even when there exist no
perfect solutions W.

7
Training Single TLU:

• 3.2.2 Augmented Vectors notation


– for Simplicity of Mathematics, Let’s define

s = W ⋅ X − θ = ( w1 , w2 ,..., wn ) ⋅ ( x1 , x 2 ,..., xn ) − θ
= W '⋅ X ' = ( w1 , w2 ,..., wn ,−θ ) ⋅ ( x1 , x 2 ,..., xn ,1)

• 3.2.3 Gradient Descent Methods

– Learning is a search over internal representations,


which maximize/minizie some evaluation function :

e = ∑ (d i − f i ) 2
X i ∈Ξ
TLU output : f i = f s (W ⋅ X i )
desired output : di

– How to find a weight W, which minimizes e ? ∂ε


• Gradient descent (Greedy optimization ) ∂W

• Incremental learning : adjust W that slightly reduce e for one Xi ε = (d − f ) 2


• Batch learning : adjust W that reduce e for all Xi

Training Single TLU:

– Gradient descent learning rule (Single TLU, Incremental)

∂ε def  ∂ε ∂ε ∂ε 
= ,..., ,..., 
∂W  ∂w1 ∂wi ∂wn +1  ε = ( d − f ) 2 = ( d − f s (W ⋅ X )) 2
s =W ⋅ X
∂ε ∂ε ∂s
=
∂W ∂s ∂W
∂s ∂W ⋅ X
= =X
∂W ∂W
∂ε ∂ε
= X
∂W ∂s
∂ε ∂ ( d − f s ) 2 ∂f
= = −2( d − f s ) s
∂s ∂s ∂s
∂ε ∂f
= −2(d − f ) X
∂W ∂s

∂f s
case 1) f s ( s) ≡ s =1
∂s
∂ε
= −2( d − f ) X 1 ∂f s
∂W case 2) f s ( s) ≡ = f s (1 − f s )
1 + e−s ∂s
∂ε
= −2(d − f ) f (1 − f ) X
∂W

8
Training Single TLU:

– Example problem

0/1 situations are linearly separable!

Home work !!

Training Single TLU:

• 3.3 Neural Networks


– 3.3.1 Motivation

• A single TLU is not enough !


• There are sets of stimuli and responses that cannot be learned by a single TLU.
(non linearly separable functions)
• Let’s use a network of TLUs !

• Networking TLUs
– Feedforward net : There is no circuit in the net, output value is dependent on only input values.
– Recurrent net : There are circuits in the net, output value is dependent on input & history.
* Layers of net : group of TLUs, which input from and output to TLUs in other group.

• Example 3 layer feedforward network


– input nodes
– hidden nodes
– output node

f = x1 x2 + x1 x2
* sometimes it is called as 2 layer network.

9
Training Single TLU:
– 3.3.2 Notations
( j)
• j-th Layer output vector : X
• Input vector : X
( 0)
= input
• Final layer output vector : X
(k )
= f

• Weight vector of i-th TLU in j-th layer : Wi ( j ) Wi ( j ) = [ wl(,ij ) ] l = 1,2,..., m( j −1) + 1


• Input to a i-th TLU in j-th layer (activation) : si( j ) s ( j)
=X ( j −1)
⋅Wi ( j)
i
• Number of TLUs in j-th layer :
mj

A general, k-layer feedforward network

– 3.3.3 The Backpropagation Method

• Gradient of squared-error function , with respect to a weight vector Wi is,


( j) Wi ( j ) = [ wl(,ij ) ]

∂ε ∂ε  ∂ε
def
∂ε ∂ε 
=  ( j ) ,..., ( j ) ,..., ( j ) 
∂Wi ( j ) ∂Wi ( j ) ∂ ∂ ∂
 1,i 
w w l ,i wm j−1 +1,i 

• Using activation variable and the chain rule, si( j ) = X ( j −1) ⋅Wi ( j )

X ( j −1)
∂ε ∂ε ∂ε ∂s ( j)
∂si( j )
= i
= X ( j −1)
∂si( j ) ∂Wi ( j ) ∂si( j ) ∂W i
( j)
∂Wi ( j )

∂ε
= X ( j −1)
∂si( j )
∂ε ∂ (d − f ) 2 ∂f
• Using the Derivative of Sigmoid = = −2(d − f ) ( j )
∂si( j ) ∂si( j ) ∂si
d− f ∂f
∂ε ∂f
∂si( j ) = −2( d − f ) ( j ) X ( j −1)
∂Wi ( j ) ∂si
∂f
• Using a new variable δ i( j ) (activation-error influence) δ i( j ) = ( d − f )
∂si( j )
∂ε ∂ε
δ i( j ) = −2δ i( j ) ⋅ X ( j −1) = −2δ i( j )
∂Wi ( j ) ∂si( j )
• A new weight update rule (gradient descent )

∂ε
Wi ( j ) ← Wi ( j ) + ci( j )δ i( j ) X ( j −1) W ' ← W − 12 c
∂W

10
Training Single TLU:

– 3.3.4 Computing Weight Changes in the Final Layer ( computing δ (k ) )

• By definition
∂f
δ i( j ) = ( d − f )
∂si( j )

• At the final layer , there is only 1 output TLU.

∂f
δ ( k ) = (d − f )
∂s ( k )
• Since f is the sigmoid function of s (k ) f = sigmoid ( s ( k ) )
∂f
δ ( k ) = (d − f ) f (1 − f ) = f (1 − f )
∂s ( k )

• So, backpropagation weight adjustment rule for the single TLU in the final layer is,

Wi ( k ) ← Wi ( k ) + ci( k ) ( d − f ) f (1 − f ) X ( kfd −1) Wi ( j ) ← Wi ( j ) + ci( j )δ i( j ) X ( j −1)

– 3.3.5 Computing Changes to the Weights in Intermediate Layers ( computing δ i( j ) )

• Using the chain rule,

∂f  ∂f ∂s ( j +1) ∂f ∂s ( j +1) ∂f ∂s m j+1 


( j +1)

δ i( j ) = ( d − f ) = (d − f )  ( j +1) 1 ( j ) + ... + ( j +1) l ( j ) + ... + ( j +1) 


∂si( j ) ∂ ∂ ∂ ∂ ∂ ∂
 1 
( j)
s s s s s s
∂ε
i l i m j +1 i

δ l( j ) = − 12 m j +1
∂f ∂sl( j +1)
∂si( j ) = ∑ (d − f ) ∂f ∂ε
∂sl( j +1) ∂si( j ) (d − f ) = δ i( j ) = − 12 ( j )
l =1
∂si( j ) ∂si
m j +1
∂sl( j +1)
δ l( j ) = ∑ δ l( j +1)
δ l( j +1) ∂sl( j +1) l =1 ∂si( j )
∂si( j ) m j +1
• Using relationships between activation s l( j +1) = X ( j ) ⋅Wl ( j +1) = ∑ f v( j ) ⋅ wv( ,jl+1)
v =1

m j +1
∂[ ∑ f v( j ) ⋅ wv( ,jl+1) ]
∂sl( j +1) m j +1
∂f v( j ) ∂f ( j ) ∂f v( j )
= v =1
= ∑ wv( ,jl+1) ⋅ = wi(,lj +1) ⋅ i( j ) = 0 , if i ≠ v
∂si( j ) ∂s ( j)
i v =1 ∂si( j ) ∂si ∂si( j )

wv( ,jl+1) f i( j ) = wv( ,jl+1) f i ( j ) (1 − f i ( j ) )

• Using those relation

[ ]
m j +1 m j +1
δ i( j ) = ∑ δ l( j +1) wv( ,jl+1) f i ( j ) (1 − f i ( j ) ) = f i ( j ) (1 − f i ( j ) ) ∑ δ l( j +1) wv( ,jl+1)
l =1 l =1

∂ε
Wi ( j ) ← Wi ( j ) + ci( j )δ i( j ) X ( j −1) W ' ← W − 12 c
∂W

11
∂ε ∂ε
Recursive equation of ∂W ( j ) Dynamic Programming to calculate
i ∂Wi ( j )

∂ε
∂Wi ( j ) (j-1)-th layer

∂Wi ( j )
X ( j −1)
X ( j −1) ∂ε
∂si( j ) si( j )
∂si( j ) xi( j ) = f i ( j )
j-th layer
∂ε
δ i( j ) = α ( j)
∂si wi(,lj +1)
∂f
d− f
∂si( j )
∂s1( j +1) ∂sl( j +1) ∂sl(+j1+1) ∂s m( j( +j +11))
(j+1)-th layer

δ l
( j)

δ l( j +1) ∂sl( j +1)


∂si( j )
∂ε k-th layer

∂ε ∂ε
= −2δ i( j ) ⋅ X ( j −1) W ' ← W − 12 c
∂Wi ( j ) ∂W

wv( ,jl+1) fi( j)

Hopfield Net
• Proper when exact binary representations are
possible.
• Can be used as an associative memory or to
solve optimization problems.
• As an associative memory, Hopfield net has a
problem.
The number of classes (M) must be kept smaller than
0.15 times the number of nodes (N).

M < 0.15 N ( N = 100 , M < 15)

12
Hopfield Neural Net
OUTPUTS(Valid After Convergence)
x’0 x’1 x’N-2 x’N-1

. . . . .

x0 x1 xN-2 xN-1
INPUTS(Applied At Time Zero)
A Hopfield neural net that can be used as a content-addressable memory. An
unknown binary input pattern is applied at time zero and the net then iterates
until convergence when node ouputs remain unchanged. The output is that
pattern produced by node outputs after convergence

• Hopfield Net Algorithm


Step 1 : Assign Connection Weights
M −1

Tij = ∑x x
s =0
i
s s
j i≠ j
0 i=0
Tij is the connection weight from node i to node j ,
xis = 1, or − 1 (i − th element of class s )

Step 2 : Initialize with unknown input pattern


mi (0) = xi 0 ≤ i ≤ N −1
mi (t ) : output of node i at time t
xi : i − th element of the input pattern

13
Hopfield Net Algorithm

Step 3 : Iterate until convergence


Fh (.)

 N −1 
mi (t + 1) = Fh ∑ Tij m j (t )  1

 j =0  0

Fh : hard limiter -1

Step 4 : goto step 2.

14
Hamming Net
• Optimum minimum error classifier
Calculate Hamming distance to the exemplar for each class and
selects that class with minimum Hamming distance

• Advantages Over Hopfield Net:


– Hopfield Net is worse than or equivalent to Hamming Net
– Hamming Net requires fewer connections
– The number of connections in Hamming Net grows linearly

N2 vs NM + M2 → M(N+M)
N=100, M=10 10000 1100 ≈ NM (1000)
N >> M

Hamming Net
• Network 구조
OUTPUT(valid after MAXNET converge)
Y0 Y1 YM-2 YM-1 (Class)

MAXNET
Tkl
PICKS
MAXIMUM

CALCULATE
Wij
MATCHING
SCORES
x0 x1 xN-2 xN-1 (Data)
INPUT(applied at time zero)

15
Hamming Net Algorithm
• Step 1. Assign Connection Weights and offsets
in the lower subnet :
xij N
wij = , θj =
2 2
0 ≤ i ≤ N − 1, 0 ≤ j ≤ M − 1
in the upper subnet :
1 k =l

t kl =  1
− ε k ≠ l, ε <
M
0 ≤ k , l ≤ M −1
wij : connection weight from input i to node j
in the lower subnet
t kl : connection weight from node k to node l

• Step2. Initialize with unknown input pattern


N −1
µ j (0) = f t (∑ wij xi − θ j ) , 0 ≤ j ≤ M − 1
i =0

µ j (t ) : output of node j in the upper subnet at time t


xi : i − th element of the input

• Step3. Iterate until convergence


µ j (t + 1) = f t ( µ j (t ) − ε ∑ t jk µ k (t )) , 0 ≤ j, k ≤ M − 1
k≠ j

This process is repeated until convergence.

• Step4. Go to step 2

ft:
1

16
17

You might also like