0% found this document useful (0 votes)

66 views

DR C Aravindan

This document outlines the agenda for a presentation on neural networks and machine learning classification algorithms. It introduces artificial neural networks and how they can learn functions from input-output examples when the underlying mapping is unclear. It then discusses binary classification problems and linear discriminant functions, describing how a weighted sum of features can be used for classification similar to a basic neuron model. The outline lists topics like perceptrons, backpropagation, building neural networks, and more.

Uploaded by

Lekshmi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views

DR C Aravindan

Uploaded by

Lekshmi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 140

Introduction to Neural Networks

Perceptron and Feed-Forward Networks

Chandrabose Aravindan
<[email protected]>

Machine Learning Research Group

SSN College of Engineering, Chennai

Presented at:
Workshop on Machine Learning for Image Analysis
SSN, Chennai

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 1 / 62

Outline

1 Introduction

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 3 / 62

Introduction

Artificial Neural Network (ANN) is a computational model that is

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 4 / 62

Introduction

Artificial Neural Network (ANN) is a computational model that is

capable of representing any continuous or discrete function
Interesting point about ANN is that a model can be “learnt” from
input-output example pairs
Extremely useful when the underlying functional mapping is not clear
and all we have is set of examples
Can we describe a function that maps an image to a digit?
Do we know the function that maps CT scan images to stenosis?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 4 / 62

Introduction

Artificial Neural Network (ANN) is a computational model that is

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 4 / 62

Binary Classification Problem

Consider a task where we need to decide whether an instance belongs

to a class or not — for example, decide if given image is digit “1” or
not.
This is a binary classification task.
Assume that we have identified the features (that can be represented
by real numbers) and have collected positive and negative instances
— for digit recognition, we may simply take the pixel values
Each instance is a vector in the feature space
Machine learning problem here is to find a geometric model using
which we can predict the class of new instances
In this talk, we focus on building discriminating neural network
models from instances (inductive learning) using supervised learning
algorithms

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 5 / 62

Binary Classification

f2
+ + +
+
+
- + +
- -
- f1
-
-

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 6 / 62

Z-Score

Problem: Predict if a company goes bankrupt in the near future

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62

Z-Score

Problem: Predict if a company goes bankrupt in the near future

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62

Z-Score

Problem: Predict if a company goes bankrupt in the near future

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62

Z-Score

Problem: Predict if a company goes bankrupt in the near future

Features:
A: Working capital / Total assets
B: Retained earnings / Total assets
C: Earnings before interest and tax / Total assets
D: Market value of equity / Total liabilities
E: Sales / Total assets
Altman’s Z-Score: Z = 1.2A + 1.4B + 3.3C + 0.6D + 1.0E
Predict bankruptcy if Z < 1.8
How did Prof. Edward Altman find these magic numbers?
Do we have any algorithm / tool today that can auto-magically find
these numbers from available labelled data?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62

Linear Discriminant Function

General idea for discrimination: Consider weighted sum of feature

values and take decision based on how it compares with a threshold
(like Altman’s Z-Score)

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62

Linear Discriminant Function

General idea for discrimination: Consider weighted sum of feature

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62

Linear Discriminant Function

General idea for discrimination: Consider weighted sum of feature

values and take decision based on how it compares with a threshold
(like Altman’s Z-Score)
g(x ) = w0 + w1 x1 + . . . + wn xn (note that w0 is the bias added to the
weighted sum)
In vector notation, this can be written as g(x) = wT x + w0
Typically, a non-linear function is applied on this weighted sum. Let’s
assume a simple step function. Binary class decision can now be
taken as +1 if g(x) > 0, and −1 if g(x) < 0.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62

Linear Discriminant Function

General idea for discrimination: Consider weighted sum of feature

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62

Model of a Neuron

Figure: Model of a Neuron (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 9 / 62

Model of a Neuron

Figure: Biological Neuron (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 10 / 62

NN Examples

Figure: Simple Examples (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 11 / 62

Activation Functions

Binary and Bipolar step functions

1 x >0

f (x ) =
0 x ≤0
1 x >0

f (x ) =
−1 x ≤ 0
Binary and bipolar sigmoid functions
1
f (x ) =
1 + e −σx
1 − e −σx
g(x ) = 2f (x ) − 1 =
1 + e −σx

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 12 / 62

Activation Functions

Desirable properties of an activation function include that it is

differentiable and the differentiation is expressible in terms of the
function itself
Binary sigmoid:
f 0 (x ) = σf (x ) [1 − f (x )]
Bipolar sigmoid:
σ
g 0 (x ) = [1 + g(x )] [1 − g(x )]
2

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 13 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 14 / 62

Boundary separating the classes

Let us consider a neuron with a step function for activation

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 15 / 62

Boundary separating the classes

Let us consider a neuron with a step function for activation

What is the boundary separating the two classes?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 15 / 62

Boundary separating the classes

Let us consider a neuron with a step function for activation

What is the boundary separating the two classes?

g(x) = wT x + w0 = 0
In case of two dimensions this is a line; for three dimensions this is a
plane; and in general it is a hyperplane
Thus, we are looking for a geometric model (hyperplane) defined by
weights as model parameters

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 15 / 62

Binary Classification — Boundary

f2
+ + +
+
+
- + +
- -
- f1
-
-
Binary Classification — Boundary

f2
?
+ + +
+
+
- + +
- -
- f1
-
-
Binary Classification — Boundary

f2
?
+ + +
+
+
? + +
-
- -
- f1
-
-
Binary Classification — Boundary

f2
?
+ + +
+
+
? + +
-
- -
- ? f1
-
-

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 16 / 62

Geometry of Linear Discriminantion

If x1 and x2 are any two arbitrary points on the hyperplane, then

wT x1 + w0 = wT x2 + w0 = 0

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 17 / 62

Geometry of Linear Discriminantion

If x1 and x2 are any two arbitrary points on the hyperplane, then

wT x1 + w0 = wT x2 + w0 = 0

This means that

wT (x1 − x2 ) = 0

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 17 / 62

Geometry of Linear Discriminantion

If x1 and x2 are any two arbitrary points on the hyperplane, then

wT x1 + w0 = wT x2 + w0 = 0

This means that

wT (x1 − x2 ) = 0

Note that the vector x1 − x2 lies on the hyperplane, and its dot
product with weight vector is 0.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 17 / 62

Geometry of Linear Discriminantion

If x1 and x2 are any two arbitrary points on the hyperplane, then

wT x1 + w0 = wT x2 + w0 = 0

This means that

wT (x1 − x2 ) = 0

Note that the vector x1 − x2 lies on the hyperplane, and its dot
product with weight vector is 0.
Hence, weight vector w is orthogonal to the hyperplane and points to
the positive direction.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 17 / 62

Geometry of Linear Discrimination

Figure: Hyperplane geometry in 2-dimensional space (Adopted from

[Theodoridis and Koutroumbas, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 18 / 62

Geometry of Linear Discrimination

Now, distance between an arbitrary vector x and the hyperplane is

given as
|g(x)|
z=
||w||
In particular, the distance between origin and the hyperplane is

|w0 |
d=
||w||

Thus, the sign of w0 says whether origin is in negative response or

positive response region.
g(x) is a measure of the Euclidean distance of the point x from the
decision plane
w0 determines the position of hyperplane, while the direction is
determined by the weight vector w.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 19 / 62

Learning the weights

Now that we have discussed about how the binary classification

decision is made and the corresponding geometry, we turn our
attention to how to find a model from given examples.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 20 / 62

Learning the weights

Now that we have discussed about how the binary classification

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 20 / 62

Learning the weights

Now that we have discussed about how the binary classification

decision is made and the corresponding geometry, we turn our
attention to how to find a model from given examples.
In this case, the model parameters are the weights and thus we need
to learn the weight vector and bias w0 from the examples.
Perceptron algorithm is one of the earliest and simplest algorithm for
this purpose.
It is not difficult to pose this as minimization of mean squared error
(remember that this is supervised learning) and so gradient descent
based LMS algorithm is also popular.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 20 / 62

Supervised Learning Feedback Loop

Target
f(x)
x Model h(x)
(defined by Compare
model parameters)

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 21 / 62

Supervised Learning Feedback Loop

Target
f(x)
x Model h(x)
(defined by Compare
model parameters)

Major Issue
Will this feedback loop converge?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 21 / 62

Supervised Learning Feedback Loop

Target
f(x)
x Model h(x)
(defined by Compare
model parameters)

Major Issue
Will this feedback loop converge?

Major Issue
Will the model generalize beyond the training samples?
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 21 / 62
Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 22 / 62

Perceptron Algorithm

Start with some initial weight vector (including the bias component
w0 )

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62

Perceptron Algorithm

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62

Perceptron Algorithm

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62

Perceptron Algorithm

Start with some initial weight vector (including the bias component
w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi
If a negative example xj is misclassified, i.e. g(xi ) > 0 leading to a
positive response +1 while the target yj = −1, we need to decrease
the weight: w0 = w − ηxj
This can be combined in to a single update rule. When an example xi
is misclassified, update the weights as follows: w0 = w + ηyi xi

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62

Perceptron Algorithm

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62

Perceptron Convergence

Will this iteration converge and terminate?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 24 / 62

Perceptron Convergence

Will this iteration converge and terminate?

Perceptron Convergence Theorem: Perceptron learning algorithm
converges for any linearly separable data irrespective of the initial
weights chosen

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 24 / 62

Perceptron: Dual view

There is no restriction on what the initial weight vector is. We can

very well start with zero vector. Also assume that the learning rate is
set to 1.
As per the Perceptron algorithm, every time an example xi is
misclassified, we add yi xi to the weight vector.
Let αi be the number of times an example xi is misclassified. Then,
the resulting weight vector can be expressed as
n
X
w= αi yi xi
i=1

where n is the number of examples in the training set.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 25 / 62

Perceptron: Dual view

Note that an example xj is mis-classified when yj wxj < 0, i.e. when

yj ni=1 αi yi xi · xj < 0
P

In the dual problem, we learn all αi . In the learning iteration, αi is

incremented whenever xi is mis-classified.
When the training is over, an instance x will be classified as
n
!
X
y = sign αi yi xi · x
i=1

An interesting point to note is that αi could be set to 0 for many

examples that are far away from the hyperplane (thereby reducing the
number of model parameters)!
Another interesting point here is that we need only the pairwise dot
products of examples. This is very important in the context of
Support Vector Machines (SVM)

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 26 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 27 / 62

LMS Weight Update Rule

Instead of simple step function, a differentiable non-linear function,

such as a sigmoid function, can be applied on weighted sum of inputs
(logistic regression)
Can be used for both Classification and Regression

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 28 / 62

LMS Weight Update Rule

Instead of simple step function, a differentiable non-linear function,

such as a sigmoid function, can be applied on weighted sum of inputs
(logistic regression)
Can be used for both Classification and Regression
Error to be minimized
1 1
E = Err 2 = (y − hw (x))2
2 2

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 28 / 62

LMS Weight Update Rule

Instead of simple step function, a differentiable non-linear function,

such as a sigmoid function, can be applied on weighted sum of inputs
(logistic regression)
Can be used for both Classification and Regression
Error to be minimized
1 1
E = Err 2 = (y − hw (x))2
2 2

Simple gradient descent algorithm may be used to find the weights

that minimize the error
Note that the gradient is obtained as
* +
∂E ∂E ∂E ∂E
, ,... ,...
∂w1 ∂w2 ∂wj ∂wn

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 28 / 62

LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62

LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62

LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

= −Err × f 0 (inp) × xj

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62

LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

= −Err × f 0 (inp) × xj

Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62

LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

= −Err × f 0 (inp) × xj

Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction
But, what should be the quantum of change in that direction?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62

LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

= −Err × f 0 (inp) × xj

Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction
But, what should be the quantum of change in that direction?
We use a parameter called learning rate to control this and arrive at
the following rule:
wj0 = wj + η × Err × f 0 (inp) × xj

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62

LMS Algorithm

Start with an initial weight vector (including the bias component w0 )

For each example in the training set, do
Compute the output with current weight
Find the Err at the output
Adjust all the weights as per the weight update rule (incremental
update)
Add up all the weight updates and apply at the end of the iteration
(batch update)
Continue with this iteration until the mean squared error falls below a
pre-determined threshold

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 30 / 62

Convergence of LMS

Does this algorithm converge?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62

Convergence of LMS

Does this algorithm converge?

The learning rate η plays an important role. Convergence analysis has
been carried out and constraints on choosing η have been reported in
the literature

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62

Convergence of LMS

Does this algorithm converge?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62

Convergence of LMS

Does this algorithm converge?

The learning rate η plays an important role. Convergence analysis has
been carried out and constraints on choosing η have been reported in
the literature
Does this converge to an optimal weight vector?
Like any gradient descent algorithm, this learning algorithm may get
trapped in a local minima

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62

Convergence of LMS

Does this algorithm converge?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62

Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 32 / 62

Limitations of linear models

The geometric models and their corresponding learning algorithms are

obviously suitable only for linearly separable classification problems
But in the real-world, in general, problems are not linearly separable,
and hence linear discriminator model would not be sufficient.
A very simple and classic example is given below:

Figure: Linear Separability (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 33 / 62

Multi-layer Perceptron

Figure: Multi-Layer Perceptron (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 34 / 62

Multi-layer Perceptron

Any continuous function can be represented with two layers and any
function with three layers [Hornik et al., 1989]
Combine two opposite facing threshold functions to make a ridge
Combine two perpendicular ridges to make a bump
Add bumps of various sizes and locations to fit any surface

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 35 / 62

Back-propagation algorithm
Gradient based error minimization techniques do not work because we
do not know the target for hidden units!

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 36 / 62

Back-propagation algorithm
Gradient based error minimization techniques do not work because we
do not know the target for hidden units!
Solution is to “back propagate” error from all the output units to set
a target for a hidden unit [Rumelhart et al., 1986]

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 36 / 62

.
.
∆j .

Wj,i
∆i

.
.
.

Hidden Output
Layer Layer

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 36 / 62

.
.
∆j .

Wj,i
∆i

.
.
.

Hidden Output
Layer Layer

∆j = g 0 (inpj ) Wj,i ∆i , where ∆i = Erri × g 0 (inpi )

i
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 36 / 62
Derivation of back-propagation weight update rules

1X
E= (yi − ai )2
2 i