DR C Aravindan
DR C Aravindan
Chandrabose Aravindan
<[email protected]>
Presented at:
Workshop on Machine Learning for Image Analysis
SSN, Chennai
1 Introduction
1 Introduction
1 Introduction
1 Introduction
1 Introduction
5 Back-Propagation Algorithm
1 Introduction
5 Back-Propagation Algorithm
1 Introduction
5 Back-Propagation Algorithm
7 Summary
1 Introduction
5 Back-Propagation Algorithm
7 Summary
f2
+ + +
+
+
- + +
- -
- f1
-
-
1 Introduction
5 Back-Propagation Algorithm
7 Summary
g(x) = wT x + w0 = 0
In case of two dimensions this is a line; for three dimensions this is a
plane; and in general it is a hyperplane
Thus, we are looking for a geometric model (hyperplane) defined by
weights as model parameters
f2
+ + +
+
+
- + +
- -
- f1
-
-
Binary Classification — Boundary
f2
+ + +
+
+
- + +
- -
- f1
-
-
Binary Classification — Boundary
f2
?
+ + +
+
+
- + +
- -
- f1
-
-
Binary Classification — Boundary
f2
?
+ + +
+
+
? + +
-
- -
- f1
-
-
Binary Classification — Boundary
f2
?
+ + +
+
+
? + +
-
- -
- ? f1
-
-
wT x1 + w0 = wT x2 + w0 = 0
wT x1 + w0 = wT x2 + w0 = 0
wT x1 + w0 = wT x2 + w0 = 0
Note that the vector x1 − x2 lies on the hyperplane, and its dot
product with weight vector is 0.
wT x1 + w0 = wT x2 + w0 = 0
Note that the vector x1 − x2 lies on the hyperplane, and its dot
product with weight vector is 0.
Hence, weight vector w is orthogonal to the hyperplane and points to
the positive direction.
|w0 |
d=
||w||
Target
f(x)
x Model h(x)
(defined by Compare
model parameters)
Target
f(x)
x Model h(x)
(defined by Compare
model parameters)
Major Issue
Will this feedback loop converge?
Target
f(x)
x Model h(x)
(defined by Compare
model parameters)
Major Issue
Will this feedback loop converge?
Major Issue
Will the model generalize beyond the training samples?
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 21 / 62
Outline
1 Introduction
5 Back-Propagation Algorithm
7 Summary
Start with some initial weight vector (including the bias component
w0 )
Start with some initial weight vector (including the bias component
w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi
Start with some initial weight vector (including the bias component
w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi
If a negative example xj is misclassified, i.e. g(xi ) > 0 leading to a
positive response +1 while the target yj = −1, we need to decrease
the weight: w0 = w − ηxj
Start with some initial weight vector (including the bias component
w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi
If a negative example xj is misclassified, i.e. g(xi ) > 0 leading to a
positive response +1 while the target yj = −1, we need to decrease
the weight: w0 = w − ηxj
This can be combined in to a single update rule. When an example xi
is misclassified, update the weights as follows: w0 = w + ηyi xi
Start with some initial weight vector (including the bias component
w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi
If a negative example xj is misclassified, i.e. g(xi ) > 0 leading to a
positive response +1 while the target yj = −1, we need to decrease
the weight: w0 = w − ηxj
This can be combined in to a single update rule. When an example xi
is misclassified, update the weights as follows: w0 = w + ηyi xi
This process is repeated until there are no more misclassified
examples.
1 Introduction
5 Back-Propagation Algorithm
7 Summary
∂E ∂Err
= Err ×
∂wj ∂wj
∂E ∂Err
= Err ×
∂wj ∂wj
n
∂ X
= Err × y − f wj xj
∂wj j=0
∂E ∂Err
= Err ×
∂wj ∂wj
n
∂ X
= Err × y − f wj xj
∂wj j=0
= −Err × f 0 (inp) × xj
∂E ∂Err
= Err ×
∂wj ∂wj
n
∂ X
= Err × y − f wj xj
∂wj j=0
= −Err × f 0 (inp) × xj
Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction
∂E ∂Err
= Err ×
∂wj ∂wj
n
∂ X
= Err × y − f wj xj
∂wj j=0
= −Err × f 0 (inp) × xj
Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction
But, what should be the quantum of change in that direction?
∂E ∂Err
= Err ×
∂wj ∂wj
n
∂ X
= Err × y − f wj xj
∂wj j=0
= −Err × f 0 (inp) × xj
Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction
But, what should be the quantum of change in that direction?
We use a parameter called learning rate to control this and arrive at
the following rule:
wj0 = wj + η × Err × f 0 (inp) × xj
1 Introduction
5 Back-Propagation Algorithm
7 Summary
Any continuous function can be represented with two layers and any
function with three layers [Hornik et al., 1989]
Combine two opposite facing threshold functions to make a ridge
Combine two perpendicular ridges to make a bump
Add bumps of various sizes and locations to fit any surface
.
.
∆j .
Wj,i
∆i
.
.
.
Hidden Output
Layer Layer
.
.
∆j .
Wj,i
∆i
.
.
.
Hidden Output
Layer Layer
i
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 36 / 62
Derivation of back-propagation weight update rules
1X
E= (yi − ai )2
2 i
1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I
1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
∂ X
= −(yI − aI )g 0 (inp I ) Wj,i aj
∂WJ,I j
1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
∂ X
= −(yI − aI )g 0 (inp I ) Wj,i aj
∂WJ,I j
1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
∂ X
= −(yI − aI )g 0 (inp I ) Wj,i aj
∂WJ,I j
= −∆I aJ
∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
∂inp i ∂ X
(yi − ai )g 0 (inp i )
X X
=− =− ∆i Wj,i aj
i
∂WK ,J i
∂W K ,J j
∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
∂inp i ∂ X
(yi − ai )g 0 (inp i )
X X
=− =− ∆i Wj,i aj
i
∂WK ,J i
∂W K ,J j
X ∂aJ X ∂g(inp J )
=− ∆i WJ,i =− ∆i WJ,i
i
∂WK ,J i
∂WK ,J
∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
∂inp i ∂ X
(yi − ai )g 0 (inp i )
X X
=− =− ∆i Wj,i aj
i
∂WK ,J i
∂W K ,J j
X ∂aJ X ∂g(inp J )
=− ∆i WJ,i =− ∆i WJ,i
i
∂WK ,J i
∂WK ,J
!
0 ∂inp J ∂
∆i WJ,i g 0 (inp J )
X X X
=− ∆i WJ,i g (inp J ) =− Wk,J ak
i
∂WK ,J i
∂WK ,J k
∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
∂inp i ∂ X
(yi − ai )g 0 (inp i )
X X
=− =− ∆i Wj,i aj
i
∂WK ,J i
∂W K ,J j
X ∂aJ X ∂g(inp J )
=− ∆i WJ,i =− ∆i WJ,i
i
∂WK ,J i
∂WK ,J
!
0 ∂inp J ∂
∆i WJ,i g 0 (inp J )
X X X
=− ∆i WJ,i g (inp J ) =− Wk,J ak
i
∂WK ,J i
∂WK ,J k
Wj,i ← Wj,i + η × aj × ∆i
where
∆i = Erri × g 0 (inpi )
Wj,i ← Wj,i + η × aj × ∆i
where
∆i = Erri × g 0 (inpi )
Hidden Layer: Back-propagate error from output layer and use that
for updating weights
∆j = g 0 (inpj )
X
Wj,i ∆i
i
Wk,j ← Wk,j + η × ak × ∆j
1 Introduction
5 Back-Propagation Algorithm
7 Summary
1 Introduction
5 Back-Propagation Algorithm
7 Summary
Machine learning is about using the right features to build the right
models that achieve the right tasks [Flach, 2012]
In this talk, we have focused on finding linear discriminant model (a
hyperplane in the feature space) for binary classification problem
The model has to be constructed from examples (inductive learning)
that are properly labeled (supervised learning). Further, the model
has to be used for predicting the class of a new instance (predictive
analytics)
A hyperplane in the feature space that properly separates positive and
negative examples can be constructed by Perceptron or LMS learning
algorithms
It has been shown that these algorithms converge for linearly
separable problems
These algorithms can only find linear models and are not suitable for
problems that are not linearly separable
In neural networks perspective, we need hidden layers to handle such
problems
Back-propagation of errors is the basic mechanism used to arrive at
algorithms for learning weights for such neural networks
However, there are several issues with back-propagation algorithm —
not guaranteed to converge, may get trapped in local minima, etc.
We have highlighted a few points to overcome these limitations and
suggested a process to apply ANN for solving a problem
Fahlman, S. E. (1988).
An empirical study of learning speed in back-propagation networks.
Technical Report CMU-CS-88-162, Carnegie Mellon University.
Flach, P. (2012).
Machine Learning: The art and science of algorithms that make sense
of data.
Cambridge University Press.
Hagan, M. T. and Menhaj, M. B. (1994).
Training feedforward networks with the marquardt algorithm.
IEEE Transactions on Neural Networks, 5:989–993.
Hassoun, M. H. (1995).
Fundamentals of Artificial Neural Networks.
The MIT Press.
LeCun, Y., Denker, J., Solla, S., Howard, R. E., and Jackel, L. D.
(1990).
Optimal brain damage.
In Advances in Neural Information Processing Systems, volume II.
Mitchell, T. M. (1997).
Machine Learning.
McGraw-Hill.
Moller, M. F. (1993).
A scaled conjugated gradient algorithm for fast supervised learning.
Neural Networks, 6:525–533.
Tollenaere, T. (1990).
SuperSAB: Fast adaptive backpropagation with good scaling
properties.
Neural Networks, 3:561–573.
Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W. T., and Alkon,
D. L. (1988).
Accelerating the convergence of the backpropagation method.
Biological Cybernetics, 58:257–263.
Yam, J. Y. F. and Chow, T. W. S. (2001).
Feed forward networks training speed enhancement by optimal
initialization of the synaptic coefficients.
IEEE Transactions on Neural Networks, 12(2):430–434.