0% found this document useful (0 votes)
2 views

mod3

The document discusses various algorithms for classification and regression, including neural networks and their historical development. It covers the structure and functioning of perceptrons, activation functions, and the differences between single-layer and multi-layer perceptrons. Additionally, it explains the importance of activation functions in neural networks and provides guidance on when to use specific functions for different types of classification problems.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

mod3

The document discusses various algorithms for classification and regression, including neural networks and their historical development. It covers the structure and functioning of perceptrons, activation functions, and the differences between single-layer and multi-layer perceptrons. Additionally, it explains the importance of activation functions in neural networks and provides guidance on when to use specific functions for different types of classification problems.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

predict/Classify into discrete predict the continuous values

values such as Male or Female, such as price, salary, age, etc.


True or False, Spam or Not Regression algorithms:
Spam, etc. • Simple Linear Regression
Classification algorithms: • Multiple Linear Regression
• Logistic Regression • Polynomial Regression
• K-Nearest Neighbours • Support Vector Regression
• Support Vector Machines • Decision Tree Regression
• Kernel SVM • Random Forest Regression
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification Simple linear model: y= f(x)
Neural networks for classification and regression
Brief History of Neural Networks
Biological Neuron
Artificial Neuron
Artificial Neuron
Activation functions
Activation functions
Exercise
Consider a 2-input neuron with bias connection, where the weights and bias are given as :
w1= 2, w2 = -4 and b = -5. Compute the output of the neuron for the input x1= 15 and x2 = 1.
OR
A single neuron receives inputs from two nodes having weights w1 = 2, w2 = -4 and the bias
b = -5. the inputs of the neuron are x1= 15 and x2 = 1.
Calculate output if activation is
i) Threshold (Hard limit)
net = 30  4  5 = 21 >= 0
y = f(net) = 1
ii) Linear (purelin)
net = 21
y= f(net) =net
= 21
iii) Log sigmoid with sloping value =1
y = 1/(1+exp( *net))
= 1/(1+exp( 21)) =0.999
iv) Hyperbolic tangent with sloping value =0.5
y = (2/(1+exp( *21))  1=  0.001
v) Saturating Linear (satlin)
y = 1 for net > 1
Different Network Topologies
Different Network Topologies
How to decide on network topology?
McCulloch-Pitts Model

• The inputs of the McCulloch-Pitts neuron could be either 0 or 1.


• It has only two types of inputs — Excitatory and Inhibitory. The excitatory inputs
have weights of positive magnitude and the inhibitory weights have weights of
negative magnitude.
• It has a threshold function as an activation function
Implementing Logical functions
Simple McCulloch-Pitts neurons can be used to design logical operations. For that purpose,
the connection weights need to be correctly decided along with the threshold function
Geometrical interpretation of McCulloch-Pitts model - ANN classifier
Consider the AND function. Since there are two inputs, we can assume this to be a 2
dimensional Cartesian plane. Plotting the 4 combinations on 2D graph

All that’s left to find is the weights and the threshold value. Given the weights and threshold
value a decision boundary is plotted using the AND function’s aggregation equation:
x1w1+x2w2+w0=0
x1+x2 - 1.5=0

A single neuron is needed to do this "classification".

Take any random line that separates the black and the red points. Then find it's equation. Then
you'll have the weights (the coefficients of x1 and x2) and the thershold value T (the constant in
the line's equation).
x1w1+x2w2+w0 = 0
x2 = – (w1/w2) x1 – (w0/w2)
x2 = (slope) x1 + bias)

w1 = 1, w2 = 1,wo = -1.5
X1 + X2 - 1.5=0
x1 x2 y x1 x2 y If net sum > = 0, then output is 1 or (1)
0 0 -1 (0) -1 -1 -1 Otherwise output is 0 or (-1)
0 1 -1 (0) -1 1 -1
1 0 -1 (0) 1 -1 -1
1 1 1 (1) 1 1 1
Rosenblatt's Perceptron built around the McCulloch-Pitts neural model

• The linear combiner or the adder mode computes the linear combination of the inputs
applied to the synapses with synaptic weights being w1, w2,……,wn.
• Then, the hard limiter checks whether the resulting sum is positive or negative
• If the input of the hard limiter node is positive, the output is +1, and if the input is
negative, the output is -1.
• Mathematically the hard limiter input is:
Rosenblatt's Perceptron
• The objective of the perceptron is o classify a set of inputs into two classes c1 and c2.
• For the two input signals denoted by the variables x1 and x2, the decision boundary is a
straight line of the form:

So, for a perceptron having the values of synaptic weights w0,w1 and w2 as -2, 1/2 and 1/4,
respectively. The linear decision boundary will be of the form:

So, any point (x,1x2) which lies above the decision


boundary, as depicted by the graph, will be assigned to
class C1 and the points which lie below the boundary are
assigned to class C2.
Rosenblatt's Perceptron
• for a data set with linearly separable classes, perceptrons can always be employed to
solve classification problems using decision lines (for 2-dimensional space), decision
planes (for 3-dimensional space) or decision hyperplanes (for n-dimensional space).
• Appropriate values of the synaptic weights can be obtained by training a perceptron.
However, one assumption for perceptron to work properly is that the two classes should
be linearly separable i.e. the classes should be sufficiently separated from each other.
• Otherwise, if the classes are non-linearly separable, then the classification problem
cannot be solved by perceptron.
Single Layer Perceptron
Linearly separable problems

W.X + b > 0

W.X+ b < 0
x1 + x2 – 1.5 = 0

W.X + b > 0
W.X+ b < 0

x1 + x2 – 0.5 = 0
Multilayer Perceptron
• A basic perceptron works very successfully for data sets which possess linearly separable
patterns.
• However, in practical situations, Minsky and Papert in their work in 1969 showed that a
basic perceptron is not able to learn to compute even a simple 2 bit XOR.
• So, let us understand the reason.
The data is not linearly separable. Only a curved decision
boundary can separate the classes properly. To address this
issue, the other option is to use two decision boundary lines in
place of one. This is the philosophy used to design the multi-
layer perceptron model.

Classification with two decision lines in the XOR function output


Multilayer Perceptron
XOR problem - Linearly not separable
– x1 + x2 – 0.5 = 0

>0

<0 x1 – x2 – 0.5 = 0
Single layer Perceptron Learning Rule
• When Rosenblatt introduced the perceptron, he also introduced the perceptron
learning rule(the algorithm used to calculate the correct weights for a perceptron
automatically).

Perceptron algorithm is to minimize the misclassifications. Minimization of objective


function is also referred to as a loss function.

The weight vector is updated as follows


Perceptron Algorithm
(t is iteration counter)
Single layer Perceptron Learning algorithm
Learning Rule for Multi Layer Perceptron
• The rule didn’t generalize well for multi-layered networks of perceptrons, thus making
the training process of these machines a lot more complex and, most of the time, an
unknown process.
• This limitation ended up being responsible for a huge disinterest and lack of funding of
neural networks research for more than 10 years.
• In 1986, a paper entitled Learning representations by back-propagating errors by
David Rumelhart and Geoffrey Hinton changed the history of neural networks research.
• It introduced a ground-breaking learning procedure: the backpropagation algorithm.
• The paper proposed the usage of a differentiable function instead of the step function
as the activation for the perceptron. With this modification, a multi-layered network of
perceptrons would become differentiable. Hence gradient descent could be applied to
minimize the network’s error and the chain rule could “back-propagate” proper error
derivatives to update the weights from every layer of the network.
Activation Functions
• And today we have the most used model of a modern perceptron. Even though it
doesn’t look much different, it was only on 2012 that Alex Krizhevsky was able to train a
big network of artificial neurons that changed the field of computer vision and started a
new era in neural networks research.

The only noticeable difference from Rosenblatt’s model to the one above is the
differentiability of the activation function. Since 1986, a lot of different activation
functions have been proposed. See some of the most popular examples in the next slides.
Activation functions that are commonly used based on few desirable properties
like :
•Nonlinear — When the activation function is non-linear, then a two-layer
neural network can be proven to be a universal function approximator. The
identity activation function does not satisfy this property. When multiple layers
use the identity activation function, the entire network is equivalent to a
single-layer model.
•Range — When the range of the activation function is finite, gradient-based
training methods tend to be more stable, because pattern presentations
significantly affect only limited weights. When the range is infinite, training is
generally more efficient because pattern presentations significantly affect most
of the weights. In the latter case, smaller learning rates are typically necessary.
•Continuously differentiable — This property is desirable (ReLU is not
continuously differentiable and has some issues with gradient-based
optimization, but it is still possible) for enabling gradient-based optimization
methods. The binary step activation function is not differentiable at 0, and it
differentiates to 0 for all other values, so gradient-based methods can make no
progress with it.
•Monotonic — When the activation function is monotonic, the error surface
associated with a single-layer model is guaranteed to be convex.
The Activation Functions can be basically divided into 2 types-
1.Linear Activation Function
2.Non-linear Activation Functions
Linear or Identity Activation Function

Equation : f(x) = x
Range : (-infinity to infinity)
It doesn’t help with the complexity or
various parameters of usual data that
is fed to the neural networks.

• Linear function has limited power and ability to handle complexity. It can be used for a
simple task like interpretability.
• Derivative of a linear function is a constant..
• If all layers of the neural network uses linear activation it is equivalent to a single layer
network, meaning output of the first layer is same as the output of the nth layer.
Non- linear Activation Function
Sigmoid or Logistic Activation Function- The Sigmoid Function curve looks like a S-shape.

The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is
especially used for models where we have to predict the probability as an output. Since probability
of anything exists only between the range of 0 and 1, sigmoid is the right choice.
• The function is monotonic and differentiable. That means, we can find the slope of the sigmoid
curve at any two points.
• The function is centered on 0.5 not zero centered and this reduces the efficiency of updating the
weights.
• Gives rise to a problem of vanishing gradients. Sigmoids saturate and kill gradients and slow in
convergence.
• The softmax function is a more generalized logistic activation function which is used for
multiclass classification.
Non- linear Activation Function
Tanh or hyperbolic tangent Activation Function -tanh is also like logistic sigmoid but better. The
range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped)..

• The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be
mapped near zero in the tanh graph.
• The function is monotonic and differentiable. Output is zero centered hence convergence is usually
faster if the average of each input variable over the training set is close to zero.
• The function suffers from vanishing gradient problem and is computationally expensive due to its
exponential operation. It saturate and kill the gradients.
• The tanh function is mainly used for classification between two classes.
• Both tanh and logistic sigmoid activation functions are used in feed-forward nets.
Non- linear or Identity Activation Function
ReLU (Rectified Linear Unit) Activation Function
The ReLU is the most used activation function. It is used in almost all the convolutional
neural networks or deep learning.

• As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero
and f(z) is equal to z when z is above or equal to zero.
• The function is monotonic and continuous. It is not differentiable at point 0, and derivative
for negative values is 0.
• Relu overcomes the vanishing gradient problem, and computationally faster, but suffers
from dying ReLU problem for negative values.
• But the issue is that all the negative values become zero immediately which decreases the
ability of the model to fit or train from the data properly. That means any negative input given
to the ReLU activation function turns the value into zero immediately in the graph, which in
turns affects the resulting graph by not mapping the negative values appropriately.
Non- linear or Identity Activation Function
Leaky ReLU - It is an attempt to solve the dying ReLU problem

• The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01
or so.
• When a is not 0.01 then it is called Randomized ReLU.
• Due to its exponential nature, it is computationally expensive.
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
• Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their
derivatives also monotonic in nature.
• Suffers from vanishing gradient problem for negative input values.
Non- linear or Identity Activation Function

Softmax
• a type of sigmoid function but very useful to handle multi-class classification
problems.
• It can be described as a combination of multiple sigmoidal functions.
• It returns the probability for a data-point belonging to each individual class. The sum
of all the probability values is 1. Hence the output is the probability distribution.
• The softmax function is often used in the final layer of a neural network-based
classifier. Such networks are commonly trained under a log loss (or cross-entropy)
regime, giving a non-linear variant of multinomial logistic regression.
When to use which activation functions
• Usually, if the output ranges between (0,1) or (-1, 1) then sigmoid or tanh can be
used. On the other hand, to predict output values larger than 1, ReLU is commonly
used.
• In the case of a binary classifier, the Sigmoid activation function should be used.
While predicting a probability for a multiclass problem, the softmax activation
function should be used in the last layer.
• Again, tanh or sigmoid usually doesn’t work well in hidden layers. ReLU or Leaky
ReLU should be used in hidden layers. Swish activation function is used when
number of hidden layers are high (close to 30).
• However, the use of activation functions mostly depends on the data, problem in hand
and the range of the expected output.
Multi-class Vs Binary Classification
Multi-class Classification One-vs-All classification

• In multi-class classification, each example belongs to precisely one class. Therefore a


dataset is annotated with the correct class label using a one-hot target output vector
containing zeros, except for the target class, which has a value of one.
• One-vs-All (OvA) classification involves training K different binary
classifiers (output units), each designed to discriminate an instance of a given class
relative to all other classes. (one binary classifier per class)
• To do this, a softmax activation function is used in the output layer, and the weights of
the deep neural network are optimized using the cross-entropy loss function and a
particular optimizer.

The categorical cross-entropy loss JOvA for a


single training example is:

Training: one binary classifier for each class


against the rest
Testing: apply all classifiers, the highest
scoring one wins
Multi-class Classification One-vs-One classification

• In OvO classification, instead of using a one-hot target vector that assigns a one to the target
class and zeros to all other classes, we need to construct a method that allows for pairwise
classification.
• Therefore for K number of classes, we need to construct a target vector consisting of L =
K(K-1)/2 values. (one binary classifier per pair of class)
• The output units in the deep neural network represent binary classifiers with outputs in the
bound [-1,1] or [0,1]. Sigmoid activation function is used for the L output units,

Example: Given a dataset having with 4 classes(A, B, C, D), the required number of output
neurons 6 forming total 6 binary classes.

The multi-output binary cross-entropy loss JOvO for an example is computed with:
Loss functions:
Loss functions are helpful to train a neural network. Given an input
and a target, they calculate the loss, i.e difference between output and
target variable. Loss functions fall under four major category:

(i) Regressive loss functions:


They are used in case of regressive problems, that is when the target variable is
continuous. Most widely used regressive loss function is Squared Error.

Other loss functions are:


1. Absolute error — measures the mean absolute value of the element-wise
difference between input;
2. Smooth Absolute Error — a smooth version of Abs Criterion.
Regressive loss functions: Example

Final activation: Linear or ReLU

Loss function: Mean Squared Error


Loss functions:
(ii) Classification loss functions:
The output variable in classification problem is usually a probability value f(x),
called the score for the input x. Generally, the magnitude of the score represents
the confidence of our prediction. The target variable y, is a binary variable, 1 for
true and -1 for false.

On an example (x,y), the margin is defined as yf(x). The margin is a measure of


how correct we are. Most classification losses mainly aim to maximize the margin.
Some classification algorithms are:
1. Binary Cross Entropy
2. Negative Log Likelihood
3. Margin Classifier
4. Soft Margin Classifier
Loss functions:
Predicting binary outcome

Final activation:

Loss function:
Loss functions:

Predicting single label


From multiple classes

Final activation: soft max

One Vs All
Loss function: K number of neurons required in the output layer for k- classes
Loss functions:

Predicting multiple labels 0 0 0 - C1


0 0 1 -C2
From multiple classes

1 1 1 – C8

Final activation: sigmoid

Loss function:
Summary of activation functions and loss functions:

Difference between label encoding and one-hot encoding


Variants of single layer Perceptron

Y^ = sign(W.X)

y^=sign(W.X) Y^=W.X

Smooth approximation
of desired goal

(classification) (classification)

Y^= 1/(1+exp(-W.X)
Y^=W.X

(classification)
Fig. c- (Widrow-Hoff method), LMS, ADALINE
Fisher Discriminant function
Variants of Perceptron

Hinge loss

Sign function can be applied for final prediction

Summary
Multilayer Perceptron

𝑚 𝑛
𝑍𝑗 = 𝑓 𝑉𝑖𝑗 𝑋𝑖 𝑌𝑘 = 𝑓 𝑊𝑗𝑘 𝑍𝑗
𝑖=1 𝑗=1
Gradient Descent

Optimization refers to the task of minimizing/maximizing an objective


function f(x) parameterized by x. In machine/deep learning terminology, it’s the
task of minimizing the cost/loss function J(w) parameterized by the model’s
parameters w ∈ Rd. Optimization algorithms (in case of minimization) have one of
the following goals:
•Find the global minimum of the objective function. This is feasible if the objective
function is convex, i.e. any local minimum is a global minimum.
•Find the lowest possible value of the objective function within its neighborhood.
That’s usually the case if the objective function is not convex as the case in most
deep learning problems.
Gradient Descent

error

m
Gradient Descent

There are three kinds of optimization algorithms:


• Optimization algorithm that is not iterative and simply solves for one point.
• Optimization algorithm that is iterative in nature and converges to acceptable
solution regardless of the parameters initialization such as gradient descent
applied to logistic regression.
• Optimization algorithm that is iterative in nature and applied to a set of
problems that have non-convex cost functions such as neural networks.
Therefore, parameters’ initialization plays a critical role in speeding up
convergence and achieving lower error rates.

Gradient descent algorithm and its variants:


Batch Gradient Descent,
Mini-batch Gradient Descent, and
Stochastic Gradient Descent.
Gradient Descent
W=W+L
How gradient descent works on logistic regression: Assume that the logistic
regression model has only two parameters: weight w and bias b.
1. Initialize weight w and bias b to any random numbers.
2. Pick a value for the learning rate α. The learning rate determines how big the
step would be on each iteration.
If α is very small, it would take long time to converge and become computationally
expensive.
If α is large, it may fail to converge and overshoot the minimum.

The most commonly used rates are : 0.001,


0.003, 0.01, 0.03, 0.1, 0.3.

Gradient descent with different learning rates


Gradient Descent Y^=f(WX)

3. Make sure to scale the data if it’s on a very different scales. If we don’t scale
the data, the level curves (contours) would be narrower and taller which means it
would take longer time to converge

Scale the data to have μ = 0 and σ = 1. Below is the formula for scaling each
example:
Gradient Descent
4. On each iteration, take the partial derivative of the cost function J(w) w.r.t each
parameter (gradient):

If the slope of the current value of w > 0, this


means that we are to the right of optimal w*.
Therefore, the update will be negative, and
will start getting close to the optimal values
of w*. However, if it’s negative, the update
will be positive and will increase the current
values of w to converge to the optimal values
of w*

•Continue the process until the cost function converges. That is, until the error curve becomes flat
and doesn’t change.
•In addition, on each iteration, the step would be in the direction that gives the maximum change
since it’s perpendicular to level curves at each step.
Gradient Descent

Batch gradient
•Batch Gradient Descent is when we sum up over all examples on each iteration when performing
the updates to the parameters. Therefore, for each update, we have to sum over all examples:

Mini-batch Gradient Descent


Instead of going over all examples, Mini-batch Gradient Descent sums up over lower number of
examples based on the batch size. Therefore, learning happens on each mini-batch of b examples:

Stochastic Gradient Descent


Instead of going through all examples, Stochastic Gradient Descent (SGD) performs the
parameters update on each example (xi,yi). Therefore, learning happens on every example:
Gradient Descent

gradient descent’s variants and their direction towards the minimum:


XOR Problem

• XOR cannot be modeled with a single


layer feed-forward perceptron.
• This limitation of perceptron can be
eliminated by stacking multiple
perceptrons called Multilayer Perceptron
(MLP)

MLP can solve XOR problem

Wnew=wold+J
Multi layer Perceptron and Back Propagation
LTU-Linear threshold unit

yj = aj - output of jth neuron in the output layer


aj = f( net j), where netj =  Wij ai

Wij -output weight layer

ai - output of ith neuron in hidden layer


Vki- hidden layer weights ai = f( net i, where neti =  Vki xk
Multi layer Perceptron and Back Propagation
Back propagation algorithm is a way to train the MLP
Step 1: Forward pass
For each training instance feed the input to the network and compute the output of every
neuron in every consecutive layer.
Step 2: Measure the network output error
Find the difference between the desired output and predicted output.
Step 3: error back propagation
Compute how much each neuron in the last hidden contributed to each
output neuron’s error.
It then proceeds to measure how much of these error contributions came from
each neuron in the previous hidden layer, and so on until the algorithm reaches the input
layer.
This backward pass measures error gradient across all the connection weights by
propagating the error gradient backward in the network
Step 4: Adjust the weights to reduce the error- (Gradient descend step)
Multi layer Perceptron and Back Propagation

For the Backpropagation algorithm to work, step function is replaced with logistic
function. (Step is non differential function and its gradient is zero for the flats)

Other activation functions:

Hyperbolic tangent activation

ReLU function
Multi layer Perceptron and Back Propagation

Activation functions and their derivatives


Multi layer Perceptron and Back Propagation

Step 4: Adjust the weights to reduce the error- (Gradient descend step)

Where t is the target and a is the


actual output
t-aj=(t- f(wij*ai))2
Gradient descent is to minimize loss function w.r.t weigjts Wij

Where ai output of the neuron


in the hidden layer
aj = f( netj) where netj =  wij ai

j is the error signal term of output node j

(for logistic function)


(-error  f’(netj)

( weighted sum of error term from next layer)  f’(neti)


Multi layer Perceptron and Back Propagation

Step 1: Forward pass – predict the output


Multi layer Perceptron and Back Propagation
Step 2 & 3: Compute the output error and the error gradient for the
Neurons in all layers
Multi layer Perceptron and Back Propagation

Exercise
Given network is to be trained on the data for XOR.
For the input vector [0, 1] , learning rate 0.25, logistic activation function, compute the
weight corrections.

Wnew = Wold+W

Solution 0 1
i j
xk vki zi wij
yj
x E z E y
vki     xk  i wij    zi j
vk j wi j
 i    j wij f ' ( znetj )  j  (t  y ) f ' ( ynet )
j
wij   (t j  y j ) f ' ( ynetj ) zi
vki   f ' ( z neti ) xk  w
j
j ij
Multi layer Perceptron and Back Propagation
Solution
Let w1 = -4.5, w2 = 5.3
v11 = -2.0, v12 = 4.3
v21 = 9.2, v22 = 8.8
b1 = 2.0, b2 = -0.1
b3 = -0.8

Step 1- feed forward pass


Z1net = 2.0+0+9.2 = 11.2 Step 2- output layer weight corrections
Z1= f(z1net) = 1 / 1+ exp(-11.2) = 0.9999 wij   (t j  y j ) f ' ( ynetj ) zi   j zi
Z2net = 0+ 8.8 - 0.1 = 8.7 w1 =0.25 (1 – 0.5005)(0.5002)(1-0.5002) (0.9999)
Z2 = f(z2net) = 1 / 1+ exp(-8.7) = 0.9998 =0.0312

w2 = 0.25 (1 – 0.5005)(0.5002)(1-0.5002) (0.9998)


ynet = - 0.8+(0.9999)(-4.5) +(0.9998)(5.3)= -0.0006 =0.0312
y = f(ynet) = 1 / 1+ exp(-0.0006) = 0.5002
b3 = 0.25 (1 – 0.5005)(0.5002)(1-0.5002)
Error term j = (tj-yj) f’(ynetj) =0.0312
= (1-0.5002)(0.5002)(1-0. 5002)
= 0.1249
Multi layer Perceptron and Back Propagation
Solution
Let w1 = -4.5, w2 = 5.3
v11 = -2.0, v12 = 4.3
v21 = 9.2, v22 = 8.8
b1 = 2.0, b2 = -0.1
b3 = -0.8

Step 3: hidden layer weight corrections

vki   f ' ( z neti ) xk  w


j
j ij

v11 =0.25 (0.9999)(1-0.9999) (0) [(0.1249)(-4.5)] v12 =0.25 (0.9998)(1-0.9998) (0) [(0.1249)(5.3)]
=0 =0
v21 =0.25 (0.9999)(1-0.9999) (1) [(0.1249)(-4.5)] v22 =0.25 (0.9998)(1-0.9998) (1) [(0.1249)(5.3)]
= 0.000014 =0.000033

b1 =0.25 (0.9999)(1-0.9999) (1) [(0.1249)(-4.5)] b2 =0.25 (0.9998)(1-0.9998) (1) [(0.1249)(5.3)]
=0.000014 =0.000033
Multi layer Perceptron and Back Propagation
Solution
Let w1 = -4.5, w2 = 5.3
v11 = -2.0, v12 = 4.3
v21 = 9.2, v22 = 8.8
b1 = 2.0, b2 = -0.1
b3 = -0.8

Step 4: update weights wnew  wold  w


vnew  vold  v
v11 = -2.0 +0 = -2.0
v21 = 9.2+0.000014 = 9.200014 w1 = -4.5 + 0.0312 = - 4.4688
V12 = 4.3+0 = 4.3 w2 = 5.3+0.0312 = 5.3312
v22 = 8.8+0.000033 = 8.800033 b3 = -0.8+0.0312 = - 0.7688
b1 = 2.0+0.000014 = 2.000014
b2 = -0.1+0.000033 = -0.09997
Multi layer Perceptron and Back Propagation

Let’s assume we are training a network to differentiate between cats and dogs. We therefore
only need two output neurons — one for each classification. We feed a cat image into the
network.
For now, imagine that each pixel of the image corresponds to one ‘input’ (we’ll see later how
we can improve on this for images). Here, it’s assigned a probability of 62% that the image is a
dog, and 38% that it’s a cat. Ideally, we want it to say this image is 100% cat.
Multi layer Perceptron and Back Propagation

So, we go backwards through the network, nudging the weights and biases to increase the
chance that the network would classify this as a cat.
Multi layer Perceptron and Back Propagation

Training the network


How do we know how wrong the network is? We measure the difference between the
network’s output and the correct output using the ‘loss function’. The loss function is
also sometimes called the error function, energy function or cost function. The goal
of training is to find the weights and biases that minimise the loss function. Our aim
is to find the lowest point of the loss function, and then see what weight values that
corresponds to. To find the lowest point, we use a technique called Gradient Descent.
Multi layer Perceptron and Back Propagation
Feature scaling
In order to make the life of gradient descent algorithms easier, there are some
techniques that can be applied to your data on training/test phase. If the features on
your input vector , are out of scale your loss space will be somehow stretched. This
will make the gradient descent convergence harder, or at least slower. On the
example below your input X has 2 features (house size, and number of bedrooms).
The problem is that house size feature range from 0...2000, while number of
bedrooms range from 0...5.
Feature scaling- Normalizing the input data
Centralize data and normalize

• Consider your input data , where N is the number of samples on your input data
(batch size) and D the dimensions (On the previous example D is 2, size house,
num bedrooms).
• The first thing to do is to subtract the mean value of the input data, this will
centralize the data dispersion around zero. (i.e the average of input variables over
the training set should be zero).
• On prediction phase is common to store this mean value to be subtracted from a
test example.
• After your data is centralized around zero, you can make all features have the
same range by dividing X by it's standard deviation. (i.e scale input variables so
that their covariances are about the same)
Multi layer Perceptron and Back Propagation
Weight initialization

• If you initialize your weights to zero, your gradient descent will never converge.

• A better idea is to initialize your weights with values close to zero (but not zero),
ie: 0.01
• A weight should be randomly drawn from a distribution (e.g uniform distribution)
with mean zero and unit standard deviation. The problem with this initialization is
that the variance of the outputs will grow with the number of inputs. To solve this
issue we can divide the random term by the square root of the number of inputs.

• Therefore the initial weights are close to zero with mean zero
• and standard deviation

• Where m is the number of connections into the node.


Multi layer Perceptron and Back Propagation

Activation function
Multi layer Perceptron and Back Propagation
Using learning rates
Multi layer Perceptron
Multi layer Perceptron Fine tuning neural network hyper-parameters

Number of hidden layers


Multi layer Perceptron Fine tuning neural network hyper-parameters

Number of neurons per hidden layer


Multi layer Perceptron Fine tuning neural network hyper-parameters

MNIST data
Multi layer Perceptron Fine tuning neural network hyper-parameters

MNIST data

0
1
2
3
4

5
6
7
8
9
Multi layer Perceptron Fine tuning neural network hyper-parameters

Number of hidden layers and hidden neurons

XOR problem

1.What is the required number of hidden layers?


2.What is the number of the hidden neurons across each hidden layer?
Multi layer Perceptron Fine tuning neural network hyper-parameters

First draw the decision boundaries

There is more than one possible decision boundary


any ANN is built using the single layer perceptron as a building block. The
single layer perceptron is a linear classifier which separates the classes using a
line created according to the following equation:
Multi layer Perceptron Fine tuning neural network hyper-parameters

In this case decision boundary replaced by a set of two lines

• Knowing that there are just two lines required to represent the decision
boundary tells us that the first hidden layer will have two hidden neurons.
• we have a single hidden layer with two hidden neurons.
• Each hidden neuron could be regarded as a linear classifier that is represented
as a line as in figure
• There will be two outputs, one from each classifier (i.e. hidden neuron). But we
are to build a single classifier with one output representing the class label, not
two classifiers.
Multi layer Perceptron Fine tuning neural network hyper-parameters

Required architecture
Multi layer Perceptron Fine tuning neural network hyper-parameters

Example 2
Multi layer Perceptron Fine tuning neural network hyper-parameters

Example 2

The next step is to split the decision boundary into a set of lines, where each
line will be modeled as a perceptron in the ANN. Before drawing lines, the
points at which the boundary changes direction should be marked as shown in
figure
Multi layer Perceptron Fine tuning neural network hyper-parameters

Example 2

How many lines are required?


Multi layer Perceptron Fine tuning neural network hyper-parameters

Example 2
Multi layer Perceptron Fine tuning neural network hyper-parameters

Final architecture
Techniques to prevent overfitting
1. Hold-out (data)
Rather than using all of our data for training, we can simply split our dataset into two
sets: training and testing. A common split ratio is 80% for training and 20% for testing.
We train our model until it performs well not only on the training set but also for the
testing set. This indicates good generalization capability since the testing set
represents unseen data that were not used for training. However, this approach would
require a sufficiently large dataset to train on even after splitting.
2. Cross-validation (data)
We can split our dataset into k groups (k-fold cross-validation). We let one of the
groups to be the testing set (please see hold-out explanation) and the others as the
training set, and repeat this process until each individual group has been used as the
testing set (e.g., k repeats). Unlike hold-out, cross-validation allows all data to be
eventually used for training but is also more computationally expensive than hold-out.
3. Data augmentation (data)
A larger dataset would reduce overfitting. If we cannot gather more data and are
constrained to the data we have in our current dataset, we can apply data
augmentation to artificially increase the size of our dataset. For example, if we are
training for an image classification task, we can perform various image
transformations to our image dataset (e.g., flipping, rotating, rescaling, shifting).
Techniques to prevent overfitting
4. Feature selection (data)
If we have only a limited amount of training samples, each with a large number of
features, we should only select the most important features for training so that our
model doesn’t need to learn for so many features and eventually overfit. We can
simply test out different features, train individual models for these features, and
evaluate generalization capabilities, or use one of the various widely used feature
selection methods.
5. L1 / L2 regularization (learning algorithm)
Regularization is a technique to constrain our network from learning a model that is
too complex, which may therefore overfit. In L1 or L2 regularization, we can add a
penalty term on the cost function to push the estimated coefficients towards zero
(and not take more extreme values). L2 regularization allows weights to decay
towards zero but not to zero, while L1 regularization allows weights to decay to zero.
Techniques to prevent overfitting
6. Remove layers / number of units per layer (model)
As mentioned in L1 or L2 regularization, an over-complex model may more likely
overfit. Therefore, we can directly reduce the model’s complexity by removing layers
and reduce the size of our model. We may further reduce complexity by decreasing
the number of neurons in the fully-connected layers. We should have a model with a
complexity that sufficiently balances between underfitting and overfitting for our task.
Techniques to prevent overfitting
7. Dropout (model)
By applying dropout, which is a form of regularization, to our layers, we ignore a
subset of units of our network with a set probability. Using dropout, we can reduce
interdependent learning among units, which may have led to overfitting. However,
with dropout, we would need more epochs for our model to converge.
Techniques to prevent overfitting
8. Early stopping (model)
We can first train our model for an arbitrarily large number of epochs and plot the
validation loss graph (e.g., using hold-out). Once the validation loss begins to degrade
(e.g., stops decreasing but rather begins increasing), we stop the training and save the
current model. We can implement this either by monitoring the loss graph or set an
early stopping trigger. The saved model would be the optimal model for generalization
among different training epoch values.

You might also like