DOC-20241108-WA0006.

The document discusses the limitations of single-layer perceptrons in handling non-linear data and introduces multi-layer perceptrons (MLPs) as a solution, highlighting their architecture and training process. It explains the importance of backpropagation in updating weights to minimize error and describes various activation functions used in MLPs. Additionally, it covers gradient descent techniques for optimizing model parameters and considerations for training MLPs effectively, including the number of training examples and hidden layers.

Uploaded by

gedelasuvarnaraju47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

DOC-20241108-WA0006.

Uploaded by

gedelasuvarnaraju47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

 The single layer Perceptron classifies discussed

previously can only deal with linearly separable set of

patterns.
 The multi-layer networks to be introduced here are the
most widespread neural network architecture deal with
non linear separable.
 The only problem with single-layer perceptrons is that
it can not capture the dataset’s non-linearity and hence
does not give good results on non-linear data. This
problem can be easily solved by multi-layer perception,
which performs very well on non-linear datasets.
 Multilayer Perceptrons(MLP) train on a set of input-
output pairs and learn to model the correlation (or
dependencies) between those inputs and outputs.

 A multilayer perceptron (MLP) is an artificial neural

network with multiple layers of neurons between input
and output. MLPs are also called feedforward neural
networks. Feedforward means that data flow in one
direction from the input to the output layer. Typically,
every neuron’s output is connected to every neuron in
the next layer. Layers that come between the input and
output layers are referred to as hidden layers
 MLPs are widely used for pattern classification,
recognition, prediction, and approximation, and can learn
complicated patterns that are not separable using linear or
other easily articulated curves. The capacity of an MLP
network to learn complicated(complexity problems)
patterns increases with the number of neurons and layers.

 A multilayer Perceptron (MLP) is a feedforward

artificial neural network that generates a set of
outputs from a set of inputs. An MLP is characterized
by several layers of input nodes connected as a
directed graph between the input and output layers.
MLP uses backpropogation for training the network.

 Back propagation is one of the important concepts of
a neural network. Our task is to classify our data best.
For this, we have to update the weights of parameter
and bias.
 Backpropagation algorithm calculates the gradient of
the error function.
 The main features of Backpropagation are the iterative,
recursive and efficient method through which it
calculates the updated weight to improve the network
until it is not able to perform the task for which it is
being trained.
 Now, how error function is used in Backpropagation
and how Backpropagation works? Let start with an
example and do it mathematically to understand how
exactly updates the weight using Backpropagation.
 Input values
X1=0.05
X2=0.10

 Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

 Bias Values
b1=0.35 b2=0.60
 Target Values
T1=0.01
T2=0.99
 Now, we first calculate the values of H1 and H2 by a
forward pass.
Part 1: Calculate Forward Propagation Error
Input layer--→Hidden layer--→Output layer
 To find the value of H1 (In and Out) we first multiply
the input value from the weights as
 H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775 (In)
 We have updated all the weights. We found the error
0.298371109 on the network when we fed forward
the 0.05 and 0.1 inputs. In the first round of
Backpropagation, the total error is down to
0.291027924. After repeating this process 10,000, the
total error is down to 0.0000351085. At this point, the
outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we
feed forward the 0.05 and 0.1.
 The MLP algorithm suggests that the weights are
initialised to small random numbers, both positive and
negative.
 If the initial weight values are close to 1 or -1 then the
inputs to the sigmoid are also likely to be close to ±1
and so the output of the neuron is either 0 or 1.
 If the weights are very small (close to zero) then the
input is still close to 0 and so the output of the neuron
is just linear, so we get a linear model.
 Choosing the size of the initial values needs a little
more thought, then. Each neuron is getting input from n
different places (either input nodes if the neuron is in
the hidden layer, or hidden neurons if it is in the output
layer)
 If we view the values of these inputs as having uniform
variance, then the typical input to the neuron will be
w√n, where w is the initialization value of the weights.
 So a common trick is to set the weights in the range
−1/ √ n < w < 1/ √ n, where n is the number of nodes in
the input layer to those weights.
 In neural networks, the activation function is a
mathematical “gate” in between the input feeding the
current neuron and its output going to the next layer.
 The activation functions are at the very core of
Machine Learning. They determine the output of a
model, its accuracy, and computational efficiency. In
some cases, activation functions have a major effect on
the model’s ability to converge and the convergence
speed.
 The following are the most popular activation
functions in Machine Learning algorithms

 Sigmoid (Logistic)
 Hyperbolic Tangent (Tanh)
 Rectified Linear Unit (ReLU)
 Leaky ReLU
 Parametric Leaky ReLU (PReLU)
 Exponential Linear Units (ELU)
 Scaled Exponential Linear Unit (SELU)
Sigmoid (Logistic)
The Sigmoid function (also known as the Logistic
function) is one of the most widely used activation
function. The function is defined as:
 Another very popular and widely used activation
function is the Hyperbolic Tangent, also known
as Tanh. It is defined as:
Rectified Linear Unit (ReLU)
 The Rectified Linear Unit (ReLU) is the most
commonly used activation function in deep learning.
The function returns 0 if the input is negative, but for
any positive input, it returns that value back. The
function is defined as:
Leaky ReLU
 Leaky ReLU is an improvement over the ReLU activation
function. It has all properties of ReLU, plus it will never
have dying ReLU problem. Leaky ReLU is defined as:
f(x) = max(αx, x)
Parametric leaky ReLU (PReLU)
 Parametric leaky ReLU (PReLU) is a variation of
Leaky ReLU, where α is authorized to be learned during
training (instead of being a hyperparameter, it becomes a
parameter that can be modified by backpropagation like
any other parameters). This was reported to strongly
outperform ReLU on large image datasets, but on smaller
datasets it runs the risk of overfitting the training set.
 Exponential Linear Unit (ELU)

 Exponential Linear Unit (ELU) is a variation

of ReLU with a better output for z < 0. The function is
defined as:
 Scaled Exponential Linear Unit (SELU)
 Exponential Linear Unit (SELU) activation function
is another variation of ReLU proposed by Günter
Klambauer et al. [4] in 2017. The authors showed that
if you build a neural network composed exclusively of
a stack of dense layers, and if all hidden layers use
the SELU activation function, then the network will
self-normalize. This activation function often
outperforms other activation functions very
significantly.
f(x) = scale * x , z > 0
= scale * α * (exp(x) - 1) , z <= 0
 The MLP is designed to be a batch algorithm.
 All of the training examples are presented to the neural
network, the average sum-of-squares error is then
computed, and this is used to update the weights.
 Thus there is only one set of weight updates for each
epoch (pass through all the training examples). This
means that we only update the weights once for each
iteration of the algorithm, which means that the weights
are moved in the direction that most of the inputs want
them to move, rather than being pulled around by each
input individually
 The algorithm that was described earlier was the
sequential version, where the errors are computed and
the weights updated after each input.
 This is not guaranteed to be as efficient in learning, but
it is simpler to program when using loops, and it is
therefore much more common.
 Since it does not converge as well, it can also
sometimes avoid local minima, thus potentially
reaching better solutions. While the description of the
algorithm is sequential.
 The driving force behind the learning rule is the
minimization of the network error by gradient descent
(using the derivative of the error function to make the
error smaller).
 This means that we are performing an optimization: we
are adapting the values of the weights in order to
minimize the error function.
 As should be clear by now, the way that we are doing
this is by approximating the gradient of the error and
following it downhill so that we end up at the bottom of
the slope. However, following the slope downhill only
guarantees that we end up at a local minimum, a point
that is lower than those close to it.
 However, there is no guarantee that it will have stopped
at the lowest point—only the lowest point locally.
There may be a much lower point over the next hill, but
the ball can’t see that, and it doesn’t have enough
energy to climb over the hill and find the global
minimum (have another look at Figure 4.3 to see a
picture of this).
 Gradient descent works in the same way in two or more
dimensions, and has similar (and worse) problems. The
problem is that efficient downhill directions in two
dimensions and higher are harder to compute locally.
 Let’s go back to the analogy of the ball rolling down the
hill. The reason that the ball stops rolling is because it runs
out of energy at the bottom of the dip. If we give the ball
some weight, then it will generate momentum as it rolls,
and so it is more likely to overcome a small hill on the other
side of the local minimum, and so more likely to find the
global minimum. We can implement this idea in our neural
network learning by adding in some contribution from the
previous weight change that we made to the current one. In
two dimensions it will mean that the ball rolls more directly
towards the valley bottom, since on average that will be the
correct direction, rather than being controlled by the local
changes
 There is another benefit to momentum. It makes it
possible to use a smaller learning rate, which means
that the learning is more steable.
 Another thing that can be added is known as weight
decay. This reduces the size of the weights as the
number of iterations increases.
 The argument goes that small weights are better since
they lead to a network that is closer to linear (since they
are close to zero, they are in the region where the
sigmoid is increasing linearly), and only those weights
that are essential to the non-linear learning should be
large.
 In machine learning, gradient descent is an
optimization technique used for computing the model
parameters (coefficients and bias) for algorithms like
linear regression, logistic regression, neural networks,
etc. In this technique, we repeatedly iterate through the
training set and update the model parameters in
accordance with the gradient of the error with respect
to the training set. Depending on the number of training
examples considered in updating the model parameters,
we have 3-types of gradient descents:
 Batch Gradient Descent: Parameters are updated after
computing the gradient of the error with respect to the
entire training set
 Stochastic Gradient Descent: Parameters are updated
after computing the gradient of the error with respect to
a single training example
 Mini-Batch Gradient Descent: Parameters are
updated after computing the gradient of the error with
respect to a subset of the training set.
 Thus, mini-batch gradient descent makes a compromise
between the speedy convergence and the noise
associated with gradient update which makes it a more
flexible and robust algorithm.
 For the MLP with one hidden layer there are (L + 1) ×
M + (M + 1) × N weights, where L, M, N are the
number of nodes in the input, hidden, and output layers,
respectively. The extra +1s come from the bias nodes,
which also have adjustable weights.
 This is a potentially huge number of adjustable
parameters that we need to set during the training
phase. Setting the values of these weights is the job of
the back-propagation algorithm, which is driven by the
errors coming from the training data.
 Clearly, the more training data there is, the better for
learning, although the time that the algorithm takes to
learn increases. Unfortunately, there is no way to
compute what the minimum amount of data required is,
since it depends on the problem.
 A rule of thumb that has been around for almost as long
as the MLP itself is that you should use a number of
training examples that is at least 10 times the number of
weights.
 This is probably going to be a very large number of
examples, so neural network training is a fairly
computationally expensive operation, because we need
to show the network all of these inputs lots of times.
 Number of Hidden Layers
 There are two other considerations concerning the
number of weights that are inherent in the calculation
above, which is the choice of the number of hidden
nodes, and the number of hidden layers.
 Making these choices is obviously fundamental to the
successful application of the algorithm.
 We will shortly see a pictorial demonstration of the fact
that two hidden layers is the most that you ever need
for normal MLP learning.
 We can use the back-propagation algorithm for a
network with as many layers as we like, although it
gets progressively harder to keep track of which
weights are being updated at any given time.
 Two hidden layers are sufficient to compute these
bump functions for different inputs, and so if the
function that we want to learn (approximate) is
continuous, the network can compute it.
When to Stop Learning

The training of the MLP requires that the algorithm runs

over the entire dataset many times, with the weights
changing as the network makes errors in each iteration.
The question is how to decide when to stop learning,
and this is a question that we are now ready to answer.
It is unfortunate that the most obvious options are not
sufficient: setting some predefined number N of
iterations, and running until that is reached runs the risk
that the network has overfitted by then, or not learnt
sufficiently, and only stopping when some predefined
minimum error is reached might mean the algorithm
never terminates, or that it overfits.
 However, the validation set gives us something rather
more useful, since we can use it to monitor the
generalisation ability of the network at its current stage
of learning. If we plot the sum-of-squares error during
training, it typically reduces fairly quickly during the
first few training iterations, and then the reduction
slows down as the learning algorithm performs small
changes to find the exact local minimum. We don’t
want to stop training until the local minimum has been
found, but, as we’ve just discussed, keeping on training
too long leads to overfitting of the network.

Module 3 Final
No ratings yet
Module 3 Final
88 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
ML unit-2
100% (1)
ML unit-2
28 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
Ml Unit 2 Lecture Notes
No ratings yet
Ml Unit 2 Lecture Notes
20 pages
Lecture_2 (1)
No ratings yet
Lecture_2 (1)
52 pages
Module 3_Modified
No ratings yet
Module 3_Modified
106 pages
MLP Lecture 4
No ratings yet
MLP Lecture 4
35 pages
Lec 15 MLP Cont
No ratings yet
Lec 15 MLP Cont
34 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
AD3451 ML UNIT 4 NOTES
No ratings yet
AD3451 ML UNIT 4 NOTES
36 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Machine Learning
No ratings yet
Machine Learning
83 pages
Single Neuron Model
No ratings yet
Single Neuron Model
16 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
Slides 11
No ratings yet
Slides 11
48 pages
Assignment - 4
No ratings yet
Assignment - 4
24 pages
SOFT COMPUTING UNIT 2
No ratings yet
SOFT COMPUTING UNIT 2
22 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
ML unit 4
No ratings yet
ML unit 4
23 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Back Propagation
100% (1)
Back Propagation
27 pages
DEEP LEARNING Paper
No ratings yet
DEEP LEARNING Paper
12 pages
7 Ann Multilayer Perceptron Full
No ratings yet
7 Ann Multilayer Perceptron Full
69 pages
Ad3451 Ml Unit 4 Notes
No ratings yet
Ad3451 Ml Unit 4 Notes
34 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
72 pages
ML Module 2 New
No ratings yet
ML Module 2 New
36 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
No ratings yet
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
7 pages
Chapter 10: Artificial Neural Networks
No ratings yet
Chapter 10: Artificial Neural Networks
17 pages
Machine Learning NN
100% (2)
Machine Learning NN
16 pages
MLS+1+-+Presentation
No ratings yet
MLS+1+-+Presentation
11 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
deep learning
No ratings yet
deep learning
11 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
ML807_Distributed_and_Federated_Learning_Slides_2
No ratings yet
ML807_Distributed_and_Federated_Learning_Slides_2
211 pages
Lect 15 MLP Introduction Backprop
No ratings yet
Lect 15 MLP Introduction Backprop
24 pages
NeuralNetworks
No ratings yet
NeuralNetworks
29 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
20 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Back Propagation Algorithm PDF
No ratings yet
Back Propagation Algorithm PDF
9 pages
Multi Layer Perceptron
No ratings yet
Multi Layer Perceptron
62 pages
4 Neural Networks
No ratings yet
4 Neural Networks
31 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Types of Machine Learning: Supervised Learning: The Computer Is Presented With Example Inputs and Their
No ratings yet
Types of Machine Learning: Supervised Learning: The Computer Is Presented With Example Inputs and Their
50 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
7 Ann Multilayer Perceptron Full
No ratings yet
7 Ann Multilayer Perceptron Full
69 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
neural-networks-essay-feranmi-dere
No ratings yet
neural-networks-essay-feranmi-dere
7 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
4-Neural Networks and Activation Function
No ratings yet
4-Neural Networks and Activation Function
28 pages
Deep Online Sequential Extreme Learning Machines and Its Application in Pneumonia Detection
No ratings yet
Deep Online Sequential Extreme Learning Machines and Its Application in Pneumonia Detection
6 pages
Gan
No ratings yet
Gan
28 pages
Back propogation
No ratings yet
Back propogation
9 pages
2...............EFFResNet-ViT a Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
No ratings yet
2...............EFFResNet-ViT a Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
29 pages
DL Unit-2
No ratings yet
DL Unit-2
31 pages
Deep Learning (Syllabus)
No ratings yet
Deep Learning (Syllabus)
1 page
Lecture 7 - Perceptrons and Multi-Layer Feedforward Neural Networks Using Matlab Part 3
No ratings yet
Lecture 7 - Perceptrons and Multi-Layer Feedforward Neural Networks Using Matlab Part 3
6 pages
2403B05107_DL_ACTIVITY_04(1)
No ratings yet
2403B05107_DL_ACTIVITY_04(1)
9 pages
10.2.4 RNN-Context
No ratings yet
10.2.4 RNN-Context
10 pages
ML Unit4
No ratings yet
ML Unit4
32 pages
Nvidia Fundamentals of Deep Learning PPT 4
No ratings yet
Nvidia Fundamentals of Deep Learning PPT 4
19 pages
Pre-printCopy
No ratings yet
Pre-printCopy
41 pages
Back Propogation Algorithm
No ratings yet
Back Propogation Algorithm
17 pages
Backpropagation in Convolutional Neural Networks
No ratings yet
Backpropagation in Convolutional Neural Networks
4 pages
Learning Deep Learning Theory and Practice of Neural Networks Computer Vision NLP and Transformers using TensorFlow 1st Edition Ekman Magnus download
100% (1)
Learning Deep Learning Theory and Practice of Neural Networks Computer Vision NLP and Transformers using TensorFlow 1st Edition Ekman Magnus download
55 pages
深度强化学习（初稿）
No ratings yet
深度强化学习（初稿）
289 pages
NN 08
No ratings yet
NN 08
36 pages
Ensemble Deep Learning Based Prediction of Fraudul
No ratings yet
Ensemble Deep Learning Based Prediction of Fraudul
13 pages
RNN and LSTM
No ratings yet
RNN and LSTM
15 pages
MLT unit 4 (1)
No ratings yet
MLT unit 4 (1)
15 pages
Diabetes Detection Using Deep Learning Algorithms 2018
No ratings yet
Diabetes Detection Using Deep Learning Algorithms 2018
4 pages
DL Unit-4
No ratings yet
DL Unit-4
26 pages
Revisiting Deep Learning Models for Tabular Data
No ratings yet
Revisiting Deep Learning Models for Tabular Data
12 pages
ANN Matlab
No ratings yet
ANN Matlab
13 pages
Enhanced Bagging EBagging A Novel Approach For Ens
No ratings yet
Enhanced Bagging EBagging A Novel Approach For Ens
15 pages
XXXBetter Plain ViT Baselines for ImageNet-1k
No ratings yet
XXXBetter Plain ViT Baselines for ImageNet-1k
3 pages
TH3769 1
No ratings yet
TH3769 1
10 pages
Unit 5
No ratings yet
Unit 5
39 pages
Syllabus For CSCI 631 - Foundations of Computer Vision
No ratings yet
Syllabus For CSCI 631 - Foundations of Computer Vision
1 page