0% found this document useful (0 votes)
8 views

DOC-20241108-WA0006.

The document discusses the limitations of single-layer perceptrons in handling non-linear data and introduces multi-layer perceptrons (MLPs) as a solution, highlighting their architecture and training process. It explains the importance of backpropagation in updating weights to minimize error and describes various activation functions used in MLPs. Additionally, it covers gradient descent techniques for optimizing model parameters and considerations for training MLPs effectively, including the number of training examples and hidden layers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DOC-20241108-WA0006.

The document discusses the limitations of single-layer perceptrons in handling non-linear data and introduces multi-layer perceptrons (MLPs) as a solution, highlighting their architecture and training process. It explains the importance of backpropagation in updating weights to minimize error and describes various activation functions used in MLPs. Additionally, it covers gradient descent techniques for optimizing model parameters and considerations for training MLPs effectively, including the number of training examples and hidden layers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

 The single layer Perceptron classifies discussed

previously can only deal with linearly separable set of


patterns.
 The multi-layer networks to be introduced here are the
most widespread neural network architecture deal with
non linear separable.
 The only problem with single-layer perceptrons is that
it can not capture the dataset’s non-linearity and hence
does not give good results on non-linear data. This
problem can be easily solved by multi-layer perception,
which performs very well on non-linear datasets.
 Multilayer Perceptrons(MLP) train on a set of input-
output pairs and learn to model the correlation (or
dependencies) between those inputs and outputs.

 A multilayer perceptron (MLP) is an artificial neural


network with multiple layers of neurons between input
and output. MLPs are also called feedforward neural
networks. Feedforward means that data flow in one
direction from the input to the output layer. Typically,
every neuron’s output is connected to every neuron in
the next layer. Layers that come between the input and
output layers are referred to as hidden layers
 MLPs are widely used for pattern classification,
recognition, prediction, and approximation, and can learn
complicated patterns that are not separable using linear or
other easily articulated curves. The capacity of an MLP
network to learn complicated(complexity problems)
patterns increases with the number of neurons and layers.

 A multilayer Perceptron (MLP) is a feedforward


artificial neural network that generates a set of
outputs from a set of inputs. An MLP is characterized
by several layers of input nodes connected as a
directed graph between the input and output layers.
MLP uses backpropogation for training the network.

 Back propagation is one of the important concepts of
a neural network. Our task is to classify our data best.
For this, we have to update the weights of parameter
and bias.
 Backpropagation algorithm calculates the gradient of
the error function.
 The main features of Backpropagation are the iterative,
recursive and efficient method through which it
calculates the updated weight to improve the network
until it is not able to perform the task for which it is
being trained.
 Now, how error function is used in Backpropagation
and how Backpropagation works? Let start with an
example and do it mathematically to understand how
exactly updates the weight using Backpropagation.
 Input values
X1=0.05
X2=0.10

 Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

 Bias Values
b1=0.35 b2=0.60
 Target Values
T1=0.01
T2=0.99
 Now, we first calculate the values of H1 and H2 by a
forward pass.
Part 1: Calculate Forward Propagation Error
Input layer--→Hidden layer--→Output layer
 To find the value of H1 (In and Out) we first multiply
the input value from the weights as
 H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775 (In)
 We have updated all the weights. We found the error
0.298371109 on the network when we fed forward
the 0.05 and 0.1 inputs. In the first round of
Backpropagation, the total error is down to
0.291027924. After repeating this process 10,000, the
total error is down to 0.0000351085. At this point, the
outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we
feed forward the 0.05 and 0.1.
 The MLP algorithm suggests that the weights are
initialised to small random numbers, both positive and
negative.
 If the initial weight values are close to 1 or -1 then the
inputs to the sigmoid are also likely to be close to ±1
and so the output of the neuron is either 0 or 1.
 If the weights are very small (close to zero) then the
input is still close to 0 and so the output of the neuron
is just linear, so we get a linear model.
 Choosing the size of the initial values needs a little
more thought, then. Each neuron is getting input from n
different places (either input nodes if the neuron is in
the hidden layer, or hidden neurons if it is in the output
layer)
 If we view the values of these inputs as having uniform
variance, then the typical input to the neuron will be
w√n, where w is the initialization value of the weights.
 So a common trick is to set the weights in the range
−1/ √ n < w < 1/ √ n, where n is the number of nodes in
the input layer to those weights.
 In neural networks, the activation function is a
mathematical “gate” in between the input feeding the
current neuron and its output going to the next layer.
 The activation functions are at the very core of
Machine Learning. They determine the output of a
model, its accuracy, and computational efficiency. In
some cases, activation functions have a major effect on
the model’s ability to converge and the convergence
speed.
 The following are the most popular activation
functions in Machine Learning algorithms

 Sigmoid (Logistic)
 Hyperbolic Tangent (Tanh)
 Rectified Linear Unit (ReLU)
 Leaky ReLU
 Parametric Leaky ReLU (PReLU)
 Exponential Linear Units (ELU)
 Scaled Exponential Linear Unit (SELU)
Sigmoid (Logistic)
The Sigmoid function (also known as the Logistic
function) is one of the most widely used activation
function. The function is defined as:
 Another very popular and widely used activation
function is the Hyperbolic Tangent, also known
as Tanh. It is defined as:
Rectified Linear Unit (ReLU)
 The Rectified Linear Unit (ReLU) is the most
commonly used activation function in deep learning.
The function returns 0 if the input is negative, but for
any positive input, it returns that value back. The
function is defined as:
Leaky ReLU
 Leaky ReLU is an improvement over the ReLU activation
function. It has all properties of ReLU, plus it will never
have dying ReLU problem. Leaky ReLU is defined as:
f(x) = max(αx, x)
Parametric leaky ReLU (PReLU)
 Parametric leaky ReLU (PReLU) is a variation of
Leaky ReLU, where α is authorized to be learned during
training (instead of being a hyperparameter, it becomes a
parameter that can be modified by backpropagation like
any other parameters). This was reported to strongly
outperform ReLU on large image datasets, but on smaller
datasets it runs the risk of overfitting the training set.
 Exponential Linear Unit (ELU)

 Exponential Linear Unit (ELU) is a variation


of ReLU with a better output for z < 0. The function is
defined as:
 Scaled Exponential Linear Unit (SELU)
 Exponential Linear Unit (SELU) activation function
is another variation of ReLU proposed by Günter
Klambauer et al. [4] in 2017. The authors showed that
if you build a neural network composed exclusively of
a stack of dense layers, and if all hidden layers use
the SELU activation function, then the network will
self-normalize. This activation function often
outperforms other activation functions very
significantly.
f(x) = scale * x , z > 0
= scale * α * (exp(x) - 1) , z <= 0
 The MLP is designed to be a batch algorithm.
 All of the training examples are presented to the neural
network, the average sum-of-squares error is then
computed, and this is used to update the weights.
 Thus there is only one set of weight updates for each
epoch (pass through all the training examples). This
means that we only update the weights once for each
iteration of the algorithm, which means that the weights
are moved in the direction that most of the inputs want
them to move, rather than being pulled around by each
input individually
 The algorithm that was described earlier was the
sequential version, where the errors are computed and
the weights updated after each input.
 This is not guaranteed to be as efficient in learning, but
it is simpler to program when using loops, and it is
therefore much more common.
 Since it does not converge as well, it can also
sometimes avoid local minima, thus potentially
reaching better solutions. While the description of the
algorithm is sequential.
 The driving force behind the learning rule is the
minimization of the network error by gradient descent
(using the derivative of the error function to make the
error smaller).
 This means that we are performing an optimization: we
are adapting the values of the weights in order to
minimize the error function.
 As should be clear by now, the way that we are doing
this is by approximating the gradient of the error and
following it downhill so that we end up at the bottom of
the slope. However, following the slope downhill only
guarantees that we end up at a local minimum, a point
that is lower than those close to it.
 However, there is no guarantee that it will have stopped
at the lowest point—only the lowest point locally.
There may be a much lower point over the next hill, but
the ball can’t see that, and it doesn’t have enough
energy to climb over the hill and find the global
minimum (have another look at Figure 4.3 to see a
picture of this).
 Gradient descent works in the same way in two or more
dimensions, and has similar (and worse) problems. The
problem is that efficient downhill directions in two
dimensions and higher are harder to compute locally.
 Let’s go back to the analogy of the ball rolling down the
hill. The reason that the ball stops rolling is because it runs
out of energy at the bottom of the dip. If we give the ball
some weight, then it will generate momentum as it rolls,
and so it is more likely to overcome a small hill on the other
side of the local minimum, and so more likely to find the
global minimum. We can implement this idea in our neural
network learning by adding in some contribution from the
previous weight change that we made to the current one. In
two dimensions it will mean that the ball rolls more directly
towards the valley bottom, since on average that will be the
correct direction, rather than being controlled by the local
changes
 There is another benefit to momentum. It makes it
possible to use a smaller learning rate, which means
that the learning is more steable.
 Another thing that can be added is known as weight
decay. This reduces the size of the weights as the
number of iterations increases.
 The argument goes that small weights are better since
they lead to a network that is closer to linear (since they
are close to zero, they are in the region where the
sigmoid is increasing linearly), and only those weights
that are essential to the non-linear learning should be
large.
 In machine learning, gradient descent is an
optimization technique used for computing the model
parameters (coefficients and bias) for algorithms like
linear regression, logistic regression, neural networks,
etc. In this technique, we repeatedly iterate through the
training set and update the model parameters in
accordance with the gradient of the error with respect
to the training set. Depending on the number of training
examples considered in updating the model parameters,
we have 3-types of gradient descents:
 Batch Gradient Descent: Parameters are updated after
computing the gradient of the error with respect to the
entire training set
 Stochastic Gradient Descent: Parameters are updated
after computing the gradient of the error with respect to
a single training example
 Mini-Batch Gradient Descent: Parameters are
updated after computing the gradient of the error with
respect to a subset of the training set.
 Thus, mini-batch gradient descent makes a compromise
between the speedy convergence and the noise
associated with gradient update which makes it a more
flexible and robust algorithm.
 For the MLP with one hidden layer there are (L + 1) ×
M + (M + 1) × N weights, where L, M, N are the
number of nodes in the input, hidden, and output layers,
respectively. The extra +1s come from the bias nodes,
which also have adjustable weights.
 This is a potentially huge number of adjustable
parameters that we need to set during the training
phase. Setting the values of these weights is the job of
the back-propagation algorithm, which is driven by the
errors coming from the training data.
 Clearly, the more training data there is, the better for
learning, although the time that the algorithm takes to
learn increases. Unfortunately, there is no way to
compute what the minimum amount of data required is,
since it depends on the problem.
 A rule of thumb that has been around for almost as long
as the MLP itself is that you should use a number of
training examples that is at least 10 times the number of
weights.
 This is probably going to be a very large number of
examples, so neural network training is a fairly
computationally expensive operation, because we need
to show the network all of these inputs lots of times.
 Number of Hidden Layers
 There are two other considerations concerning the
number of weights that are inherent in the calculation
above, which is the choice of the number of hidden
nodes, and the number of hidden layers.
 Making these choices is obviously fundamental to the
successful application of the algorithm.
 We will shortly see a pictorial demonstration of the fact
that two hidden layers is the most that you ever need
for normal MLP learning.
 We can use the back-propagation algorithm for a
network with as many layers as we like, although it
gets progressively harder to keep track of which
weights are being updated at any given time.
 Two hidden layers are sufficient to compute these
bump functions for different inputs, and so if the
function that we want to learn (approximate) is
continuous, the network can compute it.
When to Stop Learning

The training of the MLP requires that the algorithm runs


over the entire dataset many times, with the weights
changing as the network makes errors in each iteration.
The question is how to decide when to stop learning,
and this is a question that we are now ready to answer.
It is unfortunate that the most obvious options are not
sufficient: setting some predefined number N of
iterations, and running until that is reached runs the risk
that the network has overfitted by then, or not learnt
sufficiently, and only stopping when some predefined
minimum error is reached might mean the algorithm
never terminates, or that it overfits.
 However, the validation set gives us something rather
more useful, since we can use it to monitor the
generalisation ability of the network at its current stage
of learning. If we plot the sum-of-squares error during
training, it typically reduces fairly quickly during the
first few training iterations, and then the reduction
slows down as the learning algorithm performs small
changes to find the exact local minimum. We don’t
want to stop training until the local minimum has been
found, but, as we’ve just discussed, keeping on training
too long leads to overfitting of the network.

You might also like