0% found this document useful (0 votes)
27 views

UNIT V (1)

The document provides an overview of neural networks, focusing on perceptrons and multilayer perceptrons, including their structure, activation functions, and training methods such as gradient descent and backpropagation. It discusses the limitations of single-layer perceptrons in handling non-linearly separable problems, the importance of activation functions in introducing non-linearity, and the optimization techniques used in training neural networks. Key concepts such as stochastic gradient descent, weight adjustment, and the role of hidden layers in enhancing representational power are also covered.

Uploaded by

avins0204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

UNIT V (1)

The document provides an overview of neural networks, focusing on perceptrons and multilayer perceptrons, including their structure, activation functions, and training methods such as gradient descent and backpropagation. It discusses the limitations of single-layer perceptrons in handling non-linearly separable problems, the importance of activation functions in introducing non-linearity, and the optimization techniques used in training neural networks. Key concepts such as stochastic gradient descent, weight adjustment, and the role of hidden layers in enhancing representational power are also covered.

Uploaded by

avins0204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT V Neural Networks

Perceptron - Multilayer perceptron, activation functions, network training - gradient descent


Se optimization - stochastic gradient descent, error backpropagation, from shallow networks
to deep networks -Unit saturation (aka the vanishing gradient problem) - ReLU,
hyperparameter tuning, batch normalization, regularization, dropout.
Perceptron
• The perceptron is a feed-forward network with one output neuron that learns a separating
hyper-plane in a pattern space.
• The "n" linear Fx neurons feed forward to one threshold output Fy neuron. The perceptron
separates linearly separable set of pa set of patterns.

Single Layer Perceptron


• The perceptron is a feed-forward network with one output neuron that learns a separating
hyper-plane in a pattern space. The "n" linear Fx neurons feed forward to one threshold
output Fy neuron. The perceptron separates linearly separable set of patterns.
• SLP is the simplest type of artificial neural networks and can only classify linearly
inseparable cases with a binary target (1, 0).
• We can connect any number of McCulloch-Pitts neurons together in any way we like. An
arrangement of one input layer of McCulloch-Pitts neurons feeding forward to one output
layer of McCulloch-Pitts neurons is known as a Perceptron.
• A single layer feed-forward network consists of one or more output neurons, each of which
is connected with a weighting factor Wij to all of the inputs Xi.
• The Perceptron is a kind of a single-layer artificial network with only one neuron. The
Percepton is a network in which the neuron unit calculates the linear combination of its real-
valued or boolean inputs and passes it through a threshold activation function. Fig. shows
Perceptron.

• The Perceptron is sometimes referred to a Threshold Logic Unit (TLU) since it


discriminates the data depending on whether the sum is greater than the threshold value.
• In the simplest case the network has only two inputs and a single output. The output of the
neuron is:
y = f ( Σ2i=1 WiXi + b)
• Suppose that the activation function is a threshold then
f = {1 if s > 0
-1 if s < 0
• The Perceptron can represent most of the primitive boolean functions: AND, OR, NAND
and NOR but can not represent XOR.
• In single layer perceptron, initial weight values are assigned randomly because it does not
have previous knowledge. It sum all the weighted inputs. If the sum is greater than the
threshold value then it is activated i.e. output = 1.
Output
W1X1 + W2X2 +...+ WnXn > 0 ⇒ 1
W1X1 + W2X2 +...+ WnXn ≤ 0 ⇒ 0
• The input values are presented to the perceptron, and if the predicted output is the same as
the desired output, then the performance is considered satisfactory and no changes to the
weights are made.
• If the output does not match the desired output, then the weights need to be changed to
reduce the error.
• The weight adjustment is done as follows:
∆W = ῃ × d × x
Where
x = Input data
d = Predicted output and desired output.
ῃ = Learning rate
• If the output of the perceptron is correct then we do not take any action. If the output is
incorrect then the weight vector is W→ W + W.
• The process of weight adaptation is called learning.
• Perceptron Learning Algorithm:
1. Select random sample from training set as input.
2. If classification is correct, do nothing.
3. If classification is incorrect, modify the weight vector W using
Wi = Wi + ῃd (n) Xi (n)
Repeat this procedure until the entire training set is classified correctly.
Multilayer Perceptron
• A multi-layer perceptron (MLP) has the same structure of a single layer perceptron with one
or more hidden layers. An MLP is a network of simple neurons called perceptrons.
• A typical multilayer perceptron network consists of a set of source nodes forming the input
layer, one or more hidden layers of computation nodes, and an output layer of nodes.
• It is not possible to find weights which enable single layer perceptrons to deal with non-
linearly separable problems like XOR:

Limitation of Learning in Perceptron: linear separability


• Consider two-input patterns (X1, X2) being classified into two classes as shown in Fig.
10.1.3. Each point with either symbol of x or 0 represents a pattern with a set of values (X1,
X2).

• Each pattern is classified into one of two classes. Notice that these classes can be separated
with a single line L. They are known as linearly separable patterns.
• Linear separability refers to the fact that classes of patterns with n-dimensional vector x =
(x1, x2, …xn) can be separated with a single decision surface. In the case above, the line L
represents the decision surface.
• If two classes of patterns can be separated by a decision boundary, represented by the linear
equation then they are said to be linearly separable. The simple network can correctly classify
any patterns.
• Decision boundary (i.e., W, b or q) of linearly separable classes can be determined either by
some learning procedures or by solving linear equation systems based on representative
patterns of each classes.
• If such a decision boundary does not exist, then the two classes are said to be linearly
inseparable.
• Linearly inseparable problems cannot be solved by the simple network, more sophisticated
architecture is needed.
• Examples of linearly separable classes
• 1. Logical AND function

2. Logical OR function
• Examples of linearly inseparable classes
1. Logical XOR (exclusive OR) function

No line can separate these two classes, as can be seen from the fact that the following linear
inequality system has no solution.
because we have b < 0 from (1) +(4), and b >= 0 from (2) + (3), which is a contradiction.
Activation Functions
• Activation functions also known as transfer function is used to map input nodes to output
nodes in certain fashion.
• The activation function is the most important factor in a neural network which decided
whether or not a neuron will be activated or not and transferred to the next layer.
• Activation functions help in normalizing the output between 0 to 1 or 1 to 1. It helps in the
process of back propagation due to their differentiable property. During back propagation,
loss function gets updated, and activation function helps the gradient descent curves to
achieve their local minima.
• Activation function basically decides in any neural network that given input or receiving
information is relevant or it is irrelevant.
• These activation function makes the multilayer network to have greater representational
power than single layer network only when non-linearity is introduced.
• The input to the activation function is sum which is defined by the following equation.
Sum = I1W1 +I2 W2 +...+In Wn
= Σnj=1 Ij Wj + b
• Activation Function: Logistic Functions

• Logistic function monotonically increases from a lower limit (0 or 1) to an upper limit (+1)
as sum increases. In which values vary between 0 and 1, with a value of 0.5 when I is zero.
• Activation Function: Arc Tangent
• Activation Function: Hyperbolic Tangent

Identity or Linear Activation Function


• A linear activation is a mathematical equation used for obtaining output vectors with
specific properties.
• It is a simple straight line activation function where our function is directly proportional to
weighted sum of neurons or input.
• Linear activation functions are better in giving a wide range activations and a line of a
positive slops may increase the firing rate as the input rate increases.
• Fig. shows identity function.
• The equation for linear activation function is :
f(x) = a.x
When a = 1 then f(x) = x and this is a special case known as identity.
• Properties:
1. Range is - infinity to + infinity
2. Provides a convex error surface so optimisation can be achieved faster.
3. df(x)/dx = a which is constant. So cannot be optimised with gradient descent.
• Limitations:
1. Since the derivative is constant, the gradient has no relation with input.
2. Back propagation is constant as the change is delta x.
3.Activation function does not work in neural networks in practice.

Sigmoid
• A sigmoid function produces a curve with an "S" shape. The example sigmoid function
shown on the left is a special case of the logistic function, which models the growth of some
set.
Sig (t) =1/1+e-t
• In general, a sigmoid function is real-valued and differentiable, having a non-negative or
non-positive first derivative, one local minimum, and one local maximum.
• The logistic sigmoid function is related to the hyperbolic tangent as follows:
1 - 2 sig (x) = 1- 2.1/1+e–x = -tanh x/2
• Sigmoid functions are often used in artificial neural networks to introduce nonlinearity in
the model.
• A neural network element computes a linear combination of its input signals, and applies a
sigmoid function to the result.
• A reason for its popularity in neural networks is because the sigmoid function satisfies a
property between the derivative and itself such that it is computationally easy to perform.
d/dt sig (t) = sig(t) (1 - sig (t))
• Derivatives of the sigmoid function are usually employed in learning algorithms.

Gradient Descent Optimization


• Gradient Descent is an optimization algorithm in gadget mastering used to limit a feature
with the aid iteratively moving towards the minimal fee of the characteristic.
• We essentially use this algorithm when we have to locate the least possible values which
could fulfill a given fee function. In gadget getting to know, greater regularly that not we try
to limit loss features (like Mean Squared Error). By minimizing the loss characteristic, we
will improve our model and Gradient Descent is one of the most popular algorithms used for
this cause.
• The graph above shows how exactly a Gradient Descent set of rules works.
• We first take a factor in the value function and begin shifting in steps in the direction of the
minimum factor. The size of that step, or how quickly we ought to converge to the minimum
factor is defined by Learning Rate. We can cowl more location with better learning fee but at
the risk of overshooting the minima. On the opposite hand, small steps/smaller gaining
knowledge of charges will eat a number of time to attain the lowest point.
• Now, the direction wherein algorithm has to transport (closer to minimal) is also important.
We calculate this by way of using derivatives. You need to be familiar with derivatives from
calculus. A spin off is largely calculated because the slope of anon the graph at any specific
factor. We get that with the aid of finding the tangent line to the graph at that point. The extra
steep the tangent, would suggest that more steps would be needed to reach minimum point,
much less steep might suggest lesser steps are required to reach the minimum factor.

Stochastic Gradient Descent


• The word 'stochastic' means a system or a process that is linked with a random probability.
Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the
whole data set for each iteration.
• Stochastic Gradient Descent (SGD) is a type of gradient descent that runs one training
example per iteration. It processes a training epoch for each example within a dataset and
updates each training example's parameters one at a time.
• As it requires only one training example at a time, hence it is easier to store in allocated
memory. However, it shows some computational efficiency losses in comparison to batch
gradient systems as it shows frequent updates that require more detail and speed.
• Further, due to frequent updates, it is also treated as a noisy gradient. However, sometimes
it can be helpful in finding the global minimum and also escaping the local minimum.
• Advantages of Stochastic gradient descent:
a) It is easier to allocate in desired memory.
b) It is relatively fast to compute than batch gradient descent.
c) It is more efficient for large datasets.
• Disadvantages of Stochastic Gradient Descent:
a) SGD requires a number of hyper parameters such as the regularization parameter and the
number of iterations.
b) SGD is sensitive to feature scaling.
Error Backpropagation
• Backpropagation is a training method used for a multi-layer neural network. It is also called
the generalized delta rule. It is a gradient descent method which minimizes the total squared
error of the output computed by the net.
• The backpropagation algorithm looks for the minimum value of the error function in weight
space using a technique called the delta rule or gradient descent. The weights that minimize
the error function is then considered to be a solution to the learning problem.
• Backpropagation is a systematic method for training multiple layer ANN. It is a
generalization of Widrow-Hoff error correction rule. 80 % of ANN applications uses
backpropagation.
• Below Fig. shows backpropagation network.
• Consider a simple neuron:
a. Neuron has a summing junction and activation function.
b. Any non linear function which differentiable everywhere and increases everywhere with
sum can be used as activation function.

c. Examples: Logistic function, Arc tangent function, Hyperbolic tangent activation function.
• These activation function makes the multilayer network to have greater representational
power than single layer network only when non-linearity is introduced.
• Need of hidden layers:
1. A network with only two layers (input and output) can only represent the input with
whatever representation already exists in the input data.
2. If the data is discontinuous or non-linearly separable, the innate representation is
inconsistent, and the mapping cannot be learned using two layers (Input and Output).
3. Therefore, hidden layer(s) are used between input and output layers
• Weights connects unit (neuron) in one layer only to those in the next higher layer. The
output of the unit is scaled by the value of the connecting weight, and it is fed forward to
provide a portion of the activation for the units in the next higher layer.
• Backpropagation can be applied to an artificial neural network with any number of hidden
layers. The training objective is to adjust the weights so that the application of a set of inputs
produces the desired outputs.
• Training procedure: The network is usually trained with a large number of input-output
pairs.
1. Generate weights randomly to small random values (both positive and negative) to ensure
that the network is not saturated by large values of weights.
2. Choose a training pair from the training set.
3. Apply the input vector to network input.
4. Calculate the network output.
5. Calculate the error, the difference between the network output and the desired output.
6. Adjust the weights of the network in a way that minimizes this error.
7. Repeat steps 2 - 6 for each pair of input-output in the training set until the error for the
entire system is acceptably low.
Forward pass and backward pass:
• Backpropagation neural network training involves two passes.
1. In the forward pass, the input signals moves forward from the network input to the output.
2. In the backward pass, the calculated error signals propagate backward through the network,
where they are used to adjust the weights.
3. In the forward pass, the calculation of the output is carried out, layer by layer, in the
forward direction. The output of one layer is the input to the next layer.
• In the reverse pass,
a. The weights of the output neuron layer are adjusted first since the target value of each
output neuron is available to guide the adjustment of the associated weights, using the delta
rule.
b. Next, we adjust the weights of the middle layers. As the middle layer neurons have no
target values, it makes the problem complex.
• Selection of number of hidden units: The number of hidden units depends on the number of
input units.
1. Never choose h to be more than twice the number of input units.
2. You can load p patterns of I elements into log2 p hidden units.
3. Ensure that we must have at least 1/e times as many training examples.
4. Feature extraction requires fewer hidden units than inputs.
5. Learning many examples of disjointed inputs requires more hidden units than inputs.
6. The number of hidden units required for a classification task increases with the number of
classes in the task. Large networks require longer training times.
Factors influencing Backpropagation training
• The training time can be reduced by using:
1.Bias: Networks with biases can represent relationships between inputs and outputs more
easily than networks without biases. Adding a bias to each neuron is usually desirable to
offset the origin of the activation function. The weight of the bias is trainable similar to
weight except that the input is always+1.
2. Momentum: The use of momentum enhances the stability of the training process.
Momentum is used to keep the training process going in the same general direction analogous
to the way that momentum of a moving object behaves. In backpropagation with momentum,
the weight change is a combination of the current gradient and the previous gradient.

Advantages and Disadvantages


Advantages of backpropagation:
1. It is simple, fast and easy to program.
2. Only numbers of the input are tuned and not anyother parameter.
3. No need to have prior knowledge about the network.
4. It is flexible.
5. A standard approach and works efficiently.
6. It does not require the user to learn special functions.
Disadvantages of backpropagation:
1. Backpropagation possibly be sensitive to noisy data and irregularity.
2. The performance of this is highly reliant on the input data.
3. Needs excessive time for training.
4. The need for a matrix-based method for backpropagation instead of mini - batch.

Shallow Networks
• The terms shallow and deep refer to the number of layers in a neural network; shallow
neural networks refer to a neural network that have a small number of layers, usually
regarded as having a single hidden layer, and deep neural networks refer to neural networks
that have multiple hidden layers. Both types of networks perform certain tasks better than the
other and selecting the right network depth is important for creating a successful model.
• In a shallow neural network, the values of the feature vector of the data to be classified (the
input layer) are passed to a hidden layer of nodes (neurons) each of which generates a
response according to some activation function, g, acting on the weighted sum of those
values, z.
• The responses of each unit in the hidden layer is then passed to a final, output layer (which
may consist of a single unit), whose activation produces the classification prediction output.
Deep Network
• Deep learning is a new area of machine learning research, which has been introduced with
the objective of moving machine learning closer to one of its original goals. Deep learning is
about learning multiple levels of representation and abstraction that help to make sense of
data such as images, sound, and text.
• 'Deep learning' means using a neural network with several layers of nodes between input
and output. It is generally better than other methods on image, speech and certain other types
of data because the series of layers between input and output do feature identification and
processing in a series of stages, just as our brains seem to.
• Deep Learning emphasizes the network architecture of today's most successful machine
learning approaches. These methods are based on "deep" multi- neural networks with many
hidden layers.

TensorFlow
• TensorFlow is one of the most popular frameworks used to build deep learning models. The
framework is developed by Google Brain Team.
• Languages like C++, R and Python are supported by the framework to create the models as
well as the libraries. This framework can be accessed from both - desktop and mobile.
• The translator used by Google is the best example of TensorFlow. In this, the model is
created by adding the functionalities of text classification, natural language processing,
speech or handwriting recognition, image recognition, etc.
• The framework has its own visualization toolkit, named TensorBoard which helps in
powerful data visualization of the network along with its performance.
• One more tool added in TensorFlow, TensorFlow Serving, can be used for quick and easy
deployment of the newly developed algorithms without introducing any change in the
existing API or architecture.
• TensorFlow framework comes along with a detailed documentation for the users od or to
adapt it quickly and easily, making it the most preferred deep learning to do framework to
model deep learning algorithms.
• Some of the characteristics of TensorFlow is:
• Multiple GPU supported
• One can visualize graphs and queues easily using TensorBoard.
• Powerful documentation and larger support from community

Keras
• If you are comfortable in programming with Python, then learning Keras will not prove
hard to you. This will be the most recommended framework to create deep aid learning
models for ones having a sound of Python.
• Keras is built purely on Python and can run on the top of TensorFlow. Due to its complexity
and use of low level libraries, TensorFlow can be comparatively harder to adapt for the new
users as compared to Keras. Users those who are beginners in deep learning, and find its
models difficult to understand in TensorFlow generally prefer Keras as it solves all complex
models in no time.
• Keras has been developed keeping in mind the complexities in the deep learning models,
and hence it can run quickly to get the results in minimum time. Convolutional as well as
Recurrent Neural networks are supported in Keras. The framework can run easily on CPU
and GPU.
• The models in Keras can be classified into 2 categories:
1. Sequential model:
The layers in the deep learning model are defined in a sequential manner. Hence the
implementation of the layers in this model will also be done sequentially.
2. Keras functional API:
Deep learning models that has multiple outputs, or has shared layers, i.e. more complex
models can be implemented in Keras functional API.

Difference between Deep Network and Shallow Network

Vanishing Gradient Problem


• The vanishing gradient problem is a problem that user face, when we are training Neural
Networks by using gradient-based methods like backpropagation. This problem makes it
difficult to learn and tune the parameters of the earlier layers in the network.
• The vanishing gradient problem is essentially a situation in which a deep multilayer feed-
forward network or a Recurrent Neural Network (RNN) does not have the ability to
propagate useful gradient information from the grim the model back to the layers near the
input end of the model.
• It results in models with many layers being rendered unable to learn on a specific dataset. It
could even cause models with many layers to prematurely converge to a substandard solution.
• When the backpropagation algorithm advances downwards or backward going from the
output layer to the input layer, the gradients tend to shrink, becoming smaller and smaller till
they approach zero. This ends up leaving the weights of the initial or lower layers practically
unchanged. In this situation, the gradient descent does not ever end up converging to the
optimum.
• Vanishing gradient does not necessarily imply that the gradient vector is all zero. It implies
that the gradients are minuscule, which would cause the learning to be very slow.
• The most important solution to the vanishing gradient problem is a specific type of neural
network called Long Short-Term Memory Networks (LSTMs).
• Indication of vanishing gradient problem:
a) The parameters of the higher layers change to a great extent, while the parameters of lower
layers barely change.
b) The model weights could become 0 during training.
c) The model learns at a particularly slow pace and the training could stagnate at a very early
phase after only a few iterations.
• Some methods that are proposed to overcome the vanishing gradient problem:a) Residual
neural networks (ResNets)
b) Multi-level hierarchy
c) Long short term memory (LSTM)
d) Faster hardware
e) ReLU
f) Batch normalization
ReLU
• Rectified Linear Unit (ReLU) solve the vanishing gradient problem. ReLU is a non-linear
function or piecewise linear function that will output the input directly if it is positive,
otherwise, it will output zero.
• It is the most commonly used activation function in neural networks, especially in
Convolutional Neural Networks (CNNs) and Multilayer perceptron's.
• Mathematically, it is expressed as
f(x) = max (0, x)
where x : input to neuron

• The derivative of an activation function is required when updating the weights during the
back-propagation of the error. The slope of ReLU is 1 for positive values and 0 for negative
values. It becomes non-differentiable when the input x is zero, but it can be safely assumed to
be zero and causes no problem in practice.
• ReLU is used in the hidden layers instead of Sigmoid or tanh. The ReLU function solves the
problem of computational complexity of the Logistic Sigmoid and Tanh functions.
• A ReLU activation unit is known to be less likely to create a vanishing gradient problem
because its derivative is always 1 for positive values of the argument.
• Advantages of ReLU function
a) ReLU is simple to compute and has a predictable gradient for the backpropagation of the
error.
b) Easy to implement and very fast.
c) The calculation speed is very fast. The ReLU function has only a direct relationship.
d) It can be used for deep network training.
• Disadvantages of ReLU function
a) When the input is negative, ReLU is not fully functional which means when it comes to the
wrong number installed, ReLU will die. This problem is also known as the Dead Neurons
problem.
b) ReLU function can only be used within hidden layers of a Neural Network Model.

LReLU and ERELU


1. LReLU
• The Leaky ReLU is one of the most well-known activation function. It is the same as ReLU
for positive numbers. But instead of being 0 for all negative values, it has a constant slope
(less than 1.).
• Leaky ReLU is a type of activation function that helps to prevent the function from
becoming saturated at 0. It has a small slope instead of the standard ReLU which has an
infinite slope.
• Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Fig. shows LReLU
function.

• The leak helps to increase the range of the ReLU function. Usually, the value of a dog is
0.01 or so.
• The motivation for using LReLU instead of ReLU is that constant zero gradients can also
result in slow learning, as when a saturated neuron uses a sigmoid activation function
2. EReLU
• An Elastic ReLU (EReLU) considers a slope randomly drawn from a uniform distribution
during the training for the positive inputs to control the amount of non-linearity.
• The EReLU is defined as: EReLU(x) = max(Rx; 0) in the output range of [0;1) where R is a
random number
• At the test time, the ERELU becomes the identity function for positive inputs.

Hyperparameter Tuning
• Hyperparameters are parameters whose values control the learning process and determine
the values of model parameters that a learning algorithm ends up learning.
• While designing a machine learning model, one always has multiple choices for the
architectural design for the model. This creates a confusion on which design to choose for the
model based on its optimality. And due to this, there are always trials for defining a perfect
machine learning model.
• The parameters that are used to define these machine learning models are known as the
hyperparameters and the rigorous search for these parameters to build an optimized model is
known as hyperparameter tuning.
• Hyperparameters are not model parameters, which can be directly trained from data. Model
parameters usually specify the way to transform the input into the required output, whereas
hyperparameters define the actual structure of the model that gives the required data.

Layer Size
• Layer size is defined by the number of neurons in a given layer. Input and output layers are
relatively easy to figure out because they correspond directly to how our modeling problem
handles input and ouput.
• For the input layer, this will match up to the number of features in the input vector. For the
output layer, this will either be a single output neuron or a number of neurons matching the
number of classes we are trying to predict.
• It is obvious that a neural network with 3 layers will give better performance than that of 2
layers. Increasing more than 3 doesn't help that much in neural networks. In the case of CNN,
an increasing number of layers makes the model better.

Magnitude: Learning Rate


• The amount that the weights are updated during training is referred to as the step size or the
learning rate. Specifically, the learning rate is a configurable hyper-parameter used in the
training of neural networks that has a small positive value, often in the range between 0.0 and
1.0.
• For example, if learning rate is 0.1, then the weights in the network are updated 0.1*
(estimated weight error) or 10% of the estimated weight error each time the Top weights are
updated. The learning rate hyper-parameter controls the rate or speed at which the model
learns.
• Learning rates are tricky because they end up being specific to the dataset and even to other
hyper-parameters. This creates a lot of overhead for finding the right setting for hyper-
parameters.
• Large learning rates () make the model learn faster but at the same time it may cause us to
miss the minimum loss function and only reach the surrounding of it. In cases where the
learning rate is too large, the optimizer overshoots the minimum and the loss updates will
lead to divergent behaviours.
• On the other hand, choosing lower learning rate values gives a better chance of finding the
local minima with the trade-off of needing larger number of epochs and more time.
• Momentum can accelerate learning on those problems where the high-dimensional weight
space that is being navigated by the optimization process has structures that mislead the
gradient descent algorithm, such as flat regions or steep curvature.

Normalization
• Normalization is a data preparation technique that is frequently used in machine learning.
The process of transforming the columns in a dataset to the same scale is referred to as
normalization. Every dataset does not need to be normalized for machine learning.
• Normalization makes the features more consistent with each other, which allows the model
to predict outputs more accurately. The main goal of normalization is to make the data
homogenous over all records and fields.
• Normalization refers to rescaling real-valued numeric attributes into a 0 to 1 range. Data
normalization is used in machine learning to make model training less sensitive to the scale
of features.
• Normalization is important in such algorithms as k-NN, support vector machines, neural
networks, and principal components. The type of feature preprocessing and normalization
that's needed can depend on the data.

Batch Normalization
• It is a method of adaptive reparameterization, motivated by the difficulty of training very
deep models. In Deep networks, the weights are updated for each layer. So the output will no
longer be on the same scale as the input.
• When we input the data to a machine or deep learning algorithm we tend to for change the
values to a balanced scale because, we ensure that our model can generalize appropriately.
• Batch normalization is a technique for standardizing the inputs to layers in a neural
network. Batch normalization was designed to address the problem of internal covariate shift,
which arises as a consequence of updating multiple-layer inputs simultaneously in deep
neural networks.
• Batch normalization is applied to individual layers, or optionally, to all of them: In each
training iteration, we first normalize the inputs by subtracting their mean and dividing by
their standard deviation, where both are estimated based on the statistics of the current mini-
batch.
• Next, we apply a scale coefficient and an offset to recover the lost degrees of freedom. It is
precisely due to this normalization based on batch statistics that batch normalization derives
its name.
• We take the output a[i-1] from the preceding layer, and multiply by the weights W and add
the bias b of the current layer. The variable I denotes the current layer.
Z[i] = W [i] a[i-1] + b[i]
• Next, we usually apply the non-linear activation function that results in the output a[i] of the
current layer. When applying batch norm, we correct our data before feeding it to the
activation function.
• To apply batch norm, calculate the mean as well as the variance of current z.
μ = Σ mi=1 Zj
• When calculating the variance, we add a small constant to the variance to prevent potential
divisions by zero.
σ2 = 1/m Σmi=1 (Zj - μ)2 + €
• To normalize the data, we subtract the mean and divide the expression by the standard
deviation.
Z[i] = Z[i]-μ / √σ 2
• This operation scales the inputs to have a mean of 0 and a standard deviation of 1.
• Advantages of Batch Normalisation:
a) The model is less delicate to hyperparameter tuning.
b) Shrinks internal covariant shift.
c) Diminishes the reliance of gradients on the scale of the parameters or their underlying
values.
d) Dropout can be evacuated for regularization
Regularization
• Just have a look at the above figure, and we can immediately predict that once we try to
cover every minutest feature of the input data, there can be irregularities in the extracted
features, which can introduce noise in the output. This is referred to as "Overfitting".
• This may also happen with the lesser number of features extracted as some of the important
details might be missed out. This will leave an effect on the accuracy of the outputs produced.
This is referred to as "Underfitting".
• This also shows that the complexity for processing the input elements increases with
overfitting. Also, neural networks being a complex interconnection of nodes, the issue of
overfitting may arise frequently.
• To eliminate this, regularization is used, in which we have to make the slightest
modification in the design of the neural network, and we can get better outcomes.

Regularization in Machine Learning


• One of the most important factors that affect the machine learning model is overfitting.
• The machine learning model may perform poorly if it tries to capture even the noise present
in the dataset applied for training the system, which ultimately results in overfitting. In this
context, noise doesn't mean the ambiguous or false data, but those inputs which do not
acquire the required features to execute the machine learning model.
• Analyzing these data inputs may surely make the model flexible, but the risk of overfitting
will also increase accordingly.
• One of the ways to avoid this is to cross validate the training dataset, and decide
accordingly the parameters to include that can increase the efficiency and performance of the
model.
• Let this be the simple relation for linear regression
Where
Y = b0 + b1X1 + b2X2 + .... bpXp
Y = Learned relation
B = Co-efficient estimators for different variables and/or predictors (X)
• Now, we shall introduce a loss function, that implements the fitting procedure, which is
referred to as "Residual Sum of Squares" or RSS.
• The co-efficient in the function is chosen in such a way that it can minimize the loss
function easily.
Hence,
RSS = Σni=1 (Υi – β0 – Σpj=1 βi Χij )2
• Above equation will help in adjusting the co-efficient function depending on the training
dataset.
• In case noise is present in the training dataset, then the adjusted co-efficient won't be
generalized when the future datasets will be introduced. Hence, at this point, regularization
comes into picture and makes this adjusted co-efficient shrink towards zero.
• One of the methods to implement this is the ridge regression, also known as L2
regularization. Lets have a quick overview on this.

Ridge Regression (L2 Regularization)


• Ridge regression, also known as L2 regularization, is a technique of regularization to avoid
the overfitting in training data set, which introduces a small bias in the Straining model,
through which one can get long term predictions for that input.
• In this method, a penalty term is added to the cost function. This amount of bias altered to
the cost function in the model is also known as ridge regression penalty.
• Hence, the equation for the cost function, after introducing the ridge regression penalty is as
follows:
Σmi=1 (yi – y`i)2 = Σni=1 (Υi – Σnj=1 βj × Χij)2 + λ Σnj=0 βj2
Here, λ is multiplied by the square of the weight set for the individual feature of the input
data. This term is ridge regression penalty.
• It regularizes the co-efficient set for the model and hence the ridge regression term deduces
the values of the coefficient, which ultimately helps in deducing the complexity of the
machine learning model.
• From the above equation, we can observe that if the value of tends to zero, the last term on
the right hand side will tend to zero, thus making the above equation a representation of a
simple linear regression model.
• Hence, lower the value of, the model will tend to linear regression.
• This model is important to execute the neural networks for machine learning, as there would
be risks of failure for generalized linear regression models, if there are dependencies found
between its variables. Hence, ridge regression is used here.

Lasso Regression (L1 Regularization)


• One more technique to reduce the overfitting, and thus the complexity of the model is the
lasso regression.
• Lasso regression stands for Least Absolute and Selection Operator and is also sometimes
known as L1 regularization.
• The equation for the lasso regression is almost same as that of the ridge regression, except
for a change that the value of the penalty term is taken as the absolute weights.
• The advantage of taking the absolute values is that its slope can shrink to 0, as compared to
the ridge regression, where the slope will shrink it near to 0.
• The following equation gives the cost function defined in the Lasso regression:
Σmi=1 (yi – y`i)2 = Σni=1 (Υi – Σni=1 βj × Χij) + λ Σni=0 | βj|2
• Due to the acceptance of absolute values for the cost function, some of the features of the
input dataset can be ignored completely while evaluating the machine learning model, and
hence the feature selection and overfitting can be reduced to much extent.
• On the other hand, ridge regression does not ignore any feature in the model and includes it
all for model evaluation. The complexity of the model can be reduced using the shrinking of
co-efficient in the ridge regression model.

Dropout
• Dropout was introduced by "Hinton et al"and this method is now very popular. It consists of
setting to zero the output of each hidden neuron in chosen layer with some probability and is
proven to be very effective in reducing overfitting.
• Fig. shows dropout regulations.

• To achieve dropout regularization, some neurons in the artificial neural network are
randomly disabled. That prevents them from being too dependent on one another as they
learn the correlations. Thus, the neurons work more independently, and the artificial neural
network learns multiple independent correlations in the data based on different configurations
of the neurons.
• It is used to improve the training of neural networks by omitting a hidden unit. It also
speeds training.
• Dropout is driven by randomly dropping a neuron so that it will not contribute to the
forward pass and back-propagation.
• Dropout is an inexpensive but powerful method of regularizing a broad family of es models.

DropConnect
• DropConnect, known as the generalized version of Dropout, is the method used brts for
regularizing deep neural networks. Fig. 10.11.3 shows dropconnect.

Difference between L1 and L2 Regularization

You might also like