0% found this document useful (0 votes)
13 views

DL_Unit2

The document provides an overview of deep feedforward networks, detailing their architecture, learning processes, and the significance of hidden layers in capturing complex relationships. It explains gradient-based learning, particularly gradient descent, and its role in optimizing machine learning models. Additionally, it covers the importance of activation functions, cost functions, and the structure of computational graphs in neural networks.

Uploaded by

AKSHIT THAKUR
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

DL_Unit2

The document provides an overview of deep feedforward networks, detailing their architecture, learning processes, and the significance of hidden layers in capturing complex relationships. It explains gradient-based learning, particularly gradient descent, and its role in optimizing machine learning models. Additionally, it covers the importance of activation functions, cost functions, and the structure of computational graphs in neural networks.

Uploaded by

AKSHIT THAKUR
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 113

Deep Feedforward Network

• Feed-forward Networks,
• Gradient-based Learning,
• Hidden Units,
• Architecture Design,
• Computational Graphs,
• Back-Propagation,
• Regularization,
• Parameter Penalties,
• Data Augmentation,
• Multi-task Learning,
• Bagging,
• Dropout and Adversarial Training and Optimization.
Neural networks
• The term "Artificial neural network" refers to a biologically inspired sub-field of artificial intelligence
modeled after the brain. An Artificial neural network is usually a computational network based on
biological neural networks that construct the structure of the human brain. Similar to a human brain
has neurons interconnected to each other, artificial neural networks also have neurons that are linked
to each other in various layers of the networks. These neurons are known as nodes.
Feedforward networks

• The process of receiving an input to produce some kind of output to make some
kind of prediction is known as Feed Forward.
• Feed Forward neural network is the core of many other important neural networks
such as convolution neural network.
• In the feed-forward neural network, there are not any feedback loops or
connections in the network. Here is simply an input layer, a hidden layer, and an
output layer.
• There can be multiple hidden layers which depend on what kind of data you are
dealing with.
• The number of hidden layers is known as the depth of the neural network.
• The deep neural network can learn from more functions.
• Input layer first provides the neural network with data and the output layer then
make predictions on that data which is based on a series of functions.
• ReLU Function is the most commonly used activation function in the deep neural
network.
Gradient-based Learning
Gradient Descent in Machine Learning
• Gradient Descent is known as one of the most commonly used optimization
algorithms to train machine learning models by means of minimizing errors
between actual and expected results. Further, gradient descent is also used to
train Neural Networks.
• In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in
machine learning, optimization is the task of minimizing the cost function
parameterized by the model's parameters.
• The main objective of gradient descent is to minimize the convex function
using iteration of parameter updates.
• Once these machine learning models are optimized, these models can be used as
powerful tools for Artificial Intelligence and various computer science applications.
What is Gradient Descent or Steepest Descent?

• Gradient descent was initially discovered by "Augustin-Louis Cauchy"


in mid of 18th century. Gradient Descent is defined as one of the most
commonly used iterative optimization algorithms of machine learning to
train the machine learning and deep learning models. It helps in finding
the local minimum of a function.
• The best way to define the local minimum or local maximum of a function
using gradient descent is as follows:
 If we move towards a negative gradient or away from the gradient of the
function at the current point, it will give the local minimum of that
function.
 Whenever we move towards a positive gradient or towards the gradient of
the function at the current point, we will get the local maximum of that
function.
• This entire procedure is known as Gradient Ascent, which is also known as
steepest descent. The main objective of using a gradient descent
algorithm is to minimize the cost function using iteration. To achieve this
goal, it performs two steps iteratively:
 Calculates the first-order derivative of the function to compute the gradient
or slope of that function.
 Move away from the direction of the gradient, which means slope
increased from the current point by alpha times, where Alpha is defined as
Learning Rate. It is a tuning parameter in the optimization process which
helps to decide the length of the steps.
What is Cost-function?

• The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number.
• It helps to increase and improve machine learning efficiency by providing feedback to this
model so that it can minimize error and find the local or global minimum. Further, it
continuously iterates along the direction of the negative gradient until the cost function
approaches zero
• At this steepest descent point, the model will stop learning further. Although cost function and
loss function are considered synonymous, also there is a minor difference between them.
• The slight difference between the loss function and the cost function is about the error
within the training of machine learning models, as loss function refers to the error of one
training example, while a cost function calculates the average error across an entire
training set.
• The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to reduce the
cost function.
How does Gradient Descent work?

• Before starting the working principle of gradient descent, we should know


some basic concepts to find out the slope of a line from linear regression.
The equation for simple linear regression is given as:

Y=mX+c
• where 'm' represents the slope of the line, and 'c' represents the intercepts
on the y-axis.
• The starting point(shown in above fig.) is used to evaluate the performance
as it is considered just as an arbitrary point. At this starting point, we will
derive the first derivative or slope and then use a tangent line to calculate
the steepness of this slope. Further, this slope will inform the updates to the
parameters (weights and bias).
• The slope becomes steeper at the starting point or arbitrary point, but
whenever new parameters are generated, then steepness gradually reduces,
and at the lowest point, it approaches the lowest point, which is called a
point of convergence.
• The main objective of gradient descent is to minimize the cost function or
the error between expected and actual. To minimize the cost function, two
data points are required:
• Direction & Learning Rate
These two factors are used to determine the partial derivative calculation
of future iteration and allow it to the point of convergence or local
minimum or global minimum. Let's discuss learning rate factors in brief;
• Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point.
This is typically a small value that is evaluated and updated based on the
behavior of the cost function. If the learning rate is high, it results in larger
steps but also leads to risks of overshooting the minimum. At the same
time, a low learning rate shows the small step sizes, which compromises
overall efficiency but gives the advantage of more precision.
• 2. Vanishing and Exploding Gradient
In a deep neural network, if the model is trained with gradient descent and
backpropagation, there can occur two more issues other than local minima and saddle
point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During
backpropagation, this gradient becomes smaller that causing the decrease in the
learning rate of earlier layers than the later layer of the network. Once this happens, the
weight parameters update until they become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the
Gradient is too large and creates a stable model. Further, in this scenario, model weight
increases, and they will be represented as NaN. This problem can be solved using the
dimensionality reduction technique, which helps to minimize complexity within the
model.
Types of Layers

Generally, every neural network consists of vertically stacked components that


are called layers. There are three types of layers:
• An Input Layer that takes as input the raw data and passes them to the rest of the
network.
• One or more Hidden Layers that are intermediate layers between the input and
output layer and process the data by applying complex non-linear functions to
them. These layers are the key component that enables a neural network to
learn complex tasks and achieve excellent performance.
• An Output Layer that takes as input the processed data and produces the final
results.
In the above neural network, each neuron of the first hidden layer takes as
input the three input values and computes its output as follows:

where “xi “ are the input values, wi the weights ,” b” the bias and “f ()”
an activation function. Then, the neurons of the second hidden layer will
take as input the outputs of the neurons of the first hidden layer and so on.
Importance of Hidden Layers

• Hidden layers are the reason why neural networks are able to capture very complex
relationships and achieve exciting performance in many tasks.
• To better understand this concept, we should first examine a neural network
without any hidden layer like the one below that has 3 input features and 1 output:
• Based on the previous equation, the output “y” value is equal to a linear
combination along with a non-linearity. Therefore, the model is similar to a
linear regression model. As we already know, a linear regression attempts to fit a
linear equation to the observed data.
• In most machine learning tasks, a linear relationship is not enough to capture the
complexity of the task and the linear regression model fails. Here comes the
importance of the hidden layers that enables the neural network to learn very
complex non-linear functions.
Examples

• Next, we’ll discuss two examples that illustrate the importance of hidden
layers in training a neural network for a given task.
1. Logical Functions
• Let’s imagine that we want to use a neural network to predict the output of
an XOR logical gate given two binary inputs. According to the truth table
of x1 XOR x2 , the output is true whenever the inputs differ:

• To better understand our classification task, below we plot the four possible
outputs. The green points correspond to an output equal to 1 while the red
points are the zero outputs:
• At first, the problem seems pretty easy. The first approach could be to use a neural
network without any hidden layer. However, this linear architecture is able to separate
our input data points using a single line. As we can see in the above graph, the two classes
cannot be separated using a single line and the XOR problem is not linearly separable.
• Below, we can see some lines that a simple linear model may learn to solve the XOR
problem. We observe that in both cases there is an input that is misclassified:
The solution to this problem is to learn a non-linear function by adding a hidden
layer with two neurons to our neural network. So, the final decision is made based on
the outputs of these two neurons that each one learns a linear function like the ones
below:
The one line makes sure that at least one feature of the input is true and the other line
makes sure that not all features are true. So, the hidden layer manages to transform
the input features into processed features that can then be correctly classified in the
output layer.
2. Images

• Another way to realize the importance of hidden layers is to look into the computer vision domain.
Deep neural networks that consist of many hidden layers have achieved impressive results in face
recognition by learning features in a hierarchical way.
• Specifically, the first hidden layers of a neural network learn to detect short pieces of corners and
edges in the image. These features are easy to detect given the raw image but are not very useful by
themselves to recognize the identity of the person in the image. Then, the middle hidden layers combine
the detected edges from the previous layers to detect parts of the face like the eyes, the nose and the ears.
Then, the final layers combine the detectors of the nose, eyes, etc to detect the overall face of the
person.
• In the image below, we can see how each layer helps us to go from the raw pixels to our final goal:
Neural Network Architecture and
Operation
• A weight is assigned to each input to an artificial neuron. First, the inputs are multiplied
by their weights, and then a bias is applied to the outcome. After that, the weighted sum
is passed via an activation function, being a non-linear function.
• A weight is being applied to each input to an artificial neuron. First, the inputs are
multiplied by their weights, and then a bias is applied to the outcome. This is called the
weighted sum. After that, the weighted sum is processed via an activation function, as a
non-linear function.
• The first layer is the input layer, through which the data that is sent into the neural
network. The output layer is the final layer. The dataset and the type of challenge
determine the number of neurons in the final layer and the first layer. Trial and error will
be used to determine the number of neurons in the hidden layers and the number of
hidden layers.
• All of the inputs from the previous layer will be connected to the first neuron from the
first hidden layer. The second neuron in the first hidden layer will be connected to all of
the preceding layer’s inputs, and so forth for all of the first hidden layer’s neurons. The
outputs of the previously hidden layer are regarded inputs for neurons in the second
hidden layer, and each of these neurons is coupled to all of the preceding neurons.
What is a Feed-Forward Neural Network and how does it work?

• In its most basic form, a Feed-Forward Neural Network is a single layer perceptron. A
sequence of inputs enter the layer and are multiplied by the weights in this model. The
weighted input values are then summed together to form a total.
• If the sum of the values is more than a predetermined threshold, which is normally set at zero,
the output value is usually 1, and if the sum is less than the threshold, the output value is
usually -1.
• The single-layer perceptron is a popular feed-forward neural network model that is frequently
used for classification. Single-layer perceptrons can also contain machine learning features.
• The neural network can compare the outputs of its nodes with the desired values
using a property known as the delta rule, allowing the network to alter its weights
through training to create more accurate output values.
• This training and learning procedure results in gradient descent. The technique of
updating weights in multi-layered perceptrons is virtually the same, however, the
process is referred to as back-propagation. In such circumstances, the output values
provided by the final layer are used to alter each hidden layer inside the network.
Hidden Units
• In a deep feedforward neural network, hidden units refer to the nodes or neurons in layers between the
input and output layers. These hidden units are responsible for learning and representing complex patterns
and relationships in the input data. The term "hidden" comes from the fact that the values in these units are
not directly observed in the input or output data but play a crucial role in the network's ability to generalize
and make predictions.
• Each hidden unit in a deep feedforward network applies a linear transformation to its inputs, followed by a
non-linear activation function. The non-linear activation function is crucial for the network's ability to
learn complex mappings. Common activation functions include sigmoid, hyperbolic tangent (tanh), and
rectified linear unit (ReLU).
• The deep feedforward network learns by adjusting the weights and biases associated with the connections
between nodes during the training process. This is typically done using optimization algorithms like
gradient descent and backpropagation.
• In summary, hidden units in a deep feedforward network are the neurons in the layers between the input
and output layers, responsible for capturing and representing the underlying patterns in the input data.
• A hidden unit takes in a vector/tensor, compute an affine transformation z and then applies an element-wise
non-linear function g(z). Where z:
The way hidden units are differentiated from each other is based on their activation function, g(z):
• ReLU
• ELU
• GELU
• Maxout
• PReLU
• Absolute value rectification
• LeakyReLU
• Logistic Sigmoid
• Hyperbolic Tangent
• Hard Hyperbolic Tangent
• Identity
• Softplus
• Softmax
• RBF
• etc
Computational Graphs

• A computational graph is a way to represent a math function in the language of


graph theory. Recall the premise of graph theory: nodes are connected by edges,
and everything in the graph is either a node or an edge.
• In a computational graph nodes are either input values or functions for combining
values. Edges receive their weights as the data flows through the graph. Outbound
edges from an input node are weighted with that input value; outbound nodes from
a function node are weighted by combining the weights of the inbound edges using
the specified function.
• For example, consider the relatively simple expression: f(x, y, z) = (x + y) * z. This
is how we would represent that function as as computational graph:
• There are three input nodes, labeled X, Y, and Z. The two other nodes are function
nodes. In a computational graph we generally compose many simple functions into
a more complex function. We can do composition in mathematical notation as well,
but I hope you’ll agree the following isn’t as clean as the graph above:

• In both of these notations we can compute the answers to each function separately,
provided we do so in the correct order. Before we know the answer to f(x, y, z) first
we need the answer to g(x, y) and then h(g(x, y), z). With the mathematical
notation we resolve these dependencies by computing the deepest parenthetical
first; in computational graphs we have to wait until all the edges pointing into a
node have a value before computing the output value for that node. Let’s look at the
example for computing f(1, 2, 3).
• And in the graph, we use the output from each node as the weight of the
corresponding edge:
• Either way, graph or function notation, we get the same answer because these are just two
ways of expressing the same thing.
• In this simple example it might be hard to see the advantage of using a computational graph
over function notation. After all, there isn’t anything terribly hard to understand about the
function f(x, y, z) = (x + y) * z. The advantages become more apparent when we reach the
scale of neural networks.
• Even relatively “simple” deep neural networks have hundreds of thousands of nodes and
edges; it’s quite common for a neural network to have more than one million edges. Try to
imagine the function expression for such a computational graph… can you do it? How much
paper would you need to write it all down? This issue of scale is one of the reasons
computational graphs are used.
• Let’s look at one concrete example: suppose we’re building a deep neural network for
predicting if someone is single or in some sort of relationship; a binary predictor. Furthermore,
assume that we’ve gathered a dataset that tells us four things about a person: their age, gender,
what city they live in, and if they are single or in some sort of relationship.
• When we say we want to “build a neural network” to make this prediction — we’re really
saying that we want to find a mathematical function of the form:
• Where the output value is 0 if that person is in a relationship, and 1 if that person is
not in a relationship.
• We are making a huge (and wrong) assumption here that age, gender, and city tell
us everything we need to know about whether or not someone is in a relationship.
But that’s okay — all models are wrong and we can use statistics to find out if this
one is useful or not. Don’t focus on how much this toy model oversimplifies human
relationships, focus on what this means for the neural network we want to build.
• As an aside, but before we move on: encoding the value for “city” can be tricky.
It’s not at all clear what the numerical value of “Berkeley” or “Salt Lake City”
should be in our mathematical function. One-hot encoding is a popular tactic.
• In fact, a one-hot encoded vector would be used as the output layer for this network
as well.
Neural Networks as Computational Graphs

• When we define the architecture of a neural network we’re laying out the series of sub-
functions and specifying how they should be composed. When we train the neural network
we’re experimenting with the parameters of these sub-functions. Consider this function as an
example:
f(x, y) = ax² + bxy + cy²;
where a, b, and c are scalars

• The component sub-functions of this function are all of the operators: two squares, two
additions, and 4 multiplications.
• The tunable parameters of this function are a, b, and c, in neural network parlance these
are called weights.
• The inputs to the function are X and Y — we can’t tune those values in machine learning
because they are the values from the dataset.
• By changing the values of our weights (a, b, and c) we can dramatically impact the output of
the function. On the other hand, regardless of the values of a, b, and c there will always be
an x², a y² and an xy term — so our function has a limited range of possible configurations.
• Here is a computational graph representing this function:
• This isn’t technically a neural network, but it’s very close in all the ways that count.
It’s a graph that represents a function; we could use it to predict some kinds of
trends; and we could train it using gradient descent and backpropagation if we had
a dataset that mapped two inputs to an output. This particular computational graph
will be good at modeling some quadratic trends involving exactly 2 variables, but
bad at modeling anything else.
• In this example, training the network would amount to changing the weights until
we find some combination of a, b, and c that causes the function to work well as a
predictor for our dataset. If you’re familiar with linear regression, this should feel
similar to tuning the weights of the linear expression.
• This graph is still quite simple compared to even the simplest neural networks that
are used in practice, but the main idea — that a, b, and c can be adjusted to improve
the model’s performance — remains the same.
• The reason this neural network would not be used in practice is that it isn’t very
flexible. This function only has 3 parameters to tune: a, b, and c. Making matters
worse, we’ve only given ourselves room for 2 features per input (x and y).
• Fortunately, we can easily solve this problem by using more complex functions and
allowing for more complex input.
Recall two facts about deep neural networks:
 DNNs are a special kind of graph, a “computational graph”.
 DNNs are made up of a series of “fully connected” layers of nodes.
• “Fully connected” means that the output from each node in the first layer becomes one of
the inputs for every node in the second layer. In a computational graph the edges are the
output values of functions — so in a fully connected layer the output for each sub-function is
used as one of the inputs for each of the sub-functions in the next layer. But, what are those
functions?
• The function performed by each node in the neural net is called a transfer function (which is
also called the activation function). There are two steps in every transfer function. First, all of
the input values are combined in some way, usually this is a weighted sum. Second, a
“nonlinear” function is applied to that sum; this second function might change from layer to
layer within a single neural network.
• Popular nonlinear functions for this second step are tanh, log, max(0, x) (called Rectified
Linear Unit, or ReLU), and the sigmoid function. At the time of this writing, ReLU is the most
popular choice of nonlinearity, but things change quickly.
• If we zoom in on a neural network, we’d notice that each “node” in the network was actually
2 nodes in our computational graph:
• In this case, the transfer function is a sum followed by a sigmoid.
Typically, all the nodes in a layer have the same transfer and activation
function. Indeed it is common for all the layers in the same network to use
the same activation functions, though it is not a requirement by any means.
• The last sources of complexity in our neural network are biases and
weights. Every incoming edge has a unique weight, the output value from
the previous node is multiplied by this weight before it is given to the
transfer function. Each transfer function also has a single bias which is
added before the nonlinearity has been applied. Lets zoom in one more
time:
• In this diagram we can see that each input to the sum is first weighted via multiplication
then it is summed. The bias is added to that sum as well, and finally the total is sent to our
nonlinear function (sigmoid in this case). These weights and biases are the parameters that
are ultimately fine-tuned during training.
• In the previous example, we didn’t have enough flexibility because we only had 3
parameters to fine tune. So just how many parameters are there in a deep neural network
for us to tune?
• If we define a neural net to predict binary classification (in/not in a relationship) with 2
hidden layers each with 512 nodes and an input vector with 20 features we will have
20*512 + 512*512 + 512*2 = 273,408 weights that we can fine tune plus 1024 biases (one
for each node in the hidden layers). This is a “simple” neural network. “Complex” neural
networks frequently have several million tunable weights and biases.
• This extraordinary flexibility is what allows neural nets to find and model complex
relationships. It’s also why they require lots of data to train. Using backpropagation and
gradient descent we can purposely change the millions of weights until the output
becomes more correct, but because we’re doing calculations involving millions of variables
it takes a lot of time and a lot of data to find the right combination of weights and biases.
• While they are sometimes called a “black box”, neural networks are really
just a way of representing very complex mathematical functions. The
neural nets we build are particularly useful functions because they have so
many parameters that can be fine tuned. The result of the fine tuning is
that rich complexities between different components of the input can be
plucked out of the noise.
• Ultimately, the “architecture” of our computational graph will have a big
impact on how well our network can perform. Questions like: how many
nodes per layer, which activation functions are used at each layer, and how
many layers to use, are the subject of research and might change
dramatically from neural network to neural network. The architecture will
depend on the type of prediction being made and the kind of data being
fed into the system — just like we shouldn’t use a linear function to model
parabolic data, we shouldn’t use any neural net to solve every problem.
Back-Propagation,

• Backpropagation is the essence of neural network training. It is the method of


fine-tuning the weights of a neural network based on the error rate obtained in the
previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce
error rates and make the model reliable by increasing its generalization.
• Backpropagation in neural network is a short form for “backward propagation of
errors.” It is a standard method of training artificial neural networks. This method
helps calculate the gradient of a loss function with respect to all the weights in the
network.
How Backpropagation Algorithm Works
• The Back propagation algorithm in neural network computes the gradient of the
loss function for a single weight by the chain rule. It efficiently computes one layer
at a time, unlike a native direct computation. It computes the gradient, but it does
not define how the gradient is used. It generalizes the computation in the delta
rule.
• Consider the following Back propagation neural network example diagram to
understand:
1. Inputs X, arrive through the preconnected path
2. Input is modeled using real weights W. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
4. Calculate the error in the outputs

ErrorB= Actual Output – Desired Output


5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.
• Keep repeating the process until the desired output is achieved
Why We Need Backpropagation?
Most prominent advantages of Backpropagation are:
• Backpropagation is fast, simple and easy to program
• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge
about the network
• It is a standard method that generally works well
• It does not need any special mention of the features of the
function to be learned.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
• Static Back-propagation
• Recurrent Backpropagation
• Static back-propagation
• It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues
like optical character recognition.
• Recurrent Backpropagation
• Recurrent Back propagation in data mining is fed forward until a fixed value
is achieved. After that, the error is computed and propagated backward.
• The main difference between both of these methods is: that the mapping is
rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.
Regularization
• A universal problem in machine learning has been making an algorithm that performs equally well on training data and
any new samples or test dataset. Techniques used in machine learning that have specifically been designed to cater to
reducing test error, mostly at the expense of increased training error, are globally known as regularization.
What is Regularization?
• Regularization may be defined as any modification or change in the learning algorithm that helps reduce its error over
a test dataset, commonly known as generalization error but not on the supplied or training dataset.
• In learning algorithms, there are many variants of regularization techniques, each of which tries to cater to different
challenges. These can be listed down straightforwardly based on the kind of challenge the technique is trying to deal
with:
 Some try to put extra constraints on the learning of an ML model, like adding restrictions on the range/type of
parameter values.
 Some add more terms in the objective or cost function, like a soft constraint on the parameter values. More often than
not, a careful selection of the right constraints and penalties in the cost function contributes to a massive boost in the
model's performance, specifically on the test dataset.
• These extra terms can also be encoded based on some prior information that closely relates to the dataset or the
problem statement.
• One of the most commonly used regularization techniques is creating ensemble models, which take into account the
collective decision of multiple models, each trained with different samples of data.
The main aim of regularization is to reduce the over-complexity of the machine
learning models and help the model learn a simpler function to promote
generalization.
L1 Parameter Regularization:

• L1 regularization is a method of doing regularization. It tends to be more specific than


gradient descent, but it is still a gradient descent optimization problem.
• Formula and high level meaning over here:

• Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “Absolute
value of magnitude” of coefficient, as penalty term to the loss function.
• Lasso shrinks the less important feature’s coefficient to zero; thus, removing some
feature altogether. So, this works well for feature selection in case we have a huge
number of features.
• The L1 regularizer basically looks for the parameter vectors that minimize the norm of
the parameter vector (the length of the vector). This is essentially the problem of how to
optimize the parameters of a single neuron, a single layer neural network in general, and
a single layer feed-forward neural network in particular.
L2 Parameter Regularization:

• The Regression model that uses L2 regularization is called Ridge Regression.

• Regularization adds the penalty as model complexity increases. The


regularization parameter (lambda) penalizes all the parameters except
intercept so that the model generalizes the data and won’t overfit. Ridge
regression adds “squared magnitude of the coefficient” as penalty term to
the loss function. Here the box part in the above image represents the L2
regularization element/term.
Lambda is a hyperparameter.
If lambda is zero, then it is equivalent to OLS.
• Ordinary Least Square or OLS, is a stats model which also helps us in identifying
more significant features that can have a heavy influence on the output.
• But if lambda is very large, then it will add too much weight, and it will lead to
under-fitting. Important points to be considered about L2 can be listed below:
 Ridge regularization forces the weights to be small but does not make them
zero and does not give the sparse solution.
 Ridge is not robust to outliers as square terms blow up the error differences of
the outliers, and the regularization term tries to fix it by penalizing the weights.
 Ridge regression performs better when all the input features influence the
output, and all with weights are of roughly equal size.
 L2 regularization can learn complex data patterns
Differences, Usage and Importance:
• It is important to understand the demarcation between both these methods. In
comparison to L2 regularization, L1 regularization results in a solution that is
more sparse.
• Sparsity in this context refers to the fact that some parameters have an optimal
value of zero. The sparsity of L1 regularization is a qualitatively different behavior
than arises with L2 regularization
• The sparsity feature used in L1 regularization has been used extensively as a
feature selection mechanism in machine learning. Feature selection is a
mechanism which inherently simplifies a machine learning problem by efficiently
choosing which subset of the available features should be used of the model.
• Lasso integrates an L1 penalty with a linear model and a least-squares cost
function. The L1 penalty causes a subset of the weights to becomes zero, which
is safe to suggest that the corresponding features associated with the respective
weights, may easily be discarded.
Parameter Penalties

• In the context of deep learning, parameter penalties, also known as regularization techniques,
are methods employed to prevent overfitting and improve the generalization performance of a
model. Overfitting occurs when a model learns the training data too well, including its noise
and outliers, which can lead to poor performance on new, unseen data.
• Two common types of parameter penalties used in deep learning are:
L1 Regularization (Lasso Regularization):
• Objective Function Modification: L1 regularization adds a penalty term to the model's
objective function, which is a combination of the loss function and the regularization term.
The objective function is modified to include the sum of the absolute values of the model's
weights (parameters) multiplied by a regularization parameter (λ).
• Effect on Model: L1 regularization tends to shrink some of the weights to exactly zero,
effectively performing feature selection by encouraging sparsity in the model. This makes it
suitable for situations where only a subset of features is relevant.
• Objective Function with L1 Regularization:
L2 Regularization (Ridge Regularization):
• Objective Function Modification: L2 regularization adds a penalty term to the model's
objective function, which is the sum of the squared values of the model's weights multiplied
by a regularization parameter (λ).
• Effect on Model: L2 regularization encourages smaller, more evenly distributed weights
across all features. It helps prevent any single feature from having too much influence on the
model, reducing the risk of overfitting.
• Objective Function with L2 Regularization:

• In both cases, λ (lambda) is the regularization parameter, which controls the strength of the
regularization. A higher value of λ results in stronger regularization.
• Practical implementations often use a combination of L1 and L2 regularization, known as
Elastic Net regularization, to benefit from the strengths of both methods.
• These regularization techniques are essential tools in the deep learning practitioner's
toolbox, helping to create more robust models that generalize well to unseen data.
Data Augmentation
• Data augmentation is a technique of artificially increasing the training set by creating
modified copies of a dataset using existing data. It includes making minor changes to the
dataset or using deep learning to generate new data points.
Augmented vs. Synthetic data
• Augmented data is driven from original data with some minor changes. In the case of
image augmentation, we make geometric and color space transformations (flipping,
resizing, cropping, brightness, contrast) to increase the size and diversity of the training
set.
• Synthetic data is generated artificially without using the original dataset. It often uses
DNNs (Deep Neural Networks) and GANs (Generative Adversarial Networks) to generate
synthetic data.

When Should You Use Data Augmentation?


• To prevent models from overfitting.
• The initial training set is too small.
• To improve the model accuracy.
• To Reduce the operational cost of labeling and cleaning the raw dataset.
https://www.datacamp.com/tutorial/co
mplete-guide-data-augmentation
Limitations of Data Augmentation
• The biases in the original dataset persist in the augmented data.
• Quality assurance for data augmentation is expensive.
• Research and development are required to build a system with advanced
applications. For example, generating high-resolution images using GANs
can be challenging.
• Finding an effective data augmentation approach can be challenging.
Data Augmentation Techniques

Audio Data Augmentation


• Noise injection: add gaussian or random noise to the audio dataset to improve
the model performance.
• Shifting: shift audio left (fast forward) or right with random seconds.
• Changing the speed: stretches times series by a fixed rate.
• Changing the pitch: randomly change the pitch of the audio.

Text Data Augmentation


• Word or sentence shuffling: randomly changing the position of a word or
sentence.
• Word replacement: replace words with synonyms.
• Syntax-tree manipulation: paraphrase the sentence using the same word.
• Random word insertion: inserts words at random.
• Random word deletion: deletes words at random.
• Image Augmentation
1. Geometric transformations: randomly flip, crop, rotate, stretch, and zoom
images. You need to be careful about applying multiple transformations on the
same images, as this can reduce model performance.
2. Color space transformations: randomly change RGB color channels, contrast,
and brightness.
3. Kernel filters: randomly change the sharpness or blurring of the image.
4. Random erasing: delete some part of the initial image.
5. Mixing images: blending and mixing multiple images.
Advanced Techniques
• Generative adversarial networks (GANs): used to generate new data points or
images. It does not require existing data to generate synthetic data.
• Neural Style Transfer: a series of convolutional layers trained to deconstruct
images and separate context and style.
Data Augmentation Applications
• Data augmentation can apply to all machine learning applications where
acquiring quality data is challenging. Furthermore, it can help improve model
robustness and performance across all fields of study.
• Healthcare
• Acquiring and labeling medical imaging datasets is time-consuming and
expensive. You also need a subject matter expert to validate the dataset before
performing data analysis. Using geometric and other transformations can help
you train robust and accurate machine-learning models.
• For example, in the case of Pneumonia Classification, you can use random
cropping, zooming, stretching, and color space transformation to improve the
model performance. However, you need to be careful about certain
augmentations as they can result in opposite results. For example, random
rotation and reflection along the x-axis are not recommended for the X-ray
imaging dataset.
• Self-Driving Cars
• There is limited data available on self-driving cars, and companies are
using simulated environments to generate synthetic data using
reinforcement learning. It can help you train and test machine learning
applications where data security is an issue.
The possibilities of augmented data as a simulation are endless, as it can be
used to generate real-world scenarios.
• Natural Language Processing
• Text data augmentation is generally used in situations with limited quality
data, and improving the performance metric takes priority. You can apply
synonym augmentation, word embedding, character swap, and random
insertion and deletion. These techniques are also valuable for low-resource
languages.

• Researchers use text augmentation for the language models in high error
recognition scenarios, sequence-to-sequence data generation, and text
• Automatic Speech Recognition
In sound classification and speech recognition, data augmentation works wonders. It improves the model
performance even on low-resource languages.

The random noise injection, shifting, and changing the pitch can help you produce state-of-the-art speech-to-
text models. You can also use GANs to generate realistic sounds for a particular application.
Multi-task Learning,

• Multi-task learning (MTL) is a model training technique where you train a


single deep neural network on multiple tasks at the same time.
• A variant of MTL, transfer learning, where a model is trained first on a
large general dataset and then tuned on data specific for a particular task
has become the standard for both natural language processing (NLP) with
BERT, GPT-3 and Google’s newest model, Multitask Unified Model, and
computer vision (CV) with models such as EfficientNet.
• Multi-Task Learning (MTL) is a type of machine learning technique where a model is trained to
perform multiple tasks simultaneously. In deep learning, MTL refers to training a neural network to
perform multiple tasks by sharing some of the network’s layers and parameters across tasks.
• Multi-Task learning is a sub-field of Machine Learning that aims to solve multiple different tasks at the same
time, by taking advantage of the similarities between different tasks. This can improve the learning efficiency
and also act as a regularizer which we will discuss in a while.
• Formally, if there are n tasks (conventional deep learning approaches aim to solve just 1 task using 1
particular model), where these n tasks or a subset of them are related to each other but not exactly identical,
Multi-Task Learning (MTL) will help in improving the learning of a particular model by using the knowledge
contained in all the n tasks.
• In MTL, the goal is to improve the generalization performance of the model by leveraging the information
shared across tasks. By sharing some of the network’s parameters, the model can learn a more efficient and
compact representation of the data, which can be beneficial when the tasks are related or have some
commonalities.
• There are different ways to implement MTL in deep learning, but the most common approach is to use a
shared feature extractor and multiple task-specific heads.
• The shared feature extractor is a part of the network that is shared across tasks and is used to extract features
from the input data. The task-specific heads are used to make predictions for each task and are typically
connected to the shared feature extractor.
• Another approach is to use a shared decision-making layer, where the decision-making layer is shared across
tasks, and the task-specific layers are connected to the shared decision-making layer.
• MTL can be useful in many applications such as natural language processing, computer
vision, and healthcare, where multiple tasks are related or have some commonalities. It is also
useful when the data is limited, MTL can help to improve the generalization performance of
the model by leveraging the information shared across tasks.
• Intuition behind Multi-Task Learning (MTL): By using Deep learning models, we usually
aim to learn a good representation of the features or attributes of the input data to predict a
specific value. Formally, we aim to optimize for a particular function by training a model and
fine-tuning the hyperparameters till the performance can’t be increased further. By using
MTL, it might be possible to increase performance even further by forcing the model to learn
a more generalized representation as it learns (updates its weights) not just for one specific
task but a bunch of tasks. Biologically, humans learn in the same way. We learn better if we
learn multiple related tasks instead of focusing on one specific task for a long time.
• MTL as a regularizer: In Machine Learning, MTL can also be looked at as a way of inducing
bias. It is a form of inductive transfer, using multiple tasks induces a bias that prefers
hypotheses that can explain all the n tasks. MTL acts as a regularizer by introducing inductive
bias as stated above. It significantly reduces the risk of overfitting and also reduces the
model’s ability to accommodate random noise during training. Now, let’s discuss the major
and prevalent techniques to use MTL.
• Hard Parameter Sharing – A common hidden layer is used for all tasks but several
task specific layers are kept intact towards the end of the model. This technique is
very useful as by learning a representation for various tasks by a common hidden
layer, we reduce the risk of overfitting.
• Soft Parameter Sharing – Each model has their own sets of
weights and biases and the distance between these
parameters in different models is regularized so that the
parameters become similar and can represent all the tasks.
Applications: MTL techniques have found various uses, some of
the major applications are-
• Object detection and Facial recognition
• Self Driving Cars: Pedestrians, stop signs and other obstacles
can be detected together
• Multi-domain collaborative filtering for web applications
• Stock Prediction
• Language Modelling and other NLP applications
How does deep learning work in a nutshell?

• To start off, let’s talk about how deep learning works at a very high level. Imagine there’s a new restaurant
in town and you’re wondering if it’s a place that you might like. Now, you’ve never been there yourself, but
you have three friends who have. You ask all their opinions on the place, and one says it’s okay, the other
says it’s not worth it and the third says that it’s the best food ever. You go to the restaurant and it’s terrible,
so terrible that you get food poisoning. This is definitely going to affect your restaurant decision-making for
the future.

Well, next time you’re deciding whether to check out a new


restaurant, you’re likely going to trust the input of the friend
who told you to stay away a little more and trust the friend
who insisted the food was divine a whole lot less. If you were
to repeat this paradigm over and over for many, many
restaurants, eventually you’d learn exactly how much you
should trust each friend. In other words, over several
iterations, you’d know the optimal weight to put on each
friend’s input such that you could predict with a good degree
of certainty whether you’d like a restaurant just from what
they’ve told you. How does this have anything to do with
MTL or deep learning in the slightest?
This is actually a very simple, but useful analogy for how deep learning works, without
needing all the intimidating language like backpropagation, gradients and neurons.
Putting it into a little more algorithm-y but still friendly format:

• First, we need observations in the world, i.e., input data. In this case, the data are visits to
the restaurant, but it could be images, texts, tabular data, whatever. Each data point needs
a label on it, e.g., this restaurant was good, this image has a dog in it, so that the model
can eventually learn to associate input data with its proper label.
• Then, a series of intermediaries examine the data and make decisions on it using their
own weights for different features of the input. In this example, these are your
restauranteur friends who visit a restaurant and make their own decisions.
• Next, those intermediary decisions are evaluated by a final decision function and a
prediction is made. In this example, you are the final decider, and you make your
decision by how much you trust each of your friends’ recommendations to try out a new
place to eat.
• Finally, we learn. After making your prediction, you change the network based on how
close your prediction was to the true label for the data. In this example, this is trusting
your friends less who encouraged you to try a place where you got food poisoning.
• This general framework describes deep learning, albeit very broadly: the hidden layers
learn interesting patterns in the input (your friends), while the final layer (or final layers)
learn how to use those interesting patterns to solve a particular problem the network has
been trained on (your trust of each of your friends).
Bagging,

Ensemble Modeling is a technique that combines multiple machine learning models to improve overall
predictive performance. The basic idea is that a group of weak learners can come together to form one
strong learner.
An ensemble model typically consists of two steps:
1. Multiple machine learning models are trained independently.
2. Their predictions are aggregated in some way, such as by voting, averaging, or weighting. This ensemble
is then used to make the overall prediction.

• Ensembles tend to yield better results because the different models complement each other and overcome
their individual weaknesses. They also reduce variance and prevent overfitting.
• Some popular ensemble methods are bagging, boosting, and stacking. Ensemble learning is used
extensively across machine learning tasks like classification, regression, and clustering to enhance
accuracy and robustness.
What is Bagging?

• Bagging (bootstrap aggregating) is an ensemble method that involves training multiple models
independently on random subsets of the data, and aggregating their predictions through voting
or averaging.

• In detail, each model is trained on a random subset of the data sampled with replacement,
meaning that the individual data points can be chosen more than once. This random subset is
known as a bootstrap sample. By training models on different bootstraps, bagging reduces the
variance of the individual models. It also avoids overfitting by exposing the constituent
models to different parts of the dataset.
• The predictions from all the sampled models are then combined through a simple averaging to
make the overall prediction. This way, the aggregated model incorporates the strengths of the
individual ones and cancels out their errors.
• Bagging is particularly effective in reducing variance and
overfitting, making the model more robust and accurate,
especially in cases where the individual models are prone to
high variability.
Bagging Vs. Boosting

• Boosting is another popular ensemble method that is often compared to Bagging.


The main difference lies in how the constituent models are trained.
• In bagging, models are trained independently in parallel on different random
subsets of the data. Whereas in boosting, models are trained sequentially, with each
model learning from the errors of the previous one. Additionally, bagging typically
involves simple averaging of models, while boosting assigns weights based on
accuracy.
• Bagging reduces variance while boosting reduces bias. Bagging can be
used with unstable models like decision trees while boosting works better
for stable models like linear regression.
• Both methods have their strengths and weaknesses. Bagging is simpler to
run parallelly while boosting can be more powerful and accurate. In
practice, it helps to test both on a new problem to see which performs
better.
Advantages of Bagging
• Here are some key advantages of bagging:
• Reduces overfitting: It can reduce the chance of an overfit model, resulting in improved model accuracy
on unseen data.
• Decreases model cariance: Multiple models trained on different subsets of data average out their
predictions, leading to lower variance than a single model.
• Improves stability: Changes in the training dataset have less impact on bagged models, making the
overall model more stable.
• Handles high variability: Especially effective for algorithms like decision trees, which tend to have high
variance.
• Parallelizable computation: Each model in the ensemble can be trained independently, allowing for
parallel processing and efficient use of computational resources.
• Easy to understand and implement: The concept behind bagging is straightforward and can be
implemented without complex modifications to the learning algorithm.
• Good with noisy data: The averaging process helps in reducing the noise in the final prediction.
• Handles imbalanced data: Bagging can help in scenarios where the dataset is imbalanced, improving the
performance of the model in such situations.
,
Dropout

• The term “dropout” refers to dropping out the nodes (input and hidden layer) in a
neural network (as seen in Figure 1). All the forward and backwards connections
with a dropped node are temporarily removed, thus creating a new network
architecture out of the parent network. The nodes are dropped by a dropout
probability of p.
• Let’s try to understand with a given input x: {1, 2, 3, 4, 5} to the fully connected
layer. We have a dropout layer with probability p = 0.2 (or keep probability = 0.8).
During the forward propagation (training) from the input x, 20% of the nodes
would be dropped, i.e. the x could become {1, 0, 3, 4, 5} or {1, 2, 0, 4, 5} and so
on. Similarly, it applied to the hidden layers.
• For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is
applied with drop probability = 0.5, then 500 neurons would be randomly dropped
in every iteration (batch).
• Generally, for the input layers, the keep probability, i.e. 1- drop probability, is
closer to 1, 0.8 being the best as suggested by the authors. For the hidden layers,
the greater the drop probability more sparse the model, where 0.5 is the most
optimised keep probability, that states dropping 50% of the nodes.
How does it solve the Overfitting problem?
• In the overfitting problem, the model learns the statistical noise. To be
precise, the main motive of training is to decrease the loss function,
given all the units (neurons). So in overfitting, a unit may change in a
way that fixes up the mistakes of the other units. This leads to complex
co-adaptations, which in turn leads to the overfitting problem because
this complex co-adaptation fails to generalise on the unseen dataset.
• Now, if we use dropout, it prevents these units to fix up the mistake of
other units, thus preventing co-adaptation, as in every iteration the
presence of a unit is highly unreliable. So by randomly dropping a few
units (nodes), it forces the layers to take more or less responsibility for
the input by taking a probabilistic approach.
• This ensures that the model is getting generalised and hence reducing
the overfitting problem.
• From figure 2, we can easily make out that the hidden layer with dropout is learning
more of the generalised features than the co-adaptations in the layer without dropout.
It is quite apparent, that dropout breaks such inter-unit relations and focuses more on
generalisation.
Adversarial Training and Optimization.

• It is a machine learning technique that involves training models to be robust against adversarial examples.
The examples are intentionally designed inputs created to mislead the model into making inaccurate and
wrong predictions. For instance, in computer vision tasks, an adversarial example could be an image that
has been manipulated in a way that is barely noticeable to the human eye but leads to misclassification by
the model.

• It is based on the idea that models trained on adversarial examples are more robust to real-world
variations and distortions in the data. This is because it can cover various variations and distortions,
making the model more resistant to these variations.
Examples of Adversarial Learning

• Adversarial Image Examples


An attacker can manipulate an image classification model to misclassify it by adding carefully
crafted perturbations to an image. For example, adding imperceptible noise to an image of a
panda can cause a model to classify it as a different animal, like a gibbon.
• Adversarial Text Examples
An attacker can fool natural language processing models by making subtle modifications to a
piece of text. For instance, changing a few words in a spam email can bypass email filters and
make it appear legitimate.
• Adversarial Malware Examples
Attackers can generate malicious code that evades detection by antivirus software. By altering
the structure or content of the code, they can create malware that is difficult to identify and
block.
• Adversarial Attacks on Reinforcement Learning
In reinforcement learning, attackers can manipulate the reward signals or input observations to
mislead the learning process. This can lead to unexpected or undesirable behaviors in
autonomous systems such as autonomous vehicles or game-playing agents.
Adversarial Attacks on Machine Learning Models
Its attacks on machine learning models can be classified into two categories: white-box
attacks and black-box attacks.
• White-Box Attacks
• In a white-box attack, the attacker has complete knowledge of the targeted machine
learning model, including its architecture, parameters, and training data. They can
directly access and analyze the model to craft adversarial examples. With this
information, the attacker can exploit vulnerabilities in the model and generate specific
inputs that deceive the model’s predictions. White-box attacks are typically more
powerful because of the extensive knowledge available to the attacker.
• Black-Box Attacks
• In a black-box attack, the attacker has limited or no knowledge of the targeted model’s
internal details. They can only query the model with inputs and observe the
corresponding outputs. The attacker aims to generate adversarial examples without
having access to the model’s parameters or training data. Black-box attacks often involve
techniques such as transferability, where adversarial examples crafted for one model are
transferred to a different but similar model. The attacker leverages the observed
behavior of the target model to generate inputs that can fool it.
Types of Adversarial Attacks
There are several types of adversarial attacks that can be launched against machine learning
models. Here are some common types:
• Evasion Attacks: These attacks aim to manipulate input data in a way that causes
misclassification or alters the model’s output. Examples include the Fast Gradient Sign Method
(FGSM) and Iterative FGSM (I-FGSM).
• Poisoning Attacks: In poisoning attacks, an attacker introduces malicious data into the training
set to manipulate the model’s behavior. This can be done by injecting specially crafted samples
or by modifying existing training data.
• Model Inversion Attacks: Model inversion attacks attempt to reconstruct sensitive information
about the training data or inputs by exploiting the model’s output. These attacks can be used to
extract private information or reveal confidential data.
• Membership Inference Attacks: Membership inference attacks determine whether a specific
sample was part of the training data used by a model. By exploiting the model’s output
probabilities, an attacker can infer the membership status of a given sample.
• Model Extraction Attacks: In model extraction attacks, an adversary attempts to obtain a copy
or approximation of the target model by querying it and generating a substitute model. This can
be used to steal proprietary models or proprietary information embedded within the model.
Popular Adversarial Attack Methods
Why is Adversarial Learning Important for Improving Model Robustness?
• It is important for improving model robustness because it helps the model to
generalize better. This is because the model is exposed to a wide range of variations
and distortions during training, which makes it more robust to unseen data.
Additionally, adversarial learning helps the model to identify and adapt to the
structure of the data, which is essential and critical for robustness.
• It is also essential because it helps to detect weaknesses in the model and provides
insights into how the model can be improved. For example, if a specific type of
adversarial example consistently misclassifies a model, it may indicate that it is not
robust to that type of variation. This information can be used to improve the
model’s robustness.
How to Incorporate Adversarial Learning into a Machine Learning Model?
Incorporating adversarial learning into a machine learning model requires two steps: generating adversarial examples and
incorporating these examples into the training process.
• Generating Adversarial Examples
• It can be generated using many methods, including gradient-based methods, genetic algorithms, and reinforcement learning.
Gradient-based methods are the most commonly used. They involve computing the gradient of the loss function concerning
the input and then modifying the information in the direction that increases the loss.
• Incorporating Adversarial Examples into the Training Process
• Its examples can be incorporated into the training process: adversarial training and adversarial augmentation. It involves using
adversarial examples during training to update the model parameters. In contrast, adversarial augmentation involves adding
adversarial examples to the training data to improve the robustness of the model.
• Its augmentation is a simple and effective approach widely used in practice. The idea is to add adversarial examples to the
training data and then train the model on the augmented data. The model is trained to predict the correct class label for both
the original and adversarial examples, making it more robust to variations and distortions in the data.
• It has been applied to various machine learning tasks, including computer vision, speech
recognition, and natural language processing.
• In computer vision, It has been used to improve the robustness of image classification
models. For example, it has been used to improve the robustness of convolutional neural
networks (CNNs), leading to improved accuracy on unseen data.
• In speech recognition, adversarial learning has improved the robustness of automatic
speech recognition (ASR) systems. Adversarial examples in this domain are designed to
alter the input speech signal in a way that is imperceptible to humans but leads to
incorrect transcriptions by the ASR system. Adversarial training has been shown to
improve the robustness of ASR systems to these types of adversarial examples, resulting in
improved accuracy and reliability.
• In natural language processing, adversarial learning has been used to improve the
robustness of sentiment analysis models. Adversarial examples in this
NLP domains are designed to manipulate the input text in a way that leads to wrong and
inaccurate predictions by the model. Adversarial training has been shown to improve the
robustness of the sentiment analysis models to these types of adversarial examples,
resulting in improved accuracy and robustness.
Optimizers in Deep Learning

• Deep learning, a subset of machine learning, tackles intricate tasks like speech
recognition and text classification. Comprising components such as activation
functions, input, output, hidden layers, and loss functions, deep learning models
aim to generalize data and make predictions on unseen data. To optimize these
models, various algorithms, known as optimizers, are employed.
• Optimizers adjust model parameters iteratively during training to minimize a loss
function, enabling neural networks to learn from data.
• Common optimizers include Stochastic Gradient Descent (SGD), Adam, and
RMSprop, each employing specific update rules, learning rates, and momentum for
refining model parameters.
• Optimizers play a pivotal role in enhancing accuracy and speeding up the
training process, shaping the overall performance of deep learning models.
Gradient Descent
• Gradient descent is a simple optimization algorithm that updates the
model's parameters to minimize the loss function. We can write the basic
form of the algorithm as follows:
θ=θ−α⋅∇θL(θ)
• where θ is the model parameter, L(θ) is the loss function, and α is the
learning rate.
Pros:
• Simple to implement.
• Can work well with a well-tuned learning rate.
Cons:
• It can converge slowly, especially for complex models or large datasets.
• Sensitive to the choice of learning rate.
Stochastic Gradient Descent
• Stochastic gradient descent (SGD) is a variant of gradient descent that involves
updating the parameters based on a small, randomly-selected subset of the data
(i.e., a "mini-batch") rather than the full dataset. We can write the basic form of
the algorithm as follows:
θ=θ−α⋅∇θ​L(θ;x(i);y(i))
• where (x(i),y(i)) is a mini-batch of data.

Pros:
• It can be faster than standard gradient descent, especially for large datasets.
• Can escape local minima more easily.
Cons:
• It can be noisy, leading to less stability.
• It may require more hyperparameter tuning to get good performance.
• Stochastic Gradient Descent with Momentum
SGD with momentum is a variant of SGD that adds a "momentum" term to the update
rule, which helps the optimizer to continue moving in the same direction even if the local
gradient is small. The momentum term is typically set to a value between 0 and 1. We
can write the update rule as follows:
v=β⋅v+(1−β)⋅∇θ​L(θ;x(i);y(i))
θ=θ−α⋅vθ=θ−α⋅v
• where v is the momentum vector and β is the momentum hyperparameter.
Pros:
• It can help the optimizer to move more efficiently through "flat" regions of the loss
function.
• It can help to reduce oscillations and improve convergence.
Cons:
• Can overshoot good solutions and settle for suboptimal ones if the momentum is too
high.
• Requires tuning of the momentum hyperparameter.
Mini-Batch Gradient Descent
• Mini-batch gradient descent is similar to SGD, but instead of using a single
sample to compute the gradient, it uses a small, fixed-size "mini-batch"
of samples. The update rule is the same as for SGD, except that the
gradient is averaged over the mini-batch. This can reduce noise in the
updates and improve convergence.
• Pros:
• It can be faster than standard gradient descent, especially for large
datasets.
• Can escape local minima more easily.
• Can reduce noise in updates, leading to more stable convergence.
• Cons:
• Can be sensitive to the choice of mini-batch size.
Adagrad
• Adagrad is an optimization algorithm that uses an adaptive learning rate per parameter. The learning
rate is updated based on the historical gradient information so that parameters that receive many updates
have a lower learning rate, and parameters that receive fewer updates have a larger learning rate. The
update rule can be written as follows:

• Where G is a matrix that accumulates the squares of the gradients, and ϵϵ is a small constant added to avoid
division by zero.

Pros:
• It can work well with sparse data.
• Automatically adjusts learning rates based on parameter updates.
Cons:
• Can converge too slowly for some problems.
• Can stop learning altogether if the learning rates become too small.
• RMSProp
RMSProp is an optimization algorithm similar to Adagrad, but it uses an exponentially decaying average of the
squares of the gradients rather than the sum. This helps to reduce the monotonic learning rate decay of Adagrad
and improve convergence. We can write the update rule as follows:

Where G is a matrix that accumulates the squares of the gradients, ϵ is a small constant added to avoid division by
zero, and β is a decay rate hyperparameter.
• Pros:
• It can work well with sparse data.
• Automatically adjusts learning rates based on parameter updates.
• Can converge faster than Adagrad.
• Cons:
• It can still converge too slowly for some problems.
• Requires tuning of the decay rate hyperparameter.
• AdaDelta
• AdaDelta is an optimization algorithm similar to RMSProp but does not
require a hyperparameter learning rate. Instead, it uses an exponentially
decaying average of the gradients and the squares of the gradients to
determine the updated scale. We can write the update rule as follows:

Where GG and SS are matrices that accumulate the gradients and the squares of the updates, respectively,
and ϵϵ is a small constant added to avoid division.
Pros:
•Can work well with sparse data.
•Automatically adjusts learning rates based on parameter updates.
Cons:
•Can converge too slowly for some problems.
•Can stop learning altogether if the learning rates become too small.
• Adam
• Adam (short for "adaptive moment estimation") is an optimization
algorithm that combines the ideas of SGD with momentum and RMSProp.
It uses an exponentially decaying average of the gradients and the
squares of the gradients to determine the updated scale, similar to
RMSProp. It also uses a momentum term to help the optimizer move more
efficiently through the loss function. The update rule can be written as
follows:
Where mm and vv are the momentum and velocity vectors, respectively, and
β1​and β2​are decay rates for the momentum and velocity.
Pros:
• Can converge faster than other optimization algorithms.
• Can work well with noisy data.
Cons:
• It may require more tuning of hyperparameters than other algorithms.
• May perform better on some types of problems.
How Do Optimizers Work in Deep Learning?
• Optimizers in deep learning adjust the model's parameters to minimize the
loss function. The loss function measures how well the model can make
predictions on a given dataset, and the goal of training a model is to find
the set of model parameters that yields the lowest possible loss.
• The optimizer uses an optimization algorithm to search for the parameters
that minimize the loss function. The optimization algorithm uses the
gradients of the loss function to the model parameters to determine the
direction in which we should adjust the parameters.
• The gradients are computed using backpropagation, which involves
applying the chain rule to compute the gradients of the loss function to
each of the model parameters.
• The optimization algorithm then adjusts the model parameters to
minimize the loss function. This process is repeated until the loss function
reaches a minimum or the optimizer reaches the maximum number of
allowed iterations.

You might also like