DL_Unit2
DL_Unit2
• Feed-forward Networks,
• Gradient-based Learning,
• Hidden Units,
• Architecture Design,
• Computational Graphs,
• Back-Propagation,
• Regularization,
• Parameter Penalties,
• Data Augmentation,
• Multi-task Learning,
• Bagging,
• Dropout and Adversarial Training and Optimization.
Neural networks
• The term "Artificial neural network" refers to a biologically inspired sub-field of artificial intelligence
modeled after the brain. An Artificial neural network is usually a computational network based on
biological neural networks that construct the structure of the human brain. Similar to a human brain
has neurons interconnected to each other, artificial neural networks also have neurons that are linked
to each other in various layers of the networks. These neurons are known as nodes.
Feedforward networks
• The process of receiving an input to produce some kind of output to make some
kind of prediction is known as Feed Forward.
• Feed Forward neural network is the core of many other important neural networks
such as convolution neural network.
• In the feed-forward neural network, there are not any feedback loops or
connections in the network. Here is simply an input layer, a hidden layer, and an
output layer.
• There can be multiple hidden layers which depend on what kind of data you are
dealing with.
• The number of hidden layers is known as the depth of the neural network.
• The deep neural network can learn from more functions.
• Input layer first provides the neural network with data and the output layer then
make predictions on that data which is based on a series of functions.
• ReLU Function is the most commonly used activation function in the deep neural
network.
Gradient-based Learning
Gradient Descent in Machine Learning
• Gradient Descent is known as one of the most commonly used optimization
algorithms to train machine learning models by means of minimizing errors
between actual and expected results. Further, gradient descent is also used to
train Neural Networks.
• In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in
machine learning, optimization is the task of minimizing the cost function
parameterized by the model's parameters.
• The main objective of gradient descent is to minimize the convex function
using iteration of parameter updates.
• Once these machine learning models are optimized, these models can be used as
powerful tools for Artificial Intelligence and various computer science applications.
What is Gradient Descent or Steepest Descent?
• The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number.
• It helps to increase and improve machine learning efficiency by providing feedback to this
model so that it can minimize error and find the local or global minimum. Further, it
continuously iterates along the direction of the negative gradient until the cost function
approaches zero
• At this steepest descent point, the model will stop learning further. Although cost function and
loss function are considered synonymous, also there is a minor difference between them.
• The slight difference between the loss function and the cost function is about the error
within the training of machine learning models, as loss function refers to the error of one
training example, while a cost function calculates the average error across an entire
training set.
• The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to reduce the
cost function.
How does Gradient Descent work?
Y=mX+c
• where 'm' represents the slope of the line, and 'c' represents the intercepts
on the y-axis.
• The starting point(shown in above fig.) is used to evaluate the performance
as it is considered just as an arbitrary point. At this starting point, we will
derive the first derivative or slope and then use a tangent line to calculate
the steepness of this slope. Further, this slope will inform the updates to the
parameters (weights and bias).
• The slope becomes steeper at the starting point or arbitrary point, but
whenever new parameters are generated, then steepness gradually reduces,
and at the lowest point, it approaches the lowest point, which is called a
point of convergence.
• The main objective of gradient descent is to minimize the cost function or
the error between expected and actual. To minimize the cost function, two
data points are required:
• Direction & Learning Rate
These two factors are used to determine the partial derivative calculation
of future iteration and allow it to the point of convergence or local
minimum or global minimum. Let's discuss learning rate factors in brief;
• Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point.
This is typically a small value that is evaluated and updated based on the
behavior of the cost function. If the learning rate is high, it results in larger
steps but also leads to risks of overshooting the minimum. At the same
time, a low learning rate shows the small step sizes, which compromises
overall efficiency but gives the advantage of more precision.
• 2. Vanishing and Exploding Gradient
In a deep neural network, if the model is trained with gradient descent and
backpropagation, there can occur two more issues other than local minima and saddle
point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During
backpropagation, this gradient becomes smaller that causing the decrease in the
learning rate of earlier layers than the later layer of the network. Once this happens, the
weight parameters update until they become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the
Gradient is too large and creates a stable model. Further, in this scenario, model weight
increases, and they will be represented as NaN. This problem can be solved using the
dimensionality reduction technique, which helps to minimize complexity within the
model.
Types of Layers
where “xi “ are the input values, wi the weights ,” b” the bias and “f ()”
an activation function. Then, the neurons of the second hidden layer will
take as input the outputs of the neurons of the first hidden layer and so on.
Importance of Hidden Layers
• Hidden layers are the reason why neural networks are able to capture very complex
relationships and achieve exciting performance in many tasks.
• To better understand this concept, we should first examine a neural network
without any hidden layer like the one below that has 3 input features and 1 output:
• Based on the previous equation, the output “y” value is equal to a linear
combination along with a non-linearity. Therefore, the model is similar to a
linear regression model. As we already know, a linear regression attempts to fit a
linear equation to the observed data.
• In most machine learning tasks, a linear relationship is not enough to capture the
complexity of the task and the linear regression model fails. Here comes the
importance of the hidden layers that enables the neural network to learn very
complex non-linear functions.
Examples
• Next, we’ll discuss two examples that illustrate the importance of hidden
layers in training a neural network for a given task.
1. Logical Functions
• Let’s imagine that we want to use a neural network to predict the output of
an XOR logical gate given two binary inputs. According to the truth table
of x1 XOR x2 , the output is true whenever the inputs differ:
• To better understand our classification task, below we plot the four possible
outputs. The green points correspond to an output equal to 1 while the red
points are the zero outputs:
• At first, the problem seems pretty easy. The first approach could be to use a neural
network without any hidden layer. However, this linear architecture is able to separate
our input data points using a single line. As we can see in the above graph, the two classes
cannot be separated using a single line and the XOR problem is not linearly separable.
• Below, we can see some lines that a simple linear model may learn to solve the XOR
problem. We observe that in both cases there is an input that is misclassified:
The solution to this problem is to learn a non-linear function by adding a hidden
layer with two neurons to our neural network. So, the final decision is made based on
the outputs of these two neurons that each one learns a linear function like the ones
below:
The one line makes sure that at least one feature of the input is true and the other line
makes sure that not all features are true. So, the hidden layer manages to transform
the input features into processed features that can then be correctly classified in the
output layer.
2. Images
• Another way to realize the importance of hidden layers is to look into the computer vision domain.
Deep neural networks that consist of many hidden layers have achieved impressive results in face
recognition by learning features in a hierarchical way.
• Specifically, the first hidden layers of a neural network learn to detect short pieces of corners and
edges in the image. These features are easy to detect given the raw image but are not very useful by
themselves to recognize the identity of the person in the image. Then, the middle hidden layers combine
the detected edges from the previous layers to detect parts of the face like the eyes, the nose and the ears.
Then, the final layers combine the detectors of the nose, eyes, etc to detect the overall face of the
person.
• In the image below, we can see how each layer helps us to go from the raw pixels to our final goal:
Neural Network Architecture and
Operation
• A weight is assigned to each input to an artificial neuron. First, the inputs are multiplied
by their weights, and then a bias is applied to the outcome. After that, the weighted sum
is passed via an activation function, being a non-linear function.
• A weight is being applied to each input to an artificial neuron. First, the inputs are
multiplied by their weights, and then a bias is applied to the outcome. This is called the
weighted sum. After that, the weighted sum is processed via an activation function, as a
non-linear function.
• The first layer is the input layer, through which the data that is sent into the neural
network. The output layer is the final layer. The dataset and the type of challenge
determine the number of neurons in the final layer and the first layer. Trial and error will
be used to determine the number of neurons in the hidden layers and the number of
hidden layers.
• All of the inputs from the previous layer will be connected to the first neuron from the
first hidden layer. The second neuron in the first hidden layer will be connected to all of
the preceding layer’s inputs, and so forth for all of the first hidden layer’s neurons. The
outputs of the previously hidden layer are regarded inputs for neurons in the second
hidden layer, and each of these neurons is coupled to all of the preceding neurons.
What is a Feed-Forward Neural Network and how does it work?
• In its most basic form, a Feed-Forward Neural Network is a single layer perceptron. A
sequence of inputs enter the layer and are multiplied by the weights in this model. The
weighted input values are then summed together to form a total.
• If the sum of the values is more than a predetermined threshold, which is normally set at zero,
the output value is usually 1, and if the sum is less than the threshold, the output value is
usually -1.
• The single-layer perceptron is a popular feed-forward neural network model that is frequently
used for classification. Single-layer perceptrons can also contain machine learning features.
• The neural network can compare the outputs of its nodes with the desired values
using a property known as the delta rule, allowing the network to alter its weights
through training to create more accurate output values.
• This training and learning procedure results in gradient descent. The technique of
updating weights in multi-layered perceptrons is virtually the same, however, the
process is referred to as back-propagation. In such circumstances, the output values
provided by the final layer are used to alter each hidden layer inside the network.
Hidden Units
• In a deep feedforward neural network, hidden units refer to the nodes or neurons in layers between the
input and output layers. These hidden units are responsible for learning and representing complex patterns
and relationships in the input data. The term "hidden" comes from the fact that the values in these units are
not directly observed in the input or output data but play a crucial role in the network's ability to generalize
and make predictions.
• Each hidden unit in a deep feedforward network applies a linear transformation to its inputs, followed by a
non-linear activation function. The non-linear activation function is crucial for the network's ability to
learn complex mappings. Common activation functions include sigmoid, hyperbolic tangent (tanh), and
rectified linear unit (ReLU).
• The deep feedforward network learns by adjusting the weights and biases associated with the connections
between nodes during the training process. This is typically done using optimization algorithms like
gradient descent and backpropagation.
• In summary, hidden units in a deep feedforward network are the neurons in the layers between the input
and output layers, responsible for capturing and representing the underlying patterns in the input data.
• A hidden unit takes in a vector/tensor, compute an affine transformation z and then applies an element-wise
non-linear function g(z). Where z:
The way hidden units are differentiated from each other is based on their activation function, g(z):
• ReLU
• ELU
• GELU
• Maxout
• PReLU
• Absolute value rectification
• LeakyReLU
• Logistic Sigmoid
• Hyperbolic Tangent
• Hard Hyperbolic Tangent
• Identity
• Softplus
• Softmax
• RBF
• etc
Computational Graphs
• In both of these notations we can compute the answers to each function separately,
provided we do so in the correct order. Before we know the answer to f(x, y, z) first
we need the answer to g(x, y) and then h(g(x, y), z). With the mathematical
notation we resolve these dependencies by computing the deepest parenthetical
first; in computational graphs we have to wait until all the edges pointing into a
node have a value before computing the output value for that node. Let’s look at the
example for computing f(1, 2, 3).
• And in the graph, we use the output from each node as the weight of the
corresponding edge:
• Either way, graph or function notation, we get the same answer because these are just two
ways of expressing the same thing.
• In this simple example it might be hard to see the advantage of using a computational graph
over function notation. After all, there isn’t anything terribly hard to understand about the
function f(x, y, z) = (x + y) * z. The advantages become more apparent when we reach the
scale of neural networks.
• Even relatively “simple” deep neural networks have hundreds of thousands of nodes and
edges; it’s quite common for a neural network to have more than one million edges. Try to
imagine the function expression for such a computational graph… can you do it? How much
paper would you need to write it all down? This issue of scale is one of the reasons
computational graphs are used.
• Let’s look at one concrete example: suppose we’re building a deep neural network for
predicting if someone is single or in some sort of relationship; a binary predictor. Furthermore,
assume that we’ve gathered a dataset that tells us four things about a person: their age, gender,
what city they live in, and if they are single or in some sort of relationship.
• When we say we want to “build a neural network” to make this prediction — we’re really
saying that we want to find a mathematical function of the form:
• Where the output value is 0 if that person is in a relationship, and 1 if that person is
not in a relationship.
• We are making a huge (and wrong) assumption here that age, gender, and city tell
us everything we need to know about whether or not someone is in a relationship.
But that’s okay — all models are wrong and we can use statistics to find out if this
one is useful or not. Don’t focus on how much this toy model oversimplifies human
relationships, focus on what this means for the neural network we want to build.
• As an aside, but before we move on: encoding the value for “city” can be tricky.
It’s not at all clear what the numerical value of “Berkeley” or “Salt Lake City”
should be in our mathematical function. One-hot encoding is a popular tactic.
• In fact, a one-hot encoded vector would be used as the output layer for this network
as well.
Neural Networks as Computational Graphs
• When we define the architecture of a neural network we’re laying out the series of sub-
functions and specifying how they should be composed. When we train the neural network
we’re experimenting with the parameters of these sub-functions. Consider this function as an
example:
f(x, y) = ax² + bxy + cy²;
where a, b, and c are scalars
• The component sub-functions of this function are all of the operators: two squares, two
additions, and 4 multiplications.
• The tunable parameters of this function are a, b, and c, in neural network parlance these
are called weights.
• The inputs to the function are X and Y — we can’t tune those values in machine learning
because they are the values from the dataset.
• By changing the values of our weights (a, b, and c) we can dramatically impact the output of
the function. On the other hand, regardless of the values of a, b, and c there will always be
an x², a y² and an xy term — so our function has a limited range of possible configurations.
• Here is a computational graph representing this function:
• This isn’t technically a neural network, but it’s very close in all the ways that count.
It’s a graph that represents a function; we could use it to predict some kinds of
trends; and we could train it using gradient descent and backpropagation if we had
a dataset that mapped two inputs to an output. This particular computational graph
will be good at modeling some quadratic trends involving exactly 2 variables, but
bad at modeling anything else.
• In this example, training the network would amount to changing the weights until
we find some combination of a, b, and c that causes the function to work well as a
predictor for our dataset. If you’re familiar with linear regression, this should feel
similar to tuning the weights of the linear expression.
• This graph is still quite simple compared to even the simplest neural networks that
are used in practice, but the main idea — that a, b, and c can be adjusted to improve
the model’s performance — remains the same.
• The reason this neural network would not be used in practice is that it isn’t very
flexible. This function only has 3 parameters to tune: a, b, and c. Making matters
worse, we’ve only given ourselves room for 2 features per input (x and y).
• Fortunately, we can easily solve this problem by using more complex functions and
allowing for more complex input.
Recall two facts about deep neural networks:
DNNs are a special kind of graph, a “computational graph”.
DNNs are made up of a series of “fully connected” layers of nodes.
• “Fully connected” means that the output from each node in the first layer becomes one of
the inputs for every node in the second layer. In a computational graph the edges are the
output values of functions — so in a fully connected layer the output for each sub-function is
used as one of the inputs for each of the sub-functions in the next layer. But, what are those
functions?
• The function performed by each node in the neural net is called a transfer function (which is
also called the activation function). There are two steps in every transfer function. First, all of
the input values are combined in some way, usually this is a weighted sum. Second, a
“nonlinear” function is applied to that sum; this second function might change from layer to
layer within a single neural network.
• Popular nonlinear functions for this second step are tanh, log, max(0, x) (called Rectified
Linear Unit, or ReLU), and the sigmoid function. At the time of this writing, ReLU is the most
popular choice of nonlinearity, but things change quickly.
• If we zoom in on a neural network, we’d notice that each “node” in the network was actually
2 nodes in our computational graph:
• In this case, the transfer function is a sum followed by a sigmoid.
Typically, all the nodes in a layer have the same transfer and activation
function. Indeed it is common for all the layers in the same network to use
the same activation functions, though it is not a requirement by any means.
• The last sources of complexity in our neural network are biases and
weights. Every incoming edge has a unique weight, the output value from
the previous node is multiplied by this weight before it is given to the
transfer function. Each transfer function also has a single bias which is
added before the nonlinearity has been applied. Lets zoom in one more
time:
• In this diagram we can see that each input to the sum is first weighted via multiplication
then it is summed. The bias is added to that sum as well, and finally the total is sent to our
nonlinear function (sigmoid in this case). These weights and biases are the parameters that
are ultimately fine-tuned during training.
• In the previous example, we didn’t have enough flexibility because we only had 3
parameters to fine tune. So just how many parameters are there in a deep neural network
for us to tune?
• If we define a neural net to predict binary classification (in/not in a relationship) with 2
hidden layers each with 512 nodes and an input vector with 20 features we will have
20*512 + 512*512 + 512*2 = 273,408 weights that we can fine tune plus 1024 biases (one
for each node in the hidden layers). This is a “simple” neural network. “Complex” neural
networks frequently have several million tunable weights and biases.
• This extraordinary flexibility is what allows neural nets to find and model complex
relationships. It’s also why they require lots of data to train. Using backpropagation and
gradient descent we can purposely change the millions of weights until the output
becomes more correct, but because we’re doing calculations involving millions of variables
it takes a lot of time and a lot of data to find the right combination of weights and biases.
• While they are sometimes called a “black box”, neural networks are really
just a way of representing very complex mathematical functions. The
neural nets we build are particularly useful functions because they have so
many parameters that can be fine tuned. The result of the fine tuning is
that rich complexities between different components of the input can be
plucked out of the noise.
• Ultimately, the “architecture” of our computational graph will have a big
impact on how well our network can perform. Questions like: how many
nodes per layer, which activation functions are used at each layer, and how
many layers to use, are the subject of research and might change
dramatically from neural network to neural network. The architecture will
depend on the type of prediction being made and the kind of data being
fed into the system — just like we shouldn’t use a linear function to model
parabolic data, we shouldn’t use any neural net to solve every problem.
Back-Propagation,
• Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “Absolute
value of magnitude” of coefficient, as penalty term to the loss function.
• Lasso shrinks the less important feature’s coefficient to zero; thus, removing some
feature altogether. So, this works well for feature selection in case we have a huge
number of features.
• The L1 regularizer basically looks for the parameter vectors that minimize the norm of
the parameter vector (the length of the vector). This is essentially the problem of how to
optimize the parameters of a single neuron, a single layer neural network in general, and
a single layer feed-forward neural network in particular.
L2 Parameter Regularization:
• In the context of deep learning, parameter penalties, also known as regularization techniques,
are methods employed to prevent overfitting and improve the generalization performance of a
model. Overfitting occurs when a model learns the training data too well, including its noise
and outliers, which can lead to poor performance on new, unseen data.
• Two common types of parameter penalties used in deep learning are:
L1 Regularization (Lasso Regularization):
• Objective Function Modification: L1 regularization adds a penalty term to the model's
objective function, which is a combination of the loss function and the regularization term.
The objective function is modified to include the sum of the absolute values of the model's
weights (parameters) multiplied by a regularization parameter (λ).
• Effect on Model: L1 regularization tends to shrink some of the weights to exactly zero,
effectively performing feature selection by encouraging sparsity in the model. This makes it
suitable for situations where only a subset of features is relevant.
• Objective Function with L1 Regularization:
L2 Regularization (Ridge Regularization):
• Objective Function Modification: L2 regularization adds a penalty term to the model's
objective function, which is the sum of the squared values of the model's weights multiplied
by a regularization parameter (λ).
• Effect on Model: L2 regularization encourages smaller, more evenly distributed weights
across all features. It helps prevent any single feature from having too much influence on the
model, reducing the risk of overfitting.
• Objective Function with L2 Regularization:
• In both cases, λ (lambda) is the regularization parameter, which controls the strength of the
regularization. A higher value of λ results in stronger regularization.
• Practical implementations often use a combination of L1 and L2 regularization, known as
Elastic Net regularization, to benefit from the strengths of both methods.
• These regularization techniques are essential tools in the deep learning practitioner's
toolbox, helping to create more robust models that generalize well to unseen data.
Data Augmentation
• Data augmentation is a technique of artificially increasing the training set by creating
modified copies of a dataset using existing data. It includes making minor changes to the
dataset or using deep learning to generate new data points.
Augmented vs. Synthetic data
• Augmented data is driven from original data with some minor changes. In the case of
image augmentation, we make geometric and color space transformations (flipping,
resizing, cropping, brightness, contrast) to increase the size and diversity of the training
set.
• Synthetic data is generated artificially without using the original dataset. It often uses
DNNs (Deep Neural Networks) and GANs (Generative Adversarial Networks) to generate
synthetic data.
• Researchers use text augmentation for the language models in high error
recognition scenarios, sequence-to-sequence data generation, and text
• Automatic Speech Recognition
In sound classification and speech recognition, data augmentation works wonders. It improves the model
performance even on low-resource languages.
The random noise injection, shifting, and changing the pitch can help you produce state-of-the-art speech-to-
text models. You can also use GANs to generate realistic sounds for a particular application.
Multi-task Learning,
• To start off, let’s talk about how deep learning works at a very high level. Imagine there’s a new restaurant
in town and you’re wondering if it’s a place that you might like. Now, you’ve never been there yourself, but
you have three friends who have. You ask all their opinions on the place, and one says it’s okay, the other
says it’s not worth it and the third says that it’s the best food ever. You go to the restaurant and it’s terrible,
so terrible that you get food poisoning. This is definitely going to affect your restaurant decision-making for
the future.
• First, we need observations in the world, i.e., input data. In this case, the data are visits to
the restaurant, but it could be images, texts, tabular data, whatever. Each data point needs
a label on it, e.g., this restaurant was good, this image has a dog in it, so that the model
can eventually learn to associate input data with its proper label.
• Then, a series of intermediaries examine the data and make decisions on it using their
own weights for different features of the input. In this example, these are your
restauranteur friends who visit a restaurant and make their own decisions.
• Next, those intermediary decisions are evaluated by a final decision function and a
prediction is made. In this example, you are the final decider, and you make your
decision by how much you trust each of your friends’ recommendations to try out a new
place to eat.
• Finally, we learn. After making your prediction, you change the network based on how
close your prediction was to the true label for the data. In this example, this is trusting
your friends less who encouraged you to try a place where you got food poisoning.
• This general framework describes deep learning, albeit very broadly: the hidden layers
learn interesting patterns in the input (your friends), while the final layer (or final layers)
learn how to use those interesting patterns to solve a particular problem the network has
been trained on (your trust of each of your friends).
Bagging,
Ensemble Modeling is a technique that combines multiple machine learning models to improve overall
predictive performance. The basic idea is that a group of weak learners can come together to form one
strong learner.
An ensemble model typically consists of two steps:
1. Multiple machine learning models are trained independently.
2. Their predictions are aggregated in some way, such as by voting, averaging, or weighting. This ensemble
is then used to make the overall prediction.
• Ensembles tend to yield better results because the different models complement each other and overcome
their individual weaknesses. They also reduce variance and prevent overfitting.
• Some popular ensemble methods are bagging, boosting, and stacking. Ensemble learning is used
extensively across machine learning tasks like classification, regression, and clustering to enhance
accuracy and robustness.
What is Bagging?
• Bagging (bootstrap aggregating) is an ensemble method that involves training multiple models
independently on random subsets of the data, and aggregating their predictions through voting
or averaging.
• In detail, each model is trained on a random subset of the data sampled with replacement,
meaning that the individual data points can be chosen more than once. This random subset is
known as a bootstrap sample. By training models on different bootstraps, bagging reduces the
variance of the individual models. It also avoids overfitting by exposing the constituent
models to different parts of the dataset.
• The predictions from all the sampled models are then combined through a simple averaging to
make the overall prediction. This way, the aggregated model incorporates the strengths of the
individual ones and cancels out their errors.
• Bagging is particularly effective in reducing variance and
overfitting, making the model more robust and accurate,
especially in cases where the individual models are prone to
high variability.
Bagging Vs. Boosting
• The term “dropout” refers to dropping out the nodes (input and hidden layer) in a
neural network (as seen in Figure 1). All the forward and backwards connections
with a dropped node are temporarily removed, thus creating a new network
architecture out of the parent network. The nodes are dropped by a dropout
probability of p.
• Let’s try to understand with a given input x: {1, 2, 3, 4, 5} to the fully connected
layer. We have a dropout layer with probability p = 0.2 (or keep probability = 0.8).
During the forward propagation (training) from the input x, 20% of the nodes
would be dropped, i.e. the x could become {1, 0, 3, 4, 5} or {1, 2, 0, 4, 5} and so
on. Similarly, it applied to the hidden layers.
• For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is
applied with drop probability = 0.5, then 500 neurons would be randomly dropped
in every iteration (batch).
• Generally, for the input layers, the keep probability, i.e. 1- drop probability, is
closer to 1, 0.8 being the best as suggested by the authors. For the hidden layers,
the greater the drop probability more sparse the model, where 0.5 is the most
optimised keep probability, that states dropping 50% of the nodes.
How does it solve the Overfitting problem?
• In the overfitting problem, the model learns the statistical noise. To be
precise, the main motive of training is to decrease the loss function,
given all the units (neurons). So in overfitting, a unit may change in a
way that fixes up the mistakes of the other units. This leads to complex
co-adaptations, which in turn leads to the overfitting problem because
this complex co-adaptation fails to generalise on the unseen dataset.
• Now, if we use dropout, it prevents these units to fix up the mistake of
other units, thus preventing co-adaptation, as in every iteration the
presence of a unit is highly unreliable. So by randomly dropping a few
units (nodes), it forces the layers to take more or less responsibility for
the input by taking a probabilistic approach.
• This ensures that the model is getting generalised and hence reducing
the overfitting problem.
• From figure 2, we can easily make out that the hidden layer with dropout is learning
more of the generalised features than the co-adaptations in the layer without dropout.
It is quite apparent, that dropout breaks such inter-unit relations and focuses more on
generalisation.
Adversarial Training and Optimization.
• It is a machine learning technique that involves training models to be robust against adversarial examples.
The examples are intentionally designed inputs created to mislead the model into making inaccurate and
wrong predictions. For instance, in computer vision tasks, an adversarial example could be an image that
has been manipulated in a way that is barely noticeable to the human eye but leads to misclassification by
the model.
• It is based on the idea that models trained on adversarial examples are more robust to real-world
variations and distortions in the data. This is because it can cover various variations and distortions,
making the model more resistant to these variations.
Examples of Adversarial Learning
• Deep learning, a subset of machine learning, tackles intricate tasks like speech
recognition and text classification. Comprising components such as activation
functions, input, output, hidden layers, and loss functions, deep learning models
aim to generalize data and make predictions on unseen data. To optimize these
models, various algorithms, known as optimizers, are employed.
• Optimizers adjust model parameters iteratively during training to minimize a loss
function, enabling neural networks to learn from data.
• Common optimizers include Stochastic Gradient Descent (SGD), Adam, and
RMSprop, each employing specific update rules, learning rates, and momentum for
refining model parameters.
• Optimizers play a pivotal role in enhancing accuracy and speeding up the
training process, shaping the overall performance of deep learning models.
Gradient Descent
• Gradient descent is a simple optimization algorithm that updates the
model's parameters to minimize the loss function. We can write the basic
form of the algorithm as follows:
θ=θ−α⋅∇θL(θ)
• where θ is the model parameter, L(θ) is the loss function, and α is the
learning rate.
Pros:
• Simple to implement.
• Can work well with a well-tuned learning rate.
Cons:
• It can converge slowly, especially for complex models or large datasets.
• Sensitive to the choice of learning rate.
Stochastic Gradient Descent
• Stochastic gradient descent (SGD) is a variant of gradient descent that involves
updating the parameters based on a small, randomly-selected subset of the data
(i.e., a "mini-batch") rather than the full dataset. We can write the basic form of
the algorithm as follows:
θ=θ−α⋅∇θL(θ;x(i);y(i))
• where (x(i),y(i)) is a mini-batch of data.
Pros:
• It can be faster than standard gradient descent, especially for large datasets.
• Can escape local minima more easily.
Cons:
• It can be noisy, leading to less stability.
• It may require more hyperparameter tuning to get good performance.
• Stochastic Gradient Descent with Momentum
SGD with momentum is a variant of SGD that adds a "momentum" term to the update
rule, which helps the optimizer to continue moving in the same direction even if the local
gradient is small. The momentum term is typically set to a value between 0 and 1. We
can write the update rule as follows:
v=β⋅v+(1−β)⋅∇θL(θ;x(i);y(i))
θ=θ−α⋅vθ=θ−α⋅v
• where v is the momentum vector and β is the momentum hyperparameter.
Pros:
• It can help the optimizer to move more efficiently through "flat" regions of the loss
function.
• It can help to reduce oscillations and improve convergence.
Cons:
• Can overshoot good solutions and settle for suboptimal ones if the momentum is too
high.
• Requires tuning of the momentum hyperparameter.
Mini-Batch Gradient Descent
• Mini-batch gradient descent is similar to SGD, but instead of using a single
sample to compute the gradient, it uses a small, fixed-size "mini-batch"
of samples. The update rule is the same as for SGD, except that the
gradient is averaged over the mini-batch. This can reduce noise in the
updates and improve convergence.
• Pros:
• It can be faster than standard gradient descent, especially for large
datasets.
• Can escape local minima more easily.
• Can reduce noise in updates, leading to more stable convergence.
• Cons:
• Can be sensitive to the choice of mini-batch size.
Adagrad
• Adagrad is an optimization algorithm that uses an adaptive learning rate per parameter. The learning
rate is updated based on the historical gradient information so that parameters that receive many updates
have a lower learning rate, and parameters that receive fewer updates have a larger learning rate. The
update rule can be written as follows:
• Where G is a matrix that accumulates the squares of the gradients, and ϵϵ is a small constant added to avoid
division by zero.
Pros:
• It can work well with sparse data.
• Automatically adjusts learning rates based on parameter updates.
Cons:
• Can converge too slowly for some problems.
• Can stop learning altogether if the learning rates become too small.
• RMSProp
RMSProp is an optimization algorithm similar to Adagrad, but it uses an exponentially decaying average of the
squares of the gradients rather than the sum. This helps to reduce the monotonic learning rate decay of Adagrad
and improve convergence. We can write the update rule as follows:
Where G is a matrix that accumulates the squares of the gradients, ϵ is a small constant added to avoid division by
zero, and β is a decay rate hyperparameter.
• Pros:
• It can work well with sparse data.
• Automatically adjusts learning rates based on parameter updates.
• Can converge faster than Adagrad.
• Cons:
• It can still converge too slowly for some problems.
• Requires tuning of the decay rate hyperparameter.
• AdaDelta
• AdaDelta is an optimization algorithm similar to RMSProp but does not
require a hyperparameter learning rate. Instead, it uses an exponentially
decaying average of the gradients and the squares of the gradients to
determine the updated scale. We can write the update rule as follows:
Where GG and SS are matrices that accumulate the gradients and the squares of the updates, respectively,
and ϵϵ is a small constant added to avoid division.
Pros:
•Can work well with sparse data.
•Automatically adjusts learning rates based on parameter updates.
Cons:
•Can converge too slowly for some problems.
•Can stop learning altogether if the learning rates become too small.
• Adam
• Adam (short for "adaptive moment estimation") is an optimization
algorithm that combines the ideas of SGD with momentum and RMSProp.
It uses an exponentially decaying average of the gradients and the
squares of the gradients to determine the updated scale, similar to
RMSProp. It also uses a momentum term to help the optimizer move more
efficiently through the loss function. The update rule can be written as
follows:
Where mm and vv are the momentum and velocity vectors, respectively, and
β1and β2are decay rates for the momentum and velocity.
Pros:
• Can converge faster than other optimization algorithms.
• Can work well with noisy data.
Cons:
• It may require more tuning of hyperparameters than other algorithms.
• May perform better on some types of problems.
How Do Optimizers Work in Deep Learning?
• Optimizers in deep learning adjust the model's parameters to minimize the
loss function. The loss function measures how well the model can make
predictions on a given dataset, and the goal of training a model is to find
the set of model parameters that yields the lowest possible loss.
• The optimizer uses an optimization algorithm to search for the parameters
that minimize the loss function. The optimization algorithm uses the
gradients of the loss function to the model parameters to determine the
direction in which we should adjust the parameters.
• The gradients are computed using backpropagation, which involves
applying the chain rule to compute the gradients of the loss function to
each of the model parameters.
• The optimization algorithm then adjusts the model parameters to
minimize the loss function. This process is repeated until the loss function
reaches a minimum or the optimizer reaches the maximum number of
allowed iterations.