0% found this document useful (0 votes)
16 views

GD Types

Uploaded by

Anand Chopade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

GD Types

Uploaded by

Anand Chopade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 98

Gradient Descent

• What is Gradient Descent?


• Gradient Descent stands as a cornerstone orchestrating the
intricate dance of model optimization. At its core, it is a
numerical optimization algorithm that aims to find the optimal
parameters—weights and biases—of a neural network by
minimizing a defined cost function.
• Gradient Descent (GD) is a widely used optimization algorithm
in machine learning and deep learning that minimises the cost
function of a neural network model during training. It works by
iteratively adjusting the weights or parameters of the model in
the direction of the negative gradient of the cost function until
the minimum of the cost function is reached.
• The learning happens during the backpropagation
while training the neural network-based model.
• There is a term known as Gradient Descent, which is
used to optimize the weight and biases based on the
cost function.
• The cost function evaluates the difference between the
actual and predicted outputs.
• Gradient Descent is a fundamental optimization algorithm in
machine learning used to minimize the cost or loss function during
model training.
• It iteratively adjusts model parameters by moving in the direction of
the steepest decrease in the cost function.
• The algorithm calculates gradients, representing the partial
derivatives of the cost function concerning each parameter.
• These gradients guide the updates, ensuring convergence towards
the optimal parameter values that yield the lowest possible cost.
• Gradient Descent is versatile and applicable to various machine
learning models, including linear regression and neural networks. Its
efficiency lies in navigating the parameter space efficiently,
enabling models to learn patterns and make accurate predictions.
Adjusting the learning rate is crucial to balance convergence
speed and avoiding overshooting the optimal solution.
• The gradient descent algorithm is an optimization
algorithm mostly used in machine learning and deep
learning. Gradient descent adjusts parameters to
minimize particular functions to local minima. In linear
regression, it finds weight and biases, and deep learning
backward propagation uses the method.
• The algorithm objective is to identify model parameters
like weight and bias that reduce model error on training
data.
What is a Gradient?

dy = change in y
dx = change in x

1.A gradient measures how much the output of a


Gradient function changes if you change the inputs a little
bit.
Descent 2.In machine learning, a gradient is a derivative of
a function that has more than one input variable.
Known as the slope of a function in mathematical
terms, the gradient simply measures the change in
all weights about the change in error.
Learning Rate:
The algorithm designer can set the learning rate. If
we use a learning rate that is too small, it will
cause us to update very slowly, requiring more
iterations to get a better solution.
Types of Gradient Descent:

• There are three popular types


that mainly differ in the amount
of data they use:
. BATCH GRADIENT DESCENT:

• Batch gradient descent, also known as


vanilla gradient descent, calculates the
error for each example within the training
dataset. Still, the model is not changed until
every training sample has been assessed.
The entire procedure is referred to as a
cycle and a training epoch.

• Some benefits of batch are its


computational efficiency, which produces a
stable error gradient and a stable
convergence. Some drawbacks are that the
stable error gradient can sometimes result
in a state of convergence that isn’t the best
the model can achieve. It also requires the
entire training dataset to be in memory and
available to the algorithm.
. BATCH GRADIENT DESCENT:

• Advantages
1.Fewer model updates mean that this variant of the
steepest descent method is more computationally
efficient than the stochastic gradient descent
method.
2.Reducing the update frequency provides a more
stable error gradient and a more stable convergence
for some problems.
3.Separating forecast error calculations and model
updates provides a parallel processing-based
algorithm implementation.
. BATCH GRADIENT DESCENT:

• Disadvantages
1.A more stable error gradient can cause the model to
prematurely converge to a suboptimal set of parameters.
2.End-of-training epoch updates require the additional complexity
of accumulating prediction errors across all training examples.
3.The batch gradient descent method typically requires the entire
training dataset in memory and is implemented for use in the
algorithm.
4.Large datasets can result in very slow model updates or training
speeds.
5.Slow and require more computational power.
STOCHASTIC GRADIENT DESCENT:

• By contrast, stochastic gradient descent (SGD) changes


the parameters for each training sample one at a time
for each training example in the dataset. Depending on
the issue, this can make SGD faster than batch gradient
descent. One benefit is that the regular updates give us
a fairly accurate idea of the rate of improvement.
• However, the batch approach is less computationally
expensive than the frequent updates. The frequency of
such updates can also produce noisy gradients, which
could cause the error rate to fluctuate rather than
gradually go
• Advantages
1.You can instantly see your model’s performance and
improvement rates with frequent updates.
2.This variant of the steepest descent method is probably the
easiest to understand and implement, especially for beginners.
3.Increasing the frequency of model updates will allow you to
learn more about some issues faster.
4.The noisy update process allows the model to avoid local
minima (e.g., premature convergence).
5.Faster and require less computational power.
6.Suitable for the larger dataset.
• Disadvantages
1.Frequent model updates are more computationally
intensive than other steepest descent configurations,
and it takes considerable time to train the model with
large datasets.
2.Frequent updates can result in noisy gradient signals.
This can result in model parameters and cause errors to
fly around (more variance across the training epoch).
3.A noisy learning process along the error gradient can
also make it difficult for the algorithm to commit to the
model’s minimum error.
MINI-BATCH GRADIENT DESCENT:

• Since mini-batch gradient descent combines the ideas


of batch gradient descent with SGD, it is the preferred
technique. It divides the training dataset into
manageable groups and updates each separately. This
strikes a balance between batch gradient descent’s
effectiveness and stochastic gradient descent’s
durability.
• Mini-batch sizes typically range from 50 to 256,
although, like with other machine learning techniques,
there is no set standard because it depends on the
application. The most popular kind in deep learning, this
method is used when training a neural network.
• Advantages
1.The model is updated more frequently than the stack
gradient descent method, allowing for more robust
convergence and avoiding local minima.
2.Batch updates provide a more computationally efficient
process than stochastic gradient descent.
3.Batch processing allows for both the efficiency of not
having all the training data in memory and
implementing the algorithm.
• Disadvantages
1.Mini-batch requires additional hyperparameters “mini-
batch size” to be set for the learning algorithm.
2.Error information should be accumulated over a mini-
batch of training samples, such as batch gradient
descent.
3.it will generate complex functions.
• In statistics, Machine Learning and other Data Science fields, we
optimize a lot of stuff. For example in linear regression, we optimize
the Intercept and Slope, or when we use Logistic Regression we
optimize the squiggle. Moreover, in t-SNE we optimize clusters. The
interesting thing is that gradient descent can optimize all these things
and much more. A good example is the Sum of the Squared
Residuals in Regression: in Machine Learning lingo this is a type
of Loss Function. The Residual is the difference between the
Observed Value and the Predicted Value.
• The figure above shows on the y-axis the sum of the squared
residuals and the x-axis different value for the intercept. The
first point on the y-axis represent the sum of the squared
residuals when the intercept is equal to zero. We continue to
plot point on the graph based on different value of the intercept.
The lowest point in the graph has the lowest sum of the squared
residuals. Gradient descent identifies the optimal value by
taking big steps when we are far away to the optimal sum of the
squared residual, and start to make many steps when it is close
to the best solution. Than we can calculate the derivate d of
each point of the function created by the points. In other words,
we are taking the derivative of the Loss Function.
• Gradient Descent uses derivative in order to find where the Sum of the
Squared Residuals is lowest. The closer we get to the optimal value for
the intercept, the closer the slope of the curve gets to zero.
• HOW DOES GRADIENT DESCENT KNOW TO STOP TAKING
STEPS?
Gradient Descent stops when the step size is very close to zero, and
the step size is very close to zero when the slop size is close to zero.
• In practice, the Minimum Step Size is equal to 0.001 or smaller.
Moreover, Gradient Descent includes a limit on the number of steps it
will take before giving up. In practice, the Maximum Number of
Steps is equal to 1000 or greater.
• We can also estimate the intercept and the slope simultaneously. We
use the Sum of the Squared Residuals as the Loss Function, and we
can represent a 3D graph of the Loss Function for different values of
intercept and the slope.
We want to find the values for the intercept and slope that give us the
minimum Sum of the Squared Residuals.
• So, just like before, we need to take the derivative of the function
represented by the graph above for both intercept and slope.
When we have two or more derivatives of the same function (in this
case the derivative or both intercept and slope) we call this
a GRADIENT.
We will use this GRADIANT to DESCENT to lowest point in the
Loss Function, which, in this case, is the Sum of the Squared
Residuals.
Momentum-based Gradient

• Gradient Descent is an optimization technique used in


Machine Learning frameworks to train different models.
The training process consists of an objective function (or
the error function), which determines the error a
Machine Learning model has on a given dataset.
While training, the parameters of this algorithm are
initialized to random values. As the algorithm iterates,
the parameters are updated such that we reach closer
and closer to the optimal value of the function.
• However, Adaptive Optimization Algorithms are gaining
popularity due to their ability to converge swiftly. All these
algorithms, in contrast to the conventional Gradient Descent,
use statistics from the previous iterations to robustify the
process of convergence.
• Momentum-based Gradient Optimizer is a technique used in
optimization algorithms, such as Gradient Descent, to
accelerate the convergence of the algorithm and overcome
local minima. In the Momentum-based Gradient Optimizer, a
fraction of the previous update is added to the current update,
which creates a momentum effect that helps the algorithm to
move faster towards the minimum.
• The momentum term can be viewed as a moving
average of the gradients. The larger the momentum
term, the smoother the moving average, and the more
resistant it is to changes in the gradients. The
momentum term is typically set to a value between 0
and 1, with a higher value resulting in a more stable
optimization process.
• The update rule for the Momentum-based Gradient Optimizer can
be expressed as follows:
• makefile
• v = beta * v - learning_rate * gradient
• parameters = parameters + v

• // Where v is the velocity vector, beta is the momentum term,


• // learning_rate is the step size,
• // gradient is the gradient of the cost function with respect to the
parameters,
• // and parameters are the parameters of the model.
• The Momentum-based Gradient Optimizer has several
advantages over the basic Gradient Descent algorithm,
including faster convergence, improved stability, and
the ability to overcome local minima. It is widely used in
deep learning applications and is an important
optimization technique for training deep neural
networks.
Momentum-based Optimization:

• An Adaptive Optimization Algorithm uses exponentially


weighted averages of gradients over previous iterations
to stabilize the convergence, resulting in quicker
optimization. For example, in most real-world
applications of Deep Neural Networks, the training is
carried out on noisy data. It is, therefore, necessary to
reduce the effect of noise when the data are fed in
batches during Optimization. This problem can be
tackled using Exponentially Weighted Averages (or
Exponentially Weighted Moving Averages).
Nesterov Accelerated
Gradient

• NAG resolves this


problem by adding a
look ahead term in our
equation. The intuition
behind NAG can be
summarized as ‘look
before you leap’. Let us
try to understand this
through an example.
• in the momentum-based gradient, the steps become
larger and larger due to the accumulated momentum,
and then we overshoot at the 4th step. We then have to
take steps in the opposite direction to reach the
minimum point.
• the update in NAG happens in two steps. First, a partial
step to reach the look-ahead point, and then the final
update.
• We calculate the gradient at the look-ahead point and
then use it to calculate the final update.
• If the gradient at the look-ahead point is negative, our
final update will be smaller than that of a regular
momentum-based gradient.
• Like in the above example, the updates of NAG are
similar to that of the momentum-based gradient for the
first three steps because the gradient at that point and
the look-ahead point are positive. But at step 4, the
gradient of the look-ahead point is negative.
• In NAG, the first partial update 4a will be used to go to
the look-ahead point and then the gradient will be
calculated at that point without updating the
parameters. Since the gradient at step 4b is negative,
the overall update will be smaller than the momentum-
based gradient descent.
• We can see in the above example that the momentum-
based gradient descent takes six steps to reach the
minimum point, while NAG takes only five steps.
• This looking ahead helps NAG to converge to the
minimum points in fewer steps and reduce the chances
of overshooting.
AdaGrad : Adaptive Gradient
Algorithm
• AdaGrad (Adaptive Gradient Algorithm) is one such
algorithm that adjusts the learning rate for each
parameter based on its prior gradients.
The Need for Adaptive Learning Rates
• Gradient Descent and other conventional optimization
techniques use a fixed learning rate throughout the duration of
training. However, this uniform learning rate might not be the
best option for all parameters, which could cause problems with
convergence. Some parameters might need more frequent
updates to hasten convergence, while others might need
smaller changes to prevent overshooting the ideal value.
• One of the earliest adaptive learning rate algorithms
is AdaGrad, which Duchi et al. introduced in 2011. Its main
objective is to hasten convergence for sparse gradient
parameters. Each parameter’s previous gradient information is
tracked by the algorithm, which then modifies the learning rate
as necessary.
How AdaGrad Works

• The steps of the algorithm are as follows:


• Step 1: Initialize variables
1.Initialize the parameters θ and a small constant ϵ to avoid
division by zero.
2.Initialize the sum of squared gradients variable G with zeros,
which has the same shape as θ.
• Step 2: Calculate gradients
1.Compute the gradient of the loss function with respect to
each parameter.
• Step 3: Accumulate squared gradients
1.Update the sum of squared gradients G for each parameter i
• Step 4: Update parameters
1.Update each parameter using the adaptive learning rate
Advantages of AdaGrad

• AdaGrad adjusts the learning rate for each parameter to


enable effective updates based on the parameter’s
significance to the optimization process. This method
lessens the requirement for manual adjustment of
learning rates.
• Robustness: AdaGrad does well with sparse data and
variables of different sizes. It makes sure that
parameters that receive few updates get higher
learning rates, which speeds up convergence.
RMSprop (Root Mean Squared
Propagation)
• RMSprop (Root Mean Squared Propagation) is an
optimization algorithm used in deep learning and
other Machine Learning techniques.
• As data travels through very complicated functions,
such as neural networks, the resulting gradients often
disappear or expand. RMSprop is an innovative
stochastic mini-batch learning method.
• For optimizing the training of neural networks, RMSprop
relies on gradients. Backpropagation has its roots in this
idea.
• It is a variant of the gradient descent algorithm that
helps to improve the convergence speed and stability of
• Like other gradient descent algorithms, RMSprop works
by calculating the gradient of the loss function with
respect to the model’s parameters and updating the
parameters in the opposite direction of the gradient to
minimize the loss. However, RMSProp introduces a few
additional techniques to improve the performance of
the optimization process.
• One key feature is its use of a moving average of the
squared gradients to scale the learning rate for each
parameter. This helps to stabilize the learning process
and prevent oscillations in the optimization trajectory.
• The algorithm can be summarized by the following RMSProp formula:
v_t = decay_rate * v_{t-1} + (1 - decay_rate) * gradient^2
parameter = parameter - learning_rate * gradient / (sqrt(v_t) + epsilon)
Where:
• v_t is the moving average of the squared gradients;
• decay_rate is a hyperparameter that controls the decay rate of the moving
average;
• learning_rate is a hyperparameter that controls the step size of the update;
• gradient is the gradient of the loss function with respect to the parameter;
and
• epsilon is a small constant added to the denominator to prevent division by
zero.
RMSProp advantages

• Fast convergence. RMSprop is known for its fast


convergence speed, which means that it can find good
solutions to optimization problems in fewer iterations than
some other algorithms. This can be especially useful for
training large or complex models, where training time is a
critical concern.
• Stable learning. The use of a moving average of the
squared gradients in RMSprop helps to stabilize the learning
process and prevent oscillations in the optimization
trajectory. This can make the optimization process more
robust and less prone to diverging or getting stuck in local
minima.
• Fewer hyperparameters. RMSprop has fewer
hyperparameters than some other optimization algorithms
that make it easier to tune and use in practice. The main
hyperparameters in RMSprop are the learning rate and the
decay rate, which can be chosen using techniques like grid
search or random search.
• Good performance on non-convex problems. RMSprop
tends to perform well on non-convex optimization problems,
common in Machine Learning and deep learning. Non-convex
optimization problems have multiple local minima, and
RMSprop’s fast convergence speed and stable learning can
help it find good solutions even in these cases.
Autoencoders

• Autoencoders are neural networks that stack numerous


non-linear transformations to reduce input into a low-
dimensional latent space (layers).
• They use an encoder-decoder system. The encoder
converts the input into latent space, while the decoder
reconstructs it. For accurate input reconstruction, they
are trained through backpropagation.
• Autoencoders may be used to reduce dimensionality
when the latent space has fewer dimensions than the
input.
• Because they can rebuild the input, these low-
dimensional latent variables should store the most
relevant properties, according to intuition.
Role of Dimensionality Reduction in ML

• We often come into the curse of dimensionality issues in


machine learning projects, when the amount of data records is
not a significant component of the number of features.
• This often causes issues since it necessitates training a large
number of parameters with a limited data set, which may
easily lead to overfitting and poor generalization.
• High dimensionality also entails lengthy training periods. To
solve these challenges, dimensionality reduction methods are
often utilized.
• Despite its location in high-dimensional space, feature space
often possesses a low-dimensional structure.
• PCA and auto-encoders are two popular methods for lowering
the dimensionality of the feature space.
Autoencoders
PCA vs Autoencoder

• Although PCA is fundamentally a linear transformation, auto-


encoders may describe complicated non-linear processes.
• Because PCA features are projections onto the orthogonal
basis, they are completely linearly uncorrelated. However,
since autoencoded features are only trained for correct
reconstruction, they may have correlations.
• PCA is quicker and less expensive to compute than
autoencoders.
• PCA is quite similar to a single layered autoencoder with a
linear activation function.
• Because of the large number of parameters, the autoencoder
is prone to overfitting. (However, regularization and proper
planning might help to prevent this).
How to select the models?

• If the features have a non-linear connection, the


autoencoder may compress the data more efficiently into
a low-dimensional latent space by utilizing its capacity to
represent complicated non-linear processes.
• Researchers created a two-dimensional feature space
with linear and non-linear relationships between them (x
and y are two features) (with some added noise).
• After projecting the input into latent space, we can
compare the capabilities of autoencoders and PCA to
properly reconstruct the input. PCA is a linear
transformation with a well-defined inverse transform, and
the reconstructed input comes from the autoencoder’s
decoder output. For both PCA and autoencoders, we
employ a one-dimensional latent space.
• Autoencoded latent space may be employed for more
accurate reconstruction if there is a nonlinear
connection (or curvature) in the feature space. PCA, on
the other hand, only keeps the projection onto the first
principal component and discards any information that
is perpendicular to it.
Autoencoders -Machine
Learning
• At the heart of deep learning lies the neural network,
an intricate interconnected system of nodes that mimics
the human brain’s neural architecture.
• Neural networks excel at discerning intricate patterns
and representations within vast datasets, allowing them
to make predictions, classify information, and generate
novel insights.
• Autoencoders emerge as a fascinating subset of
neural networks, offering a unique approach to
unsupervised learning.
• Autoencoders are an adaptable and strong class of
architectures for the dynamic field of deep learning,
where neural networks develop constantly to identify
complicated patterns and representations.
• With their ability to learn effective representations of
data, these unsupervised learning models have received
considerable attention and are useful in a wide variety
of areas, from image processing to anomaly detection.
What are Autoencoders?

• Autoencoders are a specialized class of algorithms that can learn


efficient representations of input data with no need for labels.
• It is a class of artificial neural networks designed for
unsupervised learning.
• Learning to compress and effectively represent input data without
specific labels is the essential principle of an automatic decoder.
• This is accomplished using a two-fold structure that consists of an
encoder and a decoder.
• The encoder transforms the input data into a reduced-dimensional
representation, which is often referred to as “latent space” or
“encoding”.
• From that representation, a decoder rebuilds the initial input. For the
network to gain meaningful patterns in data, a process of encoding
and decoding facilitates the definition of essential features.
Architecture of
Autoencoder in
Deep Learning
The general
architecture of an
autoencoder includes
an encoder, decoder,
and bottleneck layer.
Encoder

1.Input layer take raw input data


2.The hidden layers progressively reduce the dimensionality of
the input, capturing important features and patterns. These
layer compose the encoder.
3.The bottleneck layer (latent space) is the final hidden layer,
where the dimensionality is significantly reduced. This layer
represents the compressed encoding of the input data.
Decoder

1. The bottleneck layer takes the encoded representation and


expands it back to the dimensionality of the original input.
2.The hidden layers progressively increase the dimensionality
and aim to reconstruct the original input.
3.The output layer produces the reconstructed output, which
ideally should be as close as possible to the input data.
• The loss function used during training is typically a
reconstruction loss, measuring the difference between
the input and the reconstructed output. Common
choices include mean squared error (MSE) for
continuous data or binary cross-entropy for binary data.
• During training, the autoencoder learns to minimize the
reconstruction loss, forcing the network to capture the
most important features of the input data in the
bottleneck layer.
After the training process, only the encoder part of the autoencoder is
retained to encode a similar type of data used in the training process.
The different ways to constrain the network are: –

• Keep small Hidden Layers: If the size of each hidden layer is kept
as small as possible, then the network will be forced to pick up only
the representative features of the data thus encoding the data.
• Regularization: In this method, a loss term is added to the cost
function which encourages the network to train in ways other than
copying the input.
• Denoising: Another way of constraining the network is to add noise to
the input and teach the network how to remove the noise from the
data.
• Tuning the Activation Functions: This method involves changing
the activation functions of various nodes so that a majority of the
nodes are dormant thus, effectively reducing the size of the hidden
layers.
Types of Autoencoders
There are diverse types of autoencoders and analyze the
advantages and disadvantages associated with different
variation:
• Denoising Autoencoder
Denoising autoencoder works on a partially corrupted input and trains to recover the
original undistorted image. As mentioned above, this method is an effective way to
constrain the network from simply copying the input and thus learn the underlying structure
and important features of the data.
• Advantages
1.This type of autoencoder can extract important features and reduce the noise or the useless
features.
2.Denoising autoencoders can be used as a form of data augmentation, the restored images
can be used as augmented data thus generating additional training samples.
• Disadvantages
1.Selecting the right type and level of noise to introduce can be challenging and may require
domain knowledge.
2.Denoising process can result into loss of some information that is needed from the
original input. This loss can impact accuracy of the output.
Sparse Autoencoder

This type of autoencoder typically contains more hidden units than the
input but only a few are allowed to be active at once. This property is
called the sparsity of the network. The sparsity of the network can be
controlled by either manually zeroing the required hidden units, tuning
the activation functions or by adding a loss term to the cost function.
• Advantages
1.The sparsity constraint in sparse autoencoders helps in filtering out
noise and irrelevant features during the encoding process.
2.These autoencoders often learn important and meaningful features
due to their emphasis on sparse activations.
• Disadvantages
1.The choice of hyperparameters play a significant role in the
performance of this autoencoder. Different inputs should result in the
activation of different nodes of the network.
2.The application of sparsity constraint increases computational
complexity.
Contractive Autoencoder (CAE)

• Contractive Autoencoder was proposed by researchers


at the University of Toronto in 2011 in the paper
Contractive auto-encoders: Explicit invariance during
feature extraction.
• The idea behind that is to make the autoencoders
robust to small changes in the training dataset.
• To deal with the above challenge that is posed by basic
autoencoders, the authors proposed adding another
penalty term to the loss function of autoencoders
Loss Function of Contactive AutoEncoder

• Contractive autoencoder adds an extra term in the loss


function of autoencoder, it is given as:

i.e. the above penalty term is the Frobenius Norm of the encoder, the Frobenius norm
is just a generalization of the Euclidean norm.
In the above penalty term, we first need to calculate the Jacobian matrix of the
hidden layer, calculating a Jacobian of the hidden layer with respect to input is
similar to gradient calculation. Let’s first calculate the Jacobian of the hidden layer:
How Contractive Autoencoders Work

• A Contractive Autoencoder consists of two main components: an encoder and a


decoder.
• The encoder compresses the input into a lower-dimensional representation, and
the decoder reconstructs the input from this representation. The goal is for the
reconstructed output to be as close as possible to the original input.
• The training process involves minimizing a loss function that has two terms. The
first term is the reconstruction loss, which measures the difference between the
original input and the reconstructed output.
• The second term is the regularization term, which measures the sensitivity of the
encoded representations to the input. By penalizing the sensitivity, the CAE learns
to produce encodings that do not change much when the input is perturbed
slightly, leading to more robust features.
Advantages of Contractive Autoencoders

• Robustness to Noise: By design, CAEs are robust to


small perturbations or noise in the input data.
• Improved Generalization: The contractive penalty
encourages the model to learn more general features
that do not depend on the specific noise or variations
present in the training data.
• Stability: The regularization term helps to stabilize the
training process by preventing the model from learning
trivial or overfitted representations.
Challenges with Contractive
Autoencoders
• Despite their advantages, CAEs also present some challenges:
• Computational Complexity: Calculating the Jacobian matrix
for the contractive penalty can be computationally expensive,
especially for large neural networks.
• Hyperparameter Tuning:The strength of the contractive
penalty is controlled by a hyperparameter that needs to be
carefully tuned to balance the reconstruction loss and the
regularization term.
• Choice of Regularization: The effectiveness of the CAE can
depend on the choice of regularization term, and different
problems may require different forms of the contractive penalty.
What is Bias?

• If machine learning model is performing very badly on a


set of data because it is not generalizing to all your data
points, This is when you say your model has high bias
and the model is said to underfit.
• The Error between average model prediction and
ground truth
• The Bias of the estimated function tells us the capacity
of the underlying model to predict the values
What is Variance?

• If Machine Learning model tries to account for all or


mostly all points in a dataset successfully. If it then
performs poorly when run on other test data sets, it is
said to have high variance and the model is said to
overfit.
• Average Variability in the model prediction for the given
dataset
• The Variance of the estimated function tells you how
much the function can adjust to the change in the
dataset
• High Bias
• Overly-Simplified Model
• Under-Fitting
• High error on both test and train data
• High Variance
• Overly-complex model
• Overly-Fitting
• Low error on train data
• High error on test data
• Starts modeling the noise in the input
Bias Variance Trade-Off

• Increasing bias reduces variance and vice-versa


• Error = Bias² + Variance + irreducible error
• The best model is where the error is reduced.
• Compromise between bias and variance.
Regularization
• The regression method used to tackle high variance is
called regularization.
• We try to minimize the error (cost function), Observe
that the cost function was dependent on the coefficients
• In such cases, the primary objective is to minimize error.
There is no restriction on how small or large the
coefficients can be, to achieve this objective. But in real
life, we need to achieve objectives with some
restrictions imposed.
• For Example, We need to minimize the cost function in
Linear regression but with some constraints on
coefficient values. This is because too high values of
coefficients may be unreliable both for explanation and
prediction as they lead to overfitting.
• Hence, to the cost function, we add these constraints
that the sum of the squared coefficients values or sum
of absolute values of coefficients. If the sum is more,
then the cost function value increases, and hence that
cannot be the optimal solution.
• The Optimal solution will be the one where the sum of
the coefficients (or coefficients squared) will be
minimum.
What is Overfitting?

• Overfitting is a common problem in machine learning and deep learning


where a model learns the training data too well. It happens when the
model captures not only the underlying patterns in the data but also the
noise or random fluctuations. As a result, the model performs
exceptionally well on the training data but poorly on unseen data (test
data or real-world data).
• Causes of Overfitting:
1.Complex Model: Using a model with too many parameters relative to the
number of observations can lead to overfitting. The model might capture
the noise in the training data, making it perform poorly on new data.
2.Insufficient Training Data: If the dataset is too small, the model might
not have enough examples to learn the underlying patterns effectively. It
might end up learning the noise in the data.
3.Lack of Regularization: Regularization techniques, such as L1 and L2
regularization, add a penalty to the loss function to prevent the model
from fitting the training data too closely. Without these, the model might
overfit.
Early Stopping
• Early stopping is a strategy for avoiding “overtraining”
your model.
• In reality, we divide our data into two sets for training
machine learning models: the training set and the
validation (or test) set. The first is employed for
training, while the latter is used to evaluate how
effectively the model is functioning.
• Simply, if the model stops developing and begins to
perform poorly during training, we cease training. As a
result, we “early end” the model.
• It is a regularization approach that should be used
with extreme caution. The main goal is to use
early stopping to prevent overfitting.
• Early stopping in machine learning involves
preventing your optimization process from
converging in the expectation that your predictions will
be more accurate at the expense of being more biased
(regularisation).
• In Regularization by Early Stopping, we stop training the
model when the performance on the validation set is
getting worse- increasing loss decreasing accuracy, or
poorer scores of the scoring metric.
• By plotting the error on the training dataset and the
validation dataset together, both the errors decrease
with a number of iterations until the point where the
model starts to overfit.
• After this point, the training error still decreases but the
validation error increases.
• So, even if training is continued after this point, early
stopping essentially returns the set of parameters that
were used at this point and so is equivalent to stopping
training at that point.
• So, the final parameters returned will enable the model
to have low variance and better generalization.
• The model at the time the training is stopped will have
a better generalization performance than the model
with the least training error.
Benefits of Early Stopping:
• Helps in reducing overfitting
• It improves generalization
• It requires less amount of training data
• Takes less time compared to other regularization models
• It is simple to implement
Limitations of Early Stopping:
• If the model stops too early, there might be risk of
underfitting
• It may not be beneficial for all types of models
• If validation set is not chosen properly, it may not lead to the
most optimal stopping
Data Augmentation

• Data augmentation is the process of modifying, or


“augmenting” a dataset with additional data. This
additional data can be anything from images to text, and
its use in machine learning algorithms helps improve their
performance.

• For example, say we wanted to build a model to classify


dog breeds, and we have a lot of images of most breeds,
except for pugs. As a result, the model wouldn’t be able to
classify pugs well. We could augment the data by adding
some (real or fake) images of pugs, or by multiplying our
existing pug images (e.g. by replicating and distorting
them to make them artificially unique).
• Data augmentation is crucial for many AI applications,
as accuracy increases with the amount of training data.
• In fact, research studies have found that basic data
augmentation can greatly improve accuracy on image
tasks, such as classification and segmentation.
• Further, large neural networks, or deep learning models,
need a huge amount of data, so they benefit even more
from data augmentation techniques.
• The problem is that most companies don’t have
enough data to train their AI models.
• This is where data augmentation comes in: even if
you’re starting with very little, you can end up with
massive amounts of data to generate insights,
predictions, and recommendations that were previously
unavailable due to a lack of relevant information.
• Additionally, using a small amount of data can increase
the risk of overfitting, while having more data points
helps to counter that.
Types of data augmentation

• Real data augmentation


Real data augmentation is when you add real, additional
data to a dataset. This can be anything from text files
with additional attributes (e.g. for images that have been
labeled), to images of other things that are similar to the
original thing, or even videos of the original thing.
• Synthetic data augmentation
Besides adding additional real data, you can also add
synthetic data, or fake data that simply looks real. This is
helpful for complex tasks like neural style transfer, but
it’s useful for any architecture, whether you’re using
GANs (Generative Adversarial Networks), CNNs
(Convolutional Neural Networks), or other deep neural
network architectures.
Parameter Sharing and Typing in Machine
Learning

• We usually apply limitations or penalties to parameters


in relation to a fixed region or point. L2 regularisation (or
weight decay) penalises model parameters that deviate
from a fixed value of zero, for example.
• However, we may occasionally require alternative
means of expressing our prior knowledge of appropriate
model parameter values. We may not know exactly
what values the parameters should take, but we do
know that there should be some dependencies between
the model parameters based on our knowledge of the
domain and model architecture.
• Deep learning models are known for their ability to learn
complex relationships between inputs and outputs.
However, as the number of parameters in a model grows,
so does the risk of overfitting and the computational cost
of training. One technique to reduce the number of
parameters and improve the generalization of a model is
parameter sharing and tying.
• Parameter sharing refers to the idea of using the same set
of parameters for multiple parts of a model. For example,
in convolutional neural networks (CNNs), the same set of
weights is used to convolve over different patches of an
image. By doing so, the model can learn features that are
translation invariant, meaning they can recognize
patterns regardless of their position in the input.
• Parameter tying, on the other hand, refers to the practice of
constraining different parts of a model to share the same
parameter values. This can be useful in cases where we want to
encourage certain properties of the model, such as symmetry or
sparsity. For example, in autoencoders, the encoder and decoder
can share the same set of weights, leading to a more symmetric
architecture and reduced number of parameters.
• Parameter sharing and tying can be applied in different ways
across different types of models. In CNNs, for example, parameter
sharing is often used in the convolutional layers to learn local
features, while parameter tying can be used in the fully connected
layers to reduce the number of parameters. In recurrent neural
networks (RNNs), parameter sharing can be used to learn shared
representations across different time steps, while parameter tying
can be used to enforce certain properties of the model, such as
weight symmetry.
• One of the main benefits of parameter sharing and tying
is the reduction in the number of parameters required to
train a model. This can lead to faster training times,
reduced memory requirements, and better generalization
performance, especially in cases where the data is
limited.
• However, parameter sharing and tying can also have
some drawbacks. For example, tying parameters can lead
to a loss of expressiveness and restrict the model’s ability
to learn complex relationships. In addition, parameter
sharing can introduce dependencies between different
parts of the model, leading to potential performance
issues if the shared parameters are not carefully chosen.
Greedy Layer Wise Pre-Training

• Artificial intelligence has undergone a revolution thanks


to neural networks, which have made significant strides
possible in a number of areas like speech recognition,
computer vision, and natural language processing.
• Deep neural network training, however, may be difficult,
particularly when working with big, complicated
datasets.
• One method that tackles some of these issues is greedy
layer-wise pre-training, which initializes deep neural
network settings layer by layer.
• Greedy layer-wise pre-training is used to initialize the
parameters of deep neural networks layer by layer,
beginning with the first layer and working through each
one that follows. A layer is trained as if it were a stand-
alone model at each step, using input from the layer
before it and output to go to the layer after it. Typically,
developing usable representations of the input data is
the training aim.
• Processes of Greedy Layer-Wise Pre-Training
• The process of greedy layer-wise pre-training can be staged as follows:
• Initialization: The neural network's first layer is trained on its own using
autoencoders and other unsupervised learning strategies. Learning a
collection of features that highlight important elements of the input data is
the aim.
• Extracting Feature: The activations of the first layer are utilized as
features to train the subsequent layer after it has been trained. Each layer
learns to represent the traits discovered by the layer before it in a higher-
level abstraction when this process is repeated repeatedly.
• Fine-Tuning: The network is adjusted as a whole using supervised learning
methods once every layer has been pretrained in this way. To maximize
performance on a particular job, this entails simultaneously modifying all of
the network's parameters using a labeled dataset.
Advantages of Greedy Layer Wise Pre-Training

• Here are some of the advantages of Greedy Layer Wise


Pre-Training:
• Feature Learning and Representation: At various
degrees of abstraction, each layer of the network gains
the ability to identify and extract pertinent characteristics
from the incoming data. Pre-training is unsupervised, so
the model may identify underlying structures and
patterns in the data without needing labeled annotations.
Consequently, the acquired representations often exhibit
more information content and generalizability, resulting in
enhanced performance on subsequent supervised tasks.
• Regularization and Generalization: Greedy layer-wise pre-
training forces the model to acquire meaningful representations
of the input data, which functions as a kind of regularization. By
acting as a kind of regularization, the pre-trained weights direct
the learning process to areas of the parameter space where
there is a higher chance of good generalization to new data.
This aids in avoiding overfitting, particularly in situations when
training data is scarce.
• Transfer Learning and Adaptability: Greedy layer-wise pre-
training makes it easier for a pre-trained model to transfer to
new tasks or domains with little further training. This is known
as transfer learning. The model is able to effectively adapt to
new contexts and achieve acceptable performance even with
insufficient labeled data, because of the learned features that
capture general patterns in the data that are frequently
transferable across other tasks or datasets.
• Efficient Training Process: Training every layer
independently makes the entire process more effective
and less prone to convergence problems. Later, the
entire network may be fine-tuned using supervised
learning. Pre-trained weights offer an excellent starting
point for further training, which expedites the training
process by lowering the number of iterations needed for
convergence.
• Disadvantages of Greedy Layer Wise Pre-Training
Greedy Layer Wise Pre-Training has various advantages but
does come up with some limitations. Here are some of the
disadvantages of Greedy Layer Wise Pre-Training:
• Complexity and Training Time: Greedy layer-wise pre-
training teaches the neural network's layers independently
using unsupervised learning, then uses supervised learning to
fine-tune the entire network. This procedure can be costly and
time-consuming in terms of processing, particularly for large-
scale datasets and complex designs. Sequentially training
more layers calls for more processing power and might not
perform well for really deep networks.
• Difficulty in Implementation: It can be difficult to
implement greedy layer-wise pre-training, especially in
deep systems with several layers. Careful design and
implementation are needed to ensure compatibility with
the following fine-tuning processes, manage the
transfer of pre-trained weights between layers, and
coordinate the training process for each layer. The
adoption of layer-wise pre-training may be hampered by
its complexity, particularly for practitioners with little
background in deep learning.
• Dependency on Data Availability: For unsupervised
learning, greedy layer-wise pre-training needs access to
a lot of unlabeled data. While this might not be a big
deal for some domains or datasets, it might be
problematic in situations when there is a lot of labeled
data available but little or expensive unlabeled data to
get. Other pre-training approaches or data
augmentation methods could be more appropriate in
certain circumstances.

You might also like