0% found this document useful (0 votes)
9 views42 pages

Unit 1 Deep Learning

This document provides a comprehensive overview of linear algebra concepts essential for deep learning, including definitions of scalars, vectors, matrices, and tensors, as well as operations and properties associated with them. It also discusses types of errors in deep learning, such as training, validation, and test errors, along with the bias-variance trade-off that affects model performance. Understanding these concepts is crucial for effectively developing and optimizing machine learning algorithms.

Uploaded by

Ramprakash Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views42 pages

Unit 1 Deep Learning

This document provides a comprehensive overview of linear algebra concepts essential for deep learning, including definitions of scalars, vectors, matrices, and tensors, as well as operations and properties associated with them. It also discusses types of errors in deep learning, such as training, validation, and test errors, along with the bias-variance trade-off that affects model performance. Understanding these concepts is crucial for effectively developing and optimizing machine learning algorithms.

Uploaded by

Ramprakash Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

1

UNIT 1 LINEAR ALGEBRA REVIEW AND OPTIMIZATION


Brief review of concepts from Linear Algebra, Types of errors, bias-variance trade-off,
overfitting under fitting, brief review of concepts from Vector Calculus and optimization,
variants of gradient descent, momentum.
Brief review of concepts from Linear Algebra

Introduction to Linear Algebra Basics

Linear algebra forms the backbone of many concepts in deep learning. Linear algebra is
a continuous form of mathematics and is applied throughout science and engineering
because it allows you to model natural phenomena and to compute them efficiently.
Because it is a form of continuous and not discrete mathematics, a lot of computer
scientists don’t have a lot of experience with it. Linear algebra is also central to almost
all areas of mathematics like geometry and functional analysis. Its concepts are a crucial
prerequisite for understanding the theory behind machine learning, especially if you are
working with deep learning algorithms.

WHAT IS LINEAR ALGEBRA?


Linear algebra is the branch of mathematics that focuses on linear equations. It is often
applied to the science and engineering fields, specifically machine learning. Linear
algebra is also central to almost all areas of mathematics like geometry and functional
analysis.

In linear algebra, data is represented by linear equations, which are presented in the form
of matrices and vectors. Therefore, you are mostly dealing with matrices and vectors
rather than with scalars (we will cover these terms in the following section). When you
have the right libraries, like Numpy, at your disposal, you can compute complex matrix
multiplication very easily with just a few lines of code. (Note: this blog post ignores
concepts of linear algebra that are not important for machine learning.)

You don’t need to understand linear algebra before getting started with machine learning,
but at some point, you may want to gain a better understanding of how the

Prepared by T.Swetha, Mits.


2

different machine learning algorithms really work under the hood. This will help you to
make better decisions during a machine learning system’s development. So if you really
want to be a professional in this field, you will have to master the parts of linear algebra
that are important for machine learning.

1. SCALAR

A scalar is simply a single number. For example 24.

2. VECTOR

In deep learning, data is often represented as vectors (arrays of numbers) and scalars
(single numbers). Vectors can represent various data types like images, text
embeddings, or numerical features.

A vector is an ordered array of numbers and can be in a row or a column. A vector has
just a single index, which can point to a specific value within the vector. For example,
V2 refers to the second value within the vector, which is -8 in the graphic above.

Prepared by T.Swetha, Mits.


3

MATRIX

A matrix is an ordered 2D array of numbers and it has two indices. The first one points
to the row and the second one to the column. For example, M23 refers to the value in
the second row and the third column, which is 8 in the yellow graphic above. A matrix
can have multiple numbers of rows and columns. Note that a vector is also a matrix, but
with only one row or one column.

The matrix in the example in the yellow graphic is also a 2- by 3-dimensional matrix
(rows x columns). Below you can see another example of a matrix along with its
notation:

Matrices: Matrices are 2D arrays of numbers that are used to represent


transformations, weights in neural networks, and various data structures in deep
learning models.
1. Matrix Operations: Linear algebra operations such as matrix multiplication,
addition, and subtraction are extensively used in deep learning. Matrix
multiplication, in particular, is pivotal in neural network computations.
2. Matrix Transpose and Inverse: Transpose of a matrix is fundamental in various
operations, like calculating gradients in optimization algorithms. The matrix
inverse is used in solving systems of linear equations, though it's less common in
deep learning due to computational cost.

Prepared by T.Swetha, Mits.


4

TENSOR

You can think of a tensor as an array of numbers, arranged on a regular grid, with a
variable number of axes. A tensor has three indices, where the first one points to the row,
the second to the column and the third one to the axis. For example, T232 points to the
second row, the third column, and the second axis. This refers to the value 0 in the right
tensor in the graphic below:

Prepared by T.Swetha, Mits.


5

Tensor is the most general term for all of these concepts above because a tensor is a
multidimensional array and it can be a Vector and a Matrix, depending on the number
of indices it has. For example, a first-order tensor would be a vector (1 index). A second-
order tensor is a matrix (2 indices) and third-order tensors (3 indices) and higher are
called higher-order tensors (3 or more indices).

Computational Rules of Linear Algebra

1. MATRIX-SCALAR OPERATIONS

If you multiply, divide, subtract, or add a scalar to a matrix, you do so with every element
of the matrix. The image below illustrates this perfectly for multiplication:

Prepared by T.Swetha, Mits.


6

2. MATRIX-VECTOR MULTIPLICATION

Multiplying a matrix by a vector can be thought of as multiplying each row of the matrix
by the column of the vector. The output will be a vector that has the same number of
rows as the matrix. The image below shows how this works:

Prepared by T.Swetha, Mits.


7

To better understand the concept, we will go through the calculation of the second image.
To get the first value of the resulting vector (16), we take the numbers of the vector we
want to multiply with the matrix (1 and 5), and multiply them with the numbers of the
first row of the matrix (1 and 3). This looks like this:

1*1 + 3*5 = 16

We do the same for the values within the second row of the matrix:

4*1 + 0*5 = 4

And again for the third row of the matrix:

2*1 + 1*5 = 7

Here is another example:

And here is a kind of cheat sheet:

Prepared by T.Swetha, Mits.


8

3. MATRIX-MATRIX ADDITION AND SUBTRACTION

Matrix-matrix addition and subtraction is fairly easy and straightforward. The


requirement is that the matrices have the same dimensions and the result is a matrix that
has also the same dimensions. You just add or subtract each value of the first matrix with
its corresponding value in the second matrix. See below:

4. MATRIX-MATRIX MULTIPLICATION

Multiplying two matrices together isn’t that hard either if you know how to multiply a
matrix by a vector. Note that you can only multiply matrices together if the number of
the first matrix’s columns matches the number of the second matrix’s rows. The result
will be a matrix with the same number of rows as the first matrix and the same number
of columns as the second matrix. It works as follows:

Prepared by T.Swetha, Mits.


9

You simply split the second matrix into column-vectors and multiply the first matrix
separately by each of these vectors. Then you put the results in a new matrix
(without adding them up!). The image below explains this step by step:

And here is again some kind of cheat sheet:

Prepared by T.Swetha, Mits.


10

Matrix Multiplication Properties

Matrix multiplication has several properties that allow us to bundle a lot of computation
into one matrix multiplication. We will discuss them one by one below. We will start by
explaining these concepts with Scalars and then with matrices because this will give you
a better understanding of the process.

1. NOT COMMUTATIVE

Scalar multiplication is commutative but matrix multiplication is not. This means that
when we are multiplying scalars, 7*3 is the same as 3*7. But when we multiply matrices
by each other, A*B isn’t the same as B*A.

2. ASSOCIATIVE

Scalar and matrix multiplication are both associative. This means that the scalar
multiplication 3(5*3) is the same as (3*5)3 and that the matrix multiplication A(B*C) is
the same as (A*B)C.

3. DISTRIBUTIVE

Scalar and matrix multiplication are also both distributive. This means that
3(5 + 3) is the same as 3*5 + 3*3 and that A(B+C) is the same as A*B + A*C.

4. IDENTITY MATRIX

The identity matrix is a special kind of matrix but first, we need to define what an Identity
is. The number 1 is an identity because everything you multiply with 1 is equal to itself.
Therefore every matrix that is multiplied by an identity matrix is equal to itself. For
example, matrix A times its identity-matrix is equal to A.

Prepared by T.Swetha, Mits.


11

You can spot an identity matrix by the fact that it has ones along its diagonals and that
every other value is zero. It is also a “squared matrix,” meaning that its number of rows
matches its number of columns.

We previously discussed that matrix multiplication is not commutative but there is one
exception, namely if we multiply a matrix by an identity matrix. Therefore, the following
equation is true: A*I = I*A = A

Inverse and Transpose

The matrix inverse and the matrix transpose are two special kinds of matrix properties.
Again, we will start by discussing how these properties relate to real numbers and then
how they relate to matrices.

1. INVERSE

First of all, what is an inverse? A number that is multiplied by its inverse is equal to 1.
Note that every number except 0 has an inverse. If you multiply a matrix by its inverse,
the result is its identity matrix. The example below shows what the inverse of scalars
looks like:

Prepared by T.Swetha, Mits.


12

But not every matrix has an inverse. You can compute the inverse of a Matrix if it is a
“squared matrix” and if it has an inverse. Discussing which matrices have an inverse
would be unfortunately out of the scope of this post.

Why do we need an inverse? Because we can’t divide matrices. There is no concept of


dividing by a matrix but we can multiply a matrix by an inverse, which results essentially
in the same thing.

The image below shows a matrix multiplied by its inverse, which results in a 2-by-2
identity matrix.

2. TRANSPOSE

And lastly, we will discuss the matrix transpose property. This is basically the mirror
image of a matrix, along a 45-degree axis. It is fairly simple to get the transpose of a
matrix. Its first column is the first row of the matrix transpose and the second column is

Prepared by T.Swetha, Mits.


13

the second row of the matrix transpose. An m*n matrix is transformed into an n*m
matrix. Also, the Aij element of A is equal to the Aji(transpose) element. The image
below illustrates that:

3. Eigenvalues and Eigenvectors: These concepts are important for understanding


stability, convergence, and transformations in deep learning models. For instance,
in Principal Component Analysis (PCA), eigenvalues and eigenvectors help in
data dimensionality reduction.
4. Vector Spaces and Linear Transformations: Deep learning involves
transforming data through layers of neural networks. Understanding vector spaces
and linear transformations is crucial in grasping these operations.
5. Dot Products and Inner Products: Dot products and inner products are used to
measure the similarity between vectors, which is crucial in various neural network
layers, such as in computing attention weights.
6. Singular Value Decomposition (SVD): SVD is used for matrix factorization,
dimensionality reduction, and understanding the structure of data in deep learning
applications like recommendation systems and feature extraction.
7. Eigen-decomposition and Diagonalization: These concepts are related to the
decomposition of matrices, which can provide insights into the structure and
behavior of linear transformations in certain neural network layers.
8. Optimization: Linear algebra concepts underpin various optimization techniques
used in deep learning, such as gradient descent, where gradients (derivatives) are
computed through matrix operations to update model parameters.

Prepared by T.Swetha, Mits.


14

Types of errors in Deep Learning

In deep learning, various types of errors can occur during model training and evaluation.
These errors affect the performance and accuracy of the model in different ways. Here
are some common types of errors:

1. Training Error: This error occurs during the training phase when the model
makes predictions on the training data. It measures the difference between the
predicted output and the actual target for the training examples. The goal of
training is to minimize this error by adjusting the model's parameters.
2. Validation Error: Validation error is calculated by evaluating the model's
performance on a separate dataset (validation set) that the model hasn't seen
during training. It helps in assessing how well the model generalizes to new,
unseen data. High validation error indicates poor generalization.
3. Test Error: Test error measures the performance of the trained model on a
completely independent dataset (test set). This set is kept entirely separate from
both the training and validation sets. Test error provides an estimate of the model's
performance in real-world scenarios.
4. Underfitting: Underfitting occurs when a model is too simple to capture the
complexity of the data, resulting in poor performance on both the training and
validation sets. The model fails to learn the patterns in the data.
5. Overfitting: Overfitting happens when a model learns to perform exceptionally
well on the training data but fails to generalize to new, unseen data. It occurs when
the model becomes too complex, capturing noise or specific details in the training
data that don't generalize well.
6. Bias Error (or Constant Error): Bias error represents the inability of a model
to capture the true relationship between input and output. High bias often leads to
underfitting, indicating that the model is too simple to represent the underlying
patterns.
7. Variance Error: Variance error occurs due to the model's sensitivity to
fluctuations in the training data. High variance can lead to overfitting, indicating
that the model is too sensitive to the specific data points in the training set.
8. Data Leakage: Data leakage happens when information from the validation or
test set unintentionally influences the model during training. This leads to an
overly optimistic estimation of the model's performance.
9. Sampling Error: Sampling error occurs when the training, validation, or test
datasets are not representative of the entire population, leading to biased model
evaluations and inaccurate predictions.

Prepared by T.Swetha, Mits.


15

Bias-Variance Trade-Off

Striking a balance between accuracy and the ability to make predictions beyond the
training data in an ML model is called the bias-variance tradeoff.

What is bias?

Bias in machine learning refers to the difference between a model’s predictions and the
actual distribution of the value it tries to predict. Models with high bias oversimplify the
data distribution rule/function, resulting in high errors in both the training outcomes and
test data analysis results.

Bias is typically measured by evaluating the performance of a model on a training


dataset. One common way to calculate bias is to use performance metrics such as mean
squared error (MSE) or mean absolute error (MAE), which determine the difference
between the predicted and real values of the training data.

Bias is a systematic error that occurs due to incorrect assumptions in the machine
learning process, leading to the misrepresentation of data distribution.

The level of bias in a model is heavily influenced by the quality and quantity of training
data involved. Using insufficient data will result in flawed predictions. At the same time,
it can also result from the choice of an inappropriate model.

Prepared by T.Swetha, Mits.


16

High-bias model features

1. Underfitting. High-bias models often underfit the data, meaning they


oversimplify the solution based on generalization. As a result, the proposed
distribution does not correspond to the actual distribution.
2. Low training accuracy. The lack of proper processing of training data results in
high training loss and low training accuracy.
3. Oversimplification. The oversimplified nature of high-bias models limits their
ability to identify complex features in the training data, making them inefficient
for solving complicated problems.

Why is bias a problem?

Bias is problematic because it indicates that your model does not accurately represent
the data it is trying to predict. However, in some situations, it may be acceptable to limit
your model to a specific area, such as a specialized medical model for women under 30,
or to add additional human control, such as calling an operator when the model is
uncertain. The primary challenge is that you may be unable to identify all of these
instances during the training and testing phase with complete certainty, which may result
in unforeseen difficulties. Therefore, you should be prepared to address such issues as
they arise.

How to reduce high bias?

There are several ways to overcome high bias:

Prepared by T.Swetha, Mits.


17

1. Incorporating additional features from data to improve the model’s accuracy.


2. Increasing the number of training iterations to allow the model to learn more
complex data.
3. Avoiding high-bias algorithms such as linear regression, logistic regression,
discriminant analysis, etc. and instead using nonlinear algorithms such as k-
nearest neighbors, SVM, decision trees, etc.
4. Decreasing regularization at various levels to help the model learn the training set
more effectively and prevent underfitting.

What is variance?

Variance stands in contrast to bias; it measures how much a distribution on several sets
of data values differs from each other. The most common approach to measuring
variance is by performing cross-validation experiments and looking at how the model
performs on different random splits of your training data.

A model with a high level of variance depends heavily on the training data and,
consequently, has a limited ability to generalize to new, unseen figures. This can result
in excellent performance on training data but significantly higher error rates during
model verification on the test data. Nonlinear machine learning algorithms often have
high variance due to their high flexibility.

A complex model can learn complicated functions, which leads to higher variance.
However, if the model becomes too complex for the dataset, high variance can result in
overfitting. Low variance indicates a limited change in the target function in response to
changes in the training data, while high variance means a significant difference.

Prepared by T.Swetha, Mits.


18

High-variance model features

 Low testing accuracy. Despite high accuracy on training data, high variance
models tend to perform poorly on test data.
 Overfitting. A high-variance model often leads to overfitting as it becomes too
complex.
 Overcomplexity. As researchers, we expect that increasing the complexity of a
model will result in improved performance on both training and testing data sets.
However, when a model becomes too complex and a simpler model may provide
the same level of accuracy, it’s better to choose the simpler one.

Model overcomplexity vs. model overfitting

Model overcomplexity and model overfitting are related, but they are not necessarily the
same thing.

To put it simply, model overfitting occurs when a model is too complex for the
distribution it is trying to predict. For instance, attempting to predict non-linear data with
a linear model.

Model overcomplexity, on the other hand, happens when a model has too many
parameters or a structure that is too complex to be effectively trained on the provided
data. For instance, attempting to fit a neural network with 1.5 million parameters when
only 100 objects are available. Such a model may fit the data well and have little bias,
but it will likely have high variance as there is not enough data to capture the proper
patterns with so many parameters.

Prepared by T.Swetha, Mits.


19

Why is variance a problem?

Variance indicates that your model is not reliable. Suppose minor changes in the input
data result in significant changes in the output. In that case, the model has not properly
extracted the underlying patterns, and its decision making cannot be trusted. While
having minor and predictable errors is acceptable, it’s a significant issue when the model
performs well for some objects but completely fails for others. One possible approach to
addressing this issue is to use a rule-based system in post-production, such as
implementing a rule like “do not sell anything with a margin greater than 50%.”

How to reduce high variance?

The following methods can be used to overcome high variance:

 Reducing the number of features in the model.


 Replacing the current model with a simpler one.
 Increasing the training data diversity to balance out the complexity of the model
and the data structure.
 Avoiding high-variance algorithms (support vector machines, decision trees, k-
nearest neighbors, etc.) and opt for low-variance ones such as linear regression,
logistic regression, and linear discriminant analysis.
 Performing hyperparameter tuning to avoid overfitting.
 Increasing regularization on inputs to decrease the complexity of the model and
prevent overfitting.
 Using a new model architecture. (Like with the high bias, this should be
considered a last resort if other methods are not effective.)

What bias-variance scenarios are possible?

Now let’s take a look at the different combinations of bias and variance in machine
learning models and the results they provide.

1. Low bias, low variance: ideal model

A machine learning model with low bias and low variance is considered ideal but is not
often the case in the machine learning practice, so we can speak of “reasonable bias” and
“reasonable variance.”

2. Low bias, high variance: results in overfitting

Prepared by T.Swetha, Mits.


20

This combination results in inconsistent predictions that are accurate on average. It


occurs when a model has too many parameters and fits too closely to the training data.

3. High bias, low variance: results in underfitting

Predictions are consistent but inaccurate on average in this scenario. This happens when
the model doesn’t learn well from the training data or has too few parameters, leading to
underfitting issues.

4. High bias, high variance: results in inaccurate predictions

With both high bias and high variance, the predictions are both inconsistent and
inaccurate on
average.

How to achieve a bias-variance tradeoff?

As we have learned, bias and variance are interdependent. In other words, lowering a
model’s bias leads to an increase in its variance and vice versa. This relationship between
bias and variance is known as the bias-variance tradeoff.

Prepared by T.Swetha, Mits.


21

The balance between bias and variance can be adjusted in specific algorithms by
modifying parameters, as seen in the following examples:

 For k-nearest neighbors, a low bias and high variance can be corrected by
increasing the value of k, which increases the bias and decreases the variance.
 For support vector machines, a low bias and high variance can be altered by
adjusting the C parameter, which increases the bias but decreases the variance.

Unfortunately, it’s impossible to determine the actual bias and variance error terms while
we are trying to predict the target function. However, bias and variance serve as useful
frameworks to understand the performance of machine learning algorithms in making
predictions.

Model Underfitting Compromise Overfitting


Model
Low Medium High
Complexity
Bias High Low Low
Variance Low Low High

Prepared by T.Swetha, Mits.


22

Brief review of concepts from Vector Calculus and Optimization,

Vector calculus and optimization are crucial branches of mathematics with widespread applications in
physics, engineering, computer science, and various other fields. Here's a brief review of key concepts
from each:

Vector Calculus:
1. Vectors and Vector Operations:
 Vectors represent quantities with both magnitude and direction.
 Vector operations include addition, subtraction, scalar multiplication, and dot product.

2. Vector Functions:
 Vector-valued functions map real numbers to vectors.
 Derivatives of vector functions involve partial derivatives for each component.
Example:

3. Gradient, Divergence, and Curl:


 The optimization of neural network parameters (weights and biases) is often
done using gradient-based methods.
 The gradient of the loss function with respect to the parameters is computed
using vector calculus.
 Backpropagation is an algorithm that efficiently computes these gradients
layer by layer.
 Gradient represents the rate of change of a scalar field.
 Divergence measures the tendency of a vector field to spread out.
 Curl measures the rotation or circulation of a vector field.

Prepared by T.Swetha, Mits.


23

4. Line Integrals:
 Integrating a scalar or vector field along a curve.

5. Surface Integrals:
 Integrating a scalar or vector field over a surface.

6. Green's, Stokes', and Gauss' Theorems:


 Fundamental theorems connecting line integrals, surface integrals, and volume
integrals.

Optimization:

Optimization in Deep Learning:


1. Gradient Descent Variants:
 Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent, and
Adam are optimization algorithms that use gradients to update model
parameters iteratively.

Prepared by T.Swetha, Mits.


24

 Adaptive learning rate methods like Adam adjust the learning rate for each
parameter individually.
2. Loss Functions:
 The objective function in deep learning is often a loss function, which
measures the difference between the predicted and true values.
 Cross-entropy, mean squared error, and hinge loss are common loss
functions used in different types of neural network tasks.
3. Regularization:
 Regularization terms, often based on L1 or L2 norms, are added to the
objective function to prevent overfitting.
 They penalize large weights in the network, promoting a simpler model.
4. Hyperparameter Tuning:
 The optimization of hyperparameters, such as learning rate and momentum,
is a crucial aspect of training deep neural networks.
 Grid search, random search, and more advanced optimization techniques
are used for hyperparameter tuning.
5. Convexity Issues:
 Deep learning problems are often non-convex, meaning they can have
multiple local minima.
 Despite the lack of convexity, stochastic gradient descent variants often
perform well in practice.
6. Automatic Differentiation:
 Automatic differentiation is a computational technique that automatically
calculates gradients of a computational graph.
 It is essential for implementing backpropagation efficiently in deep learning
frameworks.

1. Introduction to Optimization:
 Optimization involves finding the best solution to a problem, often maximizing or
minimizing a function.
2. Local and Global Extrema:
 Local extrema occur at critical points; global extrema are the absolute maximum or
minimum.
3. Optimization Techniques:
 Gradient Descent: Iterative method to minimize a function by moving in the direction
of steepest descent.
 Lagrange Multipliers: Technique for constrained optimization.
4. Convex Optimization:

Prepared by T.Swetha, Mits.


25

 Problems with convex objective functions and constraints have efficient algorithms and
unique solutions.
5. Optimization in Machine Learning:
 Widely used in training models to minimize a loss function.
 Regularization techniques to prevent overfitting.
6. Linear Programming:
 Optimization of a linear objective function subject to linear equality and inequality
constraints.

Prepared by T.Swetha, Mits.


26

Gradient Descent Algorithms and


Variations
Gradient Descent is an iterative algorithm that is used to minimize a function by finding the
optimal parameters. Gradient Descent can be applied to any dimension function i.e. 1-D, 2-
D, 3-D. In this article, we will be working on finding global minima for parabolic function
(2-D) and will be implementing gradient descent in python to find the optimal parameters for
the linear regression equation (1-D). Before diving into the implementation part, let us make
sure the set of parameters required to implement the gradient descent algorithm. To
implement a gradient descent algorithm, we require a cost function that needs to be
minimized, the number of iterations, a learning rate to determine the step size at each iteration
while moving towards the minimum, partial derivatives for weight & bias to update the
parameters at each iteration, and a prediction function.

Till now we have seen the parameters required for gradient descent. Now let us map the
parameters with the gradient descent algorithm and work on an example to better
understand gradient descent. Let us consider a parabolic equation y=4x 2. By looking at the
equation we can identify that the parabolic function is minimum at x = 0 i.e. at x=0, y=0.
Therefore x=0 is the local minima of the parabolic function y=4x 2. Now let us see the
algorithm for gradient descent and how we can obtain the local minima by applying
gradient descent:
Algorithm for Gradient Descent
Steps should be made in proportion to the negative of the function gradient (move away
from the gradient) at the current point to find local minima. Gradient Ascent is the
procedure for approaching a local maximum of a function by taking steps proportional to
the positive of the gradient (moving towards the gradient).
repeat until convergence
{
w = w - (learning_rate * (dJ/dw))
b = b - (learning_rate * (dJ/db))
}
Step 1: Initializing all the necessary parameters and deriving the gradient function for the
parabolic equation 4x 2. The derivative of x 2 is 2x, so the derivative of the parabolic equation
4x2 will be 8x.
x0 = 3 (random initialization of x)
learning_rate = 0.01 (to determine the step size while moving towards local minima)

Prepared by T.Swetha, Mits.


27

Step 2: Let us perform 3 iterations of gradient descent:


For each iteration keep on updating the value of x based on the gradient descent formula.
Iteration 1:
x1 = x0 - (learning_rate * gradient)
x1 = 3 - (0.01 * (8 * 3))
x1 = 3 - 0.24
x1 = 2.76

Iteration 2:
x2 = x1 - (learning_rate * gradient)
x2 = 2.76 - (0.01 * (8 * 2.76))
x2 = 2.76 - 0.2208
x2 = 2.5392

Iteration 3:
x3 = x2 - (learning_rate * gradient)
x3 = 2.5392 - (0.01 * (8 * 2.5392))
x3 = 2.5392 - 0.203136
x3 = 2.3360
From the above three iterations of gradient descent, we can notice that the value of x is
decreasing iteration by iteration and will slowly converge to 0 (local minima) by running the
gradient descent for more iterations. Now you might have a question, for how many iterations
we should run gradient descent?
We can set a stopping threshold i.e. when the difference between the previous and the present
value of x becomes less than the stopping threshold we stop the iterations. When it comes to
the implementation of gradient descent for machine learning algorithms and deep learning
algorithms we try to minimize the cost function in the algorithms using gradient descent. Now
that we are clear with the gradient descent’s internal working, let us look into the python
implementation of gradient descent where we will be minimizing the cost function of the
linear regression algorithm and finding the best fit line. In our case the parameters are below
mentioned:

Prediction Function
The prediction function for the linear regression algorithm is a linear equation given by
y=wx+b.
prediction_function (y) = (w * x) + b
Here, x is the independent variable
y is the dependent variable

Prepared by T.Swetha, Mits.


28

w is the weight associated with input variable


b is the bias

Cost Function
The cost function is used to calculate the loss based on the predictions made. In linear
regression, we use mean squared error to calculate the loss. Mean Squared Error is the sum
of the squared differences between the actual and predicted values.

Cost Function (J) =


Here, n is the number of samples

Partial Derivatives (Gradients)


Calculating the partial derivatives for weight and bias using the cost function. We get:

The following figure summarizes gradient descent concisely:

Prepared by T.Swetha, Mits.


29

Figure 1: The goal of gradient descent to iteratively take steps towards


lower areas of the loss landscape, similar to descending to the bottom of a
parabola, but in multiple dimensions.

But how does this apply to neural networks and deep learning?

Let’s address that in the next section.

Prepared by T.Swetha, Mits.


30

How does gradient descent power neural


networks and deep learning?

Figure 2: Forward and backpropagation of a neural network.

A neural network consists of one or more hidden layers. Each layer


consists of a set of parameters. Our goal is to optimize these parameters
such that our loss is minimized.

Typical loss functions include binary cross-entropy (two-class


classification), categorical cross-entropy (multi-class classification), mean
squared error (regression), etc.

There are many types of loss functions, each of which are used in certain
roles. Instead of getting too caught up in which loss function is being used,
instead think of it this way:

1. We initialize our neural network with a random set of weights

Prepared by T.Swetha, Mits.


31

2. We ask the neural network to make a prediction on a data point from


our training set

3. We compute the prediction and then the loss/cost function, which tells
us how good/bad of a job we did at making the correct prediction

4. We compute the gradient off the loss

5. And then we ever-so-slightly tweak the parameters of the neural


network such that our predictions are better

Here the review of fundamentals of gradient descent and focus primarily


on SGD, including two improvements to SGD, momentum and Nesterov
acceleration.

We do this over and over again until our model is said to “converge” and is
able to make reliable, accurate predictions.

There are many types of gradient descent algorithms, but the types we’ll
be focusing on here today are:

1. Vanilla gradient descent

2. Stochastic Gradient Descent (SGD)

3. Mini-batch SGD

4. SGD with momentum

5. SGD with Nesterov acceleration

Vanilla gradient descent


Consider an image dataset of N=10,000 images. Our goal is to train a
neural network to classify each of these 10,000 images into a total
of T=10 categories.

Prepared by T.Swetha, Mits.


32

To train a neural network on this dataset we would utilize gradient descent.

The most basic form of gradient descent, which I like to call vanilla
gradient descent, we only update the weights of the network once per
update.

What that means is:

1. We run all 10,000 images through our network

2. We compute the loss and the gradient

3. We update the parameters of the network

In vanilla gradient descent we only update the network’s weights once per
iteration, meaning that the network sees the entire dataset every time a
weight update is performed.

In practice, that’s not very useful.

If the number of training examples is large, then vanilla gradient


descent is going to take a long time to converge due to the fact that a
weight update is only happening once per data cycle.

Furthermore, the larger your dataset gets, the more nuanced your
gradients can become, and if you’re only updating the weights once per
epoch then you’re going to be spending the majority of your time
computing predictions and not much time actually learning (which is
the goal of an optimization problem, right?)

Luckily, there are other variations of gradient descent that address this
problem.

Prepared by T.Swetha, Mits.


33

Stochastic Gradient Descent (SGD)


Unlike vanilla gradient descent, which only does one weight update per
epoch, Stochastic Gradient Descent (SGD) instead does multiple weight
updates.

The original formulation of SGD would do N weight updates per epoch


where N is equal to the total number of data points in your dataset. So,
using our example above, if we have N=10,000 images, then we would
have 10,000 weight updates per epoch.

The SGD algorithm becomes:

 Until convergence:
 Randomly select a data point from our dataset
 Make a prediction on it
 Compute the loss and the gradient
 Update the parameters of the network

SGD tends to converge much faster because it’s able to start


improving itself after each and every weight update.

That said, performing N weight updates per epoch (where N is equal to the
total number of data points in our dataset) is also a bit computationally
wasteful — we’ve now swung to the other side of the pendulum.

What we need instead is a median between the two.

Prepared by T.Swetha, Mits.


34

Mini-batch SGD

Figure 3: Top: Vanilla gradient descent. Bottom: An illustration of mini-


batch SGD with a batch size of S=3. At each iteration data points are
sampled, predictions are made, loss is computed, and parameters to the
network are updated.

While SGD can converge faster for large datasets, we actually run into
another problem — we cannot leverage our vectorized libraries that make
training super fast (again, because we are only passing one data point at
a time through the network).

There is a variation of SGD called mini-batch SGD that solves this


problem. When you hear people talking about SGD what they are
almost always referring to is mini-batch SGD.

Mini-batch SGD introduces the concept of a batch size, S. Now, given a


dataset of size N, there will be a total of N / S updates to the network.

We can summarize the mini-batch SGD algorithm as:

Prepared by T.Swetha, Mits.


35

 Randomly shuffle the input data


 Until convergence:
 Select the next batch of data of size S
 Make predictions on the subset
 Calculate the loss and mean gradient of the mini-batch
 Update the parameters of the network

If you visualize each mini-batch directly then you’ll see a very noisy plot,
such as the following one:

Figure 4: Plotting the loss of every mini-batch can lead to a very noisy plot.

Prepared by T.Swetha, Mits.


36

But when you average out the loss across all mini-batches the plot is
actually quite stable:

Figure
5: Averaging the mini-batch loss over the course of an entire epoch leads
to more stable-looking plots.

Note: Depending on what deep learning library you are using you may see
both types of plots.

When you hear deep learning practitioners talking about SGD they
are more than likely talking about mini-batch SGD.

Prepared by T.Swetha, Mits.


37

SGD with momentum

Figure 6: Applying SGD with momentum can improve our ability to


navigate ravines in the loss landscape.

SGD has a problem when navigating areas of the loss landscape that are
significantly steeper in one dimension than in others (which you’ll see
around local optima).

When this happens, it appears that SGD simply oscillates the ravine
instead of descending into areas of lower loss and ideally lower accuracy.

By applying momentum (Figure 6) we build up a head of steam in a


direction and then allow gravity to roll us faster and faster down the hill.

Typically you’ll see a momentum value of 0.9 in most SGD applications.

Get used to seeing momentum when using SGD — it is used in the majority
of neural network experiments that apply SGD.

Prepared by T.Swetha, Mits.


38

SGD with Nesterov acceleration

Figure 7: Nesterov acceleration is an extension to SGD that may lead to


better optimization in some cases.

The problem with momentum is that once you develop a head of steam,
the train can easily become out of control and roll right over our local
minima and back up the hill again.

Basically, we shouldn’t be blindly following the slope of the gradient.

Nesterov acceleration accounts for this and helps us recognize when the
loss landscape starts sloping back up again.

Nearly all deep learning libraries that contain a SGD


implementation also include momentum and Nesterov acceleration terms.

Momentum is nearly always a good idea. Nesterov acceleration works in


some situations and not in others. You’ll want to treat them as
hyperparameters you need to tune when training your neural networks (i.e.,
pick values for each, run an experiment, log the results, update the
parameters, and repeat until you find a set of hyperparameters that yields
good results)

Prepared by T.Swetha, Mits.


39

Momentum-based Gradient Optimizer


The Momentum-based Gradient Optimizer has several advantages over the basic
Gradient Descent algorithm, including faster convergence, improved stability, and the ability
to overcome local minima. It is widely used in deep learning applications and is an important
optimization technique for training deep neural networks.

Momentum-based Optimization:

An Adaptive Optimization Algorithm uses exponentially weighted averages of gradients over


previous iterations to stabilize the convergence, resulting in quicker optimization. For
example, in most real-world applications of Deep Neural Networks, the training is carried out
on noisy data. It is, therefore, necessary to reduce the effect of noise when the data are fed in
batches during Optimization. This problem can be tackled using Exponentially Weighted
Averages (or Exponentially Weighted Moving Averages).

Momentum-based Gradient Optimizer is a technique used in optimization algorithms,


such as Gradient Descent, to accelerate the convergence of the algorithm and overcome local
minima. In the Momentum-based Gradient Optimizer, a fraction of the previous update is
added to the current update, which creates a momentum effect that helps the algorithm to
move faster towards the minimum.
The momentum term can be viewed as a moving average of the gradients. The larger the
momentum term, the smoother the moving average, and the more resistant it is to changes in
the gradients. The momentum term is typically set to a value between 0 and 1, with a higher
value resulting in a more stable optimization process.

The update rule for the Momentum-based Gradient Optimizer can be expressed as follows:

Prepared by T.Swetha, Mits.


40

Implementing Exponentially Weighted


Averages:

Prepared by T.Swetha, Mits.


41

import math

# HyperParameters of the optimization algorithm


alpha = 0.01
beta = 0.9

# Objective function

Prepared by T.Swetha, Mits.


42

def obj_func(x):
return x * x - 4 * x + 4

# Gradient of the objective function

def grad(x):
return 2 * x - 4

# Parameter of the objective function


x = 0

# Number of iterations
iterations = 0

v = 0

while (1):
iterations += 1
v = beta * v + (1 - beta) * grad(x)

x_prev = x

x = x - alpha * v

print("Value of objective function on iteration", iterations, "is", x)

if x_prev == x:
print("Done optimizing the objective function. ")
break

Prepared by T.Swetha, Mits.

You might also like