Gradient Descent
Gradient Descent
Step_2: Use all parameter values and predict y for each data point in the training data set.
w1(new)=w1(old) – α*(gradient)w1
w2(new)=w2(old) – α*(gradient)w2
Source: https://www.analyticsvidhya.com/blog/2021/05/gradient-descent-algorithm-understanding-the-logic-
behind/
how to check the function’s convergence
Source: https://www.analyticsvidhya.com/blog/2021/05/gradient-descent-algorithm-understanding-the-logic-
behind/
Variants of Gradient Descent Algorithm
● A single observation is taken randomly from the dataset to calculate the cost
function
● commonly abbreviated as SGD.
● We pass a single observation at a time, calculate the cost and update the
parameters.
● Each time the parameter is updated, it is known as an Iteration
● In the case of SGD, there will be ‘m’ iterations per epoch, where ‘m’ is the
number of observations in a dataset.
● The path taken by the algorithm to reach the minima is usually noisier than
your typical Gradient Descent algorithm.
● A weight update may reduce the error on the single observation being
presented, yet increase the error on the the full training set. Given a large
number of such individual update, however, the total error decreases.
Variants of Gradient Descent Algorithm
Let’s say we have 5 observations and each observation has three features (the
features values taken are completely random)
Now if we use the SGD, will take the first observation, then pass it through the neural network,
calculate the error and then update the parameters.
Variants of Gradient Descent Algorithm
Then will take the second observation and perform similar steps with it. This step will be repeated until all observations have
been passed through the network and the parameters have been updated.
Variation of Gradient Descent Algorithms
Mini-batch gradient descent:
● It takes a subset of the entire dataset to calculate the cost function.
● if there are ‘m’ observations then the number of observations in each subset or mini-batches
will be more than 1 and less than ‘m’.
● The number of observations in the subset is called batch size (b).
● The batch size is something we can tune. It is usually chosen as power of 2 such as 32, 64,
128, 256, 512, etc. The reason behind it is because some hardware such as GPUs achieve
better run time with common batch sizes such as power of 2.
● Mini-batch gradient descent makes a compromise between the speedy convergence and the
noise associated with gradient update which makes it a more flexible and robust algorithm
● With large training datasets, we don’t usually need more than 2–10 passes over all training
examples (epochs).
● Note: with batch size b = m (number of training examples), we get the Batch Gradient
Descent.
Variation of Gradient Descent Algorithms
Mini-batch gradient
Variation of Gradient Descent
Mini-batch Gradient:
Summary
● since we update the parameters using the entire data set in the case of the Batch GD, the cost in this case, reduces
smoothly.
● Reduction of cost in the case of SGD is not that smooth. Since we’re updating the parameters based on a single
observation, there are a lot of iterations. It might also be possible that the model starts learning noise as well.
● In the case of Mini-batch Gradient Descent the cost is smoother as compared SGD. Since we’re not updating the
parameters after every single observation but after every subset of the data.
Computational Cost of 3 variants of Gradient Descent
● Batch gradient descent processes all the training examples for each iteration. many epochs are used in find
the optimal parameters in gradient descent. Hence it is computationally very expensive and slow
● Computational cost in the case of SGD is much less as compared to the Batch Gradient Descent
since we’ve to process every single observation at a time. Computation time here increases for large
dataset as there will be more number of iterations.
● Mini-batch gradient descent works faster than both batch gradient descent and stochastic gradient
descent. Here b examples where b<m are processed per iteration. So even if the number of training
examples is large, it is processed in batches of b training examples in one go. Thus, it works for larger
training examples and that too with lesser number of iterations.
Comparison
computationally very quite faster than batch works faster than both batch
expensive gradient descent gradient descent and
stochastic gradient descent
It makes smooth updates in the It makes very noisy updates in Depending upon the batch
model parameters the parameters size, the updates can be made
less noisy – greater the batch
size less noisy is the update