Gradient Descent
Gradient Descent
gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for
finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite
direction of the gradient (or approximate gradient) of the function at the current point, because this is
the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a
local maximum of that function; the procedure is then known as gradient ascent. It is particularly useful
in machine learning for minimizing the cost or loss function. Gradient descent should not be confused
with local search algorithms, although both are iterative methods for optimization.
There are several variations of gradient descent that have been developed to address different challenges
and improve the convergence speed of the algorithm. Here are some common types of gradient descent:
In batch gradient descent, the entire training dataset is used to compute the gradient at each iteration.
It calculates the average gradient over all training examples before updating the parameters.
BGD can be computationally expensive, especially for large datasets, as it requires evaluating the
entire dataset for each iteration.
In stochastic gradient descent, only one randomly selected training example is used to compute the
gradient at each iteration.
It updates the parameters based on the gradient of a single example, resulting in faster iterations.
SGD introduces more noise due to the high variance of the gradient estimate, but it can escape
shallow local minima and handle large datasets more efficiently.
It computes the gradient using a small subset or mini-batch of training examples at each iteration.
Mini-batch size is typically chosen to be between 10 and 1,000, providing a balance between stability
and computational efficiency.
Here is a table that summarizes the key differences between the three types of gradient descent:
The best type of gradient descent to use depends on the specific problem you are trying to solve.
If accuracy is your top priority, then batch gradient descent is the best option.
If you are working with a large dataset and speed is important, then stochastic gradient descent is a
good choice.
And if you want a balance between accuracy and speed, then mini-batch gradient descent is a good
option.
The intuition behind gradient descent is to find the minimum of a function by iteratively moving in the
direction of steepest descent. Let's explore the intuition behind this algorithm:
Consider a scenario where we have a function with multiple parameters (e.g., a machine learning
model with weights and biases).
We can visualize this function as a landscape, where the height of the landscape represents the value
of the function.
The goal is to find the lowest point (minimum) on this landscape, which corresponds to the optimal
parameter values.
2. Steepest descent:
To find the minimum, we want to move downhill in the direction that decreases the function value
the most.
The steepest descent direction is given by the negative gradient of the function, which points in
the direction of the greatest increase.
By taking the negative gradient, we move in the opposite direction, i.e., the direction of steepest
descent.
3. Iterative updates:
The learning rate determines the size of the steps taken in each iteration.
If the learning rate is too large, we might overshoot the minimum and diverge.
If the learning rate is too small, convergence might be slow.
Finding an appropriate learning rate is crucial for the algorithm's success.
5. Convergence:
The key intuition behind gradient descent is that by iteratively moving in the direction of steepest descent,
we can navigate the landscape of the function and find its minimum. This process allows us to optimize
various machine learning models and solve optimization problems efficiently.
Additionally, gradient descent is inspired by the idea of mimicking the behavior of a ball rolling down a hill.
The ball naturally follows the steepest descent, ultimately settling at the bottom of the hill, which
corresponds to the minimum of the function.
In [1]: # Code
from sklearn.datasets import make_regression
import numpy as np
In [2]: X, y = make_regression(
n_samples=4,
n_features=1,
n_informative=1,
n_targets=1,
noise=80,
random_state=13
)
Out[5]: ▾ LinearRegression
LinearRegression()
In [6]: reg.coef_
Out[6]: array([78.35063668])
In [7]: reg.intercept_
Out[7]: 26.15963284313262
In [8]: plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red')
In [11]: m = 78.35
b = 0
# Calculate the slope of the loss function
loss_slope = -2 * np.sum(y - m * X.ravel() - b)
# Print the slope of the loss function
loss_slope
Out[11]: -209.27763408209216
Out[12]: -20.927763408209216
In [13]:
# Calculating the new intercept (b)
b = b - step_size # 0-(-20.927763408209216)
b
Out[13]: 20.927763408209216
plt.legend()
plt.show()
In [15]: # Iteration 2
loss_slope = -2 * np.sum(y - m*X.ravel() - b) # Here b= 20.92
loss_slope
Out[15]: -41.85552681641843
Out[16]: -4.185552681641844
Out[17]: 25.11331608985106
plt.legend()
plt.show()
In [19]: # Iteration 3
loss_slope = -2 * np.sum(y - m*X.ravel() - b1) # b1= 25.11
loss_slope
Out[19]: -8.371105363283675
Out[20]: -0.8371105363283675
Out[21]: 25.95042662617943
plt.legend()
plt.show()
The beauty of gradient descent is that it takes near to the answer no matter how huge or small the
In [25]: m = 78.35
b = 100
# Calculate the slope of the loss function
loss_slope = -2 * np.sum(y - m * X.ravel() - b)
# Print the slope of the loss function
loss_slope
Out[25]: 590.7223659179078
Out[26]: 59.072236591790784
In [27]:
# Calculating the new intercept (b)
b = b - step_size # 100-(59.072236591790784 )
b
Out[27]: 40.927763408209216
In [29]: # Iteration 2
loss_slope = -2 * np.sum(y - m*X.ravel() - b) # Here b= 40.92
loss_slope
Out[29]: 118.14447318358157
Out[30]: 11.814447318358157
Out[31]: 29.11331608985106
In [33]: # Iteration 3
loss_slope = -2 * np.sum(y - m*X.ravel() - b1) # b1= 29.11
loss_slope
Out[33]: 23.62889463671634
Out[34]: 2.362889463671634
Out[35]: 26.750426626179426
plt.plot(X,y_pred1,color='#A3E4D7',label='b = {}'.format(b))
plt.plot(X,y_pred2,color='#A3E4D7',label='b1 = {}'.format(b1))
plt.plot(X,y_pred3,color='#00a65a',label='b2 = {}'.format(b2))
plt.legend()
plt.show()
b = -100
m = 78.35
lr = 0.01
epochs = 100
for i in range(epochs):
loss_slope = -2 * np.sum(y - m*X.ravel() - b)
b = b - (lr * loss_slope)
y_pred = m * X + b
plt.plot(X,y_pred)
plt.scatter(X,y)
In [42]: plt.scatter(X,y)
In [45]: lr.fit(X_train,y_train)
print(lr.coef_)
[28.12597332]
In [46]: print(lr.intercept_)
-2.271014426178382
Out[47]: 0.6345158782661013
In [48]: print(lr.coef_)
[28.12597332]
In [49]: m = 28.12
def __init__(self,learning_rate,epochs):
self.m = 28.12
self.b = -120
self.lr = learning_rate
self.epochs = epochs
In [51]: gd = GDRegressor(0.1,10)
In [52]: gd.fit(X,y) # Here Learning rate is too High , thats the reason , we got this result
-721554187522014.0
In [53]: # lr =0.01
gd = GDRegressor(0.01,10)
In [54]: gd.fit(X,y)
-120.0
def __init__(self,learning_rate,epochs):
self.m = 28.12
self.b = -120
self.lr = learning_rate
self.epochs = epochs
In [56]: gd = GDRegressor(0.01,10)
In [57]: gd.fit(X,y)
-23537.64116001603 115.37641160016031
23537.64116001603 -120.0
-23537.64116001603 115.37641160016031
23537.64116001603 -120.0
-23537.64116001603 115.37641160016031
23537.64116001603 -120.0
-23537.64116001603 115.37641160016031
23537.64116001603 -120.0
-23537.64116001603 115.37641160016031
23537.64116001603 -120.0
-120.0
Problem
solution
In [59]: gd.fit(X,y)
-23537.64116001603 -96.46235883998396
-18830.11292801282 -77.63224591197114
-15064.090342410256 -62.568155569560886
-12051.272273928204 -50.51688329563268
-9641.017819142568 -40.87586547649011
-7712.814255314051 -33.16305122117606
-6170.251404251242 -26.99279981692482
-4936.201123400993 -22.056598693523824
-3948.9608987207935 -18.107637794803033
-3159.168718976635 -14.948469075826399
-2527.3349751813084 -12.42113410064509
-2021.8679801450467 -10.399266120500045
-1617.4943841160375 -8.781771736384007
-1293.99550729283 -7.4877762290911765
-1035.1964058342637 -6.4525798232569125
-828.157124667411 -5.624422698589502
-662.5256997339287 -4.961896998855573
-530.0205597871429 -4.4318764390684295
-424.01644782971437 -4.007859991238715
-339.2131582637714 -3.6686468329749435
-271.37052661101717 -3.3972763063639264
-217.09642128881381 -3.1801798850751126
-173.677137031051 -3.0065027480440616
-138.94170962484083 -2.867561038419221
-111.15336769987276 -2.7564076707193483
-88.92269415989811 -2.66748497655945
-71.13815532791858 -2.5963468212315317
-56.910524262334796 -2.539436296969197
-45.52841940986784 -2.4939078775593293
-36.422735527894424 -2.4574851420314348
-29.138188422315324 -2.4283469536091196
-23.310550737852424 -2.405036402871267
-18.648440590281957 -2.386387962280985
-14.918752472225457 -2.3714692098087595
-11.935001977780324 -2.359534207830979
-9.548001582224288 -2.349986206248755
-7.638401265779464 -2.3423478049829756
-6.1107210126235145 -2.336237083970352
-4.888576810098989 -2.331348507160253
-3.9108614480791175 -2.327437645712174
-3.128689158463253 -2.3243089565537107
-2.502951326770642 -2.3218060052269403
-2.0023610614164724 -2.319803644165524
-1.6018888491333385 -2.3182017553163905
-1.281511079306597 -2.316920244237084
-1.0252088634453287 -2.3158950353736385
-0.8201670907562288 -2.315074868282882
-0.6561336726047671 -2.3144187346102774
-0.5249069380840368 -2.313893827672193
-0.419925550466985 -2.3134739021217263
-0.3359404403737898 -2.3131379616813526
-0.268752352298975 -2.3128692093290537
-0.21500188183929936 -2.3126542074472143
-0.17200150547120074 -2.312482205941743
-0.1376012043770345 -2.312344604737366
-0.11008096350155938 -2.3122345237738644
-0.08806477080128161 -2.312146459003063
-0.07045181664108213 -2.312076007186422
-0.056361453312916865 -2.312019645733109
-0.04508916265030649 -2.311974556570459
-0.03607133012025088 -2.3119384852403386
localhost:8888/notebooks/ Gradient Descent ( Prudhvi Vardhan Notes).ipynb# 27/58
7/1/23, 4:01 PM Gradient Descent ( Prudhvi Vardhan Notes) - Jupyter Notebook
-0.028857064096321494 -2.311909628176242
-0.0230856512768014 -2.3118865425249653
-0.01846852102165286 -2.3118680740039435
-0.014774816817066494 -2.3118532991871263
-0.011819853453864937 -2.3118414793336726
-0.009455882762985368 -2.3118320234509095
-0.007564706210459349 -2.311824458744699
-0.006051764968219686 -2.311818406979731
-0.004841411974787491 -2.311813565567756
-0.0038731295797447274 -2.3118096924381764
-0.0030985036637218855 -2.3118065939345125
-0.0024788029310300885 -2.3118041151315816
-0.001983042344846808 -2.3118021320892366
-0.0015864338758433405 -2.3118005456553608
-0.001269147100600776 -2.31179927650826
-0.001015317680533201 -2.3117982611905794
-0.0008122541442929787 -2.311797448936435
-0.000649803315468489 -2.3117967991331194
-0.0005198426524160027 -2.311796279290467
-0.0004158741217992201 -2.311795863416345
-0.000332699297587169 -2.3117955307170477
-0.00026615943816210574 -2.3117952645576096
-0.00021292755040036582 -2.3117950516300594
-0.0001703420404339795 -2.311794881288019
-0.0001362736322789715 -2.3117947450143865
-0.00010901890571801687 -2.3117946359954806
-8.721512465115211e-05 -2.311794548780356
-6.977209962855113e-05 -2.3117944790082565
-5.581767975826324e-05 -2.3117944231905767
-4.465414388477029e-05 -2.3117943785364328
-3.572331508649995e-05 -2.311794342813118
-2.857865209193733e-05 -2.311794314234466
-2.2862921525756974e-05 -2.3117942913715446
-1.8290337379767152e-05 -2.3117942730812073
-1.4632270023184901e-05 -2.3117942584489373
-1.1705815950335818e-05 -2.3117942467431214
-9.364652811427732e-06 -2.311794237378469
-7.491722101349296e-06 -2.311794229886747
-5.99337784024101e-06 -2.311794223893369
-2.311794223893369
The effect of the learning rate on the training process can be summarized as follows:
1. Large Learning Rate: A large learning rate can cause the algorithm to take large steps during each
iteration. This may result in overshooting the optimal solution, leading to divergence or instability. The
algorithm may fail to converge, and the loss function may oscillate or increase instead of decreasing.
2. Small Learning Rate: On the other hand, a small learning rate means taking small steps during each
iteration. While this can ensure stability and prevent divergence, it can also result in slow convergence.
The algorithm may require a larger number of iterations to reach the optimal solution, increasing the
training time.
3. Appropriate Learning Rate: An appropriate learning rate strikes a balance between convergence
speed and stability. It allows the algorithm to make meaningful progress towards the optimal solution
without overshooting or oscillating too much. Finding the optimal learning rate often involves
experimentation and tuning. Techniques such as learning rate schedules, adaptive learning rates, or
automatic tuning methods can help in finding a suitable learning rate during the training process.
It's important to note that the optimal learning rate depends on various factors, including the specific
algorithm, the problem domain, and the dataset. Therefore, it is recommended to experiment with different
learning rates to find the one that works best for your specific scenario.
loss function
In [63]: plt.scatter(X,y)
In [66]: lr = LinearRegression()
[28.12597332]
-2.271014426178382
Out[68]: 0.6345158782661013
def __init__(self,learning_rate,epochs):
self.m = 100
self.b = -120
self.lr = learning_rate
self.epochs = epochs
def fit(self,X,y):
# calcualte the b using GD
for i in range(self.epochs):
loss_slope_b = -2 * np.sum(y - self.m*X.ravel() - self.b)
loss_slope_m = -2 * np.sum((y - self.m*X.ravel() - self.b)*X.ravel())
In [73]: gd =GDRegressor(0.001,100)
In [74]: gd.fit(X,y)
27.828091872608653 -2.2947448944994893
28.125973319843975 -2.27101442919929
def __init__(self,learning_rate,epochs):
self.m = 100
self.b = -120
self.lr = learning_rate
self.epochs = epochs
def fit(self,X,y):
# calcualte the b using GD
for i in range(self.epochs):
loss_slope_b = -2 * np.sum(y - self.m*X.ravel() - self.b)
loss_slope_m = -2 * np.sum((y - self.m*X.ravel() - self.b)*X.ravel())
def predict(self,X):
return self.m * X + self.b
In [88]: gd = GDRegressor(0.001,50)
In [89]: gd.fit(X_train,y_train)
28.159367347119066 -2.3004574196824854
In [90]: gd.predict(X_test)
Out[83]: 0.6343842836315579
Animation code
In [93]: %matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
import matplotlib.animation as animation
[27.82809103]
-2.29474455867698
In [97]: b = -150
m = 27.82
lr = 0.001
all_b = []
all_cost = []
epochs = 30
for i in range(epochs):
slope = 0
cost = 0
for j in range(X.shape[0]):
slope = slope - 2*(y[j] - (m * X[j]) - b)
cost = cost + (y[j] - m * X[j] -b) ** 2
b = b - (lr * slope)
all_b.append(b)
all_cost.append(cost)
y_pred = m * X + b
plt.plot(X,y_pred)
plt.scatter(X,y)
In [98]:
# Flatten the array 'all_b'
all_b = np.array(all_b).ravel()
In [99]: all_b
In [100]:
# Flatten the 'all_cost' array
all_cost = np.array(all_cost).ravel()
all_cost
Gradient-descent-animation(both-m-and-b)
In [104]: from sklearn.datasets import make_regression
import numpy as np
import matplotlib.pyplot as plt
In [107]: plt.scatter(X,y)
In [114]: b = -120
m = 100
lr = 0.001
all_b = []
all_m = []
all_cost = []
epochs = 30
for i in range(epochs):
slope_b = 0
slope_m = 0
cost = 0
for j in range(X.shape[0]):
slope_b = slope_b - 2*(y[j] - (m * X[j]) - b)
slope_m = slope_m - 2*(y[j] - (m * X[j]) - b)*X[j]
cost = cost + (y[j] - m * X[j] -b) ** 2
b = b - (lr * slope_b)
m = m - (lr * slope_m)
all_b.append(b)
all_m.append(m)
all_cost.append(cost)
# animation function
def animate(i):
label = 'epoch {0}'.format(i + 1)
xdata.append(num_epochs[i])
ydata.append(all_cost[i])
line.set_data(xdata, ydata)
axis.set_xlabel(label)
return line,
# animation function
def animate(i):
label = 'epoch {0}'.format(i + 1)
xdata.append(num_epochs[i])
ydata.append(all_b[i])
line.set_data(xdata, ydata)
axis.set_xlabel(label)
return line,
Gradient-descent-3d
In [122]: from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np
In [124]: plt.scatter(X,y)
In [127]: fig.show()
fig.write_html("cost_function.html")
b = 150
m = -127.82
lr = 0.001
all_b = []
all_m = []
all_cost = []
epochs = 30
for i in range(epochs):
slope_b = 0
slope_m = 0
cost = 0
for j in range(X.shape[0]):
slope_b = slope_b - 2*(y[j] - (m * X[j]) - b)
slope_m = slope_m - 2*(y[j] - (m * X[j]) - b)*X[j]
cost = cost + (y[j] - m * X[j] -b) ** 2
b = b - (lr * slope_b)
m = m - (lr * slope_m)
all_b.append(b)
all_m.append(m)
all_cost.append(cost)
Cost Function
5M
4M
3M
2M
1M
150
100
50
−50
# animation function
def animate(i):
label = 'epoch {0}'.format(i + 1)
xdata.append(all_m[i])
ydata.append(all_b[i])
line.set_data(xdata, ydata)
axis.set_xlabel(label)
return line,
https://developers.google.com/machine-learning/crash-course/fitter/graph
(https://developers.google.com/machine-learning/crash-course/fitter/graph)
1. Convex Initialization: Convex initialization refers to starting the optimization process with initial
parameter values that result in a convex optimization landscape. In a convex landscape, the objective
function is a convex function, meaning that it has a single global minimum and no local minima.
Convex optimization landscapes are desirable because gradient descent is guaranteed to converge to
the global minimum in such cases, regardless of the initialization. Convex initialization can lead to
faster and more stable convergence.
2. Non-convex Initialization: Non-convex initialization, on the other hand, refers to starting the
optimization process with initial parameter values that result in a non-convex optimization landscape.
In a non-convex landscape, the objective function can have multiple local minima, making the
optimization problem more challenging. Non-convex initialization can lead to slower convergence and
Saddle Point
A saddle point is a critical point in the optimization landscape where the gradients are zero, but it is not a
local minimum or maximum. At a saddle point, the objective function may have a flat region in some
dimensions and a steep region in others. Saddle points can pose challenges for gradient descent because
the algorithm may converge slowly in the flat regions or get stuck due to the presence of zero gradients. In
high-dimensional spaces, saddle points are more prevalent than local minima.
The interplay between non-convex initialization and saddle points can impact the behavior of gradient
descent. If the optimization landscape contains many saddle points, gradient descent may get trapped
in these regions, leading to slower convergence. However, it's important to note that not all saddle
points are problematic. In fact, saddle points can serve as critical points where the algorithm can
explore different directions and potentially escape suboptimal solutions.
Several techniques have been developed to address the challenges associated with non-convex
optimization landscapes and saddle points. These include using advanced optimization algorithms,
such as momentum-based methods, adaptive learning rate techniques, or second-order optimization
methods. Additionally, initialization strategies, such as random initialization, can help to escape saddle
points and find better solutions.
In practice, the specific behavior of gradient descent with respect to convex and non-convex
initializations and saddle points depends on the problem, the data, and the optimization algorithm
used. Exploring different initialization strategies and optimizing techniques can help improve the
convergence and performance of gradient descent in non-convex landscapes.
In mathematics, a local minimum of a function is a point where the function value is smaller than at nearby
points, but possibly greater than at a distant point. A global minimum is a point where the function value is
smaller than at all other feasible points.
A saddle point of a function is a point where the function value is higher in some directions than in others.
Saddle points are not minima or maxima, but they can be important in optimization problems.
Here is a table that summarizes the differences between local and global minima and saddle points:
Here are some examples of local and global minima and saddle points:
Effect of Data
Data scaling is the process of normalizing the range of features in a dataset. This is done to ensure that all
features have a similar scale, which can help to improve the performance of machine learning algorithms.
In gradient descent, the step size of the algorithm is determined by the scale of the features. If the features
are not scaled, the step size may be too large for some features and too small for others. This can lead to
the algorithm converging to a suboptimal solution.
Scaling the features can help to improve the performance of gradient descent in a number of ways. First, it
can help to ensure that the algorithm converges to a global minimum rather than a local minimum. Second,
it can help to improve the stability of the algorithm, making it less likely to diverge or oscillate. Third, it can
help to improve the accuracy of the model, making it more likely to generalize well to new data.
There are two main methods of scaling data: Normalization and Standardization.
Normalization is the process of transforming the data so that it has a mean of 0 and a standard deviation
of 1. This can be done by subtracting the mean from each feature and then dividing by the standard
deviation.
Standardization is the process of transforming the data so that it has a mean of 0 and a standard
deviation of 1, but it also centers the data around the median. This can be done by subtracting the median
from each feature and then dividing by the interquartile range.
The best method of scaling data depends on the specific machine learning algorithm that is being used.
However, in general, normalization is a good choice for most algorithms.
Convergence speed: Scaling the features can help to improve the convergence speed of gradient
descent. This is because the step size of the algorithm will be more consistent when the features are
scaled.
Accuracy: Scaling the features can help to improve the accuracy of the model. This is because the
model will be less sensitive to noise when the features are scaled.
Stability: Scaling the features can help to improve the stability of the algorithm. This is because the
algorithm will be less likely to diverge or oscillate when the features are scaled.
Overall, data scaling can be a helpful technique for improving the performance of gradient descent. If you
are using gradient descent to train a machine learning model, it is a good idea to try scaling the data and
see if it improves the performance of the model.
In [ ]: