05 Gradient Descent
05 Gradient Descent
Gradient Descent
Mostafa S. Ibrahim
Teaching, Training and Coaching for more than a decade!
● f(x) = 3x2 + 4x + 7
● f'(x) = 6x + 4
Positive slope case
● Intuitively, we would like to move towards
the left side (negative Δx)
● What is the slope sign at x = 0?
○ f’x(0) = 4: positive slope
● This means we need to move in the
opposite direction of the slope!
● Then, keep moving towards the left until
reaching a point with zero slope
○ The minimum!
● f(x) = 3x2 + 4x + 7
● f'(x) = 6x + 4
Negative slope case
● What if we started from x = -1.3?
● Intuitively, we would like to move towards
the right side (positive Δx)
● What is the slope sign at x = -1.3?
○ f’x(0) = -3.8: negative slope
● This means we need to move in the
opposite direction of the slope!
● Then, keep moving towards the right until
reaching a point with zero slope
○ The minimum!
● f(x) = 3x2 + 4x + 7
● f'(x) = 6x + 4
Gradient Descent
● An iterative algorithm to find the local minimum
○ Start from an initial location
○ Keep moving in the opposite direction of the gradient
● So far we used fixed Δx = 0.01
○ But this constant can cause issues based
on the steepness of the curve
○ It can be very slow. It can make big moves close to the minimum!
● How can we make it dynamic? Use the gradient value itself!
● However, this value itself has the potential to be very big
○ Let’s multiply by a small value
○ Let’s call it the learning rate (lr). It is a hyperparameter
■ A hyperparameter is a parameter whose value is used to control the learning process
Parameter and Hyperparameter
● Our model for a simple line has 2 parameters: m and c
○ We would like to learn these 2 lines
● We needed an extra parameter, the learning rate (and precision)
○ We call these hyperparameters
○ We typically don’t learn them
○ But we experimentally try different values to find suitable ones
Stopping Criteria
● One simple criteria is the number of iterations (e.g. 100 iterations)
○ But what if we need more?
○ However, it's still good to force an end!
● Another way is the precision
○ At each iteration we have the old x and the new x
○ Once the 2 values are almost the same, stop the program
● The best is to use both
Effect of large step size (learning rate)
● We may suffer from an oscillating behaviour around the minimum value
○ This means that it keeps missing the minimum, instead jumping to positive and negative
directions on either side of the minimum
Code Tour
Question!
● Do we need to decrease the step size (LR) over time for the algorithm to
converge?
● If you slice the graph with a plane that is parallel to the xy-plane,
then all z-values will be equal
○ For example, if z represents some cost function, then all
such (x, y) solutions will have the same cost
○ This is how we create contour plots
Img src
2x² - 4xy + y⁴ + 2