Chapter 4
Chapter 4
Training Models
Instructor:
Dr. Furqan Shoukat
Gradient Descent
• Gradient Descent is a very generic optimization algorithm capable of
finding optimal solutions to a wide range of problems.
▪ Any function has one/more optima (maxima, minima), and maybe saddle points
▪ Magnitude of derivative at a point is the rate of change of the func at that point
𝑑𝑓(𝑥) ∆𝑓(𝑥)
= lim∆𝑥→0 𝑓(𝑥)
𝑑𝑥 ∆𝑥
Sign is also important: Positive derivative
means 𝑓 is increasing at 𝑥 if we increase ∆𝑓(𝑥)
the value of 𝑥 by a very small amount; ∆𝑓(𝑥)
negative derivative means it is decreasing
∆𝑥 ∆𝑥 𝑥
Understanding how 𝑓 changes its value as we
change 𝑥 is helpful to understand optimization
(minimization/maximization) algorithms
▪ Derivative becomes zero at stationary points (optima or saddle points)
▪ The function becomes “flat” (∆𝑓 𝑥 = 0 if we change 𝑥 by a very little at such points)
▪ These are the points where the function has its maxima/minima (unless they are saddles)
8
Rules of Derivatives
Some basic rules of taking derivatives
′
▪ Sum Rule: 𝑓 𝑥 + 𝑔 𝑥 = 𝑓 ′ 𝑥 + 𝑔′ 𝑥
′
▪ Scaling Rule: 𝑎 ⋅ 𝑓 𝑥 = 𝑎 ⋅ 𝑓 ′ 𝑥 if 𝑎 is not a function of 𝑥
′
▪ Product Rule: 𝑓 𝑥 ⋅ 𝑔 𝑥 = 𝑓 ′ 𝑥 ⋅ 𝑔 𝑥 + 𝑔′ 𝑥 ⋅ 𝑓 𝑥
′ ′ 2
▪ Quotient Rule: 𝑓 𝑥 /𝑔 𝑥 ′ = 𝑓 𝑥 ⋅ 𝑔 𝑥 − 𝑔 𝑥 𝑓 𝑥 / 𝑔 𝑥
′
▪ Chain Rule: 𝑓 𝑔 𝑥 ≝ 𝑓∘𝑔 ′ 𝑥 = 𝑓′ 𝑔 𝑥 ⋅ 𝑔′ 𝑥
We already used some of these (sum, scaling
and chain) when calculating the derivative for
the linear regression model
9
Derivatives
▪ How the derivative itself changes tells us about the function’s optima
𝑓’(𝑥)= 0 and 𝑓’’(𝑥) < 0
𝑓’(𝑥)= 0 at 𝑥 𝑥 is a maxima
𝑓’(𝑥)= 0 just before 𝑥
𝑓’(𝑥)= 0 at 𝑥, 𝑓’(𝑥)= 0 just after 𝑥
𝑓’(𝑥)>0 just before 𝑥 𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 > 0
𝑥 may be a saddle
𝑓’(𝑥)<0 just after 𝑥 𝑥 is a minima
𝑥 is a maxima
𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 = 0
𝑥 may be a saddle. May
𝑓’(𝑥)= 0 at 𝑥 need higher derivatives
𝑓’(𝑥)< 0 just before 𝑥
𝑓’(𝑥)>0 just after 𝑥
𝑥 is a minima
A saddle point
Saddle is a point of
inflection where the
derivative is also zero
▪ Saddle points are very common for loss functions of deep learning models
▪ Need to be handled carefully during optimization
▪ Constrained opt. probs can be converted into unconstrained opt. (will see later)
▪ For now, assume we have an unconstrained optimization problem
15
▪ First order optimality: The gradient 𝒈 must be equal to zero at the optima
𝒈 = ∇𝒘 𝐿(𝒘) = 0
▪ Sometimes, setting 𝒈 = 𝟎 and solving for 𝒘 gives a closed form solution
▪ If closed form solution is not available, the gradient vector 𝒈 can still be used in
iterative optimization algos, like gradient descent
17
Method 2: Iterative Optimiz. via Gradient Descent
Can I used this approach For max. problems we can Iterative since it requires
to solve maximization use gradient ascent several steps/iterations to find
problems? 𝒘(𝑡+1) = 𝒘(𝑡) + 𝜂𝑡 𝒈(𝑡) the optimal solution
Fact: Gradient gives the For convex functions,
Will move in the direction Good initialization
direction of steepest GD will converge to
of the gradient needed for non-
change in function’s value the global minima
Gradient Descent convex functions
The learning rate very
▪ Initialize 𝒘 as 𝒘(0) imp. Should be set
carefully (fixed or
chosen adaptively).
▪ For iteration 𝑡 = 0,1,2, … (or until convergence) Will discuss some
strategies later
▪ Calculate the gradient 𝒈(𝑡) using the current iterates 𝒘(𝑡)
▪ Set the learning rate 𝜂𝑡 Will see the Sometimes may be
justification shortly tricky to to assess
▪ Move in the opposite direction of gradient convergence? Will
see some methods
𝒘(𝑡+1) = 𝒘(𝑡) − 𝜂𝑡 𝒈(𝑡) later
18
Gradient Descent: An Illustration
𝛿𝐿
Negative gradient here ( < 0).
𝛿𝑤 Learning rate is very important
𝐿(𝒘) Let’s move in the positive direction
If it starts on the right, it will take a very long time to Figure 4-6. Gradient Descent pitfalls
cross the plateau, and if you stop too early you will
never reach the global minimum.
Gradient Descent-Feature Scaling
•
The cost function has the shape of a
bowl, but it can be an elongated bowl
if the features have very different
scales.
• Instead of computing these partial derivatives individually, you can use Equation 4-6 to compute
them all in one go. The gradient vector noted ∇θMSE(θ), contains all the partial derivatives of the
cost function (one for each model parameter).
Batch Gradient Descent
Batch Gradient Descent-Learning Rates
• The primary drawback of Batch Gradient Descent is that it calculates the gradients
using the entire training set at each step, which can be very slow, especially with
large datasets.
• On the other hand, Stochastic Gradient Descent (SGD) selects a random instance
from the training set at each step and computes the gradients based solely on that
single example.
• This significantly speeds up the algorithm since it processes only a small amount
of data per iteration.
Stochastic Gradient Descent
1 𝑁
▪ Consider a loss function of the form 𝐿 𝒘 = σ𝑛=1 ℓ𝑛 (𝒘)
𝑁
Expensive to compute – requires
doing it for all the training
▪ The (sub)gradient in this case can be written as examples in each iteration
1 𝑁 1 𝑁
𝒈 = ∇𝒘 𝐿 𝑤 = ∇𝒘 [ ℓ𝑛 (𝒘)] = 𝒈𝑛
𝑁 𝑛=1 𝑁 𝑛=1 (Sub)gradient of the loss
on 𝑛𝑡ℎ training example
▪ We can use 𝐵 > 1 unif. rand. chosen train. ex. with indices 𝑖1 , 𝑖2 , … , 𝑖𝐵 ∈ {1,2, … , 𝑁}
▪ Using this “minibatch” of examples, we can compute a minibatch gradient
1 𝐵
𝒈≈ 𝒈𝑖𝑏
𝐵 𝑏=1
▪ Averaging helps in reducing the variance in the stochastic gradient
▪ Time complexity is 𝑂(𝐵𝐷) per iteration in this case
Mini-batch Gradient Descent
The last Gradient Descent algorithm we will
look at is called Mini-batch Gradient Descent.
It is quite simple to understand once you know
Batch and Stochastic Gradient Descent: at each
step, instead of computing the gradients based
on the full training set (as in Batch GD) or
based on just one instance (as in Stochastic
GD), Mini- batch GD computes the gradients
on small random sets of instances called mini-
batches.
The main advantage of Mini-batch GD over
Stochastic GD is that you can get a
performance boost from hardware optimization Figure 4-11. Gradient Descent paths in parameter space
of matrix operations, especially when using
GPUs.
CHAPTER:04
2nd Lecture
Training Models
Instructor:
Dr. Furqan Shoukat
Previous Lecture Recap-Gradient Descent
𝒘𝑟𝑖𝑑𝑔𝑒 = arg min𝒘 𝐿𝑟𝑒𝑔 𝒘 = arg min𝒘 σ𝑁 (𝑦
𝑛=1 𝑛 − 𝒘 ⊤ 𝒙 )2
𝑛
𝑁
𝒈 = − 2 𝑦𝑛 − 𝒘⊤ 𝒙𝑛 𝒙𝑛
𝑛=1
1 𝐵
•𝒈≈ σ𝑏=1 𝒈𝑖𝑏
𝐵
𝒈 ≈ 𝒈𝑖 = ∇𝒘 ℓ𝑖 (𝒘)
Polynomial Regression
• What if your data is actually more complex than a simple straight line?
• Surprisingly, you can actually use a linear model to fit nonlinear data.
• A simple way to do this is to add powers of each feature as new features, then
train a linear model on this extended set of features. This technique is called
Polynomial Regression.
1. Ridge Regression
2. Lasso Regression
3. Elastic Net
• The idea is quite simple: when given an instance x, the Softmax Regression model first computes
a score sk(x) for each class k, then estimates the probability of each class by applying the softmax
function (also called the normalized exponential) to the scores.
• The equation to compute sk(x) should look familiar, as it is just like the equation for Linear
Regression prediction
Softmax Regression
Cross Entropy
The cross-entropy between two probability distributions p and q is defined as
Kullback–Leibler
divergence.
Figure 4-25. Softmax Regression decision boundaries