0% found this document useful (0 votes)
11 views

Chapter 4

Uploaded by

muneebgoraya60
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Chapter 4

Uploaded by

muneebgoraya60
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

CHAPTER:04

Training Models

Instructor:
Dr. Furqan Shoukat
Gradient Descent
• Gradient Descent is a very generic optimization algorithm capable of
finding optimal solutions to a wide range of problems.

• The general idea of Gradient Descent is to tweak parameters


iteratively to minimize a cost function.

• What is a Gradient, cost function??


Calculus Recap
The objective function of the ML Assume unconstrained 6
Functions and their optima problem we are solving (e.g.,
squared loss for regression)
for now, i.e., just a real-
valued number/vector

▪ Many ML problems require us to optimize a function 𝑓 of some variable(s) 𝑥


▪ For simplicity, assume 𝑓 is a scalar-valued function of a scalar 𝑥(𝑓: ℝ → ℝ)
𝑓(𝑥) A local maxima A local maxima Global maxima
Usually interested in global
optima but often want to
find local optima, too
A local minima For deep learning models, often the
A local minima A local minima local optima are what we can find (and
𝑥 they usually suffice) – more later

Will see what


these are later
Global minima

▪ Any function has one/more optima (maxima, minima), and maybe saddle points

▪ Finding the optima or saddles requires derivatives/gradients of the function


Will sometimes use 𝑓 ′ (𝑥) to 7
Derivatives denote the derivative

▪ Magnitude of derivative at a point is the rate of change of the func at that point
𝑑𝑓(𝑥) ∆𝑓(𝑥)
= lim∆𝑥→0 𝑓(𝑥)
𝑑𝑥 ∆𝑥
Sign is also important: Positive derivative
means 𝑓 is increasing at 𝑥 if we increase ∆𝑓(𝑥)
the value of 𝑥 by a very small amount; ∆𝑓(𝑥)
negative derivative means it is decreasing
∆𝑥 ∆𝑥 𝑥
Understanding how 𝑓 changes its value as we
change 𝑥 is helpful to understand optimization
(minimization/maximization) algorithms
▪ Derivative becomes zero at stationary points (optima or saddle points)
▪ The function becomes “flat” (∆𝑓 𝑥 = 0 if we change 𝑥 by a very little at such points)
▪ These are the points where the function has its maxima/minima (unless they are saddles)
8
Rules of Derivatives
Some basic rules of taking derivatives

▪ Sum Rule: 𝑓 𝑥 + 𝑔 𝑥 = 𝑓 ′ 𝑥 + 𝑔′ 𝑥

▪ Scaling Rule: 𝑎 ⋅ 𝑓 𝑥 = 𝑎 ⋅ 𝑓 ′ 𝑥 if 𝑎 is not a function of 𝑥

▪ Product Rule: 𝑓 𝑥 ⋅ 𝑔 𝑥 = 𝑓 ′ 𝑥 ⋅ 𝑔 𝑥 + 𝑔′ 𝑥 ⋅ 𝑓 𝑥
′ ′ 2
▪ Quotient Rule: 𝑓 𝑥 /𝑔 𝑥 ′ = 𝑓 𝑥 ⋅ 𝑔 𝑥 − 𝑔 𝑥 𝑓 𝑥 / 𝑔 𝑥

▪ Chain Rule: 𝑓 𝑔 𝑥 ≝ 𝑓∘𝑔 ′ 𝑥 = 𝑓′ 𝑔 𝑥 ⋅ 𝑔′ 𝑥
We already used some of these (sum, scaling
and chain) when calculating the derivative for
the linear regression model
9
Derivatives
▪ How the derivative itself changes tells us about the function’s optima
𝑓’(𝑥)= 0 and 𝑓’’(𝑥) < 0
𝑓’(𝑥)= 0 at 𝑥 𝑥 is a maxima
𝑓’(𝑥)= 0 just before 𝑥
𝑓’(𝑥)= 0 at 𝑥, 𝑓’(𝑥)= 0 just after 𝑥
𝑓’(𝑥)>0 just before 𝑥 𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 > 0
𝑥 may be a saddle
𝑓’(𝑥)<0 just after 𝑥 𝑥 is a minima
𝑥 is a maxima
𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 = 0
𝑥 may be a saddle. May
𝑓’(𝑥)= 0 at 𝑥 need higher derivatives
𝑓’(𝑥)< 0 just before 𝑥
𝑓’(𝑥)>0 just after 𝑥
𝑥 is a minima

▪ The second derivative 𝑓’’(𝑥) can provide this information


10
Saddle Points
▪ Points where derivative is zero but are neither minima nor maxima

A saddle point

Saddle is a point of
inflection where the
derivative is also zero

▪ Saddle points are very common for loss functions of deep learning models
▪ Need to be handled carefully during optimization

▪ Second or higher derivative may help identify if a stationary point is a saddle


11
Multivariate Functions
▪ Most functions that we see in ML are multivariate function

▪ Example: Loss fn 𝐿(𝒘) in lin-reg was a multivar function of 𝐷-dim vector 𝒘


𝐿 𝒘 : ℝ𝐷 → ℝ
▪ Here is an illustration of a function of 2 variables (4 maxima and 5 minima)

Two-dim contour plot


of the function (i.e.,
what it looks like from
the above)

Plot courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html


12
Derivatives of Multivariate Functions
▪ Can define derivative for a multivariate functions as well via the gradient

▪ Gradient of a function 𝑓(𝒙): ℝ𝐷 → ℝ is a 𝐷 × 1 vector of partial derivatives


𝜕𝑓 𝜕𝑓 𝜕𝑓 Each element in this gradient vector tells us how
∇𝑓 𝒙 = , ,…, much 𝑓 will change if we move a little along the
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝐷 corresponding (akin to one-dim case)

▪ Optima and saddle points defined similar to one-dim case


▪ Required properties that we saw for one-dim case must be satisfied along all the directions

▪ The second derivative in this case is known as the Hessian


13
Convex and Non-Convex Functions
▪ A function being optimized can be either convex or non-convex
▪ Here are a couple of examples of convex functions
Convex functions are bowl-shaped.
They have a unique optima (minima)

Negative of a convex function is called


a concave function, which also has a
unique optima (maxima)

▪ Here are a couple of examples of non-convex functions Non-convex functions have


multiple minima. Usually harder
to optimize as compared to
convex functions

Loss functions of most


deep learning models are
non-convex
14
Optimization Problems in ML
▪ The general form of an optimization problem in ML will usually be
Usually a sum of the
training error + regularizer 𝑤𝑜𝑝𝑡 = arg min𝑤∈𝑪 𝐿(𝒘) However, possible to have
linear/ridge regression
▪ Here 𝐿(𝒘) denotes the loss function to be optimized where solution has some
constraints (e.g., non-neg,
sparsity, or even both)
▪ 𝑪 is the constraint set that the solution must belong to, e.g.,
Linear and ridge regression
▪ Non-negativity constraint: All entries in 𝑤𝑜𝑝𝑡 must be non-negative that we saw were
unconstrained (𝑤𝑜𝑝𝑡 was a
▪ Sparsity constraint: 𝑤𝑜𝑝𝑡 is a sparse vector with atmost 𝐾 non-zeros
real-valued vector)

▪ If no 𝑪 is specified, it is an unconstrained optimization problem

▪ Constrained opt. probs can be converted into unconstrained opt. (will see later)
▪ For now, assume we have an unconstrained optimization problem
15

Methods for Solving


Optimization Problems
16
Method 1: Using First-Order Optimality
▪ Very simple. Already used this approach for linear and ridge regression
Called “first order” since only gradient is
used and gradient provides the first order
info about the function being optimized

The approach works only for very


simple problems where the objective
is convex and there are no constraints
on the values 𝒘 can take

▪ First order optimality: The gradient 𝒈 must be equal to zero at the optima
𝒈 = ∇𝒘 𝐿(𝒘) = 0
▪ Sometimes, setting 𝒈 = 𝟎 and solving for 𝒘 gives a closed form solution

▪ If closed form solution is not available, the gradient vector 𝒈 can still be used in
iterative optimization algos, like gradient descent
17
Method 2: Iterative Optimiz. via Gradient Descent
Can I used this approach For max. problems we can Iterative since it requires
to solve maximization use gradient ascent several steps/iterations to find
problems? 𝒘(𝑡+1) = 𝒘(𝑡) + 𝜂𝑡 𝒈(𝑡) the optimal solution
Fact: Gradient gives the For convex functions,
Will move in the direction Good initialization
direction of steepest GD will converge to
of the gradient needed for non-
change in function’s value the global minima
Gradient Descent convex functions
The learning rate very
▪ Initialize 𝒘 as 𝒘(0) imp. Should be set
carefully (fixed or
chosen adaptively).
▪ For iteration 𝑡 = 0,1,2, … (or until convergence) Will discuss some
strategies later
▪ Calculate the gradient 𝒈(𝑡) using the current iterates 𝒘(𝑡)
▪ Set the learning rate 𝜂𝑡 Will see the Sometimes may be
justification shortly tricky to to assess
▪ Move in the opposite direction of gradient convergence? Will
see some methods
𝒘(𝑡+1) = 𝒘(𝑡) − 𝜂𝑡 𝒈(𝑡) later
18
Gradient Descent: An Illustration
𝛿𝐿
Negative gradient here ( < 0).
𝛿𝑤 Learning rate is very important
𝐿(𝒘) Let’s move in the positive direction

Positive gradient here.


Let’s move in the
negative direction

𝒘(2) (1) 𝒘 (0) 𝒘


𝒘(0) 𝒘(1)𝒘
𝒘(3)
∗ 𝒘(2) 𝒘
𝒘(3)
∗ 𝒘

Woohoo! ☺ Global Stuck at a local


minima found!!! minima 
GD thanks you for the Good initialization is
good initialization ☺ very important
20
GD: An Example
▪ Let’s apply GD for least squares linear regression
𝒘𝑟𝑖𝑑𝑔𝑒 = arg min𝒘 𝐿𝑟𝑒𝑔 𝒘 = arg min𝒘 σ𝑁 (𝑦
𝑛=1 𝑛 − 𝒘 ⊤ 𝒙 )2
𝑛

▪ The gradient: 𝒈 = − σ𝑁 2 𝑦 − 𝒘 ⊤𝒙 𝒙 Training examples


𝑛=1 𝑛 𝑛 𝑛 on which the
Prediction error of current model current model’s
▪ Each GD update will be of the form 𝒘(𝑡) on the 𝑛𝑡ℎ training example error is large
contribute more to
(𝑡) ⊤
𝒘 (𝑡+1)
= 𝒘 (𝑡)
+ 𝜂𝑡 σ𝑁
𝑛=1 2 𝑦𝑛 − 𝒘 𝒙𝑛 𝒙𝑛 the update
Gradient Descent-Main Idea
• Concretely, you start by filling θ with
random values (this is called random
initialization)
.
• Then you improve it gradually, taking one
baby step at a time, each step attempting to
decrease the cost function (e.g., the MSE),
until the algorithm converges to a
minimum (see Figure 4-3).

Figure 4-3. Gradient Descent


Small Learning Rate

• An important parameter in Gradient


Descent is the size of the steps, determined
by the learning rate hyperparameter.

• If the learning rate is too small, the


algorithm will have to go through many
iterations to converge, which will take a
long time (see Figure 4-4).

Figure 4-4 . Learning rate too small


Large Learning Rate

On the other hand, if the learning rate


is too high, you might jump across the
valley and end up on the other side,
possibly even higher up than you were
before.

With larger and larger values, the


algorithm might diverge, failing to
find a good solution
(see Figure 4-5).
Figure 4-5. Learning rate too large
Gradient Descent-Non Convex Functions
• Finally, not all cost functions look like nice regular
bowls. There may be holes, ridges, plateaus, and all
sorts of irregular terrains, making convergence to the
minimum very difficult.

Figure 4-6 shows the two main challenges with


Gradient Descent:

if the random initialization starts the algorithm on the


left, it will converge to a local minimum, which is not as
good as the global minimum.

If it starts on the right, it will take a very long time to Figure 4-6. Gradient Descent pitfalls
cross the plateau, and if you stop too early you will
never reach the global minimum.
Gradient Descent-Feature Scaling

The cost function has the shape of a
bowl, but it can be an elongated bowl
if the features have very different
scales.

Figure 4-7 shows Gradient Descent on


a training set where features 1 and 2
have the same scale (on the left), and
on a training set, feature 1 has much
smaller values than feature 2.
Figure 4-7. Gradient Descent with and without feature scaling

Since feature 1 is smaller, it takes a larger change in θ1 to affect


the cost function, which is why the bowl is elongated along the θ1
axis.
Batch Gradient Descent
• To implement Gradient Descent, you need to compute the gradient of the cost function
about each model parameter θj.
• In other words, you need to calculate how much the cost function will change if you
change θj just a bit. This is called a partial derivative.
• It is like asking “What is the slope of the mountain under my feet if I face east?” and
then asking the same question facing north (and so on for all other dimensions, if you
can imagine a universe with more than three dimensions).
• Equation 4-5 computes the partial derivative of the cost function with regards to
parameter θj, noted ∂ MSE(θ).
Batch Gradient Descent

• Instead of computing these partial derivatives individually, you can use Equation 4-6 to compute
them all in one go. The gradient vector noted ∇θMSE(θ), contains all the partial derivatives of the
cost function (one for each model parameter).
Batch Gradient Descent
Batch Gradient Descent-Learning Rates

Figure 4-8. Gradient Descent with various learning rates


Stochastic Gradient Descent

• The primary drawback of Batch Gradient Descent is that it calculates the gradients
using the entire training set at each step, which can be very slow, especially with
large datasets.
• On the other hand, Stochastic Gradient Descent (SGD) selects a random instance
from the training set at each step and computes the gradients based solely on that
single example.
• This significantly speeds up the algorithm since it processes only a small amount
of data per iteration.
Stochastic Gradient Descent

• On the other hand, due to its stochastic (i.e., random)


nature, this algorithm is much less regular than Batch
Gradient Descent: instead of gently decreasing until it
reaches the minimum, the cost function will bounce up
and down, decreasing only on average.
• Over time it will end up very close to the minimum, but
once it gets there it will continue to bounce around,
never settling down (see Figure 4-9). So once the
algorithm stops, the final parameter values are good,
but not optimal.

Figure 4-9. Stochastic Gradient Descent


32
Stochastic Gradient Descent (SGD) Writing as an average instead of sum.
Won’t affect minimization of 𝐿 𝒘

1 𝑁
▪ Consider a loss function of the form 𝐿 𝒘 = σ𝑛=1 ℓ𝑛 (𝒘)
𝑁
Expensive to compute – requires
doing it for all the training
▪ The (sub)gradient in this case can be written as examples in each iteration 

1 𝑁 1 𝑁
𝒈 = ∇𝒘 𝐿 𝑤 = ∇𝒘 [ ෍ ℓ𝑛 (𝒘)] = ෍ 𝒈𝑛
𝑁 𝑛=1 𝑁 𝑛=1 (Sub)gradient of the loss
on 𝑛𝑡ℎ training example

▪ Stochastic Gradient Descent (SGD) approximates 𝒈 using a single training example

▪ At iter. 𝑡, pick an index 𝑖 ∈ 1,2, … , 𝑁 uniformly randomly and approximate 𝒈 as


Can show that 𝒈𝑖 is an
𝒈 ≈ 𝒈𝑖 = ∇𝒘 ℓ𝑖 (𝒘) unbiased estimate of 𝑔,
i.e.,𝔼 𝒈𝑖 = 𝒈
▪ May take more iterations than GD to converge but each iteration is much faster ☺
▪ SGD per iter cost is 𝑂 𝐷 whereas GD per iter cost is 𝑂(𝑁𝐷)
33
Minibatch SGD
▪ Gradient approximation using a single training example may be noisy
The approximation may have a high variance –
may slow down convergence, updates may be
unstable, and may even give sub-optimal
solutions (e.g., local minima where GD might
have given global minima)

▪ We can use 𝐵 > 1 unif. rand. chosen train. ex. with indices 𝑖1 , 𝑖2 , … , 𝑖𝐵 ∈ {1,2, … , 𝑁}
▪ Using this “minibatch” of examples, we can compute a minibatch gradient
1 𝐵
𝒈≈ ෍ 𝒈𝑖𝑏
𝐵 𝑏=1
▪ Averaging helps in reducing the variance in the stochastic gradient
▪ Time complexity is 𝑂(𝐵𝐷) per iteration in this case
Mini-batch Gradient Descent
The last Gradient Descent algorithm we will
look at is called Mini-batch Gradient Descent.
It is quite simple to understand once you know
Batch and Stochastic Gradient Descent: at each
step, instead of computing the gradients based
on the full training set (as in Batch GD) or
based on just one instance (as in Stochastic
GD), Mini- batch GD computes the gradients
on small random sets of instances called mini-
batches.
The main advantage of Mini-batch GD over
Stochastic GD is that you can get a
performance boost from hardware optimization Figure 4-11. Gradient Descent paths in parameter space
of matrix operations, especially when using
GPUs.
CHAPTER:04
2nd Lecture
Training Models

Instructor:
Dr. Furqan Shoukat
Previous Lecture Recap-Gradient Descent
𝒘𝑟𝑖𝑑𝑔𝑒 = arg min𝒘 𝐿𝑟𝑒𝑔 𝒘 = arg min𝒘 σ𝑁 (𝑦
𝑛=1 𝑛 − 𝒘 ⊤ 𝒙 )2
𝑛
𝑁

𝒈 = − ෍ 2 𝑦𝑛 − 𝒘⊤ 𝒙𝑛 𝒙𝑛
𝑛=1

1 𝐵
•𝒈≈ σ𝑏=1 𝒈𝑖𝑏
𝐵

𝒈 ≈ 𝒈𝑖 = ∇𝒘 ℓ𝑖 (𝒘)
Polynomial Regression
• What if your data is actually more complex than a simple straight line?
• Surprisingly, you can actually use a linear model to fit nonlinear data.
• A simple way to do this is to add powers of each feature as new features, then
train a linear model on this extended set of features. This technique is called
Polynomial Regression.

Figure 4-12. Generated nonlinear and noisy dataset


Polynomial Regression
• If you apply high-degree Polynomial Regression, you will generally achieve a much
better fit to the training data compared to plain Linear Regression.
• For instance, in Figure 4-14, a 300-degree polynomial model is fitted to the training
data and compared with both a simple linear model and a quadratic model (2nd-
degree polynomial).
• Notice how the 300-degree polynomial model oscillates to closely match the training
instances, capturing more intricate patterns.

Figure 4-14. High-degree Polynomial Regression


Learning Curves
Learning curves can be used to determine whether the model is
overfitting/underfitting.
These are plots of the model’s performance on the training set
and the validation set as a function of the training set size (or
the training iteration).

To generate the plots, simply train the model several times


on different sized subsets of the training set.
Learning Curves of Plain Regression
Model
If your model is
underfitting the
training data, adding
more training
examples will not
help. You need to use
a more complex
model or come up
with better features.
Learning Curves of a 10th-degree
polynomial model
• The error on the training data is much
lower than with the Linear Regression
model.
• There is a gap between the curves.
This means that the model performs
significantly better on the training data
than on the validation data, which is
the hallmark of an overfitting model.

• However, if you used a much larger


training set, the two curves would
continue to get closer
Generalization Errors
03 Types
• Bias
• Variance
• Ierreducible
Bias and Variance
• This part of the generalization error is due to wrong assumptions,
such as assuming that the data is linear when it is actually
quadratic.
• A high-bias model is most likely to underfit the training data.

• This part is due to the model’s excessive sensitivity to small


variations in the training data.
• A model with many degrees of freedom (such as a high-degree
polynomial model) is likely to have high variance, and thus to
overfit the training data.
The Bias/Variance Tradeoff
• Increasing a model’s complexity will typically increase its
variance and reduce its bias.
• Conversely, reducing a model’s complexity increases its bias
and reduces its variance.
• This is why it is called a tradeoff
The Bias Example
Variance Example
Bias-Variance Intuition
Regularized Linear Models

For a linear model, regularization is typically achieved by


constraining the model’s weights.

1. Ridge Regression
2. Lasso Regression
3. Elastic Net

Three different ways to constrain the weights.


Ridge Regression
Lasso Regression
Elastic Net
Early Stopping
• A very different way to regularize iterative learning algorithms such as Gradient
Descent is to stop training as soon as the validation error reaches a minimum. This is
called early stopping.

Figure 4-20. Early stopping regularization


Logistic Regression
• Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance
belongs to a particular class (e.g., what is the probability that this email is spam?).
• If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class
(called the positive class, labeled “1”),
• or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary
classifier
Remember Linear
regression

The logistic—noted σ(·)—is a sigmoid function (i.e., S-shaped) that outputs a


number between 0 and 1. It is defined as shown in Equation 4-14
Training and Cost Function
• The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0). This
idea is captured by the cost function shown in Equation 4-16 for a single training instance x.

Figure 4-21. Logistic function


Cost Function
Decision Boundaries
• Let’s use the iris dataset to illustrate Logistic Regression. This is a famous
dataset that contains the sepal and petal length and width of 150 iris flowers of
three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica (see Figure
4-22).

Figure 4-22. Flowers of three iris plant species16


Figure 4-23. Estimated probabilities and decision boundary
Figure 4-24. Linear decision boundary
Softmax Regression
• The Logistic Regression model can be generalized to support multiple classes directly, without
having to train and combine multiple binary classifiers This is called Softmax Regression, or
Multinomial Logistic Regression.

• The idea is quite simple: when given an instance x, the Softmax Regression model first computes
a score sk(x) for each class k, then estimates the probability of each class by applying the softmax
function (also called the normalized exponential) to the scores.

• The equation to compute sk(x) should look familiar, as it is just like the equation for Linear
Regression prediction
Softmax Regression
Cross Entropy
The cross-entropy between two probability distributions p and q is defined as

Kullback–Leibler
divergence.
Figure 4-25. Softmax Regression decision boundaries

You might also like