0% found this document useful (0 votes)
4 views

Chapter04_Training_Models

The document discusses linear regression, including data generation, model fitting, and optimization techniques such as the Normal Equation and Gradient Descent. It highlights the computational complexities associated with these methods and introduces variations like Stochastic and Mini-batch Gradient Descent. Additionally, it provides Python code examples for implementing these techniques using libraries like NumPy and Scikit-Learn.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter04_Training_Models

The document discusses linear regression, including data generation, model fitting, and optimization techniques such as the Normal Equation and Gradient Descent. It highlights the computational complexities associated with these methods and introduces variations like Stochastic and Mini-batch Gradient Descent. Additionally, it provides Python code examples for implementing these techniques using libraries like NumPy and Scikit-Learn.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Training

Models
DR. ANWAR M. MIRZA
Linear Regression Data

import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

import matplotlib.pyplot as plt


plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
save_fig("generated_data_plot")
plt.show()
Linear Regression
x1 x2 x3 y ŷ
3 1 2 3.5 ?

6 7 9 2.8 ?

2 5 1 1.5 ?

8 4 0 6.9 ?

Finding the best fit linear line (or hyperplane) for the training data that minimizes
the error. In essence, find the best values for θ1, θ2, ..., θn in the above equation.
Linear Regression Equation
(Vectorized form)
¿ h𝜃 ( x)
Linear Regression Equation
(Vectorized form)
Objective is to determine the best value of θ that minimizes MSE.
How to Solve this Optimization
Problem?
Two main approaches:
 Normal Equation
 Gradient Descent
Solution Using Normal Equation
Solution Using Normal Equation
Cost function is minimum when,

∇ θ^ 𝑀𝑆𝐸 ( θ^ ) = 0

⇒ X θ^ = y
^
T T
X Xθ=X y
^θ = ( X X ) XT y
T −1
Solution Using Normal Equation
Linear Regression Data

import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
save_fig("generated_data_plot")
plt.show()
Linear Regression Model Fitting
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

theta_best

array([[4.21509616], [2.77011339]])
X_new = np.array([[0], [2]])
# add x0 = 1 to each instance
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)
y_predict

array([[4.21509616], [9.75532293]])

import matplotlib.pyplot as plt

plt.plot(X_new, y_predict, "r-")


plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()
Computational Complexity of
Normal Equation (1)
The Normal Equation computes the inverse of X⊺ X, which is an (n +
1) × (n + 1) matrix (where n is the number of features). The
computational complexity of inverting such a matrix is typically
about O(n2.4) to O(n3), depending on the implementation. In other
words, if you double the number of features, you multiply the
computation time by roughly 22.4 = 5.3 to 23 = 8.
The SVD (Singular Value Decomposition) approach used by Scikit-
Learn’s LinearRegression class is about O(n2). If you double the
number of features, you multiply the computation time by roughly 4.
Computational Complexity of
Normal Equation (2)
Both the Normal Equation and the SVD approach get very slow
when the number of features grows large (e.g., 100,000).
On the positive side, both are linear with regard to the number of
instances in the training set (they are O(m)), so they handle large
training sets efficiently, provided they can fit in memory.
Gradient Descent
Gradient Descent is a generic optimization algorithm capable of
finding optimal solutions to a wide range of problems. The general
idea of Gradient Descent is to tweak parameters iteratively in order
to minimize a cost function.
Gradient Descent
Gradient Descent – Too small
learning rate
Gradient Descent – Too large
learning rate
Gradient Descent Pitfalls
Linear Regression – Gradient
Descent
Fortunately, the MSE cost function for a Linear Regression model
happens to be a convex function, which means that if you pick any
two points on the curve, the line segment joining them never crosses
the curve.
This implies that there are no local minima, just one global
minimum. It is also a continuous function with a slope that never
changes abruptly.
These two facts have a great consequence: Gradient Descent is
guaranteed to approach arbitrarily close the global minimum (if you
wait long enough and if the learning rate is not too high).
Linear Regression – Gradient
Descent
When using Gradient Descent, you should ensure that all features have a similar
scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much
longer to converge.
Batch Gradient Descent
To implement Gradient Descent, you need to compute the gradient of the cost function with
regard to each model parameter θj.
In other words, you need to calculate how much the cost function will change if you change θj
just a little bit. This is called a partial derivative.
Batch Gradient Descent (2)
Instead of computing these partial derivatives individually, you can use Equation
4-6 to compute them all in one go.
Gradient Descent Step
Once you have the gradient vector, which points uphill, just go in the opposite
direction to go downhill. This means subtracting ∇θMSE(θ) from θ. This is where
the learning rate η comes into play: multiply the gradient vector by η to
determine the size of the downhill step
Gradient Descent in Python
eta = 0.1 # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
Gradient Descent – Learning
Rates
Gradient Descent – Iterations and
Tolerance
You may wonder how to set the number of iterations. If it is too low,
you will still be far away from the optimal solution when the
algorithm stops; but if it is too high, you will waste time while the
model parameters do not change anymore.
A simple solution is to set a very large number of iterations but to
interrupt the algorithm when the gradient vector becomes tiny—that
is, when its norm becomes smaller than a tiny number ϵ (called the
tolerance)—because this happens when Gradient Descent has
(almost) reached the minimum.
Stochastic Gradient Descent
The main problem with Batch Gradient Descent is the fact that it uses the whole
training set to compute the gradients at every step, which makes it very slow
when the training set is large.

Stochastic Gradient Descent picks a random instance in the training set at every
step and computes the gradients based only on that single instance.
Stochastic Gradient Descent (2)
Stochastic Gradient Descent (3)
One solution to this dilemma is to gradually reduce the learning rate.

The function that determines the learning rate at each iteration is called the
learning schedule.
Linear Regression Using
SGDRegressor
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, eta0=0.1)
sgd_reg.fit(X, y.ravel())
Mini-batch Gradient Descent
Instead of computing the gradients based on the full training set (as in Batch
GD) or based on just one instance (as in Stochastic GD), Mini-batch GD
computes the gradients on small random sets of instances called mini-batches.
The main advantage of Mini-batch GD over Stochastic GD is that you can get a
performance boost from hardware optimization of matrix operations, especially
when using GPUs.
The algorithm’s progress in parameter space is less erratic than with Stochastic
GD, especially with fairly large mini-batches.
Gradient Descent Comparison
Linear Regression Using Scikit

You might also like