Chapter04_Training_Models
Chapter04_Training_Models
Models
DR. ANWAR M. MIRZA
Linear Regression Data
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
6 7 9 2.8 ?
2 5 1 1.5 ?
8 4 0 6.9 ?
Finding the best fit linear line (or hyperplane) for the training data that minimizes
the error. In essence, find the best values for θ1, θ2, ..., θn in the above equation.
Linear Regression Equation
(Vectorized form)
¿ h𝜃 ( x)
Linear Regression Equation
(Vectorized form)
Objective is to determine the best value of θ that minimizes MSE.
How to Solve this Optimization
Problem?
Two main approaches:
Normal Equation
Gradient Descent
Solution Using Normal Equation
Solution Using Normal Equation
Cost function is minimum when,
∇ θ^ 𝑀𝑆𝐸 ( θ^ ) = 0
⇒ X θ^ = y
^
T T
X Xθ=X y
^θ = ( X X ) XT y
T −1
Solution Using Normal Equation
Linear Regression Data
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
save_fig("generated_data_plot")
plt.show()
Linear Regression Model Fitting
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
theta_best
array([[4.21509616], [2.77011339]])
X_new = np.array([[0], [2]])
# add x0 = 1 to each instance
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)
y_predict
array([[4.21509616], [9.75532293]])
Stochastic Gradient Descent picks a random instance in the training set at every
step and computes the gradients based only on that single instance.
Stochastic Gradient Descent (2)
Stochastic Gradient Descent (3)
One solution to this dilemma is to gradually reduce the learning rate.
The function that determines the learning rate at each iteration is called the
learning schedule.
Linear Regression Using
SGDRegressor
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, eta0=0.1)
sgd_reg.fit(X, y.ravel())
Mini-batch Gradient Descent
Instead of computing the gradients based on the full training set (as in Batch
GD) or based on just one instance (as in Stochastic GD), Mini-batch GD
computes the gradients on small random sets of instances called mini-batches.
The main advantage of Mini-batch GD over Stochastic GD is that you can get a
performance boost from hardware optimization of matrix operations, especially
when using GPUs.
The algorithm’s progress in parameter space is less erratic than with Stochastic
GD, especially with fairly large mini-batches.
Gradient Descent Comparison
Linear Regression Using Scikit