Gradient Descent Vizcs229 PDF
Gradient Descent Vizcs229 PDF
import time
import numpy as np
import matplotlib
import matplotlib.animation
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
The hope is to give you a mechanical view of what we've done in lecture.
Visualizing these concepts makes life much easier.
Get into the habit of trying things out! Machine learning is wonderful because it is so successful.
if isocline:
fig, ax = plt.subplots()
ax.set_title('Gradient Descent')
CS = ax.contour(T0, T1, J)
else:
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.set_title('Gradient Descent')
CS = ax.contour3D(T0, T1, J, 50)
To control the situation, we'll generate the data by assuming there is a valueθ∗ and we'll generate some data with it.
Now, in machine learning we don't often have a perfect linear relationship, and so we'll add just a tiny bit of noise. We want it to be clean enough
so that it's clear what's going on with the algorithm.
In [4]: n=25
theta_true = np.random.randn(2,1)
X = np.matrix(np.random.randn(n,2)@np.diag([1.5,1]))
y = X@theta_true + 1e-3*np.random.randn(n,1)
<IPython.core.display.Javascript object>
Let's visualize the loss function in two ways. The main point of chosing such nice data was to get nice loss surfaces .
In [6]: render_points(X,y,[],isocline=False)
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Iteration
The iteration rule is:
θ ← θ − α∇θ J(θ)
In which α is the stepsize.
The derivative for a single example say is xT (xT θ − y) writing x = x(i) and y = y(i) , for reasons of nicer markdown rendering only.
In [8]: theta = np.matrix([4,4]).T
T = 50
alpha = 0.01
points = []
for t in range(T):
theta = theta - alpha*X.T@(X@theta-y)
points.append(theta)
ani = render_points(X,y,points, isocline=True)
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Out[10]: matrix([[-0.00012458],
[-0.0002192 ]])
Intuitevly, if you're data is already noisy we can get a good answer but much faster.
1
In both casees, we are going to take 50 steps. But in SGD, we'll only look at point not all n points!
The update algorithm, well it's what we derived on the board to start!
In [11]: theta = np.matrix([4,4]).T
T = 50
# Shuffle the examples
n = X.shape[0]
perm = np.arange(n)
np.random.shuffle(perm)
points = []
for t in range(T):
i = t % n
xi,yi = X[i,:].T, y[i]
theta = theta - 0.1*xi*(xi.T@theta-yi)
points.append(theta)
ani = render_points(X,y,points, isocline=True)
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
In [ ]:
Convergence
Some of the most interesting models used today are not bowl shape, and they may have spurious local minima (or not! sometimes we don't know!).
What does gradient descent do in this situation? It's unclear!
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Take-away points
1. SGD can be faster than batch gradient descent, intuitevely, when the dataset contains redundancy--say the same point occurs many times--
SGD could complete before batch gradient does one iteration!
Aside for the researchers and mathematics types: almost all theory suggests that you sample with replacement, but as above this shuffling approach
is more widely used. Shockingly, while some theory (http://proceedings.mlr.press/v23/recht12/recht12.pdf) suggests it's better only recently have
folks been able to show that's true!
In [15]: url="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
d=pd.read_csv(url, header=None).values
Out[16]: matrix([[1.00870733],
[5.51742967]])
Out[17]: matrix([[0.99620855],
[5.85008478]])
In [ ]: