0% found this document useful (0 votes)

92 views

Gradient Descent Vizcs229 PDF

This document discusses gradient descent for linear regression. It begins by introducing least squares loss and defining functions to calculate loss and visualize gradient descent. Synthetic data is generated to demonstrate gradient descent. Batch gradient descent and stochastic gradient descent are applied to minimize loss and find the optimal parameters. Key points are that SGD can be faster, convergence is not guaranteed for non-convex problems, and learning rate is an important hyperparameter. Real auto mileage data is then loaded to further demonstrate gradient descent.

Uploaded by

Alankar Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views

Gradient Descent Vizcs229 PDF

Uploaded by

Alankar Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

In [1]: %matplotlib notebook

import time
import numpy as np
import matplotlib
import matplotlib.animation
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd

Gradient Descent for Linear Regression

This is meant to show you how gradient descent works and familiarize yourself with the terms and ideas. We're going to look at that least squares.

The hope is to give you a mechanical view of what we've done in lecture.
Visualizing these concepts makes life much easier.
Get into the habit of trying things out! Machine learning is wonderful because it is so successful.

Some boiler plate code.

Just contains the loss
And some code to make pretty pictures below.

In [2]: # Refactor for least squares, and then we can visualize.

def least_sq_loss(X, theta_0, theta_1, y, delta=0.1):
Z = np.zeros( (theta_0.shape[0], theta_1.shape[0]) )
for i in range(Z.shape[0]):
for j in range(Z.shape[1]):
u = np.matrix( [theta_0[i],theta_1[j]] ).T
v = X@u - y
Z[i,j] = np.float(np.dot(v.T,v))
return 0.5*Z

In [3]: def render_points(X, y, points, delta=0.2, isocline=False, loss=least_sq_loss):

(theta_0,theta_1) = (np.arange(-5,5,delta), np.arange(-5,5,delta))
J = loss(X, theta_0, theta_1, y, delta=delta)
T0,T1 = np.meshgrid(theta_0,theta_1)

if isocline:
fig, ax = plt.subplots()
ax.set_title('Gradient Descent')
CS = ax.contour(T0, T1, J)
else:
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.set_title('Gradient Descent')
CS = ax.contour3D(T0, T1, J, 50)

ax.clabel(CS, inline=1, fontsize=10)

def animate(i): ax.plot( points[i][0], points[i][1], "or")
ani = matplotlib.animation.FuncAnimation(fig, animate, frames=len(points), repeat=False)
return ani

Generate some datapoints.

Let's look at some really simply synthetic data. This is a very common model to understand how optimization algorithsm behave. I'm showing it to
you because it's a clean abstraction of the problem. It's an exercise for you to go back to the real data and understand it.

To control the situation, we'll generate the data by assuming there is a valueθ∗ and we'll generate some data with it.
Now, in machine learning we don't often have a perfect linear relationship, and so we'll add just a tiny bit of noise. We want it to be clean enough
so that it's clear what's going on with the algorithm.

In [4]: n=25
theta_true = np.random.randn(2,1)
X = np.matrix(np.random.randn(n,2)@np.diag([1.5,1]))
y = X@theta_true + 1e-3*np.random.randn(n,1)

Synthetic or real, you should always look at your data!

In [5]: fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], y, c="r", marker="o")
plt.show()

<IPython.core.display.Javascript object>

Let's visualize the loss function in two ways. The main point of chosing such nice data was to get nice loss surfaces .

In [6]: render_points(X,y,[],isocline=False)

<IPython.core.display.Javascript object>

Out[6]: <matplotlib.animation.FuncAnimation at 0x11e289da0>

In [7]: render_points(X,y,[],isocline=True)

<IPython.core.display.Javascript object>

Out[7]: <matplotlib.animation.FuncAnimation at 0x11e31a2b0>

Gradient Descent and Least Squares

Our cost function is

J(θ) = 12 ∑ (x(i) θ − y(i) )2

n
i=1
.

Let's compute this derivative on the board!

Iteration
The iteration rule is:
θ ← θ − α∇θ J(θ)
In which α is the stepsize.
The derivative for a single example say is xT (xT θ − y) writing x = x(i) and y = y(i) , for reasons of nicer markdown rendering only.
In [8]: theta = np.matrix([4,4]).T
T = 50
alpha = 0.01
points = []
for t in range(T):
theta = theta - alpha*X.T@(X@theta-y)
points.append(theta)
ani = render_points(X,y,points, isocline=True)

<IPython.core.display.Javascript object>

Let's look at predictions!

That is for a data point x we predict its label by xT y
In [9]: pred = X@theta # predictions
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], y, c="r", marker="o")
ax.scatter(X[:,0], X[:,1], pred, c="b", marker="x")
plt.show()

<IPython.core.display.Javascript object>

In [10]: theta - theta_true

Out[10]: matrix([[-0.00012458],
[-0.0002192 ]])

Stochastic Gradient Descent

Why wait to see the whole dataset? It turns out an even simpler algorithm is widely used and can be much faster.

Intuitevly, if you're data is already noisy we can get a good answer but much faster.
1
In both casees, we are going to take 50 steps. But in SGD, we'll only look at point not all n points!
The update algorithm, well it's what we derived on the board to start!
In [11]: theta = np.matrix([4,4]).T
T = 50
# Shuffle the examples
n = X.shape[0]
perm = np.arange(n)
np.random.shuffle(perm)
points = []
for t in range(T):
i = t % n
xi,yi = X[i,:].T, y[i]
theta = theta - 0.1*xi*(xi.T@theta-yi)
points.append(theta)
ani = render_points(X,y,points, isocline=True)

<IPython.core.display.Javascript object>

In [12]: pred = X@theta # predictions

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], y, c="r", marker="o")
ax.scatter(X[:,0], X[:,1], pred, c="b", marker="x")
plt.show()

<IPython.core.display.Javascript object>

In [ ]:

Convergence
Some of the most interesting models used today are not bowl shape, and they may have spurious local minima (or not! sometimes we don't know!).
What does gradient descent do in this situation? It's unclear!

In [13]: import math

a = 3
def nasty_loss(X, theta_1, theta_2, y, delta=0, a=3):
points = np.zeros( (len(theta_1), len(theta_2)) )
for i,t_1 in enumerate(theta_1):
for j,t_2 in enumerate(theta_2):
points[i,j] = 2*a + (t_1**2 - a*math.cos(2*math.pi*t_1) + t_2**2 - a*math.cos(2*math.pi*t_2))
return points

ani = render_points(X,y,[], isocline=True, loss=nasty_loss)

<IPython.core.display.Javascript object>

In [14]: theta = np.matrix(np.random.randn(2,1))

T = 50
alpha = 0.01
points = []
for t in range(T):
theta = theta - (theta + a*np.sin(2*math.pi*theta))
points.append(theta)
ani = render_points(X,y,points, isocline=True, loss=nasty_loss)

<IPython.core.display.Javascript object>

Take-away points
1. SGD can be faster than batch gradient descent, intuitevely, when the dataset contains redundancy--say the same point occurs many times--
SGD could complete before batch gradient does one iteration!

NB: This is common in machine learning, we need redundancy to learn!

This algorithm is widely used in practice. Fancy tools like pytorch and Tensorﬂow use this. (They compute the gradients for you!). Odds are
you have interacted today with some system that was trained by SGD--and not in this course!
2. Convergence is not guaranteed. When bowl-shaped, we have a robust (and interesting!) theory. Many models you'll see are not convex, and one
of the great mysteries right now is how some of the industrial models are highly nonconvex.
3. There are annoying parameters in machine learning. Sigh. You met your ﬁrst one, the learning rate. It sadly won't be the last... You can play with
the learning rate--note it's higher in SGD!.

Aside for the researchers and mathematics types: almost all theory suggests that you sample with replacement, but as above this shuﬄing approach
is more widely used. Shockingly, while some theory (http://proceedings.mlr.press/v23/recht12/recht12.pdf) suggests it's better only recently have
folks been able to show that's true!

Back to the real data (if you dare!)

Beware, you need to scale your data to get the above code to work. A little bit of playing with it is helpful!

In [15]: url="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
d=pd.read_csv(url, header=None).values

# Construct the dataset

city_mpg, hi_mpg = np.matrix(d[:n,23], dtype=float).T,np.matrix(d[:n,24], dtype=float).T
X_car = np.concatenate([city_mpg, np.ones( (n,1))],axis=1) # add in the bias term
y_car = hi_mpg

In [16]: theta_car = np.matrix( [1.5,1] ).T

for t in range(10000):
theta_car = theta_car - 1e-4*X_car.T@(X_car@theta_car-y_car)
theta_car

Out[16]: matrix([[1.00870733],
[5.51742967]])

In [17]: theta_linear = np.linalg.inv(X_car.T@X_car)@X_car.T@y_car

theta_linear

Out[17]: matrix([[0.99620855],
[5.85008478]])

In [ ]:

UC Siwes Report
No ratings yet
UC Siwes Report
13 pages
Machine Learning - Exploring The Model - Resp
No ratings yet
Machine Learning - Exploring The Model - Resp
18 pages
150 Questions - Data Structures
No ratings yet
150 Questions - Data Structures
8 pages
H2_AndresAlcivar
No ratings yet
H2_AndresAlcivar
12 pages
1 Lecture 3: Optimization and Linear Regression
No ratings yet
1 Lecture 3: Optimization and Linear Regression
27 pages
Linear Regression Lab Work
No ratings yet
Linear Regression Lab Work
11 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
04_training_linear_models
No ratings yet
04_training_linear_models
35 pages
Gradient Descent and SGD
No ratings yet
Gradient Descent and SGD
8 pages
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 4
No ratings yet
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 4
24 pages
21BCE5775 Neural Networks
No ratings yet
21BCE5775 Neural Networks
19 pages
ML Labs
No ratings yet
ML Labs
46 pages
Lecture slides - Linear Regression (2025)
No ratings yet
Lecture slides - Linear Regression (2025)
45 pages
DL03 Classroom SNN
No ratings yet
DL03 Classroom SNN
41 pages
Linear Regression With Gradient Descent
100% (1)
Linear Regression With Gradient Descent
8 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
23 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
Infotec Ai 1000 Program-hcia-Ai Lab Guide
No ratings yet
Infotec Ai 1000 Program-hcia-Ai Lab Guide
82 pages
1-Linear Regression and TensorFlow
No ratings yet
1-Linear Regression and TensorFlow
79 pages
MFD S Assignment 2
No ratings yet
MFD S Assignment 2
18 pages
Chapter04_Training_Models
No ratings yet
Chapter04_Training_Models
33 pages
1 Tutorial: Linear Regression
No ratings yet
1 Tutorial: Linear Regression
8 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Diffusion Models
No ratings yet
Diffusion Models
27 pages
Linear Regression
No ratings yet
Linear Regression
14 pages
04 LinearRegression PDF
No ratings yet
04 LinearRegression PDF
61 pages
AutomaticDifferentiation AppliedMaths
No ratings yet
AutomaticDifferentiation AppliedMaths
228 pages
Vertopal.com C1 W2 Lab03 Feature Scaling and Learning Rate Soln
No ratings yet
Vertopal.com C1 W2 Lab03 Feature Scaling and Learning Rate Soln
10 pages
Notebook - Tensorflow Keras
No ratings yet
Notebook - Tensorflow Keras
25 pages
IN5400 - Machine Learning For Image Analysis
No ratings yet
IN5400 - Machine Learning For Image Analysis
6 pages
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
No ratings yet
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
10 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Vertopal.com C1 W1 Lab04 Gradient Descent Soln
No ratings yet
Vertopal.com C1 W1 Lab04 Gradient Descent Soln
11 pages
Dr Basit Assignments
No ratings yet
Dr Basit Assignments
13 pages
Class2
No ratings yet
Class2
107 pages
Practical-5_2CEIT606_Artificial Intelligence
No ratings yet
Practical-5_2CEIT606_Artificial Intelligence
14 pages
Physics-Informed-Neural-Networks-for-Numerical-Analysis
No ratings yet
Physics-Informed-Neural-Networks-for-Numerical-Analysis
16 pages
Linear Regression With Multiple Features
No ratings yet
Linear Regression With Multiple Features
7 pages
Linear Regr Gd
No ratings yet
Linear Regr Gd
3 pages
Unit 1 and Unit 2
No ratings yet
Unit 1 and Unit 2
30 pages
d2l en PDF
No ratings yet
d2l en PDF
996 pages
niraj dl
No ratings yet
niraj dl
15 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
972 pages
lab2
No ratings yet
lab2
7 pages
practicalMachineLearning_lecture3
No ratings yet
practicalMachineLearning_lecture3
25 pages
LinearRegression Tutorial
No ratings yet
LinearRegression Tutorial
40 pages
Backpropagation: Loading Data
No ratings yet
Backpropagation: Loading Data
12 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Lec 03
No ratings yet
Lec 03
42 pages
hw07 Neural Soln PDF
No ratings yet
hw07 Neural Soln PDF
6 pages
Tutorial on Neural Networks_18MAR2024
No ratings yet
Tutorial on Neural Networks_18MAR2024
33 pages
Linear Regression in python
No ratings yet
Linear Regression in python
9 pages
d2l en PDF
No ratings yet
d2l en PDF
995 pages
50inference
No ratings yet
50inference
31 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Experiment No
No ratings yet
Experiment No
29 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
6.189 Lecture5 Parallelism
No ratings yet
6.189 Lecture5 Parallelism
63 pages
Mca-502 Web Technologies Syllabus
No ratings yet
Mca-502 Web Technologies Syllabus
1 page
Lecture15 (Presorting, BST)
No ratings yet
Lecture15 (Presorting, BST)
46 pages
Cs1102 Unit 6 Discussion Post
No ratings yet
Cs1102 Unit 6 Discussion Post
5 pages
Gu - Sap s4 Hana - Sap Business Workflow
No ratings yet
Gu - Sap s4 Hana - Sap Business Workflow
22 pages
CPU Scheduling
No ratings yet
CPU Scheduling
39 pages
Code Migration PDF
No ratings yet
Code Migration PDF
15 pages
Rtos Unit-1
No ratings yet
Rtos Unit-1
7 pages
Bca Unit1 Topics
No ratings yet
Bca Unit1 Topics
11 pages
Stop and Wait Protocol
No ratings yet
Stop and Wait Protocol
10 pages
How Not to Program in C 111 Broken Programs and 3 Working Ones or Why Does 2 2 5986 1st Edition Steve Oualline All Chapters Instant Download
100% (14)
How Not to Program in C 111 Broken Programs and 3 Working Ones or Why Does 2 2 5986 1st Edition Steve Oualline All Chapters Instant Download
60 pages
Fundamentals of Robotics AR-104: Prof. Dr. B. Kuhlenkötter
No ratings yet
Fundamentals of Robotics AR-104: Prof. Dr. B. Kuhlenkötter
1 page
Chapter 5 Understanding Layer with Service and Microservice - En
No ratings yet
Chapter 5 Understanding Layer with Service and Microservice - En
25 pages
File Organization
100% (1)
File Organization
4 pages
Plan db2
No ratings yet
Plan db2
350 pages
The-Scythe-Statistical-Library-An-Open-Source-C-Library-for-Statistical-Computation-1lrul8p
No ratings yet
The-Scythe-Statistical-Library-An-Open-Source-C-Library-for-Statistical-Computation-1lrul8p
26 pages
Find The Output - C Programming
No ratings yet
Find The Output - C Programming
45 pages
DSA 1 (Concept of Data Structures)
No ratings yet
DSA 1 (Concept of Data Structures)
7 pages
Asio Guide
No ratings yet
Asio Guide
9 pages
Chapter-2 Process Management
No ratings yet
Chapter-2 Process Management
56 pages
20461D 05
No ratings yet
20461D 05
23 pages
DWDM PPT
No ratings yet
DWDM PPT
35 pages
Python Notes
No ratings yet
Python Notes
279 pages
Powercenter 8.X New Features: Education Services
No ratings yet
Powercenter 8.X New Features: Education Services
159 pages
Text Adventure Game
No ratings yet
Text Adventure Game
47 pages
Best Practices - Teradata
No ratings yet
Best Practices - Teradata
3 pages
Security Mechanism
No ratings yet
Security Mechanism
20 pages
Project Report On Airlines Reservation System
40% (5)
Project Report On Airlines Reservation System
35 pages

Gradient Descent Vizcs229 PDF

Uploaded by

Gradient Descent Vizcs229 PDF

Uploaded by

In [1]: %matplotlib notebook

Gradient Descent for Linear Regression

Some boiler plate code.

In [2]: # Refactor for least squares, and then we can visualize.

In [3]: def render_points(X, y, points, delta=0.2, isocline=False, loss=least_sq_loss):

ax.clabel(CS, inline=1, fontsize=10)

Generate some datapoints.

Synthetic or real, you should always look at your data!

Out[6]: <matplotlib.animation.FuncAnimation at 0x11e289da0>

Out[7]: <matplotlib.animation.FuncAnimation at 0x11e31a2b0>

Gradient Descent and Least Squares

J(θ) = 12 ∑ (x(i) θ − y(i) )2

Let's compute this derivative on the board!

Let's look at predictions!

In [10]: theta - theta_true

Stochastic Gradient Descent

In [12]: pred = X@theta # predictions

In [13]: import math

ani = render_points(X,y,[], isocline=True, loss=nasty_loss)

In [14]: theta = np.matrix(np.random.randn(2,1))

NB: This is common in machine learning, we need redundancy to learn!

Back to the real data (if you dare!)

# Construct the dataset

In [16]: theta_car = np.matrix( [1.5,1] ).T

In [17]: theta_linear = np.linalg.inv(X_car.T@X_car)@X_car.T@y_car

You might also like