Linear Regression With Multiple Features
Linear Regression With Multiple Features
Similarly, instead of thinking of J as a function of the n+1 numbers, J() is just a function of the parameter vector
J(θ)
Gradient descent
Learning Rate α
Focus on the learning rate ( α)
Topics
Update rule
Debugging
How to chose α
Example
House price prediction
Two features
Frontage - width of the plot of land along road ( x1 )
Depth - depth away from road ( x2 )
You don't have to use just two features
Can create new features
Might decide that an important feature is the land area
So, create a new feature = frontage * depth ( x3 )
h(x) = θ0 + θ1 x3
Area is a better indicator
Often, by defining new features you may get a better model
Polynomial regression
May fit the data better
θ0 + θ1 x + θ2 x2 e.g. here we have a quadratic function
For housing data could use a quadratic function
But may not fit the data so well - inflection point means housing prices decrease when size gets really big
So instead must use a cubic function
Normal equation
For some linear regression problems the normal equation provides a better solution
So far we've been using gradient descent
Iterative algorithm which takes steps to converse
Normal equation solves θ analytically
Solve for the optimum value of theta
Has some advantages and disadvantages
Here
m=4
n=4
To implement the normal equation
Take examples
Add an extra column (x0 feature)
Construct a matrix (X - the design matrix) which contains all the training data features in an [m x n+1] matrix
Do something similar for y
Construct a column vector y vector [m x 1] matrix
Using the following equation (X transpose * X) inverse times X transpose y
If you compute this, you get the value of theta which minimize the cost function
General case
pinv(X'*x)*x'*y
When should you use gradient descent and when should you use feature scaling?
Gradient descent
Need to chose learning rate
Needs many iterations - could make it slower
Works well even when n is massive (millions)
Better suited to big data
What is a big n though
100 or even a 1000 is still (relativity) small
If n is 10 000 then look at using gradient descent
Normal equation
No need to chose a learning rate
No need to iterate, check for convergence etc.
Normal equation needs to compute (XT X)-1
This is the inverse of an n x n matrix
With most implementations computing a matrix inverse grows by O(n 3 )
So not great
Slow of n is large
Can be much slower