Linear Regression Program So Far
Linear Regression Program So Far
Linear Regression
Slide Type
Program so far
Slide Type
Story so far
We have tried understanding and predicting the price of a house in NY using various statistical
techniques so far.
Simple, if someone is determined to pursue a Bachelor's degree, Higher SAT scores (or GPA)
leads to more college admissions!
Slide Type
The graph below depicts Cornell's acceptance rate by SAT scores and many Universities
show similar trends
Slide Type
We also know that if we keep on drinking more and more beers, our Blood-Alcohol Content(BAC)
rises with it.
Slide Type
Think about the relationship between the Circumference of a Cirle and it's diameter. What
happens to the former whilst the latter increases?
Slide Type
The moral of the story is there are factors that influence the outcome of the variable of our
interest.
These variables are known as predictors and the variable of interest is known as the target
variable.
Slide Type
We would want to see if the price of a house is really affected by the area of the house
Intuitively, we all know the outcome but let's try to understand why we're doing this
Slide Type
In [4]:
Slide Type
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv("../data/house_prices.csv")
data.head()
Out[4]:
LotAre
SalePrice
a
0 8450 208500
1 9600 181500
2 11250 223500
3 9550 140000
4 14260 250000
Slide Type
In [2]:
Slide Type
Slide Type
By seeing our data above, we can see, an upward trend in the House Prices as the Area
of the house increases
We can say that as the Area of a house increases, it's price increases too.
Now, let's say we want to predict the price of the house whose area is 14000 sq feet,
how should we go about it?
Slide Type
Intuitively, we can just draw a straight line that would "capture" the trend of area and
house price, and predict house price from that line.
Slide Type
Slide Type
In Class Activity :
Slide Type
plt.scatter(data.LotArea, data.SalePrice)
Slide Type
In Class Activity :
Plot a line fitting our 'Sales Price' data for
price=50000+12∗areaprice=50000+12∗area
In [5]:
Slide Type
plt.scatter(data.LotArea, data.SalePrice)
Slide Type
Seems like all of them are a good fit for the data. Let's plot of them in a single plot and see how
that pans out.
In [6]:
Slide Type
Slide Type
Which line to choose?
As you can see although all three seemed like a good fit, they are quite different from each other.
In result, they will result in very different predictions.
For example, for house area = 9600, the predictions for red, black and yellow lines are
In [7]:
Slide Type
# red line:
print(("red line:", 30000 + 15*9600))
# black line:
# yellow line:
Slide Type
As you can see the price predictions are varying from each other significantly. So how do we
choose the best line?
Well, we can define a function that measures how near or far the prediction is from the actual
value.
If we cosider the actual and predicted values as points in space, we can just calculate the
distance between these two points!
Slide Type
This function is defined as:
(Ypred−Yactual)2(Ypred−Yactual)2
The farther the points, more the the distance and more the cost!
It is known as the cost function and since this function captures square of distance, it is known
as the least-squares cost function.
The idea is to minimize the cost function to get the best fitting line.
Slide Type
Linear regression using least squared cost function is known as Ordinary Least Squared Linear
Regression.
This allows us to analyze the relationship between two quantitative variables and derive some
meaningful insights
Slide Type
Great! now before moving forward, let's try to put the discussion we have had so far into a more
formal context. Let's start by learning some terminologies.
Here, we're trying to Predict the Price of the House using the value of it's Area
Thus, Area is the Independent Variable
Price is the Dependent Variable, since the value of price is dependent on the value of
area
Slide Type
Since we're using only 1 predictor (Area) to predict the Price, this method is also
called Univariate Regression/ Simple Linear Regression
But more often than not, in real problems, we utilize 2 or more predictors. Such a
regression is called Muiltivariate Regression. More on this later!
Slide Type
Let us consider a dataset where we have a value of response y for every feature x:
To create our model, we must “learn” or estimate the values of regression coefficients b_0 and
b_1. And once we’ve estimated these coefficients, we can use the model to predict responses!
Slide Type
Least Squares technique
Now consider:
yi=β0+β1xi+εi=h(xi)+εi⇒εi=yi−h(xi)yi=β0+β1xi+εi=h(xi)+εi⇒εi=yi−h(xi)
Here, e_i is residual error in ith observation.
So, our aim is to minimize the total residual error.
We define the squared error or cost function, J as:
J(β0,β1)=12n∑i=1nε2iJ(β0,β1)=12n∑i=1nεi2
and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!
Slide Type
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
return(b_0, b_1)
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
main()
Estimated coefficients:
b_0 = -0.05862068965517242
b_1 = 1.457471264367816
Slide Type
We will start to use following notations as it helps us represent the problem in a concise way.
Slide Type
Cost Function:
Ok, so now that these are out of our way, let's get started with the real stuff.
An ideal case would be when all the individual points in the scatter plot fall directly on the
line OR a straight line passes through all the points in our plot, but in reality, that rarely
happens
We can see that for a Particular Area, there is a difference between Price given by our
data point (which is the correct observation) and the line (predicted observation or Fitted
Value)
So how can we Mathematically capture such differences and represent it?
Slide Type
Cost Function
We choose θs so that predicted values are as close to the actual values as possible
We can define a mathematical function to capture the difference between the predicted and
actual values.
This function is known as the cost function and denoted by J(θ)J(θ)
Slide Type
Cost function:
12m∑i=1m(hθ(X(i))−Y(i))212m∑i=1m(hθ(X(i))−Y(i))2
θθ is the coefficient of 'x' for our linear model intuitively. It measures how much of a unit
change of 'x' will have an effect on 'y'
Here, we need to figure out the values of intercept and coefficients so that the cost
function is minimized.
We do this by a very important and widely used Algorithm: Gradient Descent
Slide Type
Anscombe's Quartet
In [8]:
Slide Type
Slide Type
Slide Type
It repeatedly performs an update on θ as shown:
Slide Type
Gradient Descent Intuition
Here α is called the learning rate. This is a very natural algorithm that repeatedly takes a
step in the direction of steepest decrease of J
Slide Type
Slide Type
Slide Type
For the training set, eq.(1) becomes
For example, predictor corresponding to θ1 and 2nd training example, x(2)1 is 24.
Slide Type
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/WnqQrPNYz5Q?
start=10&end=155" frameborder="0" allowfullscreen></iframe>')
Out[8]:
Slide Type
sklearn provides an easy api to fit a linear regression and predict values using linear regression
In [10]:
Slide Type
X = data.LotArea[:,np.newaxis] # Reshape
y = data.SalePrice
In [11]:
Slide Type
Slide Type
# Predicting the Test set results
y_pred = regressor.predict(X)
Slide Type
Slide Type
# Train Part
plt.scatter(X, y)
plt.plot(X, y_pred, "r-")
plt.title('Housing Price ')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()
Slide Type
Visually, now we now have a nice approximation of how Area affects the Price
We can also make a prediction, the easy way of course!
For example: If we want to buy a house of 14,000 sq. ft, we can simply draw a vertical
line from 14,000 up to our Approximated Trend line and continue that line towards the y-
axis
We can see that for a house whose area ~ 14,000 we need to pay ~ 2,00,000-2,25,000
Slide Type
To get the optimal value of θ , perform following algorithm known as the Batch Gradient
Descent Algorithm
Slide Type
Getting Started:
We have seen how Gradient Descent helps us to minimize the cost function. In this exercise we
will learn to write functions to implement our own batch gradient descent algorithm for univariate
linear regression!
You may want to revise and refer to the Steps of the algorithm from the slides to get a strong
intuition on what your functions should look like
You should not use any sklearn objects for the purpose of this exercise
Slide Type
Let's start with calculating the error of a given linear regression model. We will consider
univariate model so that
y=theta∗x+by=theta∗x+b
Let's write a function error_calculator() so that given b, theta and X and y, we can
calculate the error.
Slide Type
import numpy as np
Slide Type
Now that we have calculated the mean square error, we can calculate the gradient.
We can write a function, that, given current parameters, calculates the gradeint.
Slide Type
Slide Type
We can write a function, that, given current parameters, calculates the num_iteration and returns
the list of calculated errors.
Slide Type
def gradient_descent(starting_b, starting_theta, points, learning_rate, num_iterations):
b = starting_b
theta = starting_theta
b_list = []
theta_list = []
error_list = []
for i in range(num_iterations):
b, theta = gradient(b, theta, points, learning_rate)
error = error_calculator(b, theta, points)
b_list.append(b)
theta_list.append(theta)
error_list.append(error)
return b_list, theta_list, error_list
Slide Type
Validation:
In [6]:
Slide Type
import numpy as np
points = np.genfromtxt("../data/data.csv", delimiter=",")
learning_rate = 0.0001
initial_b = 2 # initial y-intercept guess
initial_m = 6 # initial slope guess
num_iterations = 15
In [7]:
Slide Type
In [ ]:
Slide Type
Slide Type
Next we need to plot 'θθ'. So we write a function, that, given current parameters, plots the value
to theta against iteration number.
Slide Type
Slide Type
plt.plot(theta)
plt.xlabel("Iterations")
plt.ylabel("theta")
plt.title("Theta")
plt.show()
Slide Type
Next we need to plot 'b'. So we write a function, that, given current parameters, plots the value to
'b' against iteration number.
Slide Type
plt.plot(b)
plt.xlabel("Iterations")
plt.ylabel("constant term")
plt.title("b");
Slide Type
Next we need to plot 'errors'. So we write a function, that, given current parameters, plots the
value to 'errors' against iteration number.
Slide Type
plt.plot(error)
plt.xlabel("Iterations")
plt.ylabel("Error")
plt.title("Error");
Slide Type
Multivariate Linear Regression
In Univariate Linear Regression we used only two variable. One as Dependent Variable
and Other as Independent variable.
Now, we will use Multiple Dependent variable instead of one and will predict the Price i.e.
Independent variable.
So, along with Area we will consider other variables as such as Pool
In [3]:
Slide Type
Slide Type
Slide Type
print(("intercept:", regressor.intercept_))
print(("coefficients of predictors:", regressor.coef_))
Slide Type
Now let's say I want to predict the price of a house with following specifications.
In [ ]:
Slide Type
my_house = X.iloc[155]
my_house
In [ ]:
Slide Type
Slide Type
Slide Type
As you can see the predicted value is not very far away from the actual value.
Now let's try to predict the price for all the houses in the dataset.
In [ ]:
Slide Type
# Predicting the results
y_pred = regressor.predict(X)
y_pred[:10]
Slide Type
Great! now, let's put the predicted values next to the actual values and see how good a job have
we done!
In [ ]:
Slide Type
prices = pd.DataFrame({"actual": y,
"predicted": y_pred})
prices.head(10)
Slide Type
Must say we have done a reasonably good job of predicting the house prices.
However, as the number of predictions increase it would be difficult to manually check the
goodness of fit. In such a case, we can use the cost function to check the goodness of fit.
Let's first start by finding the mean squared error (MSE)
In [ ]:
Slide Type
Slide Type
What do you think about the error value?
As you would notice the error value seems very high (in billions!). Why has it happened?
MSE is a relative measure of goodness of fit. We say that because the measure of goodness of
MSE depends on the unit. As it turns out, Linear regression depends on certain underlying
assumptions. Violation of these assumptions result in poor results.
There are some key assumptions that are made whilst dealing with Linear Regression
These are pretty intuitive and very essential to understand as they play an important role in
finding out some relationships in our dataset too!
Let's discuss these assumptions, their importance and mainly how we validate these
assumptions!
Slide Type
Why it is important:
Linear regression only captures the linear relationship, as it's trying to fit a linear model to the
data.
Slide Type
Slide Type
Why it is important:
It results in unstable parameter estimates which makes it very difficult to assess the effect of
independent variables.
Slide Type
Multicollinearity occurs when the features (or independent variables) are not independent from
each other.
Pair plots of features help validate.
Slide Type
3) Homoscedasticity Assumption:
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random
disturbance in the relationship between the independent variables and the dependent variable) is
the same across all values of the independent variables.
Why it is important:
How to validate:
In an ideal plot the variance around the regression line is the same for all values of the predictor
variable. In this plot we can see that the variance around our regression line is nearly the same
and hence it satisfies the condition of homoscedasticity.
Slide Type
Autocorrelation occurs when the residual errors are not independent from each other.
Slide Type
Why it is important:
The presence of correlation in error terms drastically reduces model's accuracy. This usually
occurs in time series models. If the error terms are correlated, the estimated standard errors tend
to underestimate the true standard error.
How to validate:
Residual vs time plot. look for the seasonal or correlated pattern in residual values.
Slide Type
Slide Type
Slide Type
Why it is important:
Due to the Central Limit Theorem, we may assume that there are lots of underlying facts
affecting the process and the sum of these individual errors will tend to behave like in a zero
mean normal distribution. In practice, it seems to be so.
How to validate:
Slide Type
Slide Type
Slide Type
- Thus, evaluating our model will help us know how well we're
doing with our current selection of Features from the data,
hyperparameters, etc.
There are three basic evaluation metrics for regression to check the goodness of fit.
Mean Absolute Error
Root Mean square Error
R-Square (Residual value)
Slide Type
The mean absolute error (MAE) is a quantity used to measure how close forecasts or
predictions are to the actual vales. The mean absolute error is given by:
MAE=1N∑i=1N|yi−ŷ i|MAE=1N∑i=1N|yi−y^i|
We take the Absolute of the difference because our Predicted Values can be
greater than or less than (resulting in a negative difference)
If we don't take the Mod then the result of all the differences would be a 0.
Figuring out why is a good exercise!
Slide Type
The square root of the mean/average of the square of all of the error.
RMSE is very commonly used and makes for an excellent general purpose error metric
for numerical predictions.
Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes
large errors.
RMSE=1N∑i=1N(yi−ŷ i)2⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯RMSE=1N∑i=1N(
yi−y^i)2
Slide Type
So what is the advantage of RMS when we could just take the absolute difference
instead of squaring
This severely punishes large differences in prediction. This is the reason why
RMSE is powerful as compared to Absolute Error.
Evaluating the RMSE and tuning our model to minimize it results in a more robust
model
Slide Type
Interpretation of R2R2
This is very important to understand, so read the following line multiple times if needed
R - Square is a measure of the proportion of variability in the response that
is explained by the regression model.
Earlier, when the Blood Alcohol Content Data is fitted a Linear Model, the R-
Square = 0.952
o This means that 95.2% of the the Variability in the data given to us is
explained by our Linear Model, which is good!
R - Square values are always in the range of [0,1]
Slide Type
R2R2 Intuition
Before we get into calculating R^2, let's understand, what the different Sum of Squares
for our model are:
In RMSE, we have already been introduced to Squared Residuals which is also
called Error Sum of Squares (SSE)
SSE=∑i=1N(yi−ŷ i)2SSE=∑i=1N(yi−y^i)2
Slide Type
Additionally, we have the Total Sum of Squares (SST) which is nothing but the Squared
difference between the Actual Values (yiyi) and the Mean of our dataset (y¯iy¯i)
SS(Total)=∑i=1N(yi−y¯i)2SS(Total)=∑i=1N(yi−y¯i)2
And we also have Regression Sum of Squares, which is the squared difference
between the Predicted values (ŷ iy^i) and the Mean (y¯iy¯i)
SS(Regression)=∑i=1N(ŷ i−y¯i)2SS(Regression)=∑i=1N(y^i−y¯i)2
Slide Type
R2R2 Intuition
However, in many cases, R^2 is not really a good metric and we use another metric known
as Adjusted R-Squared
Slide Type
Slide Type
Post Reads
Slide Type
Thank You
Next Session: -
Initial Exploration
Introduction to Seaborn
Univariate Analysis
Multi-variate Analysis
Slide Type