0% found this document useful (0 votes)
65 views

Linear Regression Program So Far

The document provides an overview of linear regression. It discusses introducing linear regression concepts such as understanding the intuition behind linear regression, the linear regression cost function, and using gradient descent for linear regression. It also covers evaluating linear regression metrics and assumptions. The agenda outlines understanding linear regression, the cost function, gradient descent algorithm, an introduction to linear regression in scikit-learn, assumptions, and evaluating metrics for regression.

Uploaded by

homesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Linear Regression Program So Far

The document provides an overview of linear regression. It discusses introducing linear regression concepts such as understanding the intuition behind linear regression, the linear regression cost function, and using gradient descent for linear regression. It also covers evaluating linear regression metrics and assumptions. The agenda outlines understanding linear regression, the cost function, gradient descent algorithm, an introduction to linear regression in scikit-learn, assumptions, and evaluating metrics for regression.

Uploaded by

homesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Slide Type

Linear Regression 
Slide Type

Program so far

 Introduction to Python - You are now a buddying pythonista


 Introduction to Machine Learning - You can tell a classification task from a clustering
 Basic Probability & Descriptive Stats - You are at peace with Statistics
 Steps involved in solving an end-to-end ML problem - One Step Closer to Machine
Learning
Slide Type

Agenda for the Day

 Understand the intuition behind Linear Regression


 Understand the Linear Regression Cost Function
 Understand the Linear Regression using Gradient Descent Algorithm
 Introduction to Linear Regression in sklearn
 Learn about the assumptions in Linear Regression Algorithm
 Evaluating Metrics for Regression

Slide Type

Story so far

We have tried understanding and predicting the price of a house in NY using various statistical
techniques so far.

These included both descriptive and univariate inferential statistics methodologies.


Now let's take a step forward and make our first prediction!
Slide Type
In order to learn to make predictions, it is important to learn what is a Predictor.

So what is a predictor? (1/4)

How could you say if a person went to tier 1, 2 or 3 college in America?

Simple, if someone is determined to pursue a Bachelor's degree, Higher SAT scores (or GPA)
leads to more college admissions!
Slide Type
The graph below depicts Cornell's acceptance rate by SAT scores and many Universities
show similar trends

Slide Type

What is a predictor? (2/4)

We also know that if we keep on drinking more and more beers, our Blood-Alcohol Content(BAC)
rises with it.

The graph below depicts just that, know your limits!


Slide Type

Slide Type

What is a predictor? (3/4)

Think about the relationship between the Circumference of a Cirle and it's diameter. What
happens to the former whilst the latter increases?
Slide Type

What is a predictor? (4/4)

The moral of the story is there are factors that influence the outcome of the variable of our
interest.

 SAT score --> University acceptance rate


 Number of beers --> Body alcohol level
 Diameter --> Circumference

These variables are known as predictors and the variable of interest is known as the target
variable.
Slide Type

So, how much would a house actually cost?


Let's resume to our discussion on prices of house in NY.

Here, our target variable would be __

What could be the predictors for our target variable?


Slide Type
Let's roll with area of the house

We would want to see if the price of a house is really affected by the area of the house

Intuitively, we all know the outcome but let's try to understand why we're doing this
Slide Type

Let's have a look at the data

 Now, let's have a look at the data


 Every row displays the Price and Area of each house

In [4]:

Slide Type

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

data = pd.read_csv("../data/house_prices.csv")
data.head()
Out[4]:
LotAre
SalePrice
a

0 8450 208500

1 9600 181500

2 11250 223500

3 9550 140000

4 14260 250000

Slide Type

Plotting our data


 Getting some motivation from Exploratory Data Analysis, what's intriguing is how this
data will look when we plot it
 Starting simple, let's just check how our data looks like in a scatter plot where:
 Area is taken along the Y-axis
 Price is taken along the X-axis

In [2]:

Slide Type

import matplotlib.pyplot as plt


plt.scatter(data.LotArea, data.SalePrice)
plt.title('NYC House pricing')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

Slide Type

Thinking out loud

 By seeing our data above, we can see, an upward trend in the House Prices as the Area
of the house increases
 We can say that as the Area of a house increases, it's price increases too.
 Now, let's say we want to predict the price of the house whose area is 14000 sq feet,
how should we go about it?

Slide Type

Fitting a Line On the Scatter Plot

 Intuitively, we can just draw a straight line that would "capture" the trend of area and
house price, and predict house price from that line.

Let's try and fit a line through all these points!


Slide Type

What's your prediction?


Slide Type

Which line to choose?


As you saw, there are many lines which would seem to be fitting reasonably well.

consider following lines,


price=30000+15∗areaprice=10000+17∗areaprice=50000+12∗areaprice=30000+15∗areapric
e=10000+17∗areaprice=50000+12∗area
Let's try and plot them and see if they are a good fit
In [3]:

Slide Type

import matplotlib.pyplot as plt


plt.scatter(data.LotArea, data.SalePrice)
plt.plot(data.LotArea, 30000 + 15*data.LotArea, "r-")
plt.title('NYC House pricing')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

Slide Type

In Class Activity :

Plot a line fitting our 'Sales Price' data for


price=10000+17∗areaprice=10000+17∗area
In [4]:

Slide Type

# plot "price=10000 + 17 ∗ area" line

plt.scatter(data.LotArea, data.SalePrice)

# your code here

plt.title('NYC House pricing')


plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

Slide Type

In Class Activity :
Plot a line fitting our 'Sales Price' data for
price=50000+12∗areaprice=50000+12∗area
In [5]:

Slide Type

# plot "price= 50000 + 12 ∗ area" line

plt.scatter(data.LotArea, data.SalePrice)

# Your code here

plt.title('NYC House pricing')


plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

Slide Type

Which line to choose?

Seems like all of them are a good fit for the data. Let's plot of them in a single plot and see how
that pans out.
In [6]:

Slide Type

import matplotlib.pyplot as plt


plt.scatter(data.LotArea, data.SalePrice)
plt.plot(data.LotArea, 30000 + 15*data.LotArea, "r-")
plt.plot(data.LotArea, 10000 + 17*data.LotArea, "k-")
plt.plot(data.LotArea, 50000 + 12*data.LotArea, "y-")
plt.title('NYC House pricing')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

Slide Type
Which line to choose?

As you can see although all three seemed like a good fit, they are quite different from each other.
In result, they will result in very different predictions.

For example, for house area = 9600, the predictions for red, black and yellow lines are
In [7]:

Slide Type

# red line:
print(("red line:", 30000 + 15*9600))

# black line:

# yellow line:

('red line:', 174000)

Slide Type

Which line to choose?

As you can see the price predictions are varying from each other significantly. So how do we
choose the best line?

Well, we can define a function that measures how near or far the prediction is from the actual
value.

If we cosider the actual and predicted values as points in space, we can just calculate the
distance between these two points!
Slide Type
This function is defined as:
(Ypred−Yactual)2(Ypred−Yactual)2
The farther the points, more the the distance and more the cost!
It is known as the cost function and since this function captures square of distance, it is known
as the least-squares cost function.

The idea is to minimize the cost function to get the best fitting line.
Slide Type

Introducing Linear Regression:

Linear regression using least squared cost function is known as Ordinary Least Squared Linear
Regression.

This allows us to analyze the relationship between two quantitative variables and derive some
meaningful insights
Slide Type

KYLR (Know your Linear Regression)!

Great! now before moving forward, let's try to put the discussion we have had so far into a more
formal context. Let's start by learning some terminologies.

 Here, we're trying to Predict the Price of the House using the value of it's Area
 Thus, Area is the Independent Variable
 Price is the Dependent Variable, since the value of price is dependent on the value of
area

Slide Type

KYLR (Know your Linear Regression)!

 Since we're using only 1 predictor (Area) to predict the Price, this method is also
called Univariate Regression/ Simple Linear Regression
 But more often than not, in real problems, we utilize 2 or more predictors. Such a
regression is called Muiltivariate Regression. More on this later!

Slide Type

Simple Linear Regression


Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).

Let us consider a dataset where we have a value of response y for every feature x:

For generality, we define:

x as feature vector, i.e x = [x_1, x_2, …., x_n],

y as response vector, i.e y = [y_1, y_2, …., y_n]

for n observations (in above example, n=10).


Slide Type
A scatter plot of above dataset looks like:-
Now, the task is to find a line which fits best in above scatter plot so that we can predict the
response for any new feature values. (i.e a value of x not present in dataset)

This line is called regression line.

The equation of regression line is represented as:


h(xi)=β0+β1xih(xi)=β0+β1xi
Here,

 h(x_i) represents the predicted response value for ith observation.


 b_0 and b_1 are regression coefficients and represent y-intercept and slopeof
regression line respectively.

To create our model, we must “learn” or estimate the values of regression coefficients b_0 and
b_1. And once we’ve estimated these coefficients, we can use the model to predict responses!
Slide Type
Least Squares technique
Now consider:
yi=β0+β1xi+εi=h(xi)+εi⇒εi=yi−h(xi)yi=β0+β1xi+εi=h(xi)+εi⇒εi=yi−h(xi)
Here, e_i is residual error in ith observation.
So, our aim is to minimize the total residual error.
We define the squared error or cost function, J as:

J(β0,β1)=12n∑i=1nε2iJ(β0,β1)=12n∑i=1nεi2
and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!

The results are:


β1=SSxySSxxβ1=SSxySSxx
β0=y¯−β1x¯β0=y¯−β1x¯
where SS_xy is the sum of cross-deviations of y and x:
SSxy=∑i=1n(xi−x¯)(yi−y¯)=∑i=1nyixi−nx¯y¯SSxy=∑i=1n(xi−x¯)(yi−y¯)=∑i=1nyixi−nx¯y¯
and SS_xx is the sum of squared deviations of x:
SSxx=∑i=1n(xi−x¯)2=∑i=1nx2i−n(x¯)2SSxx=∑i=1n(xi−x¯)2=∑i=1nxi2−n(x¯)2
In [10]:

Slide Type

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)

# mean of x and y vector


m_x, m_y = np.mean(x), np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x - n*m_y*m_x)
SS_xx = np.sum(x*x - n*m_x*m_x)

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return(b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


plt.show()

def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if __name__ == "__main__":
main()
Estimated coefficients:
b_0 = -0.05862068965517242
b_1 = 1.457471264367816

Slide Type

KYLR (Know your Linear Regression)!

We will start to use following notations as it helps us represent the problem in a concise way.

 x(i)x(i) denotes the predictor(s) - in our case it's the Area


 y(i)y(i) denotes the response variable (Price)

Slide Type

KYLR (Know your Linear Regression)!

A pair (x(i)x(i) , y(i)y(i)) is called a training example.


For example, 2nd training example, ( x(2) , y(2) ) corresponds to ( ... , ... ) (Fill in the blanks
above by having a look at your data!)
Slide Type

KYLR (Know your Linear Regression)!

Let's consider that any given dataset contains "m" training examples or Observations


{ x(i)x(i) , y(i)y(i) ; i = 1, . . . , m} — is called a training set.

In this example, m = 1460 (Nos. of row)


Slide Type

Cost Function:

Ok, so now that these are out of our way, let's get started with the real stuff.
 An ideal case would be when all the individual points in the scatter plot fall directly on the
line OR a straight line passes through all the points in our plot, but in reality, that rarely
happens
 We can see that for a Particular Area, there is a difference between Price given by our
data point (which is the correct observation) and the line (predicted observation or Fitted
Value)
 So how can we Mathematically capture such differences and represent it?

Slide Type

Cost Function

We choose θs so that predicted values are as close to the actual values as possible

We can define a mathematical function to capture the difference between the predicted and
actual values.
This function is known as the cost function and denoted by J(θ)J(θ)

Slide Type

Cost function:

12m∑i=1m(hθ(X(i))−Y(i))212m∑i=1m(hθ(X(i))−Y(i))2

 θθ is the coefficient of 'x' for our linear model intuitively. It measures how much of a unit
change of 'x' will have an effect on 'y'
 Here, we need to figure out the values of intercept and coefficients so that the cost
function is minimized.
 We do this by a very important and widely used Algorithm: Gradient Descent
Slide Type

Anscombe's Quartet

What is Anscombe's Quartet

In [8]:

Slide Type

from IPython.display import HTML


HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/BR9h47Jtqyw?start=188"
frameborder="0" allowfullscreen></iframe>')
Out[8]:

Slide Type

Gradient Descent Intuition

 So, we want to choose θ so as to minimize J(θ)


 Gradient Descent is a search Algorithm that starts with some “initial guess” for θ, and that
repeatedly changes θ to make J(θ) smaller, until hopefully we converge to a value of θ
that minimizes J(θ)

Slide Type
 It repeatedly performs an update on θ as shown:

Slide Type
Gradient Descent Intuition

 Here α is called the learning rate. This is a very natural algorithm that repeatedly takes a
step in the direction of steepest decrease of J

Slide Type

Gradient Descent Optimization (Optional)

Gradient in the previous eq.(1) can be simplified as following

Slide Type

Gradient Descent Optimization (Optional)

Hence, for a single training example, eq.(1) becomes

Slide Type
For the training set, eq.(1) becomes

Here, xj is the corresponding predictor for θj.

For example, predictor corresponding to θ1 and 2nd training example, x(2)1 is 24.

Value for all x(i)0 is equal to 1.


In [8]:

Slide Type
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/WnqQrPNYz5Q?
start=10&end=155" frameborder="0" allowfullscreen></iframe>')
Out[8]:

Slide Type

Linear Regression in sklearn

sklearn provides an easy api to fit a linear regression and predict values using linear regression

Let's see how it works


Slide Type

Linear Regression is not always useful

In [10]:

Slide Type

X = data.LotArea[:,np.newaxis] # Reshape
y = data.SalePrice
In [11]:

Slide Type

# Fitting Simple Linear Regression to the Training set

from sklearn.linear_model import LinearRegression


regressor = LinearRegression()
regressor.fit(X, y)
Out[11]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
normalize=False)
In [12]:

Slide Type
# Predicting the Test set results
y_pred = regressor.predict(X)

Slide Type

Plotting the Best Fitting Line


In [13]:

Slide Type

# Train Part
plt.scatter(X, y)
plt.plot(X, y_pred, "r-")
plt.title('Housing Price ')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

Slide Type

Prediction made Easy

 Visually, now we now have a nice approximation of how Area affects the Price
 We can also make a prediction, the easy way of course!
 For example: If we want to buy a house of 14,000 sq. ft, we can simply draw a vertical
line from 14,000 up to our Approximated Trend line and continue that line towards the y-
axis
 We can see that for a house whose area ~ 14,000 we need to pay ~ 2,00,000-2,25,000

Slide Type

Gradient Descent Algorithm & Implementation in Python

To get the optimal value of θ , perform following algorithm known as the Batch Gradient
Descent Algorithm

Slide Type

Getting Started:
We have seen how Gradient Descent helps us to minimize the cost function. In this exercise we
will learn to write functions to implement our own batch gradient descent algorithm for univariate
linear regression!
You may want to revise and refer to the Steps of the algorithm from the slides to get a strong
intuition on what your functions should look like
You should not use any sklearn objects for the purpose of this exercise
Slide Type

Python Implementation of Gradient Descent Algorithm

Let's start with calculating the error of a given linear regression model. We will consider
univariate model so that
y=theta∗x+by=theta∗x+b
Let's write a function error_calculator() so that given b, theta and X and y, we can
calculate the error.

The fucntion would look something like this:


In [14]:

Slide Type

import numpy as np

def error_calculator(b, theta, points):


data = np.array(points)
x = data[:,0]
y = data[:,-1]
y_predicted = theta * x + b
error = np.sum((y - y_predicted)**2) / data.shape[0]
return error

Slide Type
Now that we have calculated the mean square error, we can calculate the gradient.

We can write a function, that, given current parameters, calculates the gradeint.

Such a function would look like this:


In [15]:

Slide Type

def gradient(b_current, theta_current, points, learningRate):


data = np.array(points)
x = data[:, 0]
y = data[:, 1]
N = data.shape[0]
b_gradient = -2 * np.sum(y - (theta_current * x + b_current)) / N
theta_gradient = -2 * np.sum(x * (y - (theta_current * x + b_current))) / N
new_b = b_current - (learningRate * b_gradient)
new_theta = theta_current - (learningRate * theta_gradient)
return new_b, new_theta

Slide Type

Now that we have calculated the gradient, we need to find num_iterations

We can write a function, that, given current parameters, calculates the num_iteration and returns
the list of calculated errors.

Such a function would look like this:


In [16]:

Slide Type
def gradient_descent(starting_b, starting_theta, points, learning_rate, num_iterations):
b = starting_b
theta = starting_theta
b_list = []
theta_list = []
error_list = []
for i in range(num_iterations):
b, theta = gradient(b, theta, points, learning_rate)
error = error_calculator(b, theta, points)
b_list.append(b)
theta_list.append(theta)
error_list.append(error)
return b_list, theta_list, error_list

Slide Type

Validation:
In [6]:

Slide Type

import numpy as np
points = np.genfromtxt("../data/data.csv", delimiter=",")
learning_rate = 0.0001
initial_b = 2 # initial y-intercept guess
initial_m = 6 # initial slope guess
num_iterations = 15
In [7]:

Slide Type

error_before = error_calculator(initial_b, initial_m, points)


b, theta, error = gradient_descent(initial_b, initial_m, points, learning_rate, num_iterations)
error_after = error_calculator(b[-1], theta[-1], points)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-1dba1f9d34ed> in <module>()
----> 1 error_before = error_calculator(initial_b, initial_m, points)
2 b, theta, error = gradient_descent(initial_b, initial_m, points,
learning_rate, num_iterations)
3 error_after = error_calculator(b[-1], theta[-1], points)

NameError: name 'error_calculator' is not defined

In [ ]:

Slide Type

print(("Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m, error_before)))


print(("After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b[-1], theta[-1], error_after)))

Slide Type

Next we need to plot 'θθ'. So we write a function, that, given current parameters, plots the value
to theta against iteration number.

This function would look something like this.


In [ ]:

Slide Type

import matplotlib.pyplot as plt


%matplotlib inline
In [ ]:

Slide Type

plt.plot(theta)
plt.xlabel("Iterations")
plt.ylabel("theta")
plt.title("Theta")
plt.show()

Slide Type
Next we need to plot 'b'. So we write a function, that, given current parameters, plots the value to
'b' against iteration number.

This function would look like this.


In [ ]:

Slide Type

plt.plot(b)
plt.xlabel("Iterations")
plt.ylabel("constant term")
plt.title("b");

Slide Type

Next we need to plot 'errors'. So we write a function, that, given current parameters, plots the
value to 'errors' against iteration number.

This function would look like this.


In [ ]:

Slide Type

plt.plot(error)
plt.xlabel("Iterations")
plt.ylabel("Error")
plt.title("Error");

Slide Type
Multivariate Linear Regression

 In Univariate Linear Regression we used only two variable. One as Dependent Variable
and Other as Independent variable.
 Now, we will use Multiple Dependent variable instead of one and will predict the Price i.e.
Independent variable.
 So, along with Area we will consider other variables as such as Pool

In [3]:

Slide Type

#Loading the data


NY_Housing = pd.read_csv("../data/house_prices_multivariate.csv")
In [ ]:

Slide Type

# making Independent and Dependent variables from the dataset


X = NY_Housing.iloc[:,:-1]
y = NY_Housing.SalePrice
In [ ]:

Slide Type

# Fitting Multiple Linear Regression


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
In [ ]:
Slide Type

print(("intercept:", regressor.intercept_))
print(("coefficients of predictors:", regressor.coef_))

Slide Type

Predicting the price

Now let's say I want to predict the price of a house with following specifications.
In [ ]:

Slide Type

my_house = X.iloc[155]
my_house
In [ ]:

Slide Type

pred_my_house = regressor.predict(my_house.reshape(1, -1))


print(("predicted value:", pred_my_house[0]))
In [ ]:

Slide Type

print(("actual value:", y[155]))

Slide Type
As you can see the predicted value is not very far away from the actual value.

Now let's try to predict the price for all the houses in the dataset.
In [ ]:

Slide Type
# Predicting the results
y_pred = regressor.predict(X)
y_pred[:10]

Slide Type
Great! now, let's put the predicted values next to the actual values and see how good a job have
we done!
In [ ]:

Slide Type

prices = pd.DataFrame({"actual": y,
"predicted": y_pred})
prices.head(10)

Slide Type

Measuring the goodness of fit

Must say we have done a reasonably good job of predicting the house prices.

However, as the number of predictions increase it would be difficult to manually check the
goodness of fit. In such a case, we can use the cost function to check the goodness of fit.
Let's first start by finding the mean squared error (MSE)
In [ ]:

Slide Type

from sklearn.metrics import mean_squared_error


mean_squared_error(y_pred, y)

Slide Type
What do you think about the error value?

As you would notice the error value seems very high (in billions!). Why has it happened?

MSE is a relative measure of goodness of fit. We say that because the measure of goodness of
MSE depends on the unit. As it turns out, Linear regression depends on certain underlying
assumptions. Violation of these assumptions result in poor results.

Hence, it would be a good idea to understand these assumptions.


Slide Type

Assumptions in Linear Regression

There are some key assumptions that are made whilst dealing with Linear Regression

These are pretty intuitive and very essential to understand as they play an important role in
finding out some relationships in our dataset too!

Let's discuss these assumptions, their importance and mainly how we validate these
assumptions!
Slide Type

Assumptions in Linear Regression

1) Linear Relationship Assumption:

Relationship between response (Dependent Variables) and feature variables (Independent


Variables) should be linear.

 Why it is important:

Linear regression only captures the linear relationship, as it's trying to fit a linear model to the
data.
Slide Type

 How do we validate it:

The linearity assumption can be tested using scatter plots.


Slide Type

Assumptions in Linear Regression

Slide Type

Assumptions in Linear Regression

2) Little or No Multicollinearity Assumption:

It is assumed that there is little or no multicollinearity in the data.

 Why it is important:
It results in unstable parameter estimates which makes it very difficult to assess the effect of
independent variables.
Slide Type

 How to validate it:

Multicollinearity occurs when the features (or independent variables) are not independent from
each other.
Pair plots of features help validate.
Slide Type

Assumptions in Linear Regression

3) Homoscedasticity Assumption:

Homoscedasticity describes a situation in which the error term (that is, the “noise” or random
disturbance in the relationship between the independent variables and the dependent variable) is
the same across all values of the independent variables.

 Why it is important:

Generally, non-constant variance arises in presence of outliers or extreme leverage values.


Slide Type

 How to validate:

Plot between dependent variable vs error .


Slide Type

Assumptions in Linear Regression

In an ideal plot the variance around the regression line is the same for all values of the predictor
variable. In this plot we can see that the variance around our regression line is nearly the same
and hence it satisfies the condition of homoscedasticity.
Slide Type

Assumptions in Linear Regression

4) Little or No autocorrelation in residuals:

There should be little or no autocorrelation in the data.

Autocorrelation occurs when the residual errors are not independent from each other.
Slide Type
 Why it is important:

The presence of correlation in error terms drastically reduces model's accuracy. This usually
occurs in time series models. If the error terms are correlated, the estimated standard errors tend
to underestimate the true standard error.

 How to validate:

Residual vs time plot. look for the seasonal or correlated pattern in residual values.

Slide Type

Assumptions in Linear Regression

Slide Type

Slide Type

Assumptions in Linear Regression

5) Normal Distribution of error terms

 Why it is important:

Due to the Central Limit Theorem, we may assume that there are lots of underlying facts
affecting the process and the sum of these individual errors will tend to behave like in a zero
mean normal distribution. In practice, it seems to be so.

 How to validate:

You can look at QQ plot

Slide Type

Assumptions in Linear Regression

Slide Type

Evaluation Metrics for Regression (1/3)


 In Regression (and in other Machine Learning techniques) it's essential to know how
good our model has performed.

Slide Type

Evaluation Metrics for Regression (2/3)

 Why? Doesn't the computer do it Automatically?


 At first, it's easier to think that the Computer should automatically get the best
model once we tell it what to do
 This is not the case! We tell the computer what features/variables to use and
what methods to use to fit the data
 As models get more and more complicated (we'll get into this later) - we need to
input "hyper-parameters" that essentially tell the computers how much of that
quantity or what value we should use while optimizing complicated
parameters and cost functions
Slide Type

Evaluation Metrics for Regression (3/3)

- Thus, evaluating our model will help us know how well we're
doing with our current selection of Features from the data,
hyperparameters, etc.

 There are three basic evaluation metrics for regression to check the goodness of fit.
 Mean Absolute Error
 Root Mean square Error
 R-Square (Residual value)

Slide Type

Mean Absolute Error

 The mean absolute error (MAE) is a quantity used to measure how close forecasts or
predictions are to the actual vales. The mean absolute error is given by:
MAE=1N∑i=1N|yi−ŷ i|MAE=1N∑i=1N|yi−y^i|

 We take the Absolute of the difference because our Predicted Values can be
greater than or less than (resulting in a negative difference)
 If we don't take the Mod then the result of all the differences would be a 0.
Figuring out why is a good exercise!

Slide Type

Root Mean Square Error

 The square root of the mean/average of the square of all of the error.
 RMSE is very commonly used and makes for an excellent general purpose error metric
for numerical predictions.
 Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes
large errors.

RMSE=1N∑i=1N(yi−ŷ i)2⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯RMSE=1N∑i=1N(
yi−y^i)2

Slide Type

 Why are we squaring the Residuals and then taking a root?


 Residuals are basically the difference between Actual Values & Predicted Values
 Residuals can be positive or negative as the predicted value underestimates or
overestimates the actual value
 Thus to just focus on the magnitude of the error we take the square of the
difference as it's always positive

 So what is the advantage of RMS when we could just take the absolute difference
instead of squaring
 This severely punishes large differences in prediction. This is the reason why
RMSE is powerful as compared to Absolute Error.
 Evaluating the RMSE and tuning our model to minimize it results in a more robust
model

Slide Type
Interpretation of R2R2
 This is very important to understand, so read the following line multiple times if needed
 R - Square is a measure of the proportion of variability in the response that
is explained by the regression model.
 Earlier, when the Blood Alcohol Content Data is fitted a Linear Model, the R-
Square = 0.952
o This means that 95.2% of the the Variability in the data given to us is
explained by our Linear Model, which is good!
 R - Square values are always in the range of [0,1]

Slide Type

R2R2 Intuition

Sum of Squares Decomposition and R2R2:

 Before we get into calculating R^2, let's understand, what the different Sum of Squares
for our model are:
 In RMSE, we have already been introduced to Squared Residuals which is also
called Error Sum of Squares (SSE)

SSE=∑i=1N(yi−ŷ i)2SSE=∑i=1N(yi−y^i)2

Slide Type
 Additionally, we have the Total Sum of Squares (SST) which is nothing but the Squared
difference between the Actual Values (yiyi) and the Mean of our dataset (y¯iy¯i)

SS(Total)=∑i=1N(yi−y¯i)2SS(Total)=∑i=1N(yi−y¯i)2

 And we also have Regression Sum of Squares, which is the squared difference
between the Predicted values (ŷ iy^i) and the Mean (y¯iy¯i)

SS(Regression)=∑i=1N(ŷ i−y¯i)2SS(Regression)=∑i=1N(y^i−y¯i)2

Slide Type
R2R2 Intuition

Now, intuitively, we can see that:

SS(Total) = SS(Regression) + SSE

Thus R2R2 is defined as:

However, in many cases, R^2 is not really a good metric and we use another metric known
as Adjusted R-Squared
Slide Type

In-session Recap Time

 Understand how to make a prediction using predictors and fitting of a line


 Understand the Linear Regression Cost Function
 Understand the Linear Regression using Gradient Descent Algorithm
 Introduction to Linear Regression in sklearn
 Learn about the assumptions in Linear Regression Algorithm
 Evaluation Metric of your regression

Slide Type

Post Reads

 Linear Regression in Python


 Linear Regression using Gradient Descent Tutorial
 CS229 - Andrew Ng, Stanford University

Slide Type

Thank You

Next Session: -

 Initial Exploration
 Introduction to Seaborn
 Univariate Analysis
 Multi-variate Analysis

For more queries - Reach out to [email protected]


In [ ]:

Slide Type

You might also like