0% found this document useful (0 votes)
62 views63 pages

Course Slides - Regression Analysis

This document provides an overview of regression analysis including definitions, the goal of regression, types of regression, and how to perform simple linear regression. Simple linear regression aims to fit a straight line between two variables to make predictions, using a line of best fit and minimizing errors.

Uploaded by

bleh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views63 pages

Course Slides - Regression Analysis

This document provides an overview of regression analysis including definitions, the goal of regression, types of regression, and how to perform simple linear regression. Simple linear regression aims to fit a straight line between two variables to make predictions, using a line of best fit and minimizing errors.

Uploaded by

bleh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Regression Analysis:

Fundamentals and Practical Applications

BIDA TM - Business Intelligence & Data Analysis


Course Learning Objectives

Learn what linear regression is Learn how to perform simple Create linear regression models in
and how to use it to make real regression calculations in Excel & Python using both statsmodels and
world predictions RegressIt sklearn modules

Understand the implicit Be able to interpret regression Become familiar with more
assumptions behind linear coefficients, p-values and other advanced regression techniques
regression metrics to evaluate a model and when to use them

BIDA TM - Business Intelligence & Data Analysis


What is Regression?

Regression refers to a specific type of Machine Learning model, used to make predictions.

Discrete variables
Discrete Variable

Can only take certain values.

For discrete predictions, we use classification.


Continuous Variable

Continuous variables
Can take any value along a continuous scale.

For continuous predictions we use regression.

BIDA TM - Business Intelligence & Data Analysis


Regression Definitions

• The target variable is known as the Y variable.


3.0
• Input data is known as independent or X variable.
2.5

Target Variable (Y)


• We plot our sample data points with known X and Y. 2.0

• Then we plot a line of best fit to help us make 1.5

predictions when X is known, but the target Y is 1.0


unknown.
0.5

0
0 0.5 1.0 1.5 2.0 2.5 3.0
Independent
Variable (X)

Regression can be applied to a wide range of problems where we want to predict a continuous value
such as revenue, costs, life expectancy or film review scores.

BIDA TM - Business Intelligence & Data Analysis


Types of Linear Regression

Simple Linear Regression Multiple Linear Regression Non-Linear Regression

Target predicted using only one Target predicted using multiple Target predicted using multiple
independent variable independent variables independent variables

X1 X3
X1 Y
Y
X2 X4

BIDA TM - Business Intelligence & Data Analysis


X1 Y

Simple Linear Regression

BIDA TM - Business Intelligence & Data Analysis


Simple Linear Regression

Break linear regression down Calculate the SSE (sum of squared Learn how R2 can be used to
into its simplest form, including errors) to summarize total error in explain the performance of the
lines of best fit and errors. the model. model.

Be comfortable using regression Understand basic limitations of a Practice a basic regression scenario
terminology in conversation. linear regression model. in Excel and Python.

BIDA TM - Business Intelligence & Data Analysis


Simple Linear Regression

The aim of simple linear regression is to fit a straight-line relationship between two variables, X and Y.

• We summarize the data points with the line of


3.0
best fit.

• Use the line of best fit to predict y-values for any 2.5

Target Variable (Y)


x-values 2.0

• Some random variation cannot be explained.


1.5

1.0

0.5

0
0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis


House Price Prediction

• We can use building area to predict house prices One House


(Observation)
• Independent Variable: Building Area (Size)
300
One House
• Target Variable: House Price (Observation)
250
• Each blue point is a single house with a known

House Price (00k)


area and corresponding house price 200

Target Variable
• From the line of best fit we can predict house 150
prices for any house building area
100

One House
50
Additional Complexity (Observation)
0
• House prices are determined by more than just 0 0.5 1.0 1.5 2.0 2.5 3.0
the area of the house Building Area (‘000 sqft)
• Multiple Linear Regression can be used to include Independent Variable

other variables.

BIDA TM - Business Intelligence & Data Analysis


Linear Regression Algorithm

• The line of best fit can be described by:

ŷ = 𝜽 𝟏 𝒙 + 𝜽0
Target
• ෝ is our predicted value
𝒚 60 Variable (Y)
E.g. Umbrella Sales
• 𝒙 is the independent variable 50

• θ0 and θ1 are the parameters (or coefficients) that 40


define the regression line. Increase in Y

30
Increase in
• θ0 represents the intercept ŷ = 𝟏. 𝟐 X
20
• θ1 represents the slope
Independent
θ0 10
Variable (X)
E.g. Monthly Rainfall (cm)
0
0 1.0 2.0 3.0 4.0 5.0 6.0

To find the optimal line, we can use Ordinary Least Squares

BIDA TM - Business Intelligence & Data Analysis


Ordinary Least Squares

• Ordinary Least Squares method finds the best fitting


line by minimizing the sum of square errors (SSE)
also known as the sum of square residuals

• Errors, or residuals are the difference between the 3.0

observed and predicted values of the data (vertical


2.5
lines on chart) 𝑦𝑖

Target Variable (Y)


2.0
• In OLS, we square each error to remove negative
ŷ𝑖
values. We then add all squared errors together. 1.5

1.0
2
𝑆𝑆𝐸 = ෍ 𝑦𝑖 − ŷ𝑖
0.5

• When the SSE is smaller, the model fits the data 0


0 0.5 1.0 1.5 2.0 2.5 3.0
better overall.
Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis


Ordinary Least Squares Calculation

• If the orange line has the following equation:


ŷ = 0.7𝑥 + 0.45

• Then the errors can be calculated as: 3.0

𝒚𝒊 − ŷ𝒊 = 𝑦𝑖 − ( 0.7 × 𝒙𝒊 + 0.45 ) = error 2.5

Target Variable (Y)


𝒚𝟏 − ŷ𝟏 = 0.10 − 0.7 × 0.2 + 0.45 = − 0.49
2.0
𝒚𝟐 − ŷ𝟐 = 0.75 − 0.7 × 0.4 + 0.45 = + 0.02
1.5 y3
𝒚𝟑 − ŷ𝟑 = 1.50 − 0.7 × 0.7 + 0.45 = + 0.56
1.0
… and so on
y2
0.5
y1
• The sum of square errors for these points is then: 0
𝑆𝑆𝐸 = (−𝟎. 𝟒𝟗)2 + 𝟎. 𝟎𝟐2 + 𝟎. 𝟓𝟔2 + ⋯ 0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X)

• Summing over all the points gives us an SSE of 4.02

BIDA TM - Business Intelligence & Data Analysis


Fitting the Parameters

• The goal is to minimize the total error (SSE) produced by the line of best fit.

• To find the optimal parameters, we can:


1. Calculate the derivatives of the sum of the square errors

2. Evaluate these where they equal zero

σ(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)ത
θ0 = 𝑦ത − θ1 𝑥ҧ θ1 =
Intercept
σ(𝑥𝑖 − 𝑥)ҧ 2
Slope

• ഥ and 𝒚
𝒙 ഥ are the averages of the all the observed 𝒙𝒊 and 𝒚𝒊 values.

• Different values of θ0 and θ1 give different predictions and hence different values for the SSE.

BIDA TM - Business Intelligence & Data Analysis


Caution - Extrapolation

We should be careful not to extrapolate our results beyond the sample range of data.

• The model may perform poorly.

• A house of zero area still has a


300
positive value. This prediction
250 (intercept) is not reliable.
House Price (00k)

200
• Looking beyond the range of
observations may lead us to false
150 conclusions.

100 • Conclusion: The model is only reliable


for the observed sample space.
50

0
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Building Area (‘000 sqft)(X)

BIDA TM - Business Intelligence & Data Analysis


Caution – Correlation Vs Causation

A relationship or correlation does not necessarily mean causation.

• Ice cream sales are correlated to the


30
number of sun burn cases
Sun Burn Cases

25
• Ice cream does not cause not burn
20
• In fact, the weather (sun) causes both.
15

10
• Additionally, we can find many correlated
5 variables with no real relationship. These
are called Spurious Regressions.
0
0 100 200 300 400 500 600
Ice Cream Sales

We should carefully consider if our independent variable is truly causing the effect on the target variable.

BIDA TM - Business Intelligence & Data Analysis


Linear Regression in Practice

• Ordinary least squares will always produce the same results (analytic algorithm)

• We must test our model thoroughly using new data, before applying it to real world decisions.

Available Sample Data

Training Data Testing Data

(New Data)

• Using code we can automate a lot of regression calculations.

• We will use StatsModels and Scikit-Learn in Python to practice our skills.


• We will also fit a regression model in Excel.

BIDA TM - Business Intelligence & Data Analysis


X1
Y
X2

Multiple Linear Regression

BIDA TM - Business Intelligence & Data Analysis


Multiple Linear Regression

Y
X

Introduce and define Multiple Compare Multiple Linear Explore the parameters of Multiple
Linear Regression Regression to Simple Linear Linear Regression
Regression

Introduce Multicollinearity and Identify how the common issue of Practice an in-depth Multiple Linear
why it is important to Multiple overfitting a Multiple Linear Regression scenario in Python
Linear Regression Regression model can occur

BIDA TM - Business Intelligence & Data Analysis


What is Multiple Linear Regression?

• It is rare that a single independent variable


fully describes the variation in the target
variable

Target Variable (Y)


• Multiple linear regression allows us to
predict a target variable using any number of
independent variables
𝑦ො = 𝜽𝟎 + 𝜽𝟏 𝒙𝟏 + 𝜽𝟐 𝒙𝟐 + ⋯ + 𝜽𝒑 𝒙𝒑

• When we have two independent variables, we


are fitting a plane to the data instead of a
straight line

BIDA TM - Business Intelligence & Data Analysis


How the Algorithm Differs from Simple Linear Regression

• When we have one input variable, we have two parameters: slope and intercept

ŷ = 𝜽𝟏 𝑥 + 𝜽0
- θ0 and θ1 are the parameters that define the regression line.
- θ0 represents the intercept

- θ1 represents the slope

• For 𝑝 independent variables plus an intercept, we have a total of 𝑝 + 1 parameters

𝑦ො = 𝜽𝟎 + 𝜽𝟏 𝑥1 + 𝜽𝟐 𝑥2 + ⋯ + 𝜽𝒑 𝑥𝑝
• We use ordinary least squares again and solve 𝑝 + 1 simultaneous equations using matrix algebra for
the best fit parameter values

BIDA TM - Business Intelligence & Data Analysis


Multicollinearity

• Multicollinearity occurs when our input variables are strongly correlated with each other

House Price

Number of Building
Bedrooms Area

• With multicollinear variables, the algorithm would be unable to separate the effects of these two
variables

• There’s a high chance multicollinear variables will cause errors

BIDA TM - Business Intelligence & Data Analysis


Caution - Multiple Linear Regression

• Adding uncorrelated independent variables to the model often helps better explain the target variable
but must be done with caution

• If a new variable is added that has no predictive value, the algorithm will calculate that and assign a
parameter value close to zero
𝑦ො = 𝜽𝟎 + 𝟎. 𝟎𝟎𝟏𝑥1 + 𝜽𝟐 𝑥2 + ⋯ + 𝜽𝒑 𝑥𝑝
• Overfitting can occur when the model captures too much of the detail in the training data that doesn't
exist in the test data
• This can be caused by random effects in the data reducing the SSE and hence being detected by the
algorithm
• It is best to balance the number of independent variables so that we can adequately describe the data
without overfitting

BIDA TM - Business Intelligence & Data Analysis


Interpreting Linear Regression

BIDA TM - Business Intelligence & Data Analysis


Interpreting Linear Regresion

Review the concepts of residuals Analyze datasets to ensure they Evaluate Linear Regression metrics
and residual plots meet the Ordinary Least Squares to measure the error in models
assumptions

Analyze model output with Linear Apply knowledge with an Practice interpreting the results of
Regression coefficients & p-values interactive exercise and a Linear Regression model in
interpretation scenarios Python

BIDA TM - Business Intelligence & Data Analysis


Residuals Recap

• Residuals, or errors, are the differences


between the predicted values and the
observed values (𝑦𝑖 − 𝑦ො𝑖 )
The residual quantifies the variation in the

Target Variable (Y)



target variable that is unexplainable by our line
of best fit

• If our model captures all the independent


variables, the residuals must be down to
simple random errors that we cannot predict

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis


Plotting Residuals

• Plotting residuals is a useful tool for evaluating


the trustworthiness of our model at describing
the data
• A reliable model will have

Residual
• Residuals randomly scattered around the x-
axis
• Residuals with an average value of zero

• Note: A reliable model does not necessarily


mean an overall good model – the SSE could still
be large
Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis


Non-Random Residuals

• Non-random plots can be problematic for our


model

Residual
• Non-random patterns in the residuals could be
caused by one or more of the following factors:
• Linear model is not appropriate
• Omitted variable bias
• Data does not satisfy OLS assumptions
• In these cases, we may be able to transform the

Residual
data to make it fit the model or otherwise use a
different model

Independent Variable (X)


BIDA TM - Business Intelligence & Data Analysis
Summary

• Plotting residuals offers a powerful tool to assess the model fit

• If we observe the residuals to be randomly distributed, then we can be reasonably content that our
model is unbiased
• Non-random patterns in the residuals provide the opportunity to improve the model by introducing
new independent variables or transformations
• If these methods do not resolve the problems, then potentially a non-linear model is more appropriate

BIDA TM - Business Intelligence & Data Analysis


OLS Assumptions

The Ordinary Least Squares method relies on 6 assumptions about the data:
Linearity Homoscedasticity Zero-Mean Errors
180K
30

150K
25

Target Variable (Y)


Target Variable (Y)

120K
20

Residual
90K
1 0
5
60K
10

30K
50

0 0
0 0 1 1 2 2 3 0 15 30 45 60 75 90
Independent Variable (X) Independent Variable (X) Independent Variable (X)

Endogeneity Autocorrelation of Errors Multicollinearity


30 150 30

Independent Variable (X2)


25 100 25

20 50 20
Residual
Residual

15 0 15

10 -50 10

5 -100 5

0
-150 0
0 50 100 150 200 250 300
0 0.5 1.0 1.5 2.0 2.5 3.0 0 50 100 150 200 250 300

Independent Variable (X) Independent Variable (X) Independent Variable (X1)

BIDA TM - Business Intelligence & Data Analysis


Assumption 1: Linearity

• For a linear regression model to be appropriate,


the data must be linear in nature Non-Linear Data
300
• The target variable should increase by a fixed
250
amount for an increase in each of the

Target Variable (Y)


independent variables 200

• We can test for linearity by ensuring the errors 150

in a residual plot are randomly and


100
symmetrically scatted about the x-axis
50
• Non-linear variables can be made linear using
transformations such as a log-log 0
0 0.5 1.0 1.5 2.0 2.5 3.0
transformation (more on this later) Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis


Assumption 2 – Normality and Homoscedasticity of Errors

• Normality - plot the magnitude and direction of

Count
each error on a chart to see if it fits a normal
distribution
• Homoscedasticity – the spread of errors should -5 -4 -3 -2 -1 0 1 2 3 4 5
Residual
not change with the independent variables
180K
• Heteroskedasticity – errors that do not have
150K
constant variance

Target Variable (Y)


120K
• We can test for homoscedasticity by inspecting
the scatter of the residual plot 90K

Note: Normality is not strictly required for OLS, but 60K

it is required if we want to make statistical


30K
inferences about parameters
0
0 15 30 45 60 75 90

Independent Variable (X)


BIDA TM - Business Intelligence & Data Analysis
Assumption 3 – Zero Mean Errors

• For an unbiased model, the average of all the


errors should be zero

• The errors should only capture the random


variation in the data that the independent
variables cannot explain

Residual
• A non-zero average error means the model has a 0

systematic offset between the predicted and


observed values, known as bias
• When present, the intercept parameter will
absorb any systematic offset, forcing the average
error to be zero Independent Variable (X)

• Can test for this by checking the average of the


errors is close to zero

BIDA TM - Business Intelligence & Data Analysis


Assumption 4 - Endogeneity

• Endogeneity - there is no correlation between


the error and the independent variables
300
• Correlation suggests that the random error can
250
be predicted by independent variables
200
• Endogeneity leads to biased parameter

Residual
estimates 150

• Omitted-variable bias – an important 100

independent variable is not included in the


50
model and is correlated with other independent
variables. The omitted variable will be included in 0
0 0.5 1.0 1.5 2.0 2.5 3.0
the error and will cause endogeneity Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis


Assumption 5 – Autocorrelation of Errors

• Errors should not be correlated with previous


errors or themselves
150
• There is autocorrelation of the errors when an
100
error value is dependent on the previous error
value 50

Residual
• We often observe autocorrelation in time series 0

data
-50

• We can identify autocorrelation of errors through


-100
patterns in the residual plot or using the
Durbin-Watson statistic -150
0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis


Assumption 6 – Multicollinearity

• Multicollinearity - independent variables are


highly correlated with each other
300
• The OLS algorithm cannot separate the effects of

Independent Variable (X2)


250
the correlated independent variables on the
target variable and so will produce unstable 200

parameter estimates
150

• We can prevent multicollinear independent


100
variables by calculating the correlations between
them and removing one from every highly 50

correlated pair 0
0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X1)

BIDA TM - Business Intelligence & Data Analysis


OLS Assumptions

When all six assumptions are met, the Ordinary Least Squares method is the best linear regression method
Linearity Homoscedasticity Zero-Mean Errors
180K
30

150K
25

Target Variable (Y)


Target Variable (Y)

120K
20

Residual
1
90K
0
5
60K
10

30K
50

0 0
0 0 1 1 2 2 3 0 15 30 45 60 75 90
Independent Variable (X) Independent Variable (X) Independent Variable (X)

Endogeneity Autocorrelation of Errors Multicollinearity


30 150 30

Independent Variable (X2)


25 100 25

20 50 20
Residual
Residual

15 0 15

10 -50 10

5 -100 5

0
-150 0
0 50 100 150 200 250 300
0 0.5 1.0 1.5 2.0 2.5 3.0 0 50 100 150 200 250 300

Independent Variable (X) Independent Variable (X) Independent Variable (X1)

BIDA TM - Business Intelligence & Data Analysis


Linear Regression Evaluation

• We can measure the model performance by how


Squared Metrics
good our model is at making predictions
Sum of Squared Error (SSE)
Y
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)

Absolute Metrics
Sum Absolute Error (SAE)
Mean Absolute Error (MAE)
X

• These metrics allow us to compare the performance of different models

BIDA TM - Business Intelligence & Data Analysis


Linear Regression Evaluation – Squared Metrics

• Sum of square errors (SSE) sums the squared


distance from the residuals to a line of best fit

𝑆𝑆𝐸 = ෍ 𝑦𝑖 − ŷ𝑖 2 3.0

2.5

Target Variable (Y)


• Mean square error (MSE) divides the sum of 2.0

square errors by the number of data points


1.5
𝑆𝑆𝐸
𝑀𝑆𝐸 = 1.0
𝑁
0.5
• Root mean square error (RMSE) takes the
square root of the mean square error 0
0 0.5 1.0 1.5 2.0 2.5 3.0

𝑅𝑀𝑆𝐸 = 𝑀𝑆𝐸 Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis


Linear Regression Evaluation – Absolute Metrics

• Sum of square errors (SSE) sums the squared


distance from the residuals to a line of best fit

• Mean square error (MSE) divides the sum of


3.0
square errors by the number of data points
2.5
Root mean square error (RMSE) takes the

Target Variable (Y)



square root of the mean square error 2.0

• Mean absolute error (MAE) sums the distance 1.5

from the residuals to a line of best fit divided


1.0
by the number of data points
0.5
1
𝑀𝐴𝐸 = ෍ 𝑦𝑖 − ŷ𝑖 0
𝑁 0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis


Linear Regression Evaluation – R2

• Coefficient of Determination (R2) calculates to what extent the independent variables explain the
changes in the target variable
R2 = 1 R2 = 0.86 R2 = 0

X fully explains Y X explains 86% of X cannot explain Y


change in Y
𝑆𝑆𝐸
• Coefficient of Determination: 𝑅2 = 1 − 𝑇𝑆𝑆

• Sum of square errors: 𝑆𝑆𝐸 = σ(𝑦𝑖 − ŷ𝑖 )2


• Total sum of squares: ത 2
𝑇𝑆𝑆 = σ(𝑦𝑖 − 𝑦)

BIDA TM - Business Intelligence & Data Analysis


Linear Regression Evaluation – R2

• 𝑆𝑆𝐸/𝑇𝑆𝑆 tells us how much variance there is in our model as a fraction of the total variation in the target

• One minus this value tells us how much variation has been explained by the model

𝑆𝑆𝐸
𝑅2 =1−
𝑇𝑆𝑆
• The independent variables in models with high R2 values explain most of the variance present in the data

• As with the SSE, the coefficient of determination will never decrease by adding in new variables; however,
adding in unnecessary variables can lead to overfitting the data
• We can see the effects of overfitting when we compare training performance with test data performance

BIDA TM - Business Intelligence & Data Analysis


Linear Regression Evaluation - Adjusted R2

• We can modify the R2 to account for the number of independent variables – this is known as the adjusted
R2 or R
ഥ2

𝑆𝑆𝐸 Τ(𝒏−𝒑−1)
𝑅ത 2 = 1 − Τ
𝑇𝑆𝑆 (𝒏−1)
• 𝒏 is the number of observed datapoints
• 𝒑 is the number of independent variables in the model (excluding the intercept term)

• The adjusted R2 value only increases when a new variable is added that improves the model by more than
would be expected due to random chance
• It can therefore decrease if a variable is added that does not sufficiently explain the data
• This adjusted R2 will always be lower than the unadjusted R2

• It allows comparison between models with different numbers of independent variables


BIDA TM - Business Intelligence & Data Analysis
Overview of Linear Regression Evaluation

• Evaluating these metrics after building a model provides insight into the model performance

• MSE, RMSE and MAE metrics tell us how far our predictions are from their true values on average
• MSE or RMSE can be used when we want to give greater importance to individual large errors
• The coefficient of determination tells us how much of the variability in the data the independent
variables are explaining

• The adjusted coefficient of determination helps prevent overfitting by decreasing if a variable is added
that does not sufficiently explain the data
• The MSE, RMSE and MAE are absolute measures of fit, while R2 is relative to the total sum of squares

BIDA TM - Business Intelligence & Data Analysis


Regression Coefficients

• Regression coefficients help us understand the interaction between variables

• The following model predicts house prices (𝑃) using the building area (𝐴) and a measure of the overall
house condition (𝐶):

𝑃 = 77000 + 80𝐴 + 9200𝐶

Three Parameters
• A base price of £ 77,000
• On average for every extra sqft we add, the house price increases by £80

• Similarly for every condition point we add, the price will increase by £9,200

BIDA TM - Business Intelligence & Data Analysis


Regression Coefficients

• Regression coefficients help us understand the interaction between variables

• The following model predicts house prices (𝑃) using the building area (𝐴) and a measure of the overall
house condition (𝐶):

𝑃 = 77000 + 80𝐴 + 9200𝐶

• In general, our parameter values or coefficients tell us how much the target variable changes on
average when we change the corresponding independent variables
• Positive coefficients mean that the target variable is positively correlated with the independent variable
while negative coefficients tell us that it is negatively correlated

BIDA TM - Business Intelligence & Data Analysis


Compare Coefficients

• We can scale the independent variables before fitting the regression in order to compare coefficients

• Standardization – transforming data to follow normal distribution with mean zero and unit variance by
subtracting the mean value and dividing by the standard deviation
𝑥𝑖 − 𝑥ҧ
z=
σ
• Our scaled house price model then becomes:

𝑃 = 113200 + 41766𝑍𝐴 + 30047𝑍𝐶


• We can compare relatively which independent variable contributes more to the target variable for a single
unitless increment in the independent variable
• This produces the same results for our 800sqft/score 6 house and we observe that area contributes more
to the price relative to the condition score

BIDA TM - Business Intelligence & Data Analysis


p-Values

• We can test whether coefficients are statistically significant by calculating p-values

• A p-value tells us the probability that we would get the results we observe when we assume the null
hypothesis is true
• If this p-value is below 0.05 then it is unlikely that we observe such a distant from zero value for the
coefficient by chance and we therefore assume that there is a significant relationship between the
independent variable and the target variable
• We can calculate p-values for all the coefficients in our model and discard the independent variables
that are not significant

• Removing independent variables that are not statistically significant helps prevent overfitting

BIDA TM - Business Intelligence & Data Analysis


Calculate p-Values

Process to calculate p-values:

Calculate the standard error on each coefficient

Combine the standard error with the coefficient value to


produce a t-statistic value

Calculate the p-value as the probability of obtaining this


t-value from the student’s t-distribution

• A significant p-value does not necessarily imply a high R2 value and vice-versa

BIDA TM - Business Intelligence & Data Analysis


Interpreting Regression Model for House Price

• In the practical exercise, we used RegressIt to


predict house price.

• A good model will have a tight grouping of


residuals that are consistent across all values.
• We can see that our model works well for low
prices but begins to widen at higher values.
• This means that if our model predicts a high house
price we cannot be as sure of its accuracy.

BIDA TM - Business Intelligence & Data Analysis


Outliers

• Outliers can skew the prediction of our regression model.

• We can see from below that the two models give very different predictions.
• Should we remove the outliers, leave them there, or use a different kind of model?

BIDA TM - Business Intelligence & Data Analysis


Correlation/Causation

• An important final point to remember is the


difference between correlation and causation.

• As we can see from these charts, there can be


very strong correlations between data, with no
realistic connection.
• It is important to remember the distinction
between correlation and causation before making
recommendations and decisions.
• It is essential to first understand data and to
question whether your hypothesis are plausible.

BIDA TM - Business Intelligence & Data Analysis


Advanced Regression

BIDA TM - Business Intelligence & Data Analysis


Advanced Regression

An introduction to more Know what you don’t know and Practice a more advanced
advanced regression techniques give you a few ideas of what to regression in Excel
explore next

Techniques we will outline:

Log Log Linear Regression Repeated Measure Regression Segmented & GAM Regression

Polynomial Regression Logistic Regression Lasso Regression

Random Effect Models Bayesian Regression Poisson Regression

BIDA TM - Business Intelligence & Data Analysis


Log-Log Linear Regression
Y
Non-linear relationships are hard to interpret.

For that reason, linear models are desirable.

Log-Log Linear Regression

• Log-log regression is used when our data X


appears non-linear or we have log(Y)
heteroscedastic errors.
• When Y and atleast one X are transformed
with a log, this is log-log linear regression.

• Useful when we have linear percentage


increases between target and independent
log(X)
variables.

BIDA TM - Business Intelligence & Data Analysis


Polynomial Regression Y
X2
(Quadratic)

• We use polynomial regression when we


have a smooth, non-linear relationship
between variables.
X
• Each input (x) variable is transformed with Y
an exponent: X3
(Cubic)
ŷ = θ0 + θ1𝑥 + θ2 𝑥 2 + ⋯ + θ𝑛 𝑥 𝑛

Y
• A higher exponent suggests a more
X4 X
complex relationship. (Quartic)

• Benefits: Easy to add complexity.

BIDA TM - Business Intelligence & Data Analysis


Polynomial Regression Continued

The polynomial exponent indicates the complexity of the regression line.

Y Y Y
X2 X3 X5
(Cubic) (Quintic)
(Quadratic)

X X X

Underfitting Polynomial Overfitting Polynomial


When the exponent is too low, the When the exponent is too high, the
relationship is over simplified. relationship is too specific.

BIDA TM - Business Intelligence & Data Analysis


Logistic Regression Logistic Classification

Y=1
• Logistic regression is typically used for
classification problems
Decide
• In logistic regression we fit the logistic curve Classification
function to the data: Threshold
Y = 0.5
1
𝑦=
1 + 𝑒 −β𝑥
• The y-value increases from zero to one as x Y=0
X
increases Predict ŷ = 0 Predict ŷ = 1

• New datapoints are classified based on whether


their predicted y value is above or below a certain
• Benefit: Easy way to apply classification threshold
decisions to a continuous outcome.
• Typically, we start with 0.5 or 50% as the decision
threshold.

BIDA TM - Business Intelligence & Data Analysis


Repeated Measure Regression
In this example the patient is measured 3 times

• Repeated measure regression is used


when measurements are repeated over
Patient Y1 Y2 Y3
time.
• Common in medical fields where the 1 2 3 4
same patient is measured at regular 2 0 3 1
intervals.
3 1 4 3
• Repeat measures are therefore not
independent.

- ANOVA techniques are popular for repeated


measure regression

- ANOVA requires the intervals (Age in this case) to


be regular.

BIDA TM - Business Intelligence & Data Analysis


Segmented Reggression

Basic Segmented Regression Generalized Additive Models


Y Y
Non Linear
Linear Linear

Non
Linear
Linear

X X
• Applies unique regression lines for • Applies unique regression lines for
different regions of X. different regions of X.
• Both regions have the same form. • Each region may exhibit a different
regression form.
• Benefit: Low complexity, easy to
understand. • Benefit: Can help model more complex
relationships.

BIDA TM - Business Intelligence & Data Analysis


Other Advanced Models

Lasso Regression

• Penalises less informative variables by


reducing their coefficients.

• Effectively adds bias, in order to reduce


variance.

• Benefit: Low complexity, easy to


understand.

BIDA TM - Business Intelligence & Data Analysis


Other Advanced Models

Bayesian Regression

• Used to provide a level of certainty with


regression output.

• Useful in aiding decision making. I.e. an


output with low certainty should be
treated with caution.

BIDA TM - Business Intelligence & Data Analysis


Other Advanced Models

Random Effect Models

• Attempt to capture underlying


correlations between observations.

e.g. nationwide student results where students may


correlate by school

Y School 1

School 2

BIDA TM - Business Intelligence & Data Analysis


Other Advanced Models

Poisson Regression
• Used to model counts of something in a
given time or area.

e.g. Vehicles passing in a minute


e.g. Number of positive cases in each town

Lowest
count is 0

BIDA TM - Business Intelligence & Data Analysis

You might also like