0% found this document useful (0 votes)

62 views63 pages

Course Slides - Regression Analysis

This document provides an overview of regression analysis including definitions, the goal of regression, types of regression, and how to perform simple linear regression. Simple linear regression aims to fit a straight line between two variables to make predictions, using a line of best fit and minimizing errors.

Uploaded by

bleh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views63 pages

Course Slides - Regression Analysis

Uploaded by

bleh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Regression Analysis:

Fundamentals and Practical Applications

BIDA TM - Business Intelligence & Data Analysis

Course Learning Objectives

Learn what linear regression is Learn how to perform simple Create linear regression models in
and how to use it to make real regression calculations in Excel & Python using both statsmodels and
world predictions RegressIt sklearn modules

Understand the implicit Be able to interpret regression Become familiar with more
assumptions behind linear coefficients, p-values and other advanced regression techniques
regression metrics to evaluate a model and when to use them

BIDA TM - Business Intelligence & Data Analysis

What is Regression?

Regression refers to a specific type of Machine Learning model, used to make predictions.

Discrete variables
Discrete Variable

Can only take certain values.

For discrete predictions, we use classification.

Continuous Variable

Continuous variables
Can take any value along a continuous scale.

For continuous predictions we use regression.

BIDA TM - Business Intelligence & Data Analysis

Regression Definitions

• The target variable is known as the Y variable.

3.0
• Input data is known as independent or X variable.
2.5

Target Variable (Y)

• We plot our sample data points with known X and Y. 2.0

• Then we plot a line of best fit to help us make 1.5

predictions when X is known, but the target Y is 1.0

unknown.
0.5

0
0 0.5 1.0 1.5 2.0 2.5 3.0
Independent
Variable (X)

Regression can be applied to a wide range of problems where we want to predict a continuous value
such as revenue, costs, life expectancy or film review scores.

BIDA TM - Business Intelligence & Data Analysis

Types of Linear Regression

Simple Linear Regression Multiple Linear Regression Non-Linear Regression

Target predicted using only one Target predicted using multiple Target predicted using multiple
independent variable independent variables independent variables

X1 X3
X1 Y
Y
X2 X4

BIDA TM - Business Intelligence & Data Analysis

X1 Y

Simple Linear Regression

BIDA TM - Business Intelligence & Data Analysis

Simple Linear Regression

Break linear regression down Calculate the SSE (sum of squared Learn how R2 can be used to
into its simplest form, including errors) to summarize total error in explain the performance of the
lines of best fit and errors. the model. model.

Be comfortable using regression Understand basic limitations of a Practice a basic regression scenario
terminology in conversation. linear regression model. in Excel and Python.

BIDA TM - Business Intelligence & Data Analysis

Simple Linear Regression

The aim of simple linear regression is to fit a straight-line relationship between two variables, X and Y.

• We summarize the data points with the line of

3.0
best fit.

• Use the line of best fit to predict y-values for any 2.5

Target Variable (Y)

x-values 2.0

• Some random variation cannot be explained.

1.5

1.0

0.5

0
0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis

House Price Prediction

• We can use building area to predict house prices One House

(Observation)
• Independent Variable: Building Area (Size)
300
One House
• Target Variable: House Price (Observation)
250
• Each blue point is a single house with a known

House Price (00k)

area and corresponding house price 200

Target Variable
• From the line of best fit we can predict house 150
prices for any house building area
100

One House
50
Additional Complexity (Observation)
0
• House prices are determined by more than just 0 0.5 1.0 1.5 2.0 2.5 3.0
the area of the house Building Area (‘000 sqft)
• Multiple Linear Regression can be used to include Independent Variable

other variables.

BIDA TM - Business Intelligence & Data Analysis

Linear Regression Algorithm

• The line of best fit can be described by:

ŷ = 𝜽 𝟏 𝒙 + 𝜽0
Target
• ෝ is our predicted value
𝒚 60 Variable (Y)
E.g. Umbrella Sales
• 𝒙 is the independent variable 50

• θ0 and θ1 are the parameters (or coefficients) that 40

define the regression line. Increase in Y

30
Increase in
• θ0 represents the intercept ŷ = 𝟏. 𝟐 X
20
• θ1 represents the slope
Independent
θ0 10
Variable (X)
E.g. Monthly Rainfall (cm)
0
0 1.0 2.0 3.0 4.0 5.0 6.0

To find the optimal line, we can use Ordinary Least Squares

BIDA TM - Business Intelligence & Data Analysis

Ordinary Least Squares

• Ordinary Least Squares method finds the best fitting

line by minimizing the sum of square errors (SSE)
also known as the sum of square residuals

• Errors, or residuals are the difference between the 3.0

observed and predicted values of the data (vertical

2.5
lines on chart) 𝑦𝑖

Target Variable (Y)

2.0
• In OLS, we square each error to remove negative
ŷ𝑖
values. We then add all squared errors together. 1.5

1.0
2
𝑆𝑆𝐸 = ෍ 𝑦𝑖 − ŷ𝑖
0.5

• When the SSE is smaller, the model fits the data 0

0 0.5 1.0 1.5 2.0 2.5 3.0
better overall.
Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis

Ordinary Least Squares Calculation

• If the orange line has the following equation:

ŷ = 0.7𝑥 + 0.45

• Then the errors can be calculated as: 3.0

𝒚𝒊 − ŷ𝒊 = 𝑦𝑖 − ( 0.7 × 𝒙𝒊 + 0.45 ) = error 2.5

Target Variable (Y)

𝒚𝟏 − ŷ𝟏 = 0.10 − 0.7 × 0.2 + 0.45 = − 0.49
2.0
𝒚𝟐 − ŷ𝟐 = 0.75 − 0.7 × 0.4 + 0.45 = + 0.02
1.5 y3
𝒚𝟑 − ŷ𝟑 = 1.50 − 0.7 × 0.7 + 0.45 = + 0.56
1.0
… and so on
y2
0.5
y1
• The sum of square errors for these points is then: 0
𝑆𝑆𝐸 = (−𝟎. 𝟒𝟗)2 + 𝟎. 𝟎𝟐2 + 𝟎. 𝟓𝟔2 + ⋯ 0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X)

• Summing over all the points gives us an SSE of 4.02

BIDA TM - Business Intelligence & Data Analysis

Fitting the Parameters

• The goal is to minimize the total error (SSE) produced by the line of best fit.

• To find the optimal parameters, we can:

1. Calculate the derivatives of the sum of the square errors

2. Evaluate these where they equal zero

σ(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)ത
θ0 = 𝑦ത − θ1 𝑥ҧ θ1 =
Intercept
σ(𝑥𝑖 − 𝑥)ҧ 2
Slope

• ഥ and 𝒚
𝒙 ഥ are the averages of the all the observed 𝒙𝒊 and 𝒚𝒊 values.

• Different values of θ0 and θ1 give different predictions and hence different values for the SSE.

BIDA TM - Business Intelligence & Data Analysis

Caution - Extrapolation

We should be careful not to extrapolate our results beyond the sample range of data.

• The model may perform poorly.

• A house of zero area still has a

300
positive value. This prediction
250 (intercept) is not reliable.
House Price (00k)

200
• Looking beyond the range of
observations may lead us to false
150 conclusions.

100 • Conclusion: The model is only reliable

for the observed sample space.
50

0
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Building Area (‘000 sqft)(X)

BIDA TM - Business Intelligence & Data Analysis

Caution – Correlation Vs Causation

A relationship or correlation does not necessarily mean causation.

• Ice cream sales are correlated to the

30
number of sun burn cases
Sun Burn Cases

25
• Ice cream does not cause not burn
20
• In fact, the weather (sun) causes both.
15

10
• Additionally, we can find many correlated
5 variables with no real relationship. These
are called Spurious Regressions.
0
0 100 200 300 400 500 600
Ice Cream Sales

We should carefully consider if our independent variable is truly causing the effect on the target variable.

BIDA TM - Business Intelligence & Data Analysis

Linear Regression in Practice

• Ordinary least squares will always produce the same results (analytic algorithm)

• We must test our model thoroughly using new data, before applying it to real world decisions.

Available Sample Data

Training Data Testing Data

(New Data)

• Using code we can automate a lot of regression calculations.

• We will use StatsModels and Scikit-Learn in Python to practice our skills.

• We will also fit a regression model in Excel.

BIDA TM - Business Intelligence & Data Analysis

X1
Y
X2

Multiple Linear Regression

BIDA TM - Business Intelligence & Data Analysis

Multiple Linear Regression

Y
X

Introduce and define Multiple Compare Multiple Linear Explore the parameters of Multiple
Linear Regression Regression to Simple Linear Linear Regression
Regression

Introduce Multicollinearity and Identify how the common issue of Practice an in-depth Multiple Linear
why it is important to Multiple overfitting a Multiple Linear Regression scenario in Python
Linear Regression Regression model can occur

BIDA TM - Business Intelligence & Data Analysis

What is Multiple Linear Regression?

• It is rare that a single independent variable

fully describes the variation in the target
variable

Target Variable (Y)

• Multiple linear regression allows us to
predict a target variable using any number of
independent variables
𝑦ො = 𝜽𝟎 + 𝜽𝟏 𝒙𝟏 + 𝜽𝟐 𝒙𝟐 + ⋯ + 𝜽𝒑 𝒙𝒑

• When we have two independent variables, we

are fitting a plane to the data instead of a
straight line

BIDA TM - Business Intelligence & Data Analysis

How the Algorithm Differs from Simple Linear Regression

• When we have one input variable, we have two parameters: slope and intercept

ŷ = 𝜽𝟏 𝑥 + 𝜽0
- θ0 and θ1 are the parameters that define the regression line.
- θ0 represents the intercept

- θ1 represents the slope

• For 𝑝 independent variables plus an intercept, we have a total of 𝑝 + 1 parameters

𝑦ො = 𝜽𝟎 + 𝜽𝟏 𝑥1 + 𝜽𝟐 𝑥2 + ⋯ + 𝜽𝒑 𝑥𝑝
• We use ordinary least squares again and solve 𝑝 + 1 simultaneous equations using matrix algebra for
the best fit parameter values

BIDA TM - Business Intelligence & Data Analysis

Multicollinearity

• Multicollinearity occurs when our input variables are strongly correlated with each other

House Price

Number of Building
Bedrooms Area

• With multicollinear variables, the algorithm would be unable to separate the effects of these two
variables

• There’s a high chance multicollinear variables will cause errors

BIDA TM - Business Intelligence & Data Analysis

Caution - Multiple Linear Regression

• Adding uncorrelated independent variables to the model often helps better explain the target variable
but must be done with caution

• If a new variable is added that has no predictive value, the algorithm will calculate that and assign a
parameter value close to zero
𝑦ො = 𝜽𝟎 + 𝟎. 𝟎𝟎𝟏𝑥1 + 𝜽𝟐 𝑥2 + ⋯ + 𝜽𝒑 𝑥𝑝
• Overfitting can occur when the model captures too much of the detail in the training data that doesn't
exist in the test data
• This can be caused by random effects in the data reducing the SSE and hence being detected by the
algorithm
• It is best to balance the number of independent variables so that we can adequately describe the data
without overfitting

BIDA TM - Business Intelligence & Data Analysis

Interpreting Linear Regression

BIDA TM - Business Intelligence & Data Analysis

Interpreting Linear Regresion

Review the concepts of residuals Analyze datasets to ensure they Evaluate Linear Regression metrics
and residual plots meet the Ordinary Least Squares to measure the error in models
assumptions

Analyze model output with Linear Apply knowledge with an Practice interpreting the results of
Regression coefficients & p-values interactive exercise and a Linear Regression model in
interpretation scenarios Python

BIDA TM - Business Intelligence & Data Analysis

Residuals Recap

• Residuals, or errors, are the differences

between the predicted values and the
observed values (𝑦𝑖 − 𝑦ො𝑖 )
The residual quantifies the variation in the

Target Variable (Y)

•
target variable that is unexplainable by our line
of best fit

• If our model captures all the independent

variables, the residuals must be down to
simple random errors that we cannot predict

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis

Plotting Residuals

• Plotting residuals is a useful tool for evaluating

the trustworthiness of our model at describing
the data
• A reliable model will have

Residual
• Residuals randomly scattered around the x-
axis
• Residuals with an average value of zero

• Note: A reliable model does not necessarily

mean an overall good model – the SSE could still
be large
Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis

Non-Random Residuals

• Non-random plots can be problematic for our

model

Residual
• Non-random patterns in the residuals could be
caused by one or more of the following factors:
• Linear model is not appropriate
• Omitted variable bias
• Data does not satisfy OLS assumptions
• In these cases, we may be able to transform the

Residual
data to make it fit the model or otherwise use a
different model

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis
Summary

• Plotting residuals offers a powerful tool to assess the model fit

• If we observe the residuals to be randomly distributed, then we can be reasonably content that our
model is unbiased
• Non-random patterns in the residuals provide the opportunity to improve the model by introducing
new independent variables or transformations
• If these methods do not resolve the problems, then potentially a non-linear model is more appropriate

BIDA TM - Business Intelligence & Data Analysis

OLS Assumptions

The Ordinary Least Squares method relies on 6 assumptions about the data:
Linearity Homoscedasticity Zero-Mean Errors
180K
30

150K
25

Target Variable (Y)

120K
20

Residual
90K
1 0
5
60K
10

30K
50

0 0
0 0 1 1 2 2 3 0 15 30 45 60 75 90
Independent Variable (X) Independent Variable (X) Independent Variable (X)

Endogeneity Autocorrelation of Errors Multicollinearity

30 150 30

Independent Variable (X2)

25 100 25

20 50 20
Residual
Residual

15 0 15

10 -50 10

5 -100 5

0
-150 0
0 50 100 150 200 250 300
0 0.5 1.0 1.5 2.0 2.5 3.0 0 50 100 150 200 250 300

Independent Variable (X) Independent Variable (X) Independent Variable (X1)

BIDA TM - Business Intelligence & Data Analysis

Assumption 1: Linearity

• For a linear regression model to be appropriate,

the data must be linear in nature Non-Linear Data
300
• The target variable should increase by a fixed
250
amount for an increase in each of the

Target Variable (Y)

independent variables 200

• We can test for linearity by ensuring the errors 150

in a residual plot are randomly and

100
symmetrically scatted about the x-axis
50
• Non-linear variables can be made linear using
transformations such as a log-log 0
0 0.5 1.0 1.5 2.0 2.5 3.0
transformation (more on this later) Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis

Assumption 2 – Normality and Homoscedasticity of Errors

• Normality - plot the magnitude and direction of

Count
each error on a chart to see if it fits a normal
distribution
• Homoscedasticity – the spread of errors should -5 -4 -3 -2 -1 0 1 2 3 4 5
Residual
not change with the independent variables
180K
• Heteroskedasticity – errors that do not have
150K
constant variance

Target Variable (Y)

120K
• We can test for homoscedasticity by inspecting
the scatter of the residual plot 90K

Note: Normality is not strictly required for OLS, but 60K

it is required if we want to make statistical

30K
inferences about parameters
0
0 15 30 45 60 75 90

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis
Assumption 3 – Zero Mean Errors

• For an unbiased model, the average of all the

errors should be zero

• The errors should only capture the random

variation in the data that the independent
variables cannot explain

Residual
• A non-zero average error means the model has a 0

systematic offset between the predicted and

observed values, known as bias
• When present, the intercept parameter will
absorb any systematic offset, forcing the average
error to be zero Independent Variable (X)

• Can test for this by checking the average of the

errors is close to zero

BIDA TM - Business Intelligence & Data Analysis

Assumption 4 - Endogeneity

• Endogeneity - there is no correlation between

the error and the independent variables
300
• Correlation suggests that the random error can
250
be predicted by independent variables
200
• Endogeneity leads to biased parameter

Residual
estimates 150

• Omitted-variable bias – an important 100

independent variable is not included in the

50
model and is correlated with other independent
variables. The omitted variable will be included in 0
0 0.5 1.0 1.5 2.0 2.5 3.0
the error and will cause endogeneity Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis

Assumption 5 – Autocorrelation of Errors

• Errors should not be correlated with previous

errors or themselves
150
• There is autocorrelation of the errors when an
100
error value is dependent on the previous error
value 50

Residual
• We often observe autocorrelation in time series 0

data
-50

• We can identify autocorrelation of errors through

-100
patterns in the residual plot or using the
Durbin-Watson statistic -150
0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis

Assumption 6 – Multicollinearity

• Multicollinearity - independent variables are

highly correlated with each other
300
• The OLS algorithm cannot separate the effects of

Independent Variable (X2)

250
the correlated independent variables on the
target variable and so will produce unstable 200

parameter estimates
150

• We can prevent multicollinear independent

100
variables by calculating the correlations between
them and removing one from every highly 50

correlated pair 0
0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X1)

BIDA TM - Business Intelligence & Data Analysis

OLS Assumptions

When all six assumptions are met, the Ordinary Least Squares method is the best linear regression method
Linearity Homoscedasticity Zero-Mean Errors
180K
30

150K
25

Target Variable (Y)

120K
20

Residual
1
90K
0
5
60K
10

30K
50

0 0
0 0 1 1 2 2 3 0 15 30 45 60 75 90
Independent Variable (X) Independent Variable (X) Independent Variable (X)

Endogeneity Autocorrelation of Errors Multicollinearity

30 150 30

Independent Variable (X2)

25 100 25

20 50 20
Residual
Residual

15 0 15

10 -50 10

5 -100 5

0
-150 0
0 50 100 150 200 250 300
0 0.5 1.0 1.5 2.0 2.5 3.0 0 50 100 150 200 250 300

Independent Variable (X) Independent Variable (X) Independent Variable (X1)

BIDA TM - Business Intelligence & Data Analysis

Linear Regression Evaluation

• We can measure the model performance by how

Squared Metrics
good our model is at making predictions
Sum of Squared Error (SSE)
Y
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)

Absolute Metrics
Sum Absolute Error (SAE)
Mean Absolute Error (MAE)
X

• These metrics allow us to compare the performance of different models

BIDA TM - Business Intelligence & Data Analysis

Linear Regression Evaluation – Squared Metrics

• Sum of square errors (SSE) sums the squared

distance from the residuals to a line of best fit

𝑆𝑆𝐸 = ෍ 𝑦𝑖 − ŷ𝑖 2 3.0

2.5

Target Variable (Y)

• Mean square error (MSE) divides the sum of 2.0

square errors by the number of data points

1.5
𝑆𝑆𝐸
𝑀𝑆𝐸 = 1.0
𝑁
0.5
• Root mean square error (RMSE) takes the
square root of the mean square error 0
0 0.5 1.0 1.5 2.0 2.5 3.0

𝑅𝑀𝑆𝐸 = 𝑀𝑆𝐸 Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis

Linear Regression Evaluation – Absolute Metrics

• Sum of square errors (SSE) sums the squared

distance from the residuals to a line of best fit

• Mean square error (MSE) divides the sum of

3.0
square errors by the number of data points
2.5
Root mean square error (RMSE) takes the

Target Variable (Y)

•
square root of the mean square error 2.0

• Mean absolute error (MAE) sums the distance 1.5

from the residuals to a line of best fit divided

1.0
by the number of data points
0.5
1
𝑀𝐴𝐸 = ෍ 𝑦𝑖 − ŷ𝑖 0
𝑁 0 0.5 1.0 1.5 2.0 2.5 3.0

Independent Variable (X)

BIDA TM - Business Intelligence & Data Analysis

Linear Regression Evaluation – R2

• Coefficient of Determination (R2) calculates to what extent the independent variables explain the
changes in the target variable
R2 = 1 R2 = 0.86 R2 = 0

X fully explains Y X explains 86% of X cannot explain Y

change in Y
𝑆𝑆𝐸
• Coefficient of Determination: 𝑅2 = 1 − 𝑇𝑆𝑆

• Sum of square errors: 𝑆𝑆𝐸 = σ(𝑦𝑖 − ŷ𝑖 )2

• Total sum of squares: ത 2
𝑇𝑆𝑆 = σ(𝑦𝑖 − 𝑦)

BIDA TM - Business Intelligence & Data Analysis

Linear Regression Evaluation – R2

• 𝑆𝑆𝐸/𝑇𝑆𝑆 tells us how much variance there is in our model as a fraction of the total variation in the target

• One minus this value tells us how much variation has been explained by the model

𝑆𝑆𝐸
𝑅2 =1−
𝑇𝑆𝑆
• The independent variables in models with high R2 values explain most of the variance present in the data

• As with the SSE, the coefficient of determination will never decrease by adding in new variables; however,
adding in unnecessary variables can lead to overfitting the data
• We can see the effects of overfitting when we compare training performance with test data performance

BIDA TM - Business Intelligence & Data Analysis

Linear Regression Evaluation - Adjusted R2

• We can modify the R2 to account for the number of independent variables – this is known as the adjusted
R2 or R
ഥ2

𝑆𝑆𝐸 Τ(𝒏−𝒑−1)
𝑅ത 2 = 1 − Τ
𝑇𝑆𝑆 (𝒏−1)
• 𝒏 is the number of observed datapoints
• 𝒑 is the number of independent variables in the model (excluding the intercept term)

• The adjusted R2 value only increases when a new variable is added that improves the model by more than
would be expected due to random chance
• It can therefore decrease if a variable is added that does not sufficiently explain the data
• This adjusted R2 will always be lower than the unadjusted R2

• It allows comparison between models with different numbers of independent variables

BIDA TM - Business Intelligence & Data Analysis
Overview of Linear Regression Evaluation

• Evaluating these metrics after building a model provides insight into the model performance

• MSE, RMSE and MAE metrics tell us how far our predictions are from their true values on average
• MSE or RMSE can be used when we want to give greater importance to individual large errors
• The coefficient of determination tells us how much of the variability in the data the independent
variables are explaining

• The adjusted coefficient of determination helps prevent overfitting by decreasing if a variable is added
that does not sufficiently explain the data
• The MSE, RMSE and MAE are absolute measures of fit, while R2 is relative to the total sum of squares

BIDA TM - Business Intelligence & Data Analysis

Regression Coefficients

• Regression coefficients help us understand the interaction between variables

• The following model predicts house prices (𝑃) using the building area (𝐴) and a measure of the overall
house condition (𝐶):

𝑃 = 77000 + 80𝐴 + 9200𝐶

Three Parameters
• A base price of £ 77,000
• On average for every extra sqft we add, the house price increases by £80

• Similarly for every condition point we add, the price will increase by £9,200

BIDA TM - Business Intelligence & Data Analysis

Regression Coefficients

• Regression coefficients help us understand the interaction between variables

• The following model predicts house prices (𝑃) using the building area (𝐴) and a measure of the overall
house condition (𝐶):

𝑃 = 77000 + 80𝐴 + 9200𝐶

• In general, our parameter values or coefficients tell us how much the target variable changes on
average when we change the corresponding independent variables
• Positive coefficients mean that the target variable is positively correlated with the independent variable
while negative coefficients tell us that it is negatively correlated

BIDA TM - Business Intelligence & Data Analysis

Compare Coefficients

• We can scale the independent variables before fitting the regression in order to compare coefficients

• Standardization – transforming data to follow normal distribution with mean zero and unit variance by
subtracting the mean value and dividing by the standard deviation
𝑥𝑖 − 𝑥ҧ
z=
σ
• Our scaled house price model then becomes:

𝑃 = 113200 + 41766𝑍𝐴 + 30047𝑍𝐶

• We can compare relatively which independent variable contributes more to the target variable for a single
unitless increment in the independent variable
• This produces the same results for our 800sqft/score 6 house and we observe that area contributes more
to the price relative to the condition score

BIDA TM - Business Intelligence & Data Analysis

p-Values

• We can test whether coefficients are statistically significant by calculating p-values

• A p-value tells us the probability that we would get the results we observe when we assume the null
hypothesis is true
• If this p-value is below 0.05 then it is unlikely that we observe such a distant from zero value for the
coefficient by chance and we therefore assume that there is a significant relationship between the
independent variable and the target variable
• We can calculate p-values for all the coefficients in our model and discard the independent variables
that are not significant

• Removing independent variables that are not statistically significant helps prevent overfitting

BIDA TM - Business Intelligence & Data Analysis

Calculate p-Values

Process to calculate p-values:

Calculate the standard error on each coefficient

Combine the standard error with the coefficient value to

produce a t-statistic value

Calculate the p-value as the probability of obtaining this

t-value from the student’s t-distribution

• A significant p-value does not necessarily imply a high R2 value and vice-versa

BIDA TM - Business Intelligence & Data Analysis

Interpreting Regression Model for House Price

• In the practical exercise, we used RegressIt to

predict house price.

• A good model will have a tight grouping of

residuals that are consistent across all values.
• We can see that our model works well for low
prices but begins to widen at higher values.
• This means that if our model predicts a high house
price we cannot be as sure of its accuracy.

BIDA TM - Business Intelligence & Data Analysis

Outliers

• Outliers can skew the prediction of our regression model.

• We can see from below that the two models give very different predictions.
• Should we remove the outliers, leave them there, or use a different kind of model?

BIDA TM - Business Intelligence & Data Analysis

Correlation/Causation

• An important final point to remember is the

difference between correlation and causation.

• As we can see from these charts, there can be

very strong correlations between data, with no
realistic connection.
• It is important to remember the distinction
between correlation and causation before making
recommendations and decisions.
• It is essential to first understand data and to
question whether your hypothesis are plausible.

BIDA TM - Business Intelligence & Data Analysis

Advanced Regression

BIDA TM - Business Intelligence & Data Analysis

Advanced Regression

An introduction to more Know what you don’t know and Practice a more advanced
advanced regression techniques give you a few ideas of what to regression in Excel
explore next

Techniques we will outline:

Log Log Linear Regression Repeated Measure Regression Segmented & GAM Regression

Polynomial Regression Logistic Regression Lasso Regression

Random Effect Models Bayesian Regression Poisson Regression

BIDA TM - Business Intelligence & Data Analysis

Log-Log Linear Regression
Y
Non-linear relationships are hard to interpret.

For that reason, linear models are desirable.

Log-Log Linear Regression

• Log-log regression is used when our data X

appears non-linear or we have log(Y)
heteroscedastic errors.
• When Y and atleast one X are transformed
with a log, this is log-log linear regression.

• Useful when we have linear percentage

increases between target and independent
log(X)
variables.

BIDA TM - Business Intelligence & Data Analysis

Polynomial Regression Y
X2
(Quadratic)

• We use polynomial regression when we

have a smooth, non-linear relationship
between variables.
X
• Each input (x) variable is transformed with Y
an exponent: X3
(Cubic)
ŷ = θ0 + θ1𝑥 + θ2 𝑥 2 + ⋯ + θ𝑛 𝑥 𝑛

Y
• A higher exponent suggests a more
X4 X
complex relationship. (Quartic)

• Benefits: Easy to add complexity.

BIDA TM - Business Intelligence & Data Analysis

Polynomial Regression Continued

The polynomial exponent indicates the complexity of the regression line.

Y Y Y
X2 X3 X5
(Cubic) (Quintic)
(Quadratic)

X X X

Underfitting Polynomial Overfitting Polynomial

When the exponent is too low, the When the exponent is too high, the
relationship is over simplified. relationship is too specific.

BIDA TM - Business Intelligence & Data Analysis

Logistic Regression Logistic Classification

Y=1
• Logistic regression is typically used for
classification problems
Decide
• In logistic regression we fit the logistic curve Classification
function to the data: Threshold
Y = 0.5
1
𝑦=
1 + 𝑒 −β𝑥
• The y-value increases from zero to one as x Y=0
X
increases Predict ŷ = 0 Predict ŷ = 1

• New datapoints are classified based on whether

their predicted y value is above or below a certain
• Benefit: Easy way to apply classification threshold
decisions to a continuous outcome.
• Typically, we start with 0.5 or 50% as the decision
threshold.

BIDA TM - Business Intelligence & Data Analysis

Repeated Measure Regression
In this example the patient is measured 3 times

• Repeated measure regression is used

when measurements are repeated over
Patient Y1 Y2 Y3
time.
• Common in medical fields where the 1 2 3 4
same patient is measured at regular 2 0 3 1
intervals.
3 1 4 3
• Repeat measures are therefore not
independent.

- ANOVA techniques are popular for repeated

measure regression

- ANOVA requires the intervals (Age in this case) to

be regular.

BIDA TM - Business Intelligence & Data Analysis

Segmented Reggression

Basic Segmented Regression Generalized Additive Models

Y Y
Non Linear
Linear Linear

Non
Linear
Linear

X X
• Applies unique regression lines for • Applies unique regression lines for
different regions of X. different regions of X.
• Both regions have the same form. • Each region may exhibit a different
regression form.
• Benefit: Low complexity, easy to
understand. • Benefit: Can help model more complex
relationships.

BIDA TM - Business Intelligence & Data Analysis

Other Advanced Models

Lasso Regression

• Penalises less informative variables by

reducing their coefficients.

• Effectively adds bias, in order to reduce

variance.

• Benefit: Low complexity, easy to

understand.

BIDA TM - Business Intelligence & Data Analysis

Other Advanced Models

Bayesian Regression

• Used to provide a level of certainty with

regression output.

• Useful in aiding decision making. I.e. an

output with low certainty should be
treated with caution.

BIDA TM - Business Intelligence & Data Analysis

Other Advanced Models

Random Effect Models

• Attempt to capture underlying

correlations between observations.

e.g. nationwide student results where students may

correlate by school

Y School 1

School 2

BIDA TM - Business Intelligence & Data Analysis

Other Advanced Models

Poisson Regression
• Used to model counts of something in a
given time or area.

e.g. Vehicles passing in a minute

e.g. Number of positive cases in each town

Lowest
count is 0

BIDA TM - Business Intelligence & Data Analysis

Stats Project Final
No ratings yet
Stats Project Final
10 pages
B. Stat
No ratings yet
B. Stat
5 pages
Operations MGT Module #2
100% (1)
Operations MGT Module #2
4 pages
lecture 9-10
No ratings yet
lecture 9-10
28 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
Data Analytics Regression Unit III
No ratings yet
Data Analytics Regression Unit III
27 pages
Unit - 3 Machine Learning
No ratings yet
Unit - 3 Machine Learning
30 pages
DA-MODULE-3
No ratings yet
DA-MODULE-3
54 pages
L4a - Supervised Learning
No ratings yet
L4a - Supervised Learning
25 pages
Module 4
No ratings yet
Module 4
41 pages
Linear Regression
No ratings yet
Linear Regression
18 pages
Regression
No ratings yet
Regression
4 pages
unit-3 part 2 DA
No ratings yet
unit-3 part 2 DA
20 pages
LECTURE Regression
No ratings yet
LECTURE Regression
12 pages
AI lab7
No ratings yet
AI lab7
13 pages
Unit 2 Topic 1 REGRESSION
No ratings yet
Unit 2 Topic 1 REGRESSION
19 pages
Regression Modelling
No ratings yet
Regression Modelling
25 pages
Unit 2 Notes - Final
No ratings yet
Unit 2 Notes - Final
32 pages
MODULE-3
No ratings yet
MODULE-3
34 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Data Analytics Regression UNIT-III
No ratings yet
Data Analytics Regression UNIT-III
26 pages
4 ML
No ratings yet
4 ML
41 pages
Unit - Iii Data Analysis
No ratings yet
Unit - Iii Data Analysis
39 pages
AAI Lecture 10 Sp 25
No ratings yet
AAI Lecture 10 Sp 25
37 pages
Regression Analysis
100% (2)
Regression Analysis
11 pages
Notes 2
No ratings yet
Notes 2
22 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
ML-U2-Regression
No ratings yet
ML-U2-Regression
20 pages
Hanan
No ratings yet
Hanan
9 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
Machine Learning in Python
No ratings yet
Machine Learning in Python
36 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
DA_UNIT_3_R22
No ratings yet
DA_UNIT_3_R22
15 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
9 pages
Unit 2
No ratings yet
Unit 2
92 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Solving-One-Variable-Linear-Equations
No ratings yet
Solving-One-Variable-Linear-Equations
10 pages
Regression Techniques
No ratings yet
Regression Techniques
14 pages
Sales and Advertising
No ratings yet
Sales and Advertising
14 pages
TYPES OF SUPERVISED LEARNING2
No ratings yet
TYPES OF SUPERVISED LEARNING2
66 pages
MLT Unit 2
No ratings yet
MLT Unit 2
53 pages
MachineLearning_Unit-II
No ratings yet
MachineLearning_Unit-II
45 pages
Regression: UNIT - V Regression Model
100% (1)
Regression: UNIT - V Regression Model
21 pages
Regression
No ratings yet
Regression
11 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
104 pages
SSDMA UNIT 2 PART1
No ratings yet
SSDMA UNIT 2 PART1
20 pages
Ml Module3 Regression
No ratings yet
Ml Module3 Regression
51 pages
6 Regression Analysis
No ratings yet
6 Regression Analysis
12 pages
Linear Regression - 1st draft (1)
No ratings yet
Linear Regression - 1st draft (1)
5 pages
DA-3rd unit
No ratings yet
DA-3rd unit
16 pages
09. linear regression
No ratings yet
09. linear regression
46 pages
Linear Regression - Everything You Need To Know About Linear Regression
No ratings yet
Linear Regression - Everything You Need To Know About Linear Regression
17 pages
BA3-4-5modules
No ratings yet
BA3-4-5modules
258 pages
SIMPLE LINEAR REGRESSION
No ratings yet
SIMPLE LINEAR REGRESSION
25 pages
Regression Analysis Linear and Multiple Regression
No ratings yet
Regression Analysis Linear and Multiple Regression
6 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
Regression Analysis Linear and Multiple Regression
No ratings yet
Regression Analysis Linear and Multiple Regression
6 pages
Regression Analysis Linear and Multiple Regression
No ratings yet
Regression Analysis Linear and Multiple Regression
6 pages
Unit - II_DA
No ratings yet
Unit - II_DA
22 pages
2.1 Regression Analysis
No ratings yet
2.1 Regression Analysis
28 pages
Linear and Nonlinear Programming Essentials
From Everand
Linear and Nonlinear Programming Essentials
Tanushri Kaniyar
No ratings yet
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
Accelerating Your Career With Personal Branding
100% (1)
Accelerating Your Career With Personal Branding
13 pages
3.auto Collimator
No ratings yet
3.auto Collimator
13 pages
5.calibration and Comparators
No ratings yet
5.calibration and Comparators
17 pages
1.taper Measurement
No ratings yet
1.taper Measurement
12 pages
Plant-Growth Experiment: 15. Brief Version of The Case Study
No ratings yet
Plant-Growth Experiment: 15. Brief Version of The Case Study
11 pages
Quantitative Analysis: Eymen Errais, PHD, FRM
No ratings yet
Quantitative Analysis: Eymen Errais, PHD, FRM
15 pages
Coefficient of Variation (Meaning and Interpretation
No ratings yet
Coefficient of Variation (Meaning and Interpretation
10 pages
(Ebook) An Introduction to Statistics by George Woodbury ISBN 9780534377557, 0534377556 instant download
No ratings yet
(Ebook) An Introduction to Statistics by George Woodbury ISBN 9780534377557, 0534377556 instant download
53 pages
Estimation of Parameters Part I
No ratings yet
Estimation of Parameters Part I
46 pages
Chapter 5 - Support
No ratings yet
Chapter 5 - Support
15 pages
علاقة القيادة التحويلية في تحسين الأداء الوظيفي في إطار مشروع المؤسسة لدى مديري المدارس الابتدائية
No ratings yet
علاقة القيادة التحويلية في تحسين الأداء الوظيفي في إطار مشروع المؤسسة لدى مديري المدارس الابتدائية
18 pages
data-science-practical-with-solutions-bsc-cs-sem-6
No ratings yet
data-science-practical-with-solutions-bsc-cs-sem-6
29 pages
Download ebooks file Statistics for the Behavioral Sciences Frederick J. Gravetter all chapters
100% (1)
Download ebooks file Statistics for the Behavioral Sciences Frederick J. Gravetter all chapters
65 pages
Data Analyst Certification Study Guide
No ratings yet
Data Analyst Certification Study Guide
5 pages
Greene - Chap 9
No ratings yet
Greene - Chap 9
2 pages
KNN Solved Numerical Problem( Regression)
No ratings yet
KNN Solved Numerical Problem( Regression)
3 pages
ppt for 7 th sem AIML4 ABC
No ratings yet
ppt for 7 th sem AIML4 ABC
4 pages
Introduction To Vars and Structural Vars:: Estimation & Tests Using Stata
100% (1)
Introduction To Vars and Structural Vars:: Estimation & Tests Using Stata
69 pages
Chapter 1 BFC34303 (Lyy)
No ratings yet
Chapter 1 BFC34303 (Lyy)
104 pages
Statistics For Business and Economics: Anderson Sweeney Williams
No ratings yet
Statistics For Business and Economics: Anderson Sweeney Williams
45 pages
Vi. Statistical Method, Analysis and Interpretation of Data: Scores Frequency (LB) ( CF)
No ratings yet
Vi. Statistical Method, Analysis and Interpretation of Data: Scores Frequency (LB) ( CF)
10 pages
Guests' Perceptions On Factors Influencing Customer Loyalty An Analysis For UK Hotels
No ratings yet
Guests' Perceptions On Factors Influencing Customer Loyalty An Analysis For UK Hotels
19 pages
1
No ratings yet
1
2 pages
属性MSA
No ratings yet
属性MSA
23 pages
Nardl Package: Cointegration Bounds Test Dynamic Multipliers Plot
No ratings yet
Nardl Package: Cointegration Bounds Test Dynamic Multipliers Plot
1 page
Structured Equation Modeling
No ratings yet
Structured Equation Modeling
18 pages
Andi Prasetio - 210523617242 - Tugas Manajemen Transportasi
No ratings yet
Andi Prasetio - 210523617242 - Tugas Manajemen Transportasi
12 pages
INDIVIDUAL ASSIGNMENT
No ratings yet
INDIVIDUAL ASSIGNMENT
3 pages
ARMA Properties and A Little Theory
No ratings yet
ARMA Properties and A Little Theory
5 pages
Rohini 73149042113
No ratings yet
Rohini 73149042113
11 pages
Biostat Hypothesis Testing
100% (4)
Biostat Hypothesis Testing
31 pages