Course Slides - Regression Analysis
Course Slides - Regression Analysis
Learn what linear regression is Learn how to perform simple Create linear regression models in
and how to use it to make real regression calculations in Excel & Python using both statsmodels and
world predictions RegressIt sklearn modules
Understand the implicit Be able to interpret regression Become familiar with more
assumptions behind linear coefficients, p-values and other advanced regression techniques
regression metrics to evaluate a model and when to use them
Regression refers to a specific type of Machine Learning model, used to make predictions.
Discrete variables
Discrete Variable
Continuous variables
Can take any value along a continuous scale.
0
0 0.5 1.0 1.5 2.0 2.5 3.0
Independent
Variable (X)
Regression can be applied to a wide range of problems where we want to predict a continuous value
such as revenue, costs, life expectancy or film review scores.
Target predicted using only one Target predicted using multiple Target predicted using multiple
independent variable independent variables independent variables
X1 X3
X1 Y
Y
X2 X4
Break linear regression down Calculate the SSE (sum of squared Learn how R2 can be used to
into its simplest form, including errors) to summarize total error in explain the performance of the
lines of best fit and errors. the model. model.
Be comfortable using regression Understand basic limitations of a Practice a basic regression scenario
terminology in conversation. linear regression model. in Excel and Python.
The aim of simple linear regression is to fit a straight-line relationship between two variables, X and Y.
• Use the line of best fit to predict y-values for any 2.5
1.0
0.5
0
0 0.5 1.0 1.5 2.0 2.5 3.0
Target Variable
• From the line of best fit we can predict house 150
prices for any house building area
100
One House
50
Additional Complexity (Observation)
0
• House prices are determined by more than just 0 0.5 1.0 1.5 2.0 2.5 3.0
the area of the house Building Area (‘000 sqft)
• Multiple Linear Regression can be used to include Independent Variable
other variables.
ŷ = 𝜽 𝟏 𝒙 + 𝜽0
Target
• ෝ is our predicted value
𝒚 60 Variable (Y)
E.g. Umbrella Sales
• 𝒙 is the independent variable 50
30
Increase in
• θ0 represents the intercept ŷ = 𝟏. 𝟐 X
20
• θ1 represents the slope
Independent
θ0 10
Variable (X)
E.g. Monthly Rainfall (cm)
0
0 1.0 2.0 3.0 4.0 5.0 6.0
1.0
2
𝑆𝑆𝐸 = 𝑦𝑖 − ŷ𝑖
0.5
• The goal is to minimize the total error (SSE) produced by the line of best fit.
σ(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)ത
θ0 = 𝑦ത − θ1 𝑥ҧ θ1 =
Intercept
σ(𝑥𝑖 − 𝑥)ҧ 2
Slope
• ഥ and 𝒚
𝒙 ഥ are the averages of the all the observed 𝒙𝒊 and 𝒚𝒊 values.
• Different values of θ0 and θ1 give different predictions and hence different values for the SSE.
We should be careful not to extrapolate our results beyond the sample range of data.
200
• Looking beyond the range of
observations may lead us to false
150 conclusions.
0
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
25
• Ice cream does not cause not burn
20
• In fact, the weather (sun) causes both.
15
10
• Additionally, we can find many correlated
5 variables with no real relationship. These
are called Spurious Regressions.
0
0 100 200 300 400 500 600
Ice Cream Sales
We should carefully consider if our independent variable is truly causing the effect on the target variable.
• Ordinary least squares will always produce the same results (analytic algorithm)
• We must test our model thoroughly using new data, before applying it to real world decisions.
(New Data)
Y
X
Introduce and define Multiple Compare Multiple Linear Explore the parameters of Multiple
Linear Regression Regression to Simple Linear Linear Regression
Regression
Introduce Multicollinearity and Identify how the common issue of Practice an in-depth Multiple Linear
why it is important to Multiple overfitting a Multiple Linear Regression scenario in Python
Linear Regression Regression model can occur
• When we have one input variable, we have two parameters: slope and intercept
ŷ = 𝜽𝟏 𝑥 + 𝜽0
- θ0 and θ1 are the parameters that define the regression line.
- θ0 represents the intercept
𝑦ො = 𝜽𝟎 + 𝜽𝟏 𝑥1 + 𝜽𝟐 𝑥2 + ⋯ + 𝜽𝒑 𝑥𝑝
• We use ordinary least squares again and solve 𝑝 + 1 simultaneous equations using matrix algebra for
the best fit parameter values
• Multicollinearity occurs when our input variables are strongly correlated with each other
House Price
Number of Building
Bedrooms Area
• With multicollinear variables, the algorithm would be unable to separate the effects of these two
variables
• Adding uncorrelated independent variables to the model often helps better explain the target variable
but must be done with caution
• If a new variable is added that has no predictive value, the algorithm will calculate that and assign a
parameter value close to zero
𝑦ො = 𝜽𝟎 + 𝟎. 𝟎𝟎𝟏𝑥1 + 𝜽𝟐 𝑥2 + ⋯ + 𝜽𝒑 𝑥𝑝
• Overfitting can occur when the model captures too much of the detail in the training data that doesn't
exist in the test data
• This can be caused by random effects in the data reducing the SSE and hence being detected by the
algorithm
• It is best to balance the number of independent variables so that we can adequately describe the data
without overfitting
Review the concepts of residuals Analyze datasets to ensure they Evaluate Linear Regression metrics
and residual plots meet the Ordinary Least Squares to measure the error in models
assumptions
Analyze model output with Linear Apply knowledge with an Practice interpreting the results of
Regression coefficients & p-values interactive exercise and a Linear Regression model in
interpretation scenarios Python
Residual
• Residuals randomly scattered around the x-
axis
• Residuals with an average value of zero
Residual
• Non-random patterns in the residuals could be
caused by one or more of the following factors:
• Linear model is not appropriate
• Omitted variable bias
• Data does not satisfy OLS assumptions
• In these cases, we may be able to transform the
Residual
data to make it fit the model or otherwise use a
different model
• If we observe the residuals to be randomly distributed, then we can be reasonably content that our
model is unbiased
• Non-random patterns in the residuals provide the opportunity to improve the model by introducing
new independent variables or transformations
• If these methods do not resolve the problems, then potentially a non-linear model is more appropriate
The Ordinary Least Squares method relies on 6 assumptions about the data:
Linearity Homoscedasticity Zero-Mean Errors
180K
30
150K
25
120K
20
Residual
90K
1 0
5
60K
10
30K
50
0 0
0 0 1 1 2 2 3 0 15 30 45 60 75 90
Independent Variable (X) Independent Variable (X) Independent Variable (X)
20 50 20
Residual
Residual
15 0 15
10 -50 10
5 -100 5
0
-150 0
0 50 100 150 200 250 300
0 0.5 1.0 1.5 2.0 2.5 3.0 0 50 100 150 200 250 300
Count
each error on a chart to see if it fits a normal
distribution
• Homoscedasticity – the spread of errors should -5 -4 -3 -2 -1 0 1 2 3 4 5
Residual
not change with the independent variables
180K
• Heteroskedasticity – errors that do not have
150K
constant variance
Residual
• A non-zero average error means the model has a 0
Residual
estimates 150
Residual
• We often observe autocorrelation in time series 0
data
-50
parameter estimates
150
correlated pair 0
0 0.5 1.0 1.5 2.0 2.5 3.0
When all six assumptions are met, the Ordinary Least Squares method is the best linear regression method
Linearity Homoscedasticity Zero-Mean Errors
180K
30
150K
25
120K
20
Residual
1
90K
0
5
60K
10
30K
50
0 0
0 0 1 1 2 2 3 0 15 30 45 60 75 90
Independent Variable (X) Independent Variable (X) Independent Variable (X)
20 50 20
Residual
Residual
15 0 15
10 -50 10
5 -100 5
0
-150 0
0 50 100 150 200 250 300
0 0.5 1.0 1.5 2.0 2.5 3.0 0 50 100 150 200 250 300
Absolute Metrics
Sum Absolute Error (SAE)
Mean Absolute Error (MAE)
X
𝑆𝑆𝐸 = 𝑦𝑖 − ŷ𝑖 2 3.0
2.5
• Coefficient of Determination (R2) calculates to what extent the independent variables explain the
changes in the target variable
R2 = 1 R2 = 0.86 R2 = 0
• 𝑆𝑆𝐸/𝑇𝑆𝑆 tells us how much variance there is in our model as a fraction of the total variation in the target
• One minus this value tells us how much variation has been explained by the model
𝑆𝑆𝐸
𝑅2 =1−
𝑇𝑆𝑆
• The independent variables in models with high R2 values explain most of the variance present in the data
• As with the SSE, the coefficient of determination will never decrease by adding in new variables; however,
adding in unnecessary variables can lead to overfitting the data
• We can see the effects of overfitting when we compare training performance with test data performance
• We can modify the R2 to account for the number of independent variables – this is known as the adjusted
R2 or R
ഥ2
𝑆𝑆𝐸 Τ(𝒏−𝒑−1)
𝑅ത 2 = 1 − Τ
𝑇𝑆𝑆 (𝒏−1)
• 𝒏 is the number of observed datapoints
• 𝒑 is the number of independent variables in the model (excluding the intercept term)
• The adjusted R2 value only increases when a new variable is added that improves the model by more than
would be expected due to random chance
• It can therefore decrease if a variable is added that does not sufficiently explain the data
• This adjusted R2 will always be lower than the unadjusted R2
• Evaluating these metrics after building a model provides insight into the model performance
• MSE, RMSE and MAE metrics tell us how far our predictions are from their true values on average
• MSE or RMSE can be used when we want to give greater importance to individual large errors
• The coefficient of determination tells us how much of the variability in the data the independent
variables are explaining
• The adjusted coefficient of determination helps prevent overfitting by decreasing if a variable is added
that does not sufficiently explain the data
• The MSE, RMSE and MAE are absolute measures of fit, while R2 is relative to the total sum of squares
• The following model predicts house prices (𝑃) using the building area (𝐴) and a measure of the overall
house condition (𝐶):
Three Parameters
• A base price of £ 77,000
• On average for every extra sqft we add, the house price increases by £80
• Similarly for every condition point we add, the price will increase by £9,200
• The following model predicts house prices (𝑃) using the building area (𝐴) and a measure of the overall
house condition (𝐶):
• In general, our parameter values or coefficients tell us how much the target variable changes on
average when we change the corresponding independent variables
• Positive coefficients mean that the target variable is positively correlated with the independent variable
while negative coefficients tell us that it is negatively correlated
• We can scale the independent variables before fitting the regression in order to compare coefficients
• Standardization – transforming data to follow normal distribution with mean zero and unit variance by
subtracting the mean value and dividing by the standard deviation
𝑥𝑖 − 𝑥ҧ
z=
σ
• Our scaled house price model then becomes:
• A p-value tells us the probability that we would get the results we observe when we assume the null
hypothesis is true
• If this p-value is below 0.05 then it is unlikely that we observe such a distant from zero value for the
coefficient by chance and we therefore assume that there is a significant relationship between the
independent variable and the target variable
• We can calculate p-values for all the coefficients in our model and discard the independent variables
that are not significant
• Removing independent variables that are not statistically significant helps prevent overfitting
• A significant p-value does not necessarily imply a high R2 value and vice-versa
• We can see from below that the two models give very different predictions.
• Should we remove the outliers, leave them there, or use a different kind of model?
An introduction to more Know what you don’t know and Practice a more advanced
advanced regression techniques give you a few ideas of what to regression in Excel
explore next
Log Log Linear Regression Repeated Measure Regression Segmented & GAM Regression
Y
• A higher exponent suggests a more
X4 X
complex relationship. (Quartic)
Y Y Y
X2 X3 X5
(Cubic) (Quintic)
(Quadratic)
X X X
Y=1
• Logistic regression is typically used for
classification problems
Decide
• In logistic regression we fit the logistic curve Classification
function to the data: Threshold
Y = 0.5
1
𝑦=
1 + 𝑒 −β𝑥
• The y-value increases from zero to one as x Y=0
X
increases Predict ŷ = 0 Predict ŷ = 1
Non
Linear
Linear
X X
• Applies unique regression lines for • Applies unique regression lines for
different regions of X. different regions of X.
• Both regions have the same form. • Each region may exhibit a different
regression form.
• Benefit: Low complexity, easy to
understand. • Benefit: Can help model more complex
relationships.
Lasso Regression
Bayesian Regression
Y School 1
School 2
Poisson Regression
• Used to model counts of something in a
given time or area.
Lowest
count is 0