03 Regressionanalysis
03 Regressionanalysis
(LV 0000001540)
Session 3
21 November 2022
Regression Analysis
Rolf Moeckel | Professor of Travel Behavior | Department of Mobility Systems Engineering | Technical University of Munich
Statistical learning
Shown are Sales vs TV, Radio and Newspaper adds, with a blue linear-regression line fit separately to each.
Can we predict Sales using these three? Perhaps we can do better using a model
Sales = f (TV, Radio, Newspaper)
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
2
Notation (1)
Here, Sales is a dependent variable that we wish to predict. We generically refer to the response as 𝑌.
The ideal or optimal predictor of 𝑌 with regard to mean-squared prediction error: 𝑓 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 is the
"
function that minimizes 𝐸 𝑌 − 𝑓A 𝑋 𝑋 = 𝑥 over all functions 𝑓 at all points 𝑋 = 𝑥.
to get + value,
emphasize the large number
Reducible Irreducible
Nearest neighbor methods can be lousy when 𝑝 is large. Reason: the curse of dimensionality. Nearest
neighbors tend to be far away in high dimensions.
• We need to get a reasonable fraction of the 𝑁 values of 𝑦$ to average in order to bring the variance down,
e.g. 10%.
• A 10% neighborhood in high dimensions (i.e., large number of independent variables, or 𝑝 is large) need no
longer be local, so we lose the spirit of estimating 𝐸 𝑌 𝑋 = 𝑥 by local averaging.
10% Neighborhood in one dimension (x1) 10% Neighborhood in two dimensions (x1 and x2)
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
9
Linear regression model
A quadratic model
𝑓A( 𝑋 = 𝛽A& + 𝛽A! 𝑋 + 𝛽A" 𝑋 "
may fit slightly better.
This maybe biased towards models that are overfit. Instead, we should – when possible – compute it using
fresh test data 𝑇𝑒:
"
𝑀𝑆𝐸)- = 𝐴𝑣𝑒$∈)- 𝑦$ − 𝑓A 𝑥$
To create test data: Data records are randomly sampled, and – for example – 80% of all records are used for
model estimation and the remaining records that were not used for model estimation are used for model
testing.
𝑀𝑆𝐸!#
simpler complexer
model model
Black curve is “truth.” Orange, blue and green curves/squares correspond to fits of different flexibility.
𝑀𝑆𝐸!"
𝑀𝑆𝐸!#
simpler complexer
model model
Here, the truth is smoother (or simpler). The linear model does really well. Simple models are generally preferred over complex models.
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
18
Danger of overfitting a model (Example 3)
simpler complexer
model model
Here, the truth is wiggly and the noise is low, so the more flexible fits do the best job.
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
19
Bias-variance trade-off
Suppose we have fit a model 𝑓! 𝑥 to some training data 𝑇𝑟, and
let 𝑥$, 𝑦$ be a test observation drawn from the population. If the
true model is 𝑌 = 𝑓 𝑋 + 𝜖 with 𝑓 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 , then
% %
𝐸 𝑦$ − 𝑓! 𝑥$ = 𝑉𝑎𝑟 𝑓! 𝑥$ + 𝐵𝑖𝑎𝑠 𝑓! 𝑥$ + 𝑉𝑎𝑟 𝜖
Estimated curve
Variance refers to the amount by which 𝑓! would change if we Bias
True relationship
estimated it using a different training data set. Bias refers to the
Varia
error that is introduced by approximating a real-life problem by a
much simpler model.
nce
𝛆
Observations
There is a bias-variance trade-off. Typically, as 𝑓! becomes more
complex, its variance increases and its bias decreases.
20
Bias-variance trade-offs for the three examples
Example 1 Example 2 Example 3
Irreducible
error 𝛆
simpler complexer
model model