Lecture 1
Lecture 1
Jeanine Houwing-Duistermaat
2022-23
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
80
75
galton.sons$Height
70
65
60
60 65 70 75 80
galton.sons$Father
cor(galton.sons$Father,galton.sons$Height)
[1] 0.3913174
Q2: Given a value for the height of the father (X) what
would be a prediction for the height of the son (Y )?
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
Mathematics
Model:
Y = β0 + β1 X + ϵ,
EX (Y ) = E(Y |X) = β0 + β1 X = µ(X) is function of X
(sloppy notation since X is not a random variable)
E(Y |X = 0) = β0
E(Y |X = x1 ) = β0 + β1 x1 = µ(x1 )
V arX (Y ) = V ar(ϵ) = σ 2 (given X)
The probability density function is
2
f (y) = √1
2πσ
exp{− (y−µ(X))
2σ 2
}
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
Graphical Interpretation
Graphical representation of linear model
Yy
y = "0 + "
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1
x
x1 x2
!
X
• For example, if x = height and y = weight then µ!
x1 , x2 are two values for X ! Y |x =60 is the a
Probabilities, an example
22 − 17.5
PX=20 (Y > 22) = 1−Φ( ) = 1−Φ(1.5) = 0.067
3
with Φ normal distribution
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
Data Galton
80
75
galton.sons$Height
70
65
60
60 65 70 75 80
galton.sons$Father
OLS estimator
Let β = (β0 , β1 ) be a column vector of length 2 and x̃i be
a row vector of length 2 namely (1, xi )
Let y be the vector of the n observations and X̃ be the
matrix containing all row vectors x̃i , i.e n × 2
Assume we have an estimate of β namely β̇. Thus we
have yi = x̃i β̇ + ei
To find the ’best’ estimate of the parameter β, we need
to minimize the errors ei
n
X
(yi − x̃i β)2
i=1
OLS estimator
Estimators
•
Predicted values and residuals
Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
Suppose
estimated line y1 · · · yn for x1 · · · xn .
we observe
regression
The predicted valuesˆ ŷ1 ·ˆ · · ŷn are
yˆ1 = " 0 # "1 x1
yˆ = "ˆ ŷ#i "=
ˆ xβ̂0 + β̂1 xi
2 0 1 2
Andare
• Residuals the the
residuals e1 · · ·of
deviations enobserved
are and predicted values
!
e1 = y1 " yˆ1
! ei = yi − ŷi
e2 = y 2 " yˆ 2
y
e3
y1
! e1 e2
yˆ1
!
!. ! ! x
! x1 x2 x3
!
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
β̂0 = 38.25
galton.sons$Height
β̂1 = 0.45
70
σ̂ = 5.876
65
60
60 65 70 75 80
galton.sons$Father
Interpetation β̂1
Statistical Inference
How close is β̂ to β?
Note that the variance of β̂0 and β̂1 are on the diagonal,
while the off diagonal is the covariance of β̂0 and β̂1 ,
\
An estimate is Cov(β̂) =σ̂ 2 (X̃ ′ X̃)−1
Statistical inference
Q3 Is the relationship between Y and X statistical
significant?
H0 : β1 = 0 vs HA : β1 ̸= 0
We can use a Wald test statistic T :
β̂1
T =
s.e.(β̂1 )
T follows a student t distribution with n − 2 degrees of
freedom
We can also obtain 95% confidence interval:
Assumptions of model
Assumptions
linear relationship
no outliers
normal distribution of residuals
constant variance of residuals
independence of residuals
Check by plotting
y versus x no outliers?
qq plot of residuals
residuals versus fitted values
You can do this easily in R
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
β̂12 V ar(X)
EV (Y ) =
V ar(Y )
For Galton data, EV (Y ) = 0.15
How to explain more of the variance of Y ? How to obtain
a better prediction ŷ?
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
Y = β0 + β1 X1 + β2 X2 + ϵ,
with ϵ ∼ N (0, σ 2 ).
Here, Y and ϵ are random variables, X1 and X2 are
typically fixed numbers.
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
OLS estimator
\
Cov(β̂) = σ̂ 2 (X̃ ′ X̃)−1
single multiple
father (β̂1 (se)) 0.45 ( 0.05) 0.41 (0.05)
mother (β̂2 (se)) - 0.33 (0.05)
R-code
galton<-read.table("galton.txt", header=T)
galton.sons<-galton[(galton$Gender=="M"),]
summary(lm(Height~Father+Mother, data=galton.sons))
Call:
lm(formula = Height ~ Father + Mother, data = galton.sons)
Residuals:
Min 1Q Median 3Q Max
-9.5305 -1.5683 0.2141 1.5183 9.0968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.39988 4.13310 4.694 3.54e-06 ***
Father 0.41175 0.04668 8.820 < 2e-16 ***
Mother 0.33355 0.04600 7.252 1.74e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model fit
plot(lm(Height~Father+Mother, data=galton.sons))
Residuals vs Fitted
Normal Q-Q
289
10
289
4
126
5
Residuals
2
Standardized residuals
0
0
-5
673
-10
-2
64 66 68 70
60
Fitted values
-4
479
lm(Height ~ Father + Mother)
-3 -2 -1 0 1 2 3
Theoretical Quantiles
lm(Height ~ Father + Mother)
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
Another example
Why simulation
Simulation code
set.seed(2)
output<-numeric()
for (i in 1:1000) {
y<-rnorm(100,2,1)
output<-rbind(output,cbind(mean(y),var(y),
sqrt(var(y))/10))}
mean((output[,1]-2)^2)
sqrt(var(output[,1]))
mean(output[,3])
Simulation of a model
set.seed(2)
x<-c(rep(0,50),rep(1,50))
beta0<-2
beta1<-1
output<-numeric()
for (i in 1:100) {
e<-rnorm(100,0,1)
y<-beta0+beta1*X+e
.....
output<-rbind(output,....)}
Final remarks