0% found this document useful (0 votes)
17 views

Lecture 1

Uploaded by

Clara Carner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Lecture 1

Uploaded by

Clara Carner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Biostatistics, Lecture 1, Linear regression

Jeanine Houwing-Duistermaat

2022-23
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

This lecture is about

Relationship between two or more variables


A linear model (probablistic framework)
Method for parameter estimation using data
Method for testing for statistical significance of
parameters
Analysis in R
We will see that linear regression is used for various
questions, which require slightly different models,
interpretation of models and samples.
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Linear regression, one X variable


Suppose Y and X are two random variables, e.g. consider
a pair of a son and a father and Y and X represent the
heights of the son and his father, respectively
Now, we know that fathers and sons share half of their
genes so probably the Y and X are correlated.
What is the correlation ρ between Y and X? (Q1).
Let (Yi , Xi ) be n random copies of (Y, X) with observed
values (yi , xi ).
Note I use capital letters for random variables and small
letters for observations.
Assume Y and X follow a bivariate normal distribution.
To answer Q1, we can calculate Pearson’s correlation
coefficient ρ.
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Scatter plot Galton data


.

80
75
galton.sons$Height

70
65
60

60 65 70 75 80

galton.sons$Father

cor(galton.sons$Father,galton.sons$Height)
[1] 0.3913174
Q2: Given a value for the height of the father (X) what
would be a prediction for the height of the son (Y )?
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Linear regression model

To obtain a prediction for Y the following model might


be proposed
Y = β0 + β1 X + ϵ,
with ϵ ∼ N (0, σ 2 ).
Here, Y and ϵ are random variables, X a fixed number,
and β0 , β1 and σ 2 are the parameters of the model.
Y is also called response, outcome or dependent variable.
X is called independent variable (there are more names
which we will introduce later)
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Mathematics

Model:
Y = β0 + β1 X + ϵ,
EX (Y ) = E(Y |X) = β0 + β1 X = µ(X) is function of X
(sloppy notation since X is not a random variable)
E(Y |X = 0) = β0
E(Y |X = x1 ) = β0 + β1 x1 = µ(x1 )
V arX (Y ) = V ar(ϵ) = σ 2 (given X)
The probability density function is
2
f (y) = √1
2πσ
exp{− (y−µ(X))
2σ 2
}
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Graphical Interpretation
Graphical representation of linear model

Yy
y = "0 + "
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1

x
x1 x2
!
X
• For example, if x = height and y = weight then µ!
x1 , x2 are two values for X ! Y |x =60 is the a

weight for! all individuals 60 inches tall in the population


Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Probabilities, an example

Y = 7.5 + 0.5X + ϵ and σ = 3


If X = 20 what is the expectation of Y ?

µ(20) = 7.5 + 0.5 · 20 = 17.5


If X = 20 what is the probability that Y is larger than 22?

22 − 17.5
PX=20 (Y > 22) = 1−Φ( ) = 1−Φ(1.5) = 0.067
3
with Φ normal distribution
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Data Galton

80
75
galton.sons$Height

70
65
60

60 65 70 75 80

galton.sons$Father

How to estimate the parameters from a sample of


observations (yi , xi )?
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

OLS estimator
Let β = (β0 , β1 ) be a column vector of length 2 and x̃i be
a row vector of length 2 namely (1, xi )
Let y be the vector of the n observations and X̃ be the
matrix containing all row vectors x̃i , i.e n × 2
Assume we have an estimate of β namely β̇. Thus we
have yi = x̃i β̇ + ei
To find the ’best’ estimate of the parameter β, we need
to minimize the errors ei
n
X
(yi − x̃i β)2
i=1

This gives the estimator β̂ = (X̃ ′ X̃)−1 X̃ ′ Y


Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

OLS estimator

OLS estimator is unbiased:

E[β̂] = E[(X̃ ′ X̃)−1 X̃ ′ Y ] = E[(X̃ ′ X̃)−1 X̃ ′ (X̃β + ϵ)] = β


1
Pn
Estimator for σ 2 is s2 = n−2 i=1 (Yi − X̃i β̂)
2

Also this estimator is unbiased.


You can show that the OLS estimator for β is equivalent
to the maximum likelihood (ML) estimator.
Note s2 is not the ML estimator
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Estimators

We use (X̃ ′ X̃)−1 (X̃ ′ Y ) as an estimator for the


population parameters β

An estimate of the parameter vector β is often denoted


with β̂

We use s2 as estimator for the population conditional


variance σ 2 of Y given X

An estimate of σ 2 is often denoted with σ̂ 2


Predicted and Residual Values
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion


Predicted values and residuals
Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
Suppose
estimated line y1 · · · yn for x1 · · · xn .
we observe
regression
The predicted valuesˆ ŷ1 ·ˆ · · ŷn are
yˆ1 = " 0 # "1 x1
yˆ = "ˆ ŷ#i "=
ˆ xβ̂0 + β̂1 xi
2 0 1 2

Andare
• Residuals the the
residuals e1 · · ·of
deviations enobserved
are and predicted values
!
e1 = y1 " yˆ1
! ei = yi − ŷi
e2 = y 2 " yˆ 2
y
e3

y1
! e1 e2
yˆ1
!
!. ! ! x
! x1 x2 x3
!
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Results Galton dataset


80
75

β̂0 = 38.25
galton.sons$Height

β̂1 = 0.45
70

σ̂ = 5.876
65
60

60 65 70 75 80

galton.sons$Father

How to interpret β̂1 ?


Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Interpetation β̂1

Suppose the height of two fathers differ one unit, then a


prediction of the difference in height of their sons is β̂1 .
Of course we do not expect this to be the observed
difference.
Note that this is not causation. Causation means that if
we change the height of a father then the height of the
son will be changed with β̂1 units.
In observational studies, causation is difficult to prove.
Why? We will discuss this again.
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Statistical Inference

How close is β̂ to β?

The covariance of β̂ is σ 2 (X̃ ′ X̃)−1

Note that the variance of β̂0 and β̂1 are on the diagonal,
while the off diagonal is the covariance of β̂0 and β̂1 ,
\
An estimate is Cov(β̂) =σ̂ 2 (X̃ ′ X̃)−1

Note a larger σ 2 means larger standard errors for β̂ (more


noise makes it harder to obtain an accurate estimate)
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Statistical inference
Q3 Is the relationship between Y and X statistical
significant?
H0 : β1 = 0 vs HA : β1 ̸= 0
We can use a Wald test statistic T :

β̂1
T =
s.e.(β̂1 )
T follows a student t distribution with n − 2 degrees of
freedom
We can also obtain 95% confidence interval:

β̂1 ± tn−2 s.e.(β̂1 )

with tn−2 critical value.


Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Assumptions of model

Assumptions
linear relationship
no outliers
normal distribution of residuals
constant variance of residuals
independence of residuals
Check by plotting
y versus x no outliers?
qq plot of residuals
residuals versus fitted values
You can do this easily in R
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Non random sample

We do not have to take a random sample concerning X


Of course if X is selected Y is selected as well.
The requirement is that Y given X i.e. ϵ is random and
follows a normal distribution.
It might be efficient to oversample certain values for X.
Why?
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Assume both random again, random sample

Q4: How much of the variation of Y is explained by the


random variable X?

V ar(Y ) = β̂12 V ar(X) + σ̂ 2

β̂12 V ar(X)
EV (Y ) =
V ar(Y )
For Galton data, EV (Y ) = 0.15
How to explain more of the variance of Y ? How to obtain
a better prediction ŷ?
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

More than one X

If we know also height of mother we can probably better


predict the height of the son.
We will also explain more variation of Y by adding
relevant X variables.
To obtain a prediction for Y the following model for the
population can be proposed

Y = β0 + β1 X1 + β2 X2 + ϵ,

with ϵ ∼ N (0, σ 2 ).
Here, Y and ϵ are random variables, X1 and X2 are
typically fixed numbers.
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

OLS estimator

Let β = (β0 , β1 , β2 ) be a column vector of length 3 and


X̃ be a matrix with observed values (1′ s, X1 , X2 )
β̂ = (X̃ ′ X̃)−1 X̃ ′ y
1
Pn
Estimator for σ 2 is s2 = n−3 i=1 (Yi − X̃i β̂)2

Standard errors for β̂ can be obtained from the diagonal


of the covariance matrix of β̂

\
Cov(β̂) = σ̂ 2 (X̃ ′ X̃)−1

by taking the square root.


Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Results linear regression model

single multiple
father (β̂1 (se)) 0.45 ( 0.05) 0.41 (0.05)
mother (β̂2 (se)) - 0.33 (0.05)

Note β1 is smaller in the multiple regression model than


in the single variable model. Why?
This is called confounding.
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

R-code
galton<-read.table("galton.txt", header=T)
galton.sons<-galton[(galton$Gender=="M"),]
summary(lm(Height~Father+Mother, data=galton.sons))

Call:
lm(formula = Height ~ Father + Mother, data = galton.sons)

Residuals:
Min 1Q Median 3Q Max
-9.5305 -1.5683 0.2141 1.5183 9.0968

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.39988 4.13310 4.694 3.54e-06 ***
Father 0.41175 0.04668 8.820 < 2e-16 ***
Mother 0.33355 0.04600 7.252 1.74e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.3 on 462 degrees of freedom


Multiple R-squared: 0.2397, Adjusted R-squared: 0.2364
F-statistic: 72.82 on 2 and 462 DF, p-value: < 2.2e-16
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

model fit

plot(lm(Height~Father+Mother, data=galton.sons))

Residuals vs Fitted
Normal Q-Q

289
10

289

4
126
5
Residuals

2
Standardized residuals
0

0
-5

673
-10

-2

64 66 68 70
60
Fitted values
-4

479
lm(Height ~ Father + Mother)
-3 -2 -1 0 1 2 3

Theoretical Quantiles
lm(Height ~ Father + Mother)
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Another example

Insuline levels and sugar intake are correlated.


Question: for patients with high insuline level does
reduction in sugar intake results to lower insuline level
Observational Study:
Suppose we have data on patients with high level of
insuline and their sugar intake and propose a linear
regression model to be fitted to the data with insuline
level as dependent variable and sugar intake as
independent variable.
Any concerns about the sample used?
Any confounder?
Interpretation of the β coefficient?
Suggestion for other study design?
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Arguments for multiple linear regression

We may be able to better predict the outcome, i.e. ŷi


closer to yi
We need to adjust for confounding, i.e. a variable
influencing both the dependent and the independent
variable.
We may want to estimate the effect of the independent
variable more accurately (here: smaller s.e.) by reducing
the variance of the residuals
We are interested in all variables explaining Y
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Names for independent variable

In epidemiology when the interest is in the relationship


between a specific variable X and the dependent variable
Y , the variable X is often called exposure.
X is called exploratory variable or covariate, if their role is
to reduce noise or to explain the variance of Y .
X is confounder when including X in the model changes
the β coefficient of the exposure variable
X is called predictor variable when we are interested in
predicting an outcome.
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Why simulation

Models make assumptions and we may want to evaluate


what happens if we have small deviations of these
assumptions.
We might want to compare various estimators, e.g. OLS
and ML.
We might want to understand the model better.
We may want to check our code.
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Simulation of univariate normally distributed


variables
Assume a distribution, e.g. normal with mean 2 and
variance 1.

We generate a set of values according to the distribution

We calculate estimates, e.g. µ̂ using the data from the


sample.

When we sample a second set, do we get the same µ̂?

We repeat the sampling a number of times N

This gives us a distribution of the estimator for our


parameter. We may compare this with the theoretical
distribution.
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Simulation code

set.seed(2)
output<-numeric()
for (i in 1:1000) {
y<-rnorm(100,2,1)
output<-rbind(output,cbind(mean(y),var(y),
sqrt(var(y))/10))}

The function set.seed sets your random number


generation to ensure that you get the same results if you
repeat your simulation
The number of replicates is 1000
The size of the dataset is 100
Instead of a for loop you can use the sapply functions
in R
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Summarizing your simulation

mean((output[,1]-2)^2)
sqrt(var(output[,1]))
mean(output[,3])

The first number is the bias.


The second number is the empirical standard error of the
estimator for the mean using the empirical distribution of
the estimator obtained by simulations.
Note sqrt(var(y)/10) is the theoretical standard error
of the estimator of the mean. Therefore the third number
is the mean of the theoretical standard error of the
estimator for the mean over the 1000 replicates
The third and second number should be about the same
for large datasets and a large number of replicates
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Simulation of a model

set.seed(2)
x<-c(rep(0,50),rep(1,50))
beta0<-2
beta1<-1
output<-numeric()
for (i in 1:100) {
e<-rnorm(100,0,1)
y<-beta0+beta1*X+e
.....
output<-rbind(output,....)}

Note X is fixed here.


Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion
Lecture 1: intro Linear regression Multiple linear regression Simulation conclusion

Final remarks

Linear regression can be used to answer questions about


two or more variables.
Population, question, model; sample from population:
esitmation of parameters using data; inference based on
data.
Exercises: pen and paper, dataset in R, simulation study
Next lecture: Epidemiology

You might also like