9_2_MultipleRegression
9_2_MultipleRegression
interpretation
Lecture 17
Empirical Methods 2 & Theory of Science
02-03-2020 2
Last time
Regression
VIAS Science Cartoons - Regression Analysis
Today
2. Linear regression
• What’s ”least squares”?
• Coefficient of determination (R2)
3. Multiple regression
• Different types of predictors
• Checking residuals
• Interpreting coefficients
02-03-2020 4
Stepping back
Remember probability?
• What are the chances of rolling a 6 with a 6-sided dice?
• Is it my birthday today? What is the likelihood?
02-03-2020 5
You want to study the rate of shedding in a home Fun Facts About Tuxedo Cats ...
Reduced
with larger
sample
size
REJECT NULL NULL HYPOTHESIS
HYPOTHESIS TRUE FALSE
NO True Negative False Negative
TYPE II ERROR
YES False Positive True Positive
TYPE I ERROR Power of the study
Typically
restricted
to 5%
02-03-2020 9
You KNOW there is a difference between these two Fun Facts About Tuxedo Cats ...
P (making an error) = ⍺
P (not making an error) = 1 - ⍺
P (not making an error in m tests) = (1 - ⍺) m
P (making at least one error in m tests) = 1 – (1 - ⍺) m
02-03-2020 13
Counting errors
Genome-wide association:
For gene expression studies with DNA chips – 500000 SNP.
• For the significance level 0.01 we can expect up to 5000 false associations
Remember
• It is not fair to hunt around through the data for a big contrast and then
pretend that you’ve only done one comparison. This is called data snooping.
Often the significance of such finding is overstated.
Linear regression
SPECIFIED DIRECTION
OF EFFECT!
19
• Residual – the difference between the actual value of Y and the predicted
value of Y`: e = Y – Y` – this is your error
Regression equation: y = ax + b
Equation for heart weight Hw = 4,31*Bw – 1,18
... essentially: how far are the data points from out fitted line (and thus from
the predicted values)?
ens
Dias 29 Sted og navn navn
y=mx+b
e1
e2
SE = Σ (y-(mx+b))2
02-03-2020 30
• Total variation in y is sum of distances of each y from the mean of all y’s
ens
Dias 31 Sted og navn navn
y=mx+b y(mean)-y1
e1
Mean of Y
e2
SE = Σ (y-(mx+b))2
02-03-2020 32
• Total variation in y is sum of distances of each y from the mean of all y’s –
your squared error in y
e1
Mean of Y
e2
SEline = Σ (y-(mx+b))2
02-03-2020 34
• Total variation in y is sum of distances of each y from the mean of all y’s –
your squared error in y
R2 = 0,62
Coefficient of determination – R2
• The proportion of variability in a data set
that is accounted for by the statistical
model
where
Total sum of squares
Height → Speed
Length → Speed
15: Multiple Linear Regression Basic Biostat 39
Rollercoaster example
Rollercoaster example
Rollercoaster example
Heteroskedasticity:
where the variance of the residuals is
unequal over a range of measured
values.
02-03-2020 49
Rollercoaster example
• But we can standardize them to compare them using the ratio of the
standard deviations
𝜎
• 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 𝑚 = 𝛽 = 𝑚 𝜎 𝑥 = 𝑚 ∗ 𝑆𝐷𝑥 /𝑆𝐷𝑦
𝑦
• Alternatively, you can standardize your data BEFORE running the regression – convert
the data to z-scores
• This gives the rate of change in y in SDs in relation to the rate of change
(in SDs) of x.
02-03-2020 53
t-statistic
Significance
test
Model fit
Coefficient of
determination
02-03-2020 55
• t-test to check how unlikely it is that the model is wrong under the null
hypothesis
𝑏 𝑜𝑏𝑠 − 𝑏 exp(𝐻0) 𝑏 𝑜𝑏𝑠
tN-p = =
𝑆𝐸 𝑏 𝑆𝐸 𝑏
• The t-statistic is a standardized test – not sensitive to different scales in IV
• Another way to compare predictors, to see which has the strongest linear association
Unpredictable
Variation
Proportion of Predictable & Unpredictable Variation
R2 = Predictable
X2 (explained) variation in Y
Various Significance Tests
• Testing R2
• Test R2 through an F test
• Test of competing models (difference between R 2) through an F test of difference of
R 2s
• Testing b
• Test of each partial regression coefficient (b) by t-tests
• Comparison of partial regression coefficients with each other - t-test of difference
between standardized partial regression coefficients ( )
02-03-2020 61
Variance explained
• When the independent variables are highly correlated (say, |r | > .7), it is
not possible to determine the separate effect of any particular
independent variable on the dependent variable.
66
• Variable selection
• Choose wisely from your list of X variables
• Too many: waste the information in the data
• Too few: risk ignoring useful predictive information
• Model misspecification
• Could the model you are testing be wrong or incomplete?
• Nonlinearity
• Interactions
67
• Hypothesis tests and CIs are meaningful only when the data fits the model
well.
• Remember, when the sample size is large enough, you will probably reject
any null hypothesis of β=0.
• When the sample size is small, you may not have enough evidence to
reject a null hypothesis of β=0.
• When you fail to reject a null hypothesis, don’t be too hasty to say that a
predictor has no linear association with the outcome. It is likely that there
is some association, it just isn’t a very strong one.
Checking assumptions
• Plot the residuals versus the predicted values from the regression line.
Next week
• Assignment 2 released on
SATURDAY!
• We will discuss it on Tuesday