0% found this document useful (0 votes)
8 views

9_2_MultipleRegression

Uploaded by

Sophia Lindholm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

9_2_MultipleRegression

Uploaded by

Sophia Lindholm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Multiple regression and

interpretation
Lecture 17
Empirical Methods 2 & Theory of Science
02-03-2020 2

Last time

Regression
VIAS Science Cartoons - Regression Analysis

• Fitting linear models


• Least Squares Regession
• Residuals
• Parameter estimation
• Fitting a linear model
02-03-2020 3

Today

1. The problem of multiple comparisons


(Bonferroni correction)

2. Linear regression
• What’s ”least squares”?
• Coefficient of determination (R2)

3. Multiple regression
• Different types of predictors
• Checking residuals
• Interpreting coefficients
02-03-2020 4

Stepping back

Remember probability?
• What are the chances of rolling a 6 with a 6-sided dice?
• Is it my birthday today? What is the likelihood?
02-03-2020 5

Statistical tests are excercises in probability


HOW WELL DO YOU KNOW YOUR ORANGE CAT?

Imagine you have two groups of cats


• Group 1 – orange and white cats
• Group 2 – tuxedo black & white cats

You want to study the rate of shedding in a home Fun Facts About Tuxedo Cats ...

environment. Does color matter? How might you


test this?

2x2 – discuss and come up with hypotheses and


appropriate statistical tests for this.
02-03-2020 6

We can hypothesize difference

What’s your hypothesis for cat hair loss?


• An educated guess about what is true in the world
• An assumption (testable conjecture) – subject to
verification/validation
Today's Rhymes With Orange cartoon...... - Cartoons by Rina Piccolo | Facebook
Two types of hypotheses

The Null Hypothesis: Ho


• There is no difference between groups.
• There is no relationship between variables.
Ho: class 1 = class 2
vs.
The Alternative Hypothesis: Ha
• There is a difference between groups.
• There is a relationship between the variables.
Ha:  class 1 ≠  class 2 (for two-tailed test) What are those tails?
class 1 < class 2 (for one-tailed test)
Errors in hypothesis testing

Reduced
with larger
sample
size
REJECT NULL NULL HYPOTHESIS
HYPOTHESIS TRUE FALSE
NO True Negative False Negative
TYPE II ERROR
YES False Positive True Positive
TYPE I ERROR Power of the study

Typically
restricted
to 5%
02-03-2020 9

Statistical tests are excercises in probability


HOW WELL DO YOU KNOW YOUR ORANGE CAT?

Imagine you have two groups of cats


• Group 1 – orange and white cats
• Group 2 – tuxedo black & white cats

You KNOW there is a difference between these two Fun Facts About Tuxedo Cats ...

groups of cats! Maybe color doesn’t matter, but


what about whisker length? Tail length? Sharpness
of claws?
2x2 – discuss and come up with hypotheses and
appropriate statistical tests for this.
02-03-2020 10

Probability can get tricky

In 1950 Joseph Rhine experimented with finding people with


paranormal powers. In his experiment 1000 people tried to guess
the sequence of 10 cards – red or black?

12 persons guessed 9 of 10 cards, two of them all 10 cards


02-03-2020 11

What actually happened?


1 10
Probability of guessing all 10 cards = ( ) » 0.00098
2
1 10
Probability of guessing 9 cards = 10( ) » 0.0098
2
1 10
Probability of guessing 9 or all 10 cards =11( ) » 0.0107
2
Chances of finding a “psychic” among 100 persons = 1-(1-0.0107) 100≈ =.66

Chances of finding a “psychic” among 1000 persons = 1-(1-0.0107) 1000≈ =.9998


02-03-2020 12

Avoiding false discovery in statistical analysis (think cats!)

Generally, if we perform m hypotheses tests, what is the probability of at


least 1 false positive?

P (making an error) = ⍺
P (not making an error) = 1 - ⍺
P (not making an error in m tests) = (1 - ⍺) m
P (making at least one error in m tests) = 1 – (1 - ⍺) m
02-03-2020 13

Counting errors

Assume we are testing H1, H 2, ..., Hm


m = # of true hypotheses; R = # of rejected hypotheses

Null hypothese Alternative Total


TRUE TRUE
Not significant U T m–R
Significant V S R
m0 m-m0 m

V = # Type I errors → FamilyWise Error Rate P(V >= 1)


02-03-2020 14

Avoiding false discovery in statistical analysis (think cats!)

During m independent statistical tests with α significant level, the probability


of at least one false discovery should be:

1- (1-α)m < 0.05


α = 1- (1-0.05) 1/m≈ 0.05/ m

Carlo Bonferroni (1935) proposed the Bonferroni correction: during m


independent statistical tests only those results are significant for which
p<0.05/ m
02-03-2020 15

Problems with multiple comparisons

Genome-wide association:
For gene expression studies with DNA chips – 500000 SNP.
• For the significance level 0.01 we can expect up to 5000 false associations

Meta-studies: joining and comparison of different results obtained by


different authors can lead to false associations

BUT Bonferroni correction increases Type II error!

Look up the concept of multiple comparisons and their problems


in Chapter 10 of the Field book
02-03-2020 16

Remember
• It is not fair to hunt around through the data for a big contrast and then
pretend that you’ve only done one comparison. This is called data snooping.
Often the significance of such finding is overstated.

• To account for multiple comparisons you need to make your confidence


intervals wider and the critical value ( p-value) larger

• Use a correction when doing multiple comparisons of means or when


interested in a small number of planned tests or pairwise comparisons

• Keep in mind – Bonferroni is a conservative correction (there are many others)


15: Multiple Linear Regression Basic Biostat 17

Linear regression

Considers the relation between a single explanatory variable and response


variable
15: Multiple Linear Regression Basic Biostat 18

What’s the difference from correlation?

Considers the relation between a single explanatory variable and response


variable

SPECIFIED DIRECTION
OF EFFECT!
19

Regression & Ordinary Least Squares (OLS)

• Ordinary Least Squares (OLS) – a method of fitting a regression line,


that summarizes the linear relationship between two variables best

• Residual – the difference between the actual value of Y and the predicted
value of Y`: e = Y – Y` – this is your error

• OLS means that the sum of e 2(errors) is minimized – smallest sum of


squared errors or least squares
15: Multiple Linear Regression Basic Biostat 20

Simple Regression Model

Regression coefficients are estimated by minimizing ∑residuals 2


(i.e., sum of the squared residuals) to derive this model:

The standard error of the regression (sY|x) is based on the


squared residuals:
15: Multiple Linear Regression Basic Biostat 21

Simple Regression Model

Regression coefficients are related to Pearson correlation!

BUT... b1 is not a standardized measure, depends on the units


of measurement of the predictor variable

The value of b represents the change in outcome resulting from


a unit change in the predictor
22

Basic regression assumptions

• Dependent Variables must be normally distributed (no skewness or


outliers)
• Independent Variables
• do not need to be normally distributed, but if they are it makes for a stronger
interpretation
• have a linear relationship with DV

• No outliers among the IVs predicting DV


• Correlation assumptions apply
ens
Dias 23 Sted og navn navn

Core modules in a cat


ens
Dias 24 Sted og navn navn
ens
Dias 25 Sted og navn navn
ens
Dias 26 Sted og navn navn

Regression equation: y = ax + b
Equation for heart weight Hw = 4,31*Bw – 1,18

Bwt = 3,0kg., Hwt = 11,75g.

Bwt = 2,0kg., Hwt = 7,44g.


ens
Dias 27 Sted og navn navn

Use of regression (examples)

• The heaviest cat in the trial was 3,9 kg.

• What is the (extrapolated) heart weight


for a seriously overweight maine coon
cat @ 11kg?

• Hwt = 4,32 * 11kg. – 1,18


• = 46,26g
02-03-2020 28

How well does the model fit the data?

Test for “goodness of fit”? What does “goodness of fit” mean?

Estimations of a ‘good fit’ depart from inspection of / calculations on


residuals

... essentially: how far are the data points from out fitted line (and thus from
the predicted values)?
ens
Dias 29 Sted og navn navn

y=mx+b
e1

e2

SE = Σ (y-(mx+b))2
02-03-2020 30

How well does the model fit the data?

• How much of the variance in y is described by the fitted line?


• How much variance in y is described by the variation in x?

• Total variation in y is sum of distances of each y from the mean of all y’s
ens
Dias 31 Sted og navn navn

y=mx+b y(mean)-y1

e1
Mean of Y

e2

SE = Σ (y-(mx+b))2
02-03-2020 32

How well does the model fit the data?

• How much of the variance in y is described by the fitted line?


• How much variance in y is described by the variation in x?

• Total variation in y is sum of distances of each y from the mean of all y’s –
your squared error in y

• What percentage is not described by the variation in X?


ens
Dias 33 Sted og navn navn

SEy = Σ (ymean - yn)2


y=mx+b y(mean)-y1

e1
Mean of Y

e2

SEline = Σ (y-(mx+b))2
02-03-2020 34

How well does the model fit the data?

• How much of the variance in y is described by the fitted line?


• How much variance in y is described by the variation in x?

• Total variation in y is sum of distances of each y from the mean of all y’s –
your squared error in y

• What percentage is not described by the variation in X?


Seline / Sey *100
ens
Dias 35 Sted og navn navn

R2 = 0,62
Coefficient of determination – R2
• The proportion of variability in a data set
that is accounted for by the statistical
model

where
Total sum of squares

Explained sum of squares

This sums the squared difference between the predicted


value and the mean. This measures how much of the sum of
squares is explained by the regression line.
02-03-2020 37

Check this video for a detailed explanation of the concept

For a great explanation of R-squared:


• https://www.youtube.com/watch?v=lng4ZgConCM
For a worked example:
• https://www.youtube.com/watch?v=ww_yT9ckPWw (fitting a regression line)
• https://www.youtube.com/watch?v=Fc5t_5r_7IU (calculating R-squared)
15: Multiple Linear Regression Basic Biostat 38

The General Idea

Simple regression considers the relation between a single explanatory


variable and response variable

If we have to predict rollercoaster speed, what might be a good predictor?


Cartoon of the Day: Roller Coaster

Height → Speed
Length → Speed
15: Multiple Linear Regression Basic Biostat 39

The General Idea

Multiple regression simultaneously considers the influence of


multiple explanatory variables on a response variable Y

The intent is to look at the


independent effect of each
variable while “adjusting out”
the influence of potential
confounders
15: Multiple Linear Regression Basic Biostat 40

The General Idea

Multiple regression simultaneously considers the influence of


multiple explanatory variables on a response variable Y

Height The intent is to look at the


independent effect of each
Length Speed
variable while “adjusting out”
the influence of potential
confounders
Duration
15: Multiple Linear Regression Basic Biostat 41

Multiple Regression Model

Again, estimates for the multiple slope coefficients are derived


by minimizing ∑residuals2 to derive this multiple regression model:

Again, the standard error of the regression is based on the


∑residuals2:
Regression Modeling

• A simple regression model (one


independent variable) fits a
regression line in 2-dimensional
space

• A multiple regression model with


two explanatory variables fits a
regression plane in 3-dimensional
space
Regression Modeling

• A simple regression model (one


independent variable) fits a
regression line in 2-dimensional
space

• A multiple regression model with


two explanatory variables fits a
regression plane in 3-dimensional
space
15: Multiple Linear Regression Basic Biostat 44

Multiple Regression Model

• Intercept α predicts where the


regression plane crosses the Y
axis

• Slope for variable X1 (β1)


predicts the change in Y per
unit X 1 holding X2 constant

• The slope for variable X 2 (β2)


predicts the change in Y per
unit X 2 holding X1 constant
45

Multiple linear regression

• Multiple predictors (IV)’s for a single outcome (DV)


Y = a + b 1X1+ b 2X2 + b 3X3 + … + bnXn + e

• IV – instrumental variables; DV – dependent variable


• A multiple regression model with k independent variables fits a regression
“surface” in k + 1 dimensional space (cannot be visualized)
• The equation that describes how the mean value of y is related to x1, x 2, … xn
• The formulas for the regression coefficients b0, b 1, b 2, . . . bn involve the use
of matrix algebra.
02-03-2020 46

Rollercoaster example

• Which variables might be good predictors? How about height and


duration?
Speed = a + b 1height + b 2duration + e
02-03-2020 47

Rollercoaster example

• Which variables might be good predictors? How about height and


duration?
Speed = 23.1 + 0.22*height + 0.04*duration + e
02-03-2020 48

Rollercoaster example

• Which variables might be good predictors? How about height and


duration?
Speed = 23.1 + 0.22*height + 0.04*duration + e

Heteroskedasticity:
where the variance of the residuals is
unequal over a range of measured
values.
02-03-2020 49

Rollercoaster example

• Which variables might be good predictors? How about height and


duration?
Speed = 23.1 + 0.22*height + 0.04*duration + e
02-03-2020 50

We found the coefficients – then what?


• The coefficients tell us how y changes when x i changes
• How does speed change when either height or duration change?
• First – is the relationship positive or negative?
• Second – coefficient value tells you how much the mean of the outcome (Speed)
changes with one unit change in predictor (Height OR Duration), holding other
predictors constant
02-03-2020 51

We found the coefficients – then what?


• Estimated coefficient of height = 0.2199
• For each additional foot of height, speed increases by 0.22 km/h

• Estimated coefficient of duration = 0.0415


• For each additional second of duration, speed increases by 0.04 km/h
02-03-2020 52

We found the coefficients – then what?


• These are both interesting (and both go up), but they cannot be compared
precisely

• But we can standardize them to compare them using the ratio of the
standard deviations
𝜎
• 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 𝑚 = 𝛽 = 𝑚 𝜎 𝑥 = 𝑚 ∗ 𝑆𝐷𝑥 /𝑆𝐷𝑦
𝑦

• Alternatively, you can standardize your data BEFORE running the regression – convert
the data to z-scores

• This gives the rate of change in y in SDs in relation to the rate of change
(in SDs) of x.
02-03-2020 53

We found the coefficients – then what?


• This gives the rate of change in y in SDs in relation to the rate of change
(in SDs) of x.
• This would allow us to compare the relative magnitude of the influence of
height and duration on speed
Speed = 23.1 + 0.22*height + 0.04*duration + e
02-03-2020 54

What else does the regression output tell you?

t-statistic

Significance
test

Model fit
Coefficient of
determination
02-03-2020 55

t-statistic & significance test

• t-test to check how unlikely it is that the model is wrong under the null
hypothesis
𝑏 𝑜𝑏𝑠 − 𝑏 exp(𝐻0) 𝑏 𝑜𝑏𝑠
tN-p = =
𝑆𝐸 𝑏 𝑆𝐸 𝑏
• The t-statistic is a standardized test – not sensitive to different scales in IV
• Another way to compare predictors, to see which has the strongest linear association

• Check agains t-value table – significance test – p<0.05


• This tells us how likely it is that our coefficient tells us something (against the
assumption that it tells us nothing)
Simple vs. Multiple Regression
• One dependent variable Y • One dependent variable Y
predicted from one independent predicted from a set of
variable X independent variables (X 1, X 2 ….Xk)
• One regression coefficient for each
• One regression coefficient independent variable
• R2: proportion of variation in
dependent variable Y predictable
• r2: proportion of variation in by set of independent variables
dependent variable Y predictable (X’s)
from X
Multiple Correlation Coefficient (r) and Coefficient
of Multiple Determination (R2)

• r = the magnitude of the relationship between the dependent variable and


the best linear combination of the predictor variables

• R2 = the proportion of variation in Y accounted for by the set of


independent variables (X’s).
Explaining Variation: How much?
Predictable variation by the
combination of independent
variables
Total Variation in Y

Unpredictable
Variation
Proportion of Predictable & Unpredictable Variation

Where: (1-R2) = Unpredictable


Y = Speed (unexplained) variation in Y
X1 = Height Y
X2 = Duration X1

R2 = Predictable
X2 (explained) variation in Y
Various Significance Tests
• Testing R2
• Test R2 through an F test
• Test of competing models (difference between R 2) through an F test of difference of
R 2s

• Testing b
• Test of each partial regression coefficient (b) by t-tests
• Comparison of partial regression coefficients with each other - t-test of difference
between standardized partial regression coefficients ( )
02-03-2020 61

Variance explained

• What proportion of variance in speed can be predicted from height and


duration of the rollercoaster ride?
• Multiple R-Squared: 0.85 – 85% of variance can be accounted for by the composite of
the two variables
• Is R 2 of 0.85 statistically significantly different from 0?
• Conduct an F-test
02-03-2020 62

The F-test – test for overall significance

• Developed by eugenicist Ronald


Fisher
• The test statistic has an F-
distribution under the null
hypothesis
• For regression this statistic tests
whether the proposed regression
model fits the data well (alternative
hypotheses)
02-03-2020 63

Regression – tests of significance

• The F test shows an overall significance for the regression model


• Hypotheses: H0: 1 = 2 = . . . = p = 0; Ha: One or more of the parameters != zero.
• F = MSR/MSE
• Reject H0 if p-value <= a or if F > F , where F is based on an F distribution

• The t test is used to determine whether each of the individual independent


variables is significant
• Hypotheses: H0: 1 = 0; Ha: 1 != 0
• bi
t=
sb
i
02-03-2020 64
65

Potential problems with multiple linear regression


Multicollinearity
• When some or all of X variables are too similar to one another (high
correlations)
• Investigate why the variables are correlated (conceptually)

• When the independent variables are highly correlated (say, |r | > .7), it is
not possible to determine the separate effect of any particular
independent variable on the dependent variable.
66

Potential problems with multiple linear regression

• Variable selection
• Choose wisely from your list of X variables
• Too many: waste the information in the data
• Too few: risk ignoring useful predictive information
• Model misspecification
• Could the model you are testing be wrong or incomplete?
• Nonlinearity
• Interactions
67

Multiple regression: interpreting results

• F Test: Significant or not significant


• Tests whether the X variables, as a group, can predict Y better than just
randomly
• Coefficient of Determination: R 2
• Percentage of variability in Y explained by the X variables as a group (for
lower N pay attention to adjusted R 2)
• Intercept: a
• Predicted value of Y when every X is 0
• Regression Coefficients: b 1, b 2, … bn
• Effect of each X on Y, holding all other X variables constant
Warning about tests and CIs

• Hypothesis tests and CIs are meaningful only when the data fits the model
well.

• Remember, when the sample size is large enough, you will probably reject
any null hypothesis of β=0.
• When the sample size is small, you may not have enough evidence to
reject a null hypothesis of β=0.

• When you fail to reject a null hypothesis, don’t be too hasty to say that a
predictor has no linear association with the outcome. It is likely that there
is some association, it just isn’t a very strong one.
Checking assumptions

• Plot the residuals versus the predicted values from the regression line.

• Also plot the residuals versus each of the predictors.

• If non-random patterns in these plots, the assumptions might be violated.


General warnings for multiple regression

• Be even more wary of extrapolation.


Because there are several predictors,
you can extrapolate in many ways

• Multiple regression shows


association. It does not prove
causality. Only a carefully designed
observational study or randomized
experiment can show causality
02-03-2020 71

Next week

• Making an argument with


quantitative data
• Debates about the logic of statistics

• Assignment 2 released on
SATURDAY!
• We will discuss it on Tuesday

You might also like