0% found this document useful (0 votes)
19 views17 pages

T10 Regresion Multiple

Uploaded by

ana.amablec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views17 pages

T10 Regresion Multiple

Uploaded by

ana.amablec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

C HA P T E R 3 7


Multiple Regression
An approximate answer to the right problem is worth a good
deal more than an exact answer to an approximate problem.
J OHN T UKEY

I n laboratory experiments, you can generally control all the variables.


You change one variable, measure another, and then analyze the data
with one of the standard statistical tests. But in some kinds of experi-
ments, and in many observational studies, you must analyze how one
variable is influenced by several variables. Before reading this chapter,
which introduces the idea of multiple regression, first read Chapters 33
(simple linear regression) and 34 (fitting models to data). This chapter
explains multiple regression, in which the outcome is a continuous vari-
able. The next chapter explains logistic regression (in which the outcome
is binary) and proportional hazards regression (in which the outcome is
survival time).

GOALS OF MULTIVARIABLE REGRESSION


Multiple regression extends simple linear regression to allow for multiple inde-
pendent (X) variables. This is useful in several contexts:
• To assess the impact of one variable after accounting for others. Does a
drug work after accounting for age differences between the patients who
received the drug and those who received a placebo? Does an environmen-
tal exposure increase the risk of a disease after taking into account other dif-
ferences between people who were and were not exposed to that risk factor?
• To create an equation for making useful predictions. Given the data we
know now, what is the chance that this particular man with chest pain is
having a myocardial infarction (heart attack)? Given several variables that
can be measured easily, what is the predicted cardiac output of this patient?
• To understand scientifically how much changes in each of several variables
contribute to explaining an outcome of interest. For example, how do the
concentrations of high-density lipoproteins (HDL, good cholesterol), low-
density lipoproteins (LDL, bad cholesterol), triglycerides, C-reactive pro-
tein, and homocysteine predict the risk of heart disease? One goal might be

378
C HA P T E R 3 7 • Multiple Regression 379

to generate an equation that can predict the risk for individual patients (as
explained in the previous point). But another goal might be to understand
the contributions of each risk factor to aid public health efforts and help
prioritize future research projects.
Different regression methods are available for different kinds of data. This
chapter explains multiple linear regression, in which the outcome variable is con-
tinuous. The next chapter explains logistic regression (binary outcome) and pro-
portional hazards regression (survival times).

LINGO
Variables
A regression model predicts one variable Y from one or more other variables X. The
Y variable is called the dependent variable, the response variable, or the outcome
variable. The X variables are called independent variables, explanatory variables,
or predictor variables. In some cases, the X variables may encode variables that the
experimenter manipulated or treatments that the experimenter selected or assigned.
Each independent variable can be:
• Continuous (e.g., age, blood pressure, weight).
• Binary. An independent variable might, for example, code for gender by
defining zero as male and one as female. These codes, of course, are arbi-
trary. When there are only two possible values for a variable, it is called a
dummy variable.
• Categorical, with three or more categories (e.g., four medical school classes
or three different countries). Consult more advanced books if you need to
do this, because it is not straightforward, and it is easy to get confused.
Several dummy variables are needed.

Parameters
The multiple regression model defines the dependent variable as a function of the
independent variables and a set of parameters, or regression coefficients. Regression
methods find the values of each parameter that make the model predictions come as
close as possible to the data. This approach is analogous to linear regression, which
determines the values of the slope and intercept (the two parameters or regression
coefficients of the model) to make the model predict Y from X as closely as possible.

Simple regression versus multiple regression


Simple regression refers to models with a single X variable, as explained in
­Chapter 33. Multiple regression, also called multivariable regression, refers to
models with two or more X variables.

Univariate versus multivariate regression


Although they are beyond the scope of this book, methods do exist that can si-
multaneously analyze several outcomes (Y variables) at once. These are called
380 PA RT G • F I T T I N G M O D E L S T O DATA

multivariate methods, and they include factor analysis, cluster analysis, principal
components analysis, and multiple ANOVA (MANOVA). These methods contrast
with univariate methods, which deal with only a single Y variable.
Note that the terms multivariate and univariate are sometimes used inconsis-
tently. Sometimes multivariate is used to refer to multivariable methods for which
there is one outcome and several independent variables (i.e., multiple and logistic
regression), and sometimes univariate is used to refer to simple regression with
only one independent variable.

AN EXAMPLE OF MULTIPLE LINEAR REGRESSION


As you learned in Chapter 33, simple linear regression determines the best linear
equation to predict Y from a single variable X. Multiple linear regression extends
this approach to find the linear equation that best predicts Y from multiple inde-
pendent (X) variables.

Study design and questions


Staessen and colleagues (1992) investigated the relationship between lead expo-
sure and kidney function. Heavy exposure to lead can damage kidneys. Kidney
function decreases with age, and most people accumulate small amounts of lead
as they get older. These investigators wanted to know whether accumulation of
lead could explain some of the decrease in kidney function with aging.
The researchers studied 965 men and measured the concentration of lead in
blood, as well as creatinine clearance to quantify kidney function. The people with
more lead tended to have lower creatinine clearance, but this is not a useful find-
ing. Lead concentration increases with age, and creatinine clearance decreases
with age. Differences in age are said to confound any investigation between cre-
atinine clearance and lead concentration. To adjust for the confounding effect of
age, the investigators used multiple regression and included age as an indepen-
dent variable. They also included three other independent variables: body mass,
logarithm of γ-glutamyl transpeptidase (a measure of liver function), and previous
exposure to diuretics (coded as zero or 1).
The X variable about which they cared most was the logarithm of lead con-
centration. They used the logarithm of concentration rather than the lead concen-
tration itself because they expected the effect of lead to be multiplicative rather
than additive—that is, they expected that a doubling of lead concentration (from
any starting value) would have an equal effect on creatinine clearance. So why
logarithms? The regression model is intrinsically additive. Note that the sum of
the two logarithms is the same as the logarithm of the product: log(A) + log(B) =
Log(A · B). Therefore, transforming a variable to its logarithm converts a multi-
plicative effect to an additive one (Appendix E reviews logarithms).

Mathematical model
This book is mostly nonmathematical. I avoid explaining the math of how meth-
ods work and only provide such explanation when it is necessary to understand the
C HA P T E R 3 7 • Multiple Regression 381

β0
+ β1 × log(serum lead)
+ β2 × Age
+ β3 × Body mass index
+ β4 × log(GGT)
+ β5 × Diuretics?
+ ε (Gaussian random)
= Creatinine clearance

Table 37.1. The multiple regression model for the example.


β0 through β5 are the six parameters fit by the multiple regression
program. Each of these parameters has different units (listed in Table
37.2). The product of each parameter times the corresponding ­variable
is expressed in the units of the Y variable (creatinine ­clearance), which
are milliliters per minute (ml/min). The goal of multiple regression is
to find the values for the six parameters of the model that make the
predicted creatinine clearance values come as close as possible to the
actual values.

questions the statistical methods answer. But you really can’t understand multiple
regression at all without understanding the model that is being fit to the data.
The multiple regression model is shown in Table 37.1. The dependent (Y)
variable is creatinine clearance. The model predicts its value from a baseline value
plus the effects of five independent (X) variables, each multiplied by a regression
coefficient, also called a regression parameter.
The X variables were the logarithm of serum lead, age, body mass, logarithm
of γ-glutamyl transpeptidase (a measure of liver function), and previous exposure to
diuretics (coded as zero or 1). This last variable is designated as the dummy variable
(or indicator variable) because those two particular values were chosen arbitrarily
to designate two groups (people who have not taken diuretics and those who have).
The final component of the model, ε, represents random variability (error).
Like ordinary linear regression, multiple regression assumes that the random
scatter (individual variation unrelated to the independent variables) follows a
­Gaussian distribution.
The model of Table 37.1 can be written as an equation:

Yi = β0 + β1 ⋅ X i,1 + β 2 ⋅ X i,2 + β 3 ⋅ X i,3 + β 4 ⋅ X i,4 + β5 ⋅ X i,5 + ε i (37.1)

Notes on this equation:


• Yi. The subscript i refers to the particular patient. So Y3 is the predicted
creatinine clearance of the third patient.
• Xi,j. The first subscript (i) refers to the particular participant/patient, and the
second subscript enumerates a particular independent variable. For example,
X3,5 encodes the use of diuretics (the fifth variable in Table 37.2) for the third
participant. These values are data that you collect and enter into the program.
382 PA RT G • F I T T I N G M O D E L S T O DATA

VARIABLE MEANING UNITS

X1 log(serum lead) Logarithms are unitless. Untransformed serum lead


concentration was in micrograms per liter.
X2 Age Years
X3 Body mass index Kilograms per square meter
X4 log(GGT) Logarithms are unitless. Untransformed serum GGT
level was in units per liter.
X5 Diuretics? Unitless. 0 = never took diuretics. 1 = took diuretics.

Y Creatinine clearance Milliliters per minute

Table 37.2. Units of the variables used in the multiple regression examples.

• The regression coefficients (β1 to β5). These will be fit by the regression
program. One of the goals of regression is to find the best-fit value of
each regression coefficient, along with its CI. Each regression coefficient
represents the average change in Y when you change the corresponding X
value by 1.0. For example, β5 is the average difference in the logarithm of
creatinine between those who have diabetes (Xi,5 = 1) and those who don’t
(Xi,5 = 0).
• The intercept, β0. This is the predicted average value of Y when all the
X values are zero. In this example, the intercept has only a mathematical
meaning, but no practical meaning because setting all the X values to zero
means that you are looking at someone who is zero years old with zero
weight! β0 is fit by the regression program.
• ε. This is a random variable that is assumed to follow a Gaussian distri-
bution. Predicting Y from all the X variables is not a perfect prediction
because there also is a random component, designated by ε.

Goals of regression
Multiple regression fits the model to the data to find the values for the parameters
that will make the predictions of the model come as close as possible to the actual
data. Like simple linear regression, it does so by finding the values of the param-
eters (regression coefficients) in the model that minimize the sum of the squares
of the discrepancies between the actual and predicted Y values. Like simple linear
regression, multiple regression is a least-squares method.

Best-fit values of the parameters


The investigator’s goal was to answer the question, After adjusting for effects of
the other variables, is there a substantial linear relationship between the logarithm
of lead concentration and creatinine clearance?
The best-fit value of β1 (the parameter for the logarithm of lead concen-
tration) was –9.5 ml/min. This means that, on average, and after accounting for
C HA P T E R 3 7 • Multiple Regression 383

Change in creatinine clearance


–20

–40

–60

Baseline 10X 100X 1000X


Fold change in lead concentration

Figure 37.1. The prediction of multiple regression.


One of the variables entered into the multiple regression model was the logarithm of lead
concentration. The best-fit value for its coefficient was –9.5. For every one-unit increase in
the logarithm of lead concentration, the creatinine clearance is predicted to go down by
9.5 ml/min. When the logarithm of lead concentration increases by one unit, the lead con-
centration itself increases tenfold. The solid line shows the best-fit value of that slope. The
two dashed lines show the range of the 95% CI.

differences in the other variables, an increase in log(lead) of one unit (so that the
lead concentration increases tenfold, since they used common base 10 logarithms)
is associated with a decrease in creatinine clearance of 9.5 ml/min. The 95%
CI ranges from –18.1 to –0.9 ml/min.
Understanding these values requires some context. The study participants’
average creatinine clearance was 99 ml/min. So a tenfold increase in lead con-
centrations is associated with a reduction of creatinine clearance (reduced renal
function) of about 10%, with a 95% CI ranging from about 1% to 20%. Figure 37.1
illustrates this model.
The authors also report the values for all the other parameters in the model. For
example, the β5 coefficient for the X variable “previous diuretic therapy” was –8.8
ml/min. That variable is coded as zero if the patient had never taken diuretics and
1 if the patient had taken diuretics. It is easy to interpret the best-fit value. On av-
erage, after taking into account differences in the other variables, participants who
had taken diuretics previously had a mean creatinine clearance that was 8.8 ml/min
lower than that of participants who had not taken diuretics.

Statistical significance
Multiple regression programs can compute a P value for each parameter in the
model testing the null hypothesis that the true value of that parameter is zero. Why
zero? When a regression coefficient (parameter) equals zero, then the correspond-
ing independent variable has no effect in the model (because the product of the
independent variable times the coefficient always equals zero).
384 PA RT G • F I T T I N G M O D E L S T O DATA

The authors of this paper did not include P values and reported only the
b­ est-fit values with their 95% CIs. We can figure out which ones are less than
0.05 by looking at the CIs.
The CI of β1 runs from a negative number to another negative number and
does not include zero. Therefore, you can be 95% confident that increasing lead
concentration is associated with a drop in creatinine clearance (poorer kidney
function). Since the 95% CI does not include the value defining the null hypoth-
esis (zero), the P value must be less than 0.05. The authors don’t quantify it more
accurately than that, but most regression programs would report the exact P value.

R2: How well does the model fit the data?


R2 equals 0.27. This means that only 27% of the variability in creatinine clearance
is explained by the model. The remaining 73% of the variability is explained by
variables not included in this study, variables included in this study but not in the
forms entered in the mode, and random variation.
With simple linear regression, you can see the best-fit line superimposed
on the data and visualize goodness of fit. This is not possible with multiple re-
gression. With two independent variables, you could visualize the fit on a three-
dimensional graph, but most multiple regression models have more than two
independent variables.
Figure 37.2 shows a way to visualize how well a multiple regression model
fits the data and to understand the meaning of R2. Each point represents one par-
ticipant. The horizontal axis plots each participant’s measured creatinine clear-
ance (the Y variable in multiple regression). The vertical axis plots the creatinine
clearance value predicted by the model. This prediction is computed from the
other variables for that participant and the best-fit parameter values computed
by multiple regression, but this calculation does not use the measured value of
creatinine clearance. So the graph shows how well the model predicts the actual
creatinine clearance. If the prediction were perfect, all the points would align on a
45-degree line with the predicted creatinine clearance matching the actual creati-
nine clearance. You can see that the predictions for the example data are far from
perfect. The predicted and actual values are correlated, with R2 equal to 0.27. By
definition, this is identical to the overall R2 computed by multiple regression.

R2 versus adjusted R2
R2 is commonly used as a measure of goodness of fit in multiple linear regression,
but it can be misleading. Even if the independent variables are completely unable to
predict the dependent variable, R2 will be greater than zero. The expected value of R2
increases as more independent variables are added to the model. This limits the use-
fulness of R2 as a way to quantify goodness of fit, especially with small sample sizes.
In addition to reporting R2, which quantifies how well the model fits the
data being analyzed, most programs also report an adjusted R2, which estimates
how well the model is expected to fit new data. This measure accounts for the
number of independent variables and is always smaller than R2. How much
smaller depends on the relative numbers of participants and variables. This study
C HA P T E R 3 7 • Multiple Regression 385

R2 = 0.27
P < 0.0001

Predicted creatinine clearance

Actual creatinine clearance

Figure 37.2. The meaning of R in multiple regression. 2

The authors did not post the raw data, so this graph does not accurately represent the data
in the example. Instead, I simulated some data that look very much like what the actual
data would have looked like. Each of the 965 points represents one man in the study. The
­horizontal axis shows the actual creatinine clearance for each participant. The vertical
axis shows the creatinine clearance computed by the multiple regression model from
that ­participant’s lead level, age, body mass, log γ-glutamyl transpeptidase, and previous
exposure to diuretics. The prediction is somewhat useful, because generally people who
actually have higher creatinine clearance levels are predicted to have higher levels. However,
there is a huge amount of scatter. If the model were perfect, each predicted value would be
the same as the actual value, all the points would line up on a 45-degree line, and R2 would
equal 1.00. Here the predictions are less accurate, and R2 is only 0.27.

has far more participants (965) than independent variables (5), so the adjusted
R2 is only a tiny bit smaller than the unadjusted R2, and the two are equal to two
decimal places (0.27).

ASSUMPTIONS
Assumption: Sampling from a population
This is a familiar assumption of all statistical analyses. The goal in all forms of
multiple regression is to analyze a sample of data to make more general conclu-
sions (or predictions) about the population from which the data were sampled.

Assumption: Linear effects only


The multiple regression model assumes that increasing an X variable by one unit
increases (or decreases) the value of Y (multiple regression) by the same amount,
regardless of the values of the X variables.
In the multiple regression (lead exposure) example, the model predicts that
(on average) creatinine clearance will decrease by a certain amount when the loga-
rithm of lead concentration increases by 1.0 (which means the lead concentration
386 PA RT G • F I T T I N G M O D E L S T O DATA

increases tenfold, since the common or base 10 logarithm of 10 is 1). The assump-
tion of linear effects means that increasing the logarithm of lead concentration by
2.0 will have twice the impact on creatinine clearance as increasing it by 1.0 and
that the decrease in creatinine clearance by a certain concentration of lead does
not depend on the values of the other variables.

Assumption: No interaction beyond what is specified in the model


Two of the independent variables in the multiple linear regression model are the
logarithm of lead concentration and age. What if lead has a bigger impact on
kidney function in older people than it does in younger people? If this were the
case, the relationship between the effects of lead concentration and the effects of
age would be called an interaction. Standard multiple regression models assume
there is no interaction. It is possible to include interaction terms in the model, and
this process is explained later in the chapter.

Assumption: Independent observations


The assumption that data for each participant provide independent information
about the relationships among the variables should be familiar by now. This as-
sumption could be violated, for example, if some of the participants were identical
twins (or even siblings).

Assumption: The random component of the model is Gaussian


For any set of X values, multiple linear regression assumes that the random com-
ponent of the model follows a Gaussian distribution, at least approximately. In
other words, it assumes that the residuals (the differences between the actual Y
values and Y values predicted by the model) are sampled from a Gaussian popula-
tion. Furthermore, it assumes that the SD of that scatter is always the same and is
unrelated to any of the variables.

AUTOMATIC VARIABLE SELECTION


How variable selection works
The authors of our example stated that they collected data for more variables for
each participant and that the fit of the model was not improved when the model
also accounted for smoking habits, mean blood pressure, serum ferritin level (a
measure of iron storage), residence in urban versus rural areas, or urinary cad-
mium levels. Consequently, they omitted these variables from the model whose fit
they reported. In other words, they computed a P value for each independent vari-
able in the model, removed variables for which P values were greater than 0.05,
and then reran the model without those variables.

What is automatic variable selection?


Many multiple regression programs can choose variables automatically. One ap-
proach (called all-subsets regression) is to fit the data to every possible model
(recall that each model includes some X variables and may exclude others) and
C HA P T E R 3 7 • Multiple Regression 387

then find the one that is the best. With many variables and large data sets, the
computer time required for this approach is prohibitive. To conserve computer
time when working with huge data sets, other algorithms use a stepwise approach.
One approach (called forward-stepwise selection or a step-up procedure) is to
start with a very simple model and add new X variables one at a time, always
adding the X variable that most improves the model’s ability to predict Y. Another
approach (backward-stepwise selection or a step-down procedure) is to start with
the full model (including all X variables) and then sequentially eliminate those X
variables that contribute the least to the model.

The problems of automatic variable selection


The appeal of automatic variable selection is clear. You just put all the data into the
program, and it makes all the decisions for you. The problem is multiple compari-
sons. How many models does a multiple regression program compare when given
data with k independent variables and instructed to use the all-subsets method
to compare the fit of every possible model? Each variable can be included or
excluded from the final model, so the program will compare 2k models. For exam-
ple, if the investigator starts with 20 variables, then automatic variable selection
compares 220 models (more than a million), even before considering interactions.
When you read a paper presenting results of multiple regression, you may not
even know the number of variables with which the investigator started. Peter Flom
(2007) explains why this ignorance makes it impossible to interpret the results of
multiple regression with stepwise variable selection:
If you toss a coin ten times and get ten heads, then you are pretty sure that some-
thing weird is going on. You can quantify exactly how unlikely such an event is,
given that the probability of heads on any one toss is 0.5. If you have 10 people each
toss a coin ten times, and one of them gets 10 heads, you are less suspicious, but you
can still quantify the likelihood. But if you have a bunch of friends (you don’t count
them) toss coins some number of times (they don’t tell you how many) and someone
gets 10 heads in a row, you don’t even know how suspicious to be. That’s stepwise.

The consequences of automatic variable selection are pervasive and serious


(Harrell 2015; Flom 2007):
• The final model fits too well. R2 is too high.
• The best-fit parameter values are too far from zero. This makes sense. Since
variables with low absolute values have been eliminated, the remaining
variables tend to have absolute values that are higher than they should be.
• The CIs are too narrow, so you think you know the parameter values with
more precision than is warranted.
• When you test whether the parameters are statistically significant, the
P values are too low, so cannot be interpreted.

Simulated example of variable selection


Chapter 23 already referred to a simulated study by Freedman (1983) that
demonstrated the problem with this approach (his paper was also reprinted as
388 PA RT G • F I T T I N G M O D E L S T O DATA

an appendix in a text by Good and Hardin, 2006). He simulated a study with


100 participants and recorded data from 50 independent variables for each. All
values were simulated, so it is clear that the outcome is not associated with any
of the simulated independent variables. As expected, the overall P value from
multiple regression was high, as were most of the individual P values (one for
each independent variable).
He then chose the 15 independent variables that had the lowest P values (less
than 0.25) and reran the multiple regression program using only those variables. The
resulting overall P value from multiple regression was tiny (0.0005). The contribu-
tions of 6 of the 15 independent variables were statistically significant (P < 0.05).
If you didn’t know these were all simulated data with no associations, the
results might seem impressive. The tiny P values beg you to reject the null hypothe-
ses and conclude that the independent variables can predict the dependent variable.
The problem is essentially one of multiple comparisons, an issue already dis-
cussed in Chapter 23. With lots of variables, it is way too easy to be fooled. You
can be impressed with high R2 values and low P values, even though there are no
real underlying relationships in the population.

Should you mistrust all analyses with variable selection?


It may make sense to use statistical methods to decide whether to include or ex-
clude one or a few carefully selected independent variables. But it does not make
sense to let a statistics program test dozens or hundreds (or thousands) of possible
models in hope that the computer can work magic.
When reading the results of research that depends on multiple regression,
read the paper carefully to see how many variables were collected but not used
in the final model. If a lot of variables were collected but not ultimately used, be
skeptical of the results. In some cases, there is no problem, as the extra variables
were collected for other purposes and the variables used in the model were se-
lected based on a plan created according to the goal of the study. But watch out if
the investigators first included many variables in the model but report results with
variables that were selected because they worked the best.
In some cases, the goal of the study is exploration. The investigators are not
testing a hypothesis but rather are looking for a hypothesis to test. Variable selec-
tion can be part of the exploration. But any model that emerges from an explor-
atory study must be considered a hypothesis to be tested with new data. Deciding
how to construct models is a difficult problem. Says Gelman (2012), “This is a big
open problem in statistics: how to make one’s model general enough to let the data
speak, but structured enough to hear what the data have to say.”

SAMPLE SIZE FOR MULTIPLE REGRESSION


How many participants are needed to perform a useful multiple regression analy-
sis? It depends on the goals of the study and your assumptions about the distribu-
tions of the variables. Chapter 26 explained the general principles of sample size
determination, but applying these principles to a multiple regression analysis is
not straightforward.
C HA P T E R 3 7 • Multiple Regression 389

One approach to determining sample size is to follow a rule of thumb. The


only firm rule is that you need more cases than variables, a lot more. Beyond that,
rules of thumb specify that the number of participants (n) should be somewhere
between 10 and 20 (or even 40) times the number of variables. You must count the
number of variables you have at the start of the analysis. Even if you plan to use
automatic variable selection to reduce the number of variables in the final model,
you can’t use that lower number as the basis to calculate a smaller sample size.
The rules of thumb are inconsistent because they don’t require you to state the
goal of the study. More sophisticated approaches to calculating sample size for mul-
tiple regression require you to specify your goal (Kelley & Maxwell, 2008). Is your
goal to test the null hypothesis that the true R2 is zero? Is it to test the null hypothesis
that one particular regression coefficient equals zero? Is it to determine particular
regression coefficients with a margin of error smaller than a specified quantity? Is
it to predict future points within a specified margin of error? Once you have articu-
lated your goal, along with corresponding α and power as explained in Chapter 26,
you can then use specialized software to compute the required sample size.

MORE ADVANCED ISSUES WITH MULTIPLE REGRESSION


Multicollinearity
When two X variables are highly correlated, they both convey essentially the same
information. For example, one of the variables included in the multiple regression
example of Chapter 37 is body mass index (BMI), which is computed from an
individual’s weight and height. If the investigators had entered weight and height
separately into the model, they probably would have encountered collinearity,
because people who are taller also tend to be heavier.
The problem is that neither variable adds much to the fit of the model after
the other one is included. If you removed either height or weight from the model,
the fit probably would not change much. But if you removed both height and
weight from the model, the fit would be much worse.
The multiple regression calculations assess the additional contribution of
each variable after accounting for all the other independent variables. When vari-
ables are collinear, each variable makes little individual contribution. Therefore,
the CIs for the corresponding regression coefficients are wider and the P values
for those parameters are larger.
One way to reduce collinearity (or multicollinearity, when three or more vari-
ables are entangled) is to avoid entering related or correlated independent vari-
ables into your model. An alternative is the approach used in the example study.
In this case, the researchers combined height and weight in a biologically sensible
way into a single variable (BMI).

Interactions among independent variables


Two of the independent variables in the multiple linear regression model are the
logarithm of lead concentration and age. What if the effects of lead concentration
matter more with older people? This kind of relationship, as mentioned earlier, is
called an interaction.
390 PA RT G • F I T T I N G M O D E L S T O DATA

NAME EXAMPLE
Longitudinal or repeated measures Multiple observations of the same participant at different
times.
Crossover Each participant first gets one treatment, then another.
Bilateral Measuring from both knees in study of arthritis or both
ears in a study of tinnitus, and entering the two
­measurements into the regression separately.
Cluster Study pools results from three cities. Patients from the
same city are more similar to each other than they are
to patients from another city.
Hierarchical A clinical study of a surgical procedure uses patients from
three different medical centers. Within each center, s­ everal
different surgeons may do the procedure. For each
­patient, results might be collected at several time points.

Table 37.3. Study designs that violate the assumption of independence.

To include interaction between age and the logarithm of serum lead concen-
tration, add a new term to the model equation with a new parameter multiplied by
the product of age (X2) times the logarithm of lead (X1):

Y = β0 + β1 ⋅ X1 + β 2 ⋅ X 2 + β 3 ⋅ X 3 + β 4 ⋅ X 4 + β5 ⋅ X 5 + β1,2 ⋅ X1 ⋅ X 2 + ε

If the CI for the new parameter (β1,2) does not include zero, then you will con-
clude that there is a significant interaction between age and log(lead). This means
that the effects of lead change with age. Equivalently, the effect of age depends on
the lead concentrations.

Correlated observations
One of the assumptions of multiple regression is that each observation is inde-
pendent. In other words, the deviation from the prediction of the model is entirely
random. Table 37.3 (adapted from Katz, 2006) is a partial list of study designs that
violate this assumption and so require specialized analysis methods.

COMMON MISTAKES: MULTIPLE REGRESSION


Mistake: Comparing too many model variations
In addition to choosing which variables to include or exclude from the model (ex-
plained earlier in the chapter), multiple regression analyses also let investigators
choose among many variations of the same model:
• With and without interaction terms
• With and without transforming some independent variables
• With and without transforming the dependent variable
• Including and excluding outliers or influential points
• Pooling all the data or analyzing some subsets separately
• Rerunning the model after defining a different variable to be the dependent
(outcome) variable
C HA P T E R 3 7 • Multiple Regression 391

When you fit the data many different ways and then report only the model
that fits the data best, you are likely to come up with conclusions that are not valid.
This is essentially the same problem as choosing which variables to include in the
model, previously discussed.

Mistake: Preselecting variables to include in the model


To decide which variables to enter into a multiple regression program, some inves-
tigators first look at the correlation between each possible independent variable
and the outcome variable and then only enter the variables that are strongly corre-
lated with the outcome. Selecting variables this way is not so different than letting
a program select variables. The results will be affected the same way, with an R2
that is too high, CIs that are too narrow and P values that are too low.

Mistake: Too many independent variables


The goal of regression, as in all of statistics, is to analyze data from a sample and
make valid inferences about the overall population. That goal cannot always be
met using multiple regression techniques. It is too easy to reach conclusions that
apply to the fit of the sample data but are not really true in the population. When
the study is repeated, the conclusions will not be reproducible.
This problem is called overfitting (Babyak, 2004). It happens when you ask
more questions than the data can answer—when you have too many independent
variables in the model compared to the number of participants. The problem with
overfitting is that the modeling process ends up fitting aspects of the data caused
by random scatter or experimental quirks. Such a model won’t be reproduced
when tested on a new set of data.
How many independent variables is too many? For multiple regression, a rule
of thumb is to have at least 10 to 20 participants per independent variable. Fitting a
model with five independent variables thus requires at least 50 to 100 participants.

Mistake: Including redundant or highly correlated


independent variables
The problem of multicollinearity was discussed earlier in the chapter. This problem
is most severe when two variables in a model are redundant or highly correlated.
Say your study includes both men and women so you have one independent variable
“Woman” that equals 1 for females and zero for males, and another variable “Man”
that equals zero for females and 1 for males. You’ve introduced collinearity be-
cause the two variables encode the same information. Only one variable is needed.
An example of correlated variables would be including both weight and height in a
model, as people who are taller also tend to weigh more. One way around this issue
would be to compute the BMI from height and weight and only include that single
variable in the mode, rather than including both height and weight.

Mistake: Dichotomizing without reason


A common mistake is to convert a continuous variable to a binary one without
good reason. For example, if you want to include blood pressure in the model,
you have two choices: include blood pressure itself in the model or dichotomize
392 PA RT G • F I T T I N G M O D E L S T O DATA

blood pressure to a binary variable that encodes whether the patient has hyperten-
sion (high blood pressure) or not. One problem with the latter approach is that
it requires deciding on a somewhat arbitrary definition of whether someone has
hypertension. Another problem is that it treats patients with mild and severe hy-
pertension as the same, and so information is lost.

Mistake: Assuming that regression proves causation


In the example from the beginning of this chapter, the investigators chose to place
creatinine clearance (a measure of kidney function) on the left side of the model
(Y) and the concentration of lead on the right (X) and therefore concluded that
lead affects kidney function. It is conceivable that the relationship goes the op-
posite direction, and that damaged kidneys somehow accumulate more lead. This
is a fundamental problem of observational studies. The best way to overcome any
doubts about cause and effect is to do an experiment. While it wouldn’t be ethical
to expose people to lead to see what would happen to their renal function, such
experiments can be done with animals.
Also beware of confounding variables. An absurd example mentioned by
Katz (2006) makes this concept clear. A multiple regression model to find risk
factors associated with lung cancer might identify carrying matches as a signifi-
cant risk factor. This would prove that people who carry matches in their pocket
are more likely to get lung cancer than people who don’t. But, of course, carrying
matches doesn’t cause cancer. Rather, people who carry matches are also likely to
smoke, and that does cause cancer.

Mistake: Not thinking about the model


It is too easy to run a multiple regression program without thinking about the model.
Does the assumption of linearity make sense? Does it make sense to assume that the
variables don’t interact? You probably know a lot about the scientific context of the
work. Don’t forget all that when you pick a standard multiple regression model.

Q&A

Do you always have to decide which variable is the outcome (dependent variable)
and which variables are the predictors (independent variables) at the time of data
collection?
No. In some cases, the independent and dependent variables may not be distinct
at the time of data collection. The decision is sometimes made only at the time of
data analysis. But beware of these analyses. The more ways you analyze the data,
the more likely you are to be fooled by overfitting and multiple comparisons.
Does it make sense to compare the value of one best-fit parameter with another?
No. The units of each parameter are different, so they can’t be directly ­compared.
If you want to compare, read about standardized parameters in a more
­advanced book. Standardizing rescales the parameters so they become unitless
and can then be compared. A variable with a larger standardized parameter has
a more important impact on the dependent variable.
C HA P T E R 3 7 • Multiple Regression 393

Do all the independent variables have to be expressed in the same units?


No. Usually they are in different units.
What possible values can R2 and adjusted R2 have?
Like in linear and nonlinear regression, R2 is the fraction of variation explained by
the model, so it must have a value between 0.0 and 1.0. The adjusted R2, despite
its name, is not the square of anything. It is the R2 minus an adjustment factor
based on the number of variables and sample size. In rare circumstances (poor
fit with small sample size), this adjustment can be larger than R2, so the adjusted
R2 can be negative.
In what units are the parameters expressed?
Each parameter is expressed in the Y units divided by the X units of the variable
associated with that parameter.
How is regression related to other statistical methods?
Chapter 35 pointed out that an unpaired t test can be recast as linear regression.
Similarly, ANOVA can be recast as multiple regression. It is also possible to use
logistic or proportional hazards regression to compare two groups, replacing
the methods explained in Chapters 27 through 29 (the results won’t be exactly
the same, but they should be close). Essentially all of statistics can be recast as a
form of fitting some kind of model using an appropriate kind of regression.
I fit a model and the P value associated with of the variables is small enough so that
variable is considered to have a statistically significant effect. Now I include one more
variable in the model. Is it possible that that variable will no longer have a statistically
significant effect?
Yes. Each P value considers the effect of one variable, accounting for all the
others. Adding a new variable will change the results for all the variables.
What if a variable has a nonsignificant effect in the first model. Now I add another
variable to the model. Is it possible that the first variable will now have a statistically
significant effect?
Yes.
How does ANCOVA fit in?
This book does not discuss ANCOVA. It is a model that is equivalent to multiple
linear regression when at least one independent variable is categorical and at
least one is continuous.
How are the CIs calculated?
Some programs report the standard error of each parameter instead of its CI.
Computing the CI of the best-fit value of a model parameter works just like
­computing the CI of the mean if you know the SEM (see Chapter 12). Compute
the margin of error by multiplying the reported standard error by a value
obtained from the t distribution (see Appendix D). This value depends only on
the level of confidence desired (95% is standard) and the number of df (equal
to the number of participants in the study minus the number of parameters fit
by the model). For 95% confidence and with plenty of df (common in multiple
­regression), the multiplier approximates 2.0. Add and subtract this margin of
error from the best-fit value to obtain the CI.
Where can I learn more about multiple regression and related analyses?
The texts by Katz (2006) and Campbell (2006) are concise, clear, practical,
and nonmathematical. The books by Glantz, Slinker, and Neilands (2016) and
­Vittinghoff and colleagues (2007) have more depth and more math, but they
remain clear and practical.
394 PA RT G • F I T T I N G M O D E L S T O DATA

CHAPTER SUMMARY
• Multiple variable regression is used when the outcome you measure is
affected by several other variables.
• This approach is used when you want to assess the impact of one vari-
able after correcting for the influences of others to predict outcomes from
several variables or to try to tease apart complicated relationships among
variables.
• Multiple linear regression is used when the outcome (Y) variable is con-
tinuous. Chapter 38 explains methods used when the outcome is binary.
• Beware of the term multivariate, which is used inconsistently.
• Automatic variable selection is appealing but the results can be misleading.
It is a form of multiple comparisons.

TERMS INTRODUCED IN THIS CHAPTER


• Adjusted R2 (p. 384)
• All-subsets regression (p. 386)
• Automatic variable selection (p. 387)
• Backwards-stepwise selection (or step-down procedure) (p. 387)
• Collinearity (p. 389)
• Confounding variable (p. 380)
• Dichotomize (p. 391)
• Dummy variable (p. 379)
• Forward-stepwise selection (or step-up procedure) (p. 387)
• Interaction (p. 386)
• Multicollinearity (p. 389)
• Multiple linear regression (p. 379)
• Multivariable/multiple regression (p. 379)
• Multivariate methods (p. 380)
• Overfitting (p. 391)
• Regression coefficients (p. 379)
• Simple linear regression (p. 379)
• Univariate methods (p. 380)

You might also like