T10 Regresion Multiple
T10 Regresion Multiple
Multiple Regression
An approximate answer to the right problem is worth a good
deal more than an exact answer to an approximate problem.
J OHN T UKEY
378
C HA P T E R 3 7 • Multiple Regression 379
to generate an equation that can predict the risk for individual patients (as
explained in the previous point). But another goal might be to understand
the contributions of each risk factor to aid public health efforts and help
prioritize future research projects.
Different regression methods are available for different kinds of data. This
chapter explains multiple linear regression, in which the outcome variable is con-
tinuous. The next chapter explains logistic regression (binary outcome) and pro-
portional hazards regression (survival times).
LINGO
Variables
A regression model predicts one variable Y from one or more other variables X. The
Y variable is called the dependent variable, the response variable, or the outcome
variable. The X variables are called independent variables, explanatory variables,
or predictor variables. In some cases, the X variables may encode variables that the
experimenter manipulated or treatments that the experimenter selected or assigned.
Each independent variable can be:
• Continuous (e.g., age, blood pressure, weight).
• Binary. An independent variable might, for example, code for gender by
defining zero as male and one as female. These codes, of course, are arbi-
trary. When there are only two possible values for a variable, it is called a
dummy variable.
• Categorical, with three or more categories (e.g., four medical school classes
or three different countries). Consult more advanced books if you need to
do this, because it is not straightforward, and it is easy to get confused.
Several dummy variables are needed.
Parameters
The multiple regression model defines the dependent variable as a function of the
independent variables and a set of parameters, or regression coefficients. Regression
methods find the values of each parameter that make the model predictions come as
close as possible to the data. This approach is analogous to linear regression, which
determines the values of the slope and intercept (the two parameters or regression
coefficients of the model) to make the model predict Y from X as closely as possible.
multivariate methods, and they include factor analysis, cluster analysis, principal
components analysis, and multiple ANOVA (MANOVA). These methods contrast
with univariate methods, which deal with only a single Y variable.
Note that the terms multivariate and univariate are sometimes used inconsis-
tently. Sometimes multivariate is used to refer to multivariable methods for which
there is one outcome and several independent variables (i.e., multiple and logistic
regression), and sometimes univariate is used to refer to simple regression with
only one independent variable.
Mathematical model
This book is mostly nonmathematical. I avoid explaining the math of how meth-
ods work and only provide such explanation when it is necessary to understand the
C HA P T E R 3 7 • Multiple Regression 381
β0
+ β1 × log(serum lead)
+ β2 × Age
+ β3 × Body mass index
+ β4 × log(GGT)
+ β5 × Diuretics?
+ ε (Gaussian random)
= Creatinine clearance
questions the statistical methods answer. But you really can’t understand multiple
regression at all without understanding the model that is being fit to the data.
The multiple regression model is shown in Table 37.1. The dependent (Y)
variable is creatinine clearance. The model predicts its value from a baseline value
plus the effects of five independent (X) variables, each multiplied by a regression
coefficient, also called a regression parameter.
The X variables were the logarithm of serum lead, age, body mass, logarithm
of γ-glutamyl transpeptidase (a measure of liver function), and previous exposure to
diuretics (coded as zero or 1). This last variable is designated as the dummy variable
(or indicator variable) because those two particular values were chosen arbitrarily
to designate two groups (people who have not taken diuretics and those who have).
The final component of the model, ε, represents random variability (error).
Like ordinary linear regression, multiple regression assumes that the random
scatter (individual variation unrelated to the independent variables) follows a
Gaussian distribution.
The model of Table 37.1 can be written as an equation:
Table 37.2. Units of the variables used in the multiple regression examples.
• The regression coefficients (β1 to β5). These will be fit by the regression
program. One of the goals of regression is to find the best-fit value of
each regression coefficient, along with its CI. Each regression coefficient
represents the average change in Y when you change the corresponding X
value by 1.0. For example, β5 is the average difference in the logarithm of
creatinine between those who have diabetes (Xi,5 = 1) and those who don’t
(Xi,5 = 0).
• The intercept, β0. This is the predicted average value of Y when all the
X values are zero. In this example, the intercept has only a mathematical
meaning, but no practical meaning because setting all the X values to zero
means that you are looking at someone who is zero years old with zero
weight! β0 is fit by the regression program.
• ε. This is a random variable that is assumed to follow a Gaussian distri-
bution. Predicting Y from all the X variables is not a perfect prediction
because there also is a random component, designated by ε.
Goals of regression
Multiple regression fits the model to the data to find the values for the parameters
that will make the predictions of the model come as close as possible to the actual
data. Like simple linear regression, it does so by finding the values of the param-
eters (regression coefficients) in the model that minimize the sum of the squares
of the discrepancies between the actual and predicted Y values. Like simple linear
regression, multiple regression is a least-squares method.
–40
–60
differences in the other variables, an increase in log(lead) of one unit (so that the
lead concentration increases tenfold, since they used common base 10 logarithms)
is associated with a decrease in creatinine clearance of 9.5 ml/min. The 95%
CI ranges from –18.1 to –0.9 ml/min.
Understanding these values requires some context. The study participants’
average creatinine clearance was 99 ml/min. So a tenfold increase in lead con-
centrations is associated with a reduction of creatinine clearance (reduced renal
function) of about 10%, with a 95% CI ranging from about 1% to 20%. Figure 37.1
illustrates this model.
The authors also report the values for all the other parameters in the model. For
example, the β5 coefficient for the X variable “previous diuretic therapy” was –8.8
ml/min. That variable is coded as zero if the patient had never taken diuretics and
1 if the patient had taken diuretics. It is easy to interpret the best-fit value. On av-
erage, after taking into account differences in the other variables, participants who
had taken diuretics previously had a mean creatinine clearance that was 8.8 ml/min
lower than that of participants who had not taken diuretics.
Statistical significance
Multiple regression programs can compute a P value for each parameter in the
model testing the null hypothesis that the true value of that parameter is zero. Why
zero? When a regression coefficient (parameter) equals zero, then the correspond-
ing independent variable has no effect in the model (because the product of the
independent variable times the coefficient always equals zero).
384 PA RT G • F I T T I N G M O D E L S T O DATA
The authors of this paper did not include P values and reported only the
b est-fit values with their 95% CIs. We can figure out which ones are less than
0.05 by looking at the CIs.
The CI of β1 runs from a negative number to another negative number and
does not include zero. Therefore, you can be 95% confident that increasing lead
concentration is associated with a drop in creatinine clearance (poorer kidney
function). Since the 95% CI does not include the value defining the null hypoth-
esis (zero), the P value must be less than 0.05. The authors don’t quantify it more
accurately than that, but most regression programs would report the exact P value.
R2 versus adjusted R2
R2 is commonly used as a measure of goodness of fit in multiple linear regression,
but it can be misleading. Even if the independent variables are completely unable to
predict the dependent variable, R2 will be greater than zero. The expected value of R2
increases as more independent variables are added to the model. This limits the use-
fulness of R2 as a way to quantify goodness of fit, especially with small sample sizes.
In addition to reporting R2, which quantifies how well the model fits the
data being analyzed, most programs also report an adjusted R2, which estimates
how well the model is expected to fit new data. This measure accounts for the
number of independent variables and is always smaller than R2. How much
smaller depends on the relative numbers of participants and variables. This study
C HA P T E R 3 7 • Multiple Regression 385
R2 = 0.27
P < 0.0001
The authors did not post the raw data, so this graph does not accurately represent the data
in the example. Instead, I simulated some data that look very much like what the actual
data would have looked like. Each of the 965 points represents one man in the study. The
horizontal axis shows the actual creatinine clearance for each participant. The vertical
axis shows the creatinine clearance computed by the multiple regression model from
that participant’s lead level, age, body mass, log γ-glutamyl transpeptidase, and previous
exposure to diuretics. The prediction is somewhat useful, because generally people who
actually have higher creatinine clearance levels are predicted to have higher levels. However,
there is a huge amount of scatter. If the model were perfect, each predicted value would be
the same as the actual value, all the points would line up on a 45-degree line, and R2 would
equal 1.00. Here the predictions are less accurate, and R2 is only 0.27.
has far more participants (965) than independent variables (5), so the adjusted
R2 is only a tiny bit smaller than the unadjusted R2, and the two are equal to two
decimal places (0.27).
ASSUMPTIONS
Assumption: Sampling from a population
This is a familiar assumption of all statistical analyses. The goal in all forms of
multiple regression is to analyze a sample of data to make more general conclu-
sions (or predictions) about the population from which the data were sampled.
increases tenfold, since the common or base 10 logarithm of 10 is 1). The assump-
tion of linear effects means that increasing the logarithm of lead concentration by
2.0 will have twice the impact on creatinine clearance as increasing it by 1.0 and
that the decrease in creatinine clearance by a certain concentration of lead does
not depend on the values of the other variables.
then find the one that is the best. With many variables and large data sets, the
computer time required for this approach is prohibitive. To conserve computer
time when working with huge data sets, other algorithms use a stepwise approach.
One approach (called forward-stepwise selection or a step-up procedure) is to
start with a very simple model and add new X variables one at a time, always
adding the X variable that most improves the model’s ability to predict Y. Another
approach (backward-stepwise selection or a step-down procedure) is to start with
the full model (including all X variables) and then sequentially eliminate those X
variables that contribute the least to the model.
NAME EXAMPLE
Longitudinal or repeated measures Multiple observations of the same participant at different
times.
Crossover Each participant first gets one treatment, then another.
Bilateral Measuring from both knees in study of arthritis or both
ears in a study of tinnitus, and entering the two
measurements into the regression separately.
Cluster Study pools results from three cities. Patients from the
same city are more similar to each other than they are
to patients from another city.
Hierarchical A clinical study of a surgical procedure uses patients from
three different medical centers. Within each center, s everal
different surgeons may do the procedure. For each
patient, results might be collected at several time points.
To include interaction between age and the logarithm of serum lead concen-
tration, add a new term to the model equation with a new parameter multiplied by
the product of age (X2) times the logarithm of lead (X1):
Y = β0 + β1 ⋅ X1 + β 2 ⋅ X 2 + β 3 ⋅ X 3 + β 4 ⋅ X 4 + β5 ⋅ X 5 + β1,2 ⋅ X1 ⋅ X 2 + ε
If the CI for the new parameter (β1,2) does not include zero, then you will con-
clude that there is a significant interaction between age and log(lead). This means
that the effects of lead change with age. Equivalently, the effect of age depends on
the lead concentrations.
Correlated observations
One of the assumptions of multiple regression is that each observation is inde-
pendent. In other words, the deviation from the prediction of the model is entirely
random. Table 37.3 (adapted from Katz, 2006) is a partial list of study designs that
violate this assumption and so require specialized analysis methods.
When you fit the data many different ways and then report only the model
that fits the data best, you are likely to come up with conclusions that are not valid.
This is essentially the same problem as choosing which variables to include in the
model, previously discussed.
blood pressure to a binary variable that encodes whether the patient has hyperten-
sion (high blood pressure) or not. One problem with the latter approach is that
it requires deciding on a somewhat arbitrary definition of whether someone has
hypertension. Another problem is that it treats patients with mild and severe hy-
pertension as the same, and so information is lost.
Q&A
Do you always have to decide which variable is the outcome (dependent variable)
and which variables are the predictors (independent variables) at the time of data
collection?
No. In some cases, the independent and dependent variables may not be distinct
at the time of data collection. The decision is sometimes made only at the time of
data analysis. But beware of these analyses. The more ways you analyze the data,
the more likely you are to be fooled by overfitting and multiple comparisons.
Does it make sense to compare the value of one best-fit parameter with another?
No. The units of each parameter are different, so they can’t be directly compared.
If you want to compare, read about standardized parameters in a more
advanced book. Standardizing rescales the parameters so they become unitless
and can then be compared. A variable with a larger standardized parameter has
a more important impact on the dependent variable.
C HA P T E R 3 7 • Multiple Regression 393
CHAPTER SUMMARY
• Multiple variable regression is used when the outcome you measure is
affected by several other variables.
• This approach is used when you want to assess the impact of one vari-
able after correcting for the influences of others to predict outcomes from
several variables or to try to tease apart complicated relationships among
variables.
• Multiple linear regression is used when the outcome (Y) variable is con-
tinuous. Chapter 38 explains methods used when the outcome is binary.
• Beware of the term multivariate, which is used inconsistently.
• Automatic variable selection is appealing but the results can be misleading.
It is a form of multiple comparisons.