Download ebooks file (Ebook) Regression Analysis and Linear Models: Concepts, Applications, and Implementation by Richard B. Darlington PhD, Andrew F. Hayes PhD ISBN 9781462521135, 1462521134 all chapters
Download ebooks file (Ebook) Regression Analysis and Linear Models: Concepts, Applications, and Implementation by Richard B. Darlington PhD, Andrew F. Hayes PhD ISBN 9781462521135, 1462521134 all chapters
ebooknice.com
https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018
ebooknice.com
ebooknice.com
ebooknice.com
(Ebook) Cambridge IGCSE and O Level History Workbook 2C -
Depth Study: the United States, 1919-41 2nd Edition by
Benjamin Harrison ISBN 9781398375147, 9781398375048,
1398375144, 1398375047
https://ebooknice.com/product/cambridge-igcse-and-o-level-history-
workbook-2c-depth-study-the-united-states-1919-41-2nd-edition-53538044
ebooknice.com
ebooknice.com
ebooknice.com
https://ebooknice.com/product/applied-regression-analysis-and-
generalized-linear-models-6852208
ebooknice.com
RECENT VOLUMES
Richard B. Darlington
Andrew F. Hayes
TODD D. LITTLE
Kicking off my Stats Camp
in Albuquerque, New Mexico
Preface
only once, noting that a model with only a single regressor is just a special
case of the more general theory and mathematics of statistical inference in
regression analysis.
We return to the uses and theory of multiple regression in Chapter
5, first by showing that a dichotomous regressor can be used in a model
and that, when used alone, the result is a model equivalent to the inde-
pendent groups t-test with which readers are likely familiar. But unlike
the independent groups t-test, additional variables are easily added to a
regression model when the goal is to compare groups when holding one or
more covariates constant (variables that can be dichotomous or numerical
in any combination). We also discuss the phenomenon of regression to the
mean, how regression analysis handles it, and the advantages of regression
analysis using pretest measurements rather than difference scores when a
variable is measured more than once and interest is in change over time.
Also addressed in this chapter are measures and inference about partial
association for sets of variables. This topic is particularly important later in
the book, where an understanding of variable sets is critical to understand-
ing how to form inferences about the effect of multicategorical variables on
a dependent variable as well as testing interaction between regressors.
In Chapter 6 we take a step away from the mechanics of regression
analysis to address the general topic of cause and effect. Experimentation
is seen by most researchers as the gold-standard design for research moti-
vated by a desire to establish cause–effect relationships. But fans of experi-
mentation don’t always appreciate the limitations of the randomized exper-
iment or the strengths of statistical control as an alternative. Ultimately,
experimentation and statistical control have their own sets of strengths
and weaknesses. We take the position in this chapter that statistical control
through regression analysis and randomized experimentation complement
each other rather than compete. Although data analysis can only go so far
in establishing cause–effect, statistical control through regression analysis
and the randomized experiment can be used in tandem to strengthen the
claims that one can make about cause–effect from a data analysis. But when
random assignment is not possible or the data are already collected using
a different design, regression analysis gives a means for the researcher to
entertain and rule out at least some explanations for an association that
compete with a cause–effect interpretation.
Emphasis in the first six chapters is on the regression coefficient and
its derivatives. Chapter 7 is dedicated to the use of regression analysis as
a prediction system, where focus is less on the regression coefficients and
more on the multiple correlation R and how accurately a model generates
estimates of the dependent variable in currently available or future data.
Though no doubt this use of regression analysis is less common, an under-
standing of the subtle and sometimes complex issues that come up when
Preface xi
This changes in Chapters 13 and 14, where we discuss interaction, also called
moderation. Chapter 13 introduces the fundamentals by illustrating the flex-
ibility that can be added to a regression model by including a cross-product
of two variables in a model. Doing so allows one variable’s effect—the focal
predictor—to be a linear function of a second variable—the moderator. We
show how this approach can be used with focal predictors and moderators
that are numerical, dichotomous, or multicategorical in any combination.
In Chapter 14 we formalize the linear nature of the relationship between
focal predictor and moderator and how a function can be constructed,
allowing you to estimate one variable’s effect on the dependent variable,
knowing the value of the moderator. We also address the exercise of probing
an interaction and discuss a variety of approaches, including the appealing
but less widely known Johnson–Neyman technique. We end this section by
discussing various complications and myths in the study and analysis of
interactions, including how nonlinearity and interaction can masquerade
as each other, and why a valid test for interaction does not require that
variables be centered before a cross-product term is computed, although
centering may improve the interpretation of the coefficients of the linear
terms in the cross-product.
Moderation is easily confused with mediation, the topic of Chapter 15.
Whereas moderation focuses on estimating and understanding the bound-
ary conditions or contingencies of an effect—when an effect exists and
when it is large versus small—mediation addresses the question how an
effect operates. Using regression analysis, we illustrate how one variable’s
effect in a regression model can be partitioned into direct and indirect com-
ponents. The indirect effect of a variable quantifies the result of a causal
chain of events in which an independent variable is presumed to affect an
intermediate mediator variable, which in turn affects the dependent vari-
able. We describe the regression algebra of path analysis first in a simple
model with only a single mediator before extending it to more complex
models involving more than one mediator. After discussing inference
about direct and indirect effects, we dedicate considerable space to various
controversies and extensions of mediation analysis, including cause–effect,
models with multicategorical independent variables, nonlinear effects, and
combining moderation and mediation analysis.
Under the topic of “irregularities,” Chapter 16 is dedicated to regres-
sion diagnostics and testing regression assumptions. Some may feel
these important topics are placed later in the sequence of chapters than
they should be, but our decision was deliberate. We feel it is important to
focus on the general concepts, uses, and remarkable flexibility of regres-
sion analysis before worrying about the things that can go wrong. In this
chapter we describe various diagnostic statistics—measures of leverage, dis-
tance, and influence—that analysts can use to find problems in their data
xiv Preface
or analysis (such as clerical errors in data entry) and identify cases that
might be causing distortions or other difficulties in the analysis, whether
they take the form of violating assumptions or producing results that are
markedly different than they would be if the case were excluded from the
analysis entirely. We also describe the assumptions of regression analysis
more formally than we have elsewhere and offer some approaches to test-
ing the assumptions, as well as alternative methods one can employ if one
is worried about the effects of assumption violations.
Chapters 17 and 18 close the book by addressing various additional
complexities and problems not addressed in Chapter 16, as well as numer-
ous extensions of linear regression analysis. Chapter 17 focuses on power
and precision of estimation. Though we do not dedicate space to how to
conduct a power analysis (whole books on this topic exist, as does software
to do the computations), we do dissect the formula for the standard error of
a regression coefficient and describe the factors that influence its size. This
shows the reader how to increase power when necessary. Also in Chapter
17 is the topic of measurement error and the effects it has on power and the
validity of a hypothesis test, as well as a discussion of other miscellaneous
problems such as missing data, collinearity and singularity, and rounding
error. Chapter 18 closes the book with an introduction to logistic regression,
which is the natural next step in one’s learning about linear models. After
this brief introduction to modeling dichotomous dependent variables, we
point the reader to resources where one can learn about other extensions to
the linear model, such as models of ordinal or count dependent variables,
time series and survival analysis, structural equation modeling, and mul-
tilevel modeling.
Appendices aren’t usually much worth discussing in the precis of a
book such as this, but other than Appendix C, which contains various
obligatory statistical tables, a few of ours are worthy of mention. Although
all the analyses can be described in this book with regression analysis and
in a few cases perhaps a bit of hand computation, Appendix A describes
and documents the RLM macro for SPSS and SAS written for this book and
referenced in a few places elsewhere in the book that makes some of the
analyses considerably easier. RLM is not intended to replace your preferred
program’s regression routine, though it can do many ordinary regression
functions. But RLM has some features not found in software off the shelf
that facilitates some of the computations required for estimating and prob-
ing interactions, implementing the Johnson–Neyman technique, domi-
nance analysis, linear spline regression, and the Bonferroni correction to
the largest t-residual for testing regression assumptions, among a few other
things. RLM can be downloaded from this book’s web page at www.afhayes.
com. Appendix B is for more advanced readers who are interested in the
matrix algebra behind basic regression computations. Finally, Appendix D
Preface xv
To the Instructor
Instructors will find that our precis above combined with the Contents pro-
vides a thorough overview of the topics we cover in this book. But we high-
light some of its strengths and unique features below:
Acknowledgments
Writing a book is a team effort, and many have contributed in one way
or another to this one, including various reviewers, students, colleagues,
and family members. C. Deborah Laughton, Seymour Weingarten, Judith
Grauman, Katherine Sommer, Jeannie Tang, Martin Coleman, and others
at The Guilford Press have been professional and supportive at various
phases while also cheering us on. They make book writing enjoyable and
worth doing often. Amanda Montoya and Cindy Gunthrie provided edit-
ing and readability advice and offered a reader’s perspective that helped to
improve the book. Todd Little, the editor of Guilford’s Methodology in the
Social Sciences series, was an enthusiastic supporter of this book from the
very beginning. Scott C. Roesch and Chris Oshima reviewed the manu-
script prior to publication and made various suggestions, most of which we
incorporated into the final draft. And our families, and in particular our
wives, Betsy and Carole, deserve much credit for their support and also
tolerating the divided attention that often comes with writing a book of any
kind, but especially one of this size and scope.
RICHARD B. DARLINGTON
Ithaca, New York
ANDREW F. HAYES
Columbus, Ohio
List of Symbols and Abbreviations
Symbol Meaning
b0 regressionconstant
bj partialregressioncoefficientforregressor j
b̃ j standardizedpartialregressioncoefficientforregressor j
B numberofhypothesistestsconducted
cj contrastcoefficientforgroup j
Cov covariance
D1 , D2 . . . codesusedintherepresentationofamulticategoricalregressor
DB(b j ) dfbetaforregressor j
df degreesoffreedom
E expectedvalue
e residual
ei residualforcasei
d ei casei’sresidualwhenitisexcludedfromthemodel
F F-ratiousedinhypothesistesting
g numberofgroups
hi leverageforcasei
J1 , J2 . . . artificialvariablescreatedinsplineregression
k numberofregressors
LL loglikelihood
ln naturallogarithm
MD Mahalanobisdistance
MS meansquare
N samplesize
nj samplesizeofgroup j
p observedsigOificanceorp-value
PEi probabilityofaneventforcasei
PR partialmultiplecorrelation
PR(B.A) partialcorrelationforsetBcontrollingforsetA
pr j partialcorrelationforregressor j
R multiplecorrelation
R(A) RwithregressorsinsetA
R(AB) RwithregressorsinsetAandTFUB
RS shrunkenR
rXY Pearsoncorrelationcoefficient
rel j reliabilityofregressor j
xvii
xviii List of Symbols and Abbreviations
Symbol Meaning
sX standarddeviationofX
sY standarddeviationofY
sY.X standarderrorofestimate
SE standarderror
SR semipartialcorrelationforaset
SR(B.A) semipartialcorrelationforsetBcontrollingforsetA
sr j semipartialcorrelationforregressor j
stri standardizedresidualforcasei
SS sumofsquares
T asaprefix,thetrueorpopulationvalueofthequantity
t tstatisticusedinhypothesistesting
tri studentizedresidualforcasei
tj tstatisticforregressor j
Tol j toleranceforregressor j
Var variance
Var(Y.X) varianceoftheresiduals
VIF j varianceinflationfactorforregressor j
X aregressor
X meanofX
Xj regressor j
X1.2 portionofX1 independentofX2
x deviationfromthemeanofX
Y usuallythedependentvariable
Y meanofY
y deviationfromthemeanofY
Y.1 portionofYindependentofX1
Zf Fisher’sZ
ZX standardizedvalueofX
ZY standardizedvalueofY
Y meanofY
Ŷ estimateorfittedvalueofYfromamodel
α chosensignificancelevelforahypothesistest
αFW familywiseTypeIerrorrate
ΔR2 changeinR2
ˆ estimatedvalue
Π multiplication
Σ summation
θX conditionaleffectofX
. “controllingfor”;forexample,rXY.C isrXY controllingforC
Contents
xix
xx Contents
3.1.3 Models / 47
3.1.4 Representing a Model Geometrically / 49
3.1.5 Model Errors / 50
3.1.6 An Alternative View of the Model / 52
3.2 The Best-Fitting Model / 55
3.2.1 Model Estimation with Computer Software / 55
3.2.2 Partial Regression Coefficients / 58
3.2.3 The Regression Constant / 63
3.2.4 Problems with Three or More Regressors / 64
3.2.5 The Multiple Correlation R / 68
3.3 Scale-Free Measures of Partial Association / 70
3.3.1 Semipartial Correlation / 70
3.3.2 Partial Correlation / 71
3.3.3 The Standardized Regression Coefficient / 73
3.4 Some Relations among Statistics / 75
3.4.1 Relations among Simple, Multiple, Partial, and Semipartial Correlations / 75
3.4.2 Venn Diagrams / 78
3.4.3 Partial Relationships and Simple Relationships May Have Different Signs / 80
3.4.4 How Covariates Affect Regression Coefficients / 81
3.4.5 Formulas for bj, prj, srj, and R / 82
3.5 Chapter Summary / 83
Appendices
References 627
Data files for the examples used in the book and files
containing the SPSS and SAS versions of RLM are available
on the companion web page at www.afhayes.com.
1
Statistical Control and Linear Models
1
2 Regression Analysis and Linear Models
who do not take the course. If that thing they differ on is related to test
performance, then any differences in test performance may be due to that
thing rather than the training course itself. This needs to be accounted for
or “controlled” in some fashion in order to determine whether the course
helps students pass the test. Or perhaps in a particular town, some testers
may be easier than others. The driving schools may know which testers
are easiest and encourage their students to take their tests when they know
those testers are on duty. So the standards being used to evaluate a student
driver during the test may be systematically different for students who take
the driver training course relative to those who do not. This also needs to
be controlled in some fashion.
You might control the problem caused by preexisting difference between
those who do and do not take the course by using a list of applicants for
driving courses, randomly choosing which of the applicants is allowed to
take the course, and using the rejected applicants as the control group. That
way you know that students are likely to be equal on all things that might be
related to performance on the test before the course begins. This is random
assignment on the independent variable. Or, if you find that more women take
the course than men, you might construct a sample that is half female and
half male for both the trained and untrained groups by discarding some of
the women in the available data. This is control by exclusion of cases.
You might control the problem of differential testing standards by train-
ing testers to make them apply uniform evaluation standards; that would
be manipulation of covariates. Or you might control that problem by ran-
domly altering the schedule different testers work, so that nobody would
know which testers are on duty at a particular moment. That would not
be random assignment on the independent variable, since you have not
determined which applicants take the course; rather, it would be other types
of randomization. This includes randomly assigning which of two or more
forms of the dependent variable you use, choosing stimuli from a pop-
ulation of stimuli (e.g., in a psycholinguistics study, all common English
adjectives), and manipulating the order of presentation of stimuli.
All these methods except exclusion of cases are types of experimental
control since they all require you to manipulate the situation in some way
rather than merely observe it. But these methods are often impractical
or impossible. For instance, you might not be allowed to decide which
students take the driving course or to train testers or alter their schedules.
Or, if a covariate is worker seniority, as in one of our earlier examples,
you cannot manipulate the covariate by telling workers how long to keep
4 Regression Analysis and Linear Models
their jobs. In the same example, the independent variable is sex, and you
cannot randomly decide that a particular worker will be male or female
the way you can decide whether the worker will be in the experimental
or control condition of an experiment. Even when experimental control is
possible, the very exertion of control often intrudes the investigator into
the situation in a way that disturbs participants or alters results; ethologists
and anthropologists are especially sensitive to such issues. Experimental
control may be difficult even in laboratory studies on animals. Researchers
may not be able to control how long a rat looks at a stimulus, but they are
able to measure looking time.
Control by exclusion of cases avoids these difficulties, because you are
manipulating data rather than participants. But this method lowers sample
size, and thus lowers the precision of estimates and the power of hypothesis
tests.
A fifth method of controlling covariates—statistical control—is one of
the main topics of this book. It avoids the disadvantages of the previous
four methods. No manipulation of participants or conditions is required,
and no data are excluded. Several terms mean the same thing: to control a
covariate statistically means the same as to adjust for it or to correct for it, or
to hold constant or to partial out the covariate.
Statistical control has limitations. Scientists may disagree on what vari-
ables need to be controlled—an investigator who has controlled age, in-
come, and ethnicity may be criticized for failing to control education and
family size. And because covariates must be measured to be controlled,
they will be controlled inaccurately if they are measured inaccurately. We
return to these and other problems in Chapters 6 and 17. But because con-
trol of some covariates is almost always needed, and because the other four
methods of control are so limited, statistical control is widely recognized
as one of the most important statistical tools in the empiricist’s toolbox.
TABLE 1.1. Test Scores, Socioeconomic Status, and Preschool Attendance in Holly City
Raw frequencies
Preschool 30 10 40 30 60 90 60 70 130
Other 60 30 90 10 30 40 70 60 130
Preschool 75 33 46
Other 67 25 54
section of Table 1.1; A and B refer to scoring above and below the test
median, respectively.
But when the children are divided into “middle-class” and “working-
class,” the results are as shown on the left and center of Table 1.1. We see
that of the 40 middle-class children attending preschool, 30, or 75%, scored
above the median. There were 90 middle-class children not attending
preschool, and 60, or 67%, of them scored above the median. These values
of 75 and 67% are shown on the left in Table 1.2. Similar calculations
based on the working-class and total tables yield the other figures in Table
1.2. This table shows clearly that within each level of socioeconomic status
(SES), the preschool children outperform the other children, even though
they appear to do worse when you ignore socioeconomic status (SES). We
have held constant or controlled or partialed out the covariate of SES.
When we perform a similar analysis for nearby Ivy City, we find the
results in Table 1.3. When we inspect the total percentages, preschool
appears to have a positive effect. But when we look within each SES
group, no effect is found. Thus, the “total” tables overstate the effect of
6 Regression Analysis and Linear Models
preschool in Ivy City and understate it in Holly City. In these examples the
independent variable is preschool attendance and the dependent variable is
test score. In Holly City, we found a negative simple relationship between
these two variables (those attending preschool scored lower on the test) but
a positive partial relationship (a term more formally defined later) when SES
was controlled. In Ivy City, we found a positive simple relationship but no
partial relationship.
By examining the data more carefully, we can see what caused these
paradoxical results, known as Simpson’s paradox (for a discussion of this and
related phenomena, see Tu, Gunnel, & Gilthorpe, 2008). In Holly City, the
130 children attending preschool included 90 working-class children and
40 middle-class children, so 69% of the preschool attenders were working-
class. But the 130 nonpreschool children included 90 middle-class children
and 40 working-class children, so this group was only 31% working-class.
Thus, the test scores of the preschool group were lowered by the dispropor-
tionate number of working-class children in that group. This might have
occurred if city-subsidized preschool programs had been established pri-
marily in poorer neighborhoods. But in Ivy City this difference was in the
opposite direction: The preschool group was 75% middle-class, while the
nonpreschool group was only 25% middle-class; thus, the test scores of the
preschool group were raised by the disproportionate number of middle-
class children. This might have occurred if parents had to pay for their
children to attend preschool. In both cities the effects of preschool were
seen more clearly by controlling for or holding constant SES.
All three variables in this example were dichotomous—they had just
two levels each. The independent variable of preschool attendance had
two levels we called “preschool” and “other.” The dependent variable of
test score was dichotomized into those above and below the median. The
covariate of SES was also dichotomized. Such dichotomization is rarely
if ever something you would want do in practice (as discussed later in
section 5.1.6). Fortunately, with the methods described in this book, such
categorization is not necessary. Any or all of the variables in this problem
could have been numerically scaled. Test scores might have ranged from 0
to 100, and SES might have been measured on a scale with very many points
on a continuum. Even preschool attendance might have been numerical,
such as if we measured the exact number of days each child had attended
preschool. Changing some or all variables from dichotomous to numerical
would change the details of the analysis, but in its underlying logic the
problem would remain the same.
Statistical Control and Linear Models 7
Raw frequencies
Preschool 75 25 62
Other 75 25 38
Ph.D. completed
Yes No Total
educational level. Again, the partial relationship differs from the simple re-
lationship, though this time both the simple and partial relationships have
the same sign, meaning that men make more than women, with or without
controlling for education.
Y = b0 + b1 X1 + b2 X2 + · · · + bk Xk + e (1.1)
Statistical Control and Linear Models 9
4. Each analysis must have just one dependent variable, though it may
have several independent variables and several covariates.
2. You may choose to conduct a series of analyses from the same rect-
angular data matrix, and the same variable might be a dependent
variable in one analysis and an independent variable or covariate in
another. For instance, if the matrix includes the variables age, sex,
years of education, and salary, one analysis may examine years of ed-
ucation as a function of age and sex, while another analysis examines
salary as a function of age, sex, and education.
There are many statistical methods that are just linear models in disguise, or
closely related to linear regression analysis. For example, ANOVA, which
you may already be familiar with, can be thought of as a particular subset
of linear models designed early in the 20th century, well before comput-
ers were around. Mostly this meant using only categorical independent
variables, no covariates, and equal cell frequencies if there were two or
12 Regression Analysis and Linear Models
more independent variables. When a problem does meet the narrow re-
quirements of ANOVA, linear models and analysis of variance give the
same answers. Thus, ANOVA is just a special subset of the linear model
method. As shown in various locations throughout this book, ANOVA,
t-tests on differences between means, tests on Pearson correlations—things
you likely have already been exposed to—can all be thought of as special
simple cases of the general linear model, and can all be executed with a
program that can estimate a linear model.
Logistic regression, probit regression, and multilevel modeling are close
relatives of linear regression analysis. In logistic and probit regression, the
dependent variable can be dichotomous or ordinal, such as whether a
person succeeds or fails at a task, acts or does not act in a particular way
in some situation, or dislikes, feels neutral, or likes a stimulus. Multilevel
modeling is used when the data exhibit a “nested” structure, such as when
different subsets of the participants in a study share something such as the
neighborhood or housing development they live in or the building in a city
they work in. But you cannot fruitfully study these methods until you have
mastered linear models, since a great many concepts used in these methods
are introduced in connection with linear models.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com