An Introduction To Regression Analysis
An Introduction To Regression Analysis
Chicago Unbound
Coase-Sandor Working Paper Series in Law and
Coase-Sandor Institute for Law and Economics
Economics
1993
Recommended Citation
Alan O. Sykes, "An Introduction to Regression Analysis" (Coase-Sandor Institute for Law & Economics Working Paper No. 20, 1993).
This Working Paper is brought to you for free and open access by the Coase-Sandor Institute for Law and Economics at Chicago Unbound. It has been
accepted for inclusion in Coase-Sandor Working Paper Series in Law and Economics by an authorized administrator of Chicago Unbound. For more
information, please contact [email protected].
The Inaugural Coase Lecture
‥
Professor of Law, University of Chicago, The Law School. I thank Donna
Cote for helpful research assistance.
See, e.g, Bazemore v. Friday, U.S. , ().
See, e.g., McClesky v. Kemp, U.S. ().
See, e.g., Cotton Brothers Baking Co. v. Industrial Risk Insurers, F.d
(th Cir. ).
See, e.g., Thornburgh v. Gingles, U.S. ().
See, e.g., Sprayrite Service Corp. v. Monsanto Co., F.d (th Cir.
).
Chicago Working Paper in Law & Economics
and how they may go awry when key assumptions do not hold. To
make the discussion concrete, I will employ a series of illustrations
involving a hypothetical analysis of the factors that determine indi-
vidual earnings in the labor market. The illustrations will have a
legal fl avor in the latter part of the lecture, where they will
incorporate the possibility that earnings are impermissibly influenced
by gender in violation of the federal civil rights laws. I wish to
emphasize that this lecture is not a comprehensive treatment of the
statistical issues that arise in Title VII litigation, and that the
discussion of gender discrimination is simply a vehicle for expositing
certain aspects of regression technique. Also, of necessity, there are
many important topics that I omit, including simultaneous equation
models and generalized least squares. The lecture is limited to the
assumptions, mechanics, and common difficulties with single-
equation, ordinary least squares regression.
. What is Regression?
For purposes of illustration, suppose that we wish to identify and
quantify the factors that determine earnings in the labor market. A
moment’s reflection suggests a myriad of factors that are associated
with variations in earnings across individuals—occupation, age, ex-
perience, educational attainment, motivation, and innate ability
come to mind, perhaps along with factors such as race and gender
that can be of particular concern to lawyers. For the time being, let
us restrict attention to a single factor—call it education. Regression
analysis with a single explanatory variable is termed “simple regres-
sion.”
See U.S.C. §e- (), as amended.
Readers with a particular interest in the use of regression analysis under Title
VII may wish to consult the following references: Campbell, “Regression Analysis
in Title VII Cases—Minimum Standards, Comparable Worth, and Other Issues
Where Law and Statistics Meet,” Stan. L. Rev. (); Connolly, “The Use
of Multiple Rgeression Analysis in Employment Discrimination Cases,”
Population Res. and Pol. Rev. (); Finkelstein, “The Judicial Reception of
Multiple Regression Studies in Race and Sex Discrimination Cases,” Colum. L.
Rev. (); and Fisher, “Multiple Regression in Legal Proceedings”,
Colum. L. Rev. (), at – .
A Introduction to Regression Analysis
. Simple Regression
In reality, any effort to quantify the effects of education upon
earnings without careful attention to the other factors that affect
earnings could create serious statistical difficulties (termed “omitted
variables bias”), which I will discuss later. But for now let us assume
away this problem. We also assume, again quite unrealistically, that
“education” can be measured by a single attribute—years of school-
ing. We thus suppress the fact that a given number of years in school
may represent widely varying academic programs.
At the outset of any regression study, one formulates some hy-
pothesis about the relationship between the variables of interest,
here, education and earnings. Common experience suggests that
better educated people tend to make more money. It further suggests
that the causal relation likely runs from education to earnings rather
than the other way around. Thus, the tentative hypothesis is that
higher levels of education cause higher levels of earnings, other
things being equal.
To investigate this hypothesis, imagine that we gather data on
education and earnings for various individuals. Let E denote educa-
tion in years of schooling for each individual, and let I denote that
individual’s earnings in dollars per year. We can plot this informa-
tion for all of the individuals in the sample using a two-dimensional
diagram, conventionally termed a “scatter” diagram. Each point in
the diagram represents an individual in the sample.
Chicago Working Paper in Law & Economics
More accurately, what one can infer from the diagram is that if knowledge of
E suffices to predict I perfectly, then the relationship between them is a complex,
nonlinear one. Because we have no reason to suspect that the true relationship
between education and earnings is of that form, we are more likely to conclude
that knowledge of E is not sufficient to predict I perfectly.
The alternative possibility—that the relationship between two variables is
unstable—is termed the problem of “random” or “time varying” coefficients and
raises somewhat different statistical problems. See, e.g., H. Theil, Principles of
Econometrics – (); G. Chow, Econometrics – ().
A Introduction to Regression Analysis
When nonlinear relationships are thought to be present, investigators typi-
cally seek to model them in a manner that permits them to be transformed into
linear relationships. For example, the relationship y = cxα can be transformed into
the linear relationship log y = log c + α•log x. The reason for modeling nonlinear
relationships in this fashion is that the estimation of linear regressions is much
simpler and their statistical properties are better known. Where this approach is
infeasible, however, techniques for the estimation of nonlinear regressions have
been developed. See, e.g., G. Chow, supra note , at – .
Chicago Working Paper in Law & Economics
upon the information contained in the data set and, as shall be seen,
upon some assumptions about the characteristics of ε.
To understand how the parameter estimates are generated, note
that if we ignore the noise term ε, the equation above for the rela-
tionship between I and E is the equation for a line—a line with an
“intercept” of α on the vertical axis and a “slope” of β. Returning to
the scatter diagram, the hypothesized relationship thus implies that
somewhere on the diagram may be found a line with the equation I
= α + βE. The task of estimating α and β is equivalent to the task of
estimating where this line is located.
What is the best estimate regarding the location of this line? The
answer depends in part upon what we think about the nature of the
noise term ε. If we believed that ε was usually a large negative num-
ber, for example, we would want to pick a line lying above most or
all of our data points—the logic is that if ε is negative, the true value
of I (which we observe), given by I = α + βE + ε, will be less than the
value of I on the line I = α + βE. Likewise, if we believed that ε was
systematically positive, a line lying below the majority of data points
would be appropriate. Regression analysis assumes, however, that
the noise term has no such systematic property, but is on average
equal to zero—I will make the assumptions about the noise term
more precise in a moment. The assumption that the noise term is
usually zero suggests an estimate of the line that lies roughly in the
midst of the data, some observations below and some observations
above.
But there are many such lines, and it remains to pick one line in
particular. Regression analysis does so by embracing a criterion that
relates to the estimated noise term or “error” for each observation. To
be precise, define the “estimated error” for each observation as the
vertical distance between the value of I along the estimated line I = α
+ βE (generated by plugging the actual value of E into this equation)
and the true value of I for the same observation. Superimposing a
candidate line on the scatter diagram, the estimated errors for each
observation may be seen as follows:
A Introduction to Regression Analysis
With each possible line that might be superimposed upon the data, a
different set of estimated errors will result. Regression analysis then
chooses among all possible lines by selecting the one for which the
sum of the squares of the estimated errors is at a minimum. This is
termed the minimum sum of squared errors (minimum SSE) crite-
rion The intercept of the line chosen by this criterion provides the
estimate of α, and its slope provides the estimate of β.
It is hardly obvious why we should choose our line using the
minimum SSE criterion. We can readily imagine other criteria that
might be utilized (minimizing the sum of errors in absolute value,
for example). One virtue of the SSE criterion is that it is very easy to
employ computationally. When one expresses the sum of squared
errors mathematically and employs calculus techniques to ascertain
the values of α and β that minimize it, one obtains expressions for α
and β that are easy to evaluate with a computer using only the ob-
It should be obvious why simply minimizing the sum of errors is not an at-
tractive criterion—large negative errors and large positive errors would cancel out,
so that this sum could be at a minimum even though the line selected fitted the
data very poorly.
Chicago Working Paper in Law & Economics
The derivation is so simple in the case of one explanatory variable that it is
worth including here: Continuing with the example in the text, we imagine that
we have data on education and earnings for a number of individuals, let them be
indexed by j. The actual value of earnings for the jth individual is I j, and its esti-
mated value for any line with intercept α and slope β will be α + βEj. The esti-
mated error is thus I j – α – βEj. The sum of squared errors is then ∑j(Ij – α –
βEj)2. Minimizing this sum with respect to a requires that its derivative with re-
spect to α be set to zero, or – 2∑j(Ij – α – βEj) = 0. Minimizing with respect to β
likewise requires – 2∑jEi (Ij – α – βEj) = 0. We now have two equations in two
unknowns that can be solved for α and β.
A Introduction to Regression Analysis
The derivation may be found in any standard econometrics text. See, e.g., E.
Hanushek and J. Jackson, Statistical Methods for Social Scientists – (); J.
Johnston, Econometric Methods – (d ed. ).
Chicago Working Paper in Law & Economics
An accessible and more extensive discussion of the key assumptions of
regression may be found in Fisher, supra note .
A Introduction to Regression Analysis
“Variance” is a mesaure of the dispersion of the probability distribution of a
random variable. Consider two random variables with the same mean (same aver-
age value). If one of them has a distribution with greater variance, then, roughly
speaking, the probability that the variable will take on a value far from the mean is
greater.
Lower variance by itself is not necessarily an attractive property for an es-
timator. For example, we could employ an estimator for β of the form “β = ”
irrespective of the information in the data set. This estimator has zero variance.
Chicago Working Paper in Law & Economics
See, e.g, P. Kennedy, A Guide to Econometrics – (d ed. ).
If the expected value of the noise term is always zero irrespective of the val-
ues of the explanatory variables for the observation with which the noise term is
associated, then by definition the noise term cannot be correlated with any ex-
planatory variable.
E.g., id. at ; J. Johnston, supra note , at – .
A Introduction to Regression Analysis
See, e.g., id. at – .
See, e.g., Miller v Kansas Electric Power Cooperative, Inc., WL
(D. Kan.).
Chicago Working Paper in Law & Economics
schooling and aptitude variables, for reasons that will become clear
later. I then employed a random number generator to produce a
noise term drawn from a normal distribution, with a standard devia-
tion (the square root of the variance) equal to , and a mean of
zero. This standard deviation was chosen more or less arbitrarily to
introduce a considerable but not overwhelming amount of noise in
proportion to the total variation in earnings. The right-hand-side
variables were then used to generate the “actual value” of earnings for
each of the fifty “individuals.”
The effect of gender on earnings in this hypothetical firm enters
through the variable Gendum. Gendum is a “dummy” variable in
econometric parlance because its numerical value is arbitrary, and it
simply captures some nonnumerical attribute of the sample popula-
tion. By construction here, men and women both earn the same re-
turns to education, experience, and aptitude, but holding these fac-
tors constant the earnings of women are $, lower. In effect, the
constant term (baseline earnings) is lower for women, but otherwise
women are treated equally. In reality, of course, gender discrimina-
tion could arise in other ways (such as lower returns to education and
experience for women, for example), and I assume that it takes this
form only for purposes of illustration.
Note that the random number generator that I employed here
generates noise terms with an expected value of zero, each drawn
from a distribution with the same variance. Further, the noise terms
for the various observations are statistically independent (the realized
value of the noise term for each observation has no influence on the
noise term drawn for any other observation). Hence, the noise terms
satisfy the assumptions necessary to ensure that the minimum SSE
criterion yields unbiased, consistent, and efficient estimates. The ex-
pected value of the estimate for each parameter is equal to the true
value, therefore, and no other linear estimator will do a better job at
recovering the true parameters than the minimum SSE criterion. It
is nevertheless interesting to see just how well regression analysis
performs. I used a standard computer package to estimate the con-
stant term and the coefficients of the four independent variables
from the “observed” values of Earnings, School, Aptitude,
Experience, and Gendum for each of the fifty hypothetical individu-
als. The results are reproduced in table , under the column labeled
A Introduction to Regression Analysis
“Estimated Value.” (We will discuss the last three columns and the
R2 statistic in the next section.)
Note that all of the estimated parameters have the right sign.
Just by chance, it turns out that the regression overestimates the
returns to schooling and underestimates the other parameters. The
estimated coefficient for Aptitude is off by a great deal in proportion
to its true value, and in a later section I will offer an hypothesis as to
what the problem is. The other parameter estimates, though obvi-
ously different from the true value of the underlying parameter, are
much closer to the mark. With particular reference to the coefficient
of Gendum, the regression results correctly suggest the presence of
gender discrimination, though its magnitude is underestimated by
about percent (remember that an overestimate of the same magni-
tude was just as likely ex ante, that is, before the actual values for the
noise terms were generated).
The source of the error in the coefficient estimates, of course, is
the presence of noise. If the noise term were equal to zero for every
observation, the true values of the underlying parameters could be
recovered in this illustration with perfect accuracy from the data for
only five hypothetical individuals—it would be a simple matter of
solving five equations in five unknowns. And, if noise is the source
of error in the parameter estimates, intuition suggests that the
magnitude of the noise will affect the accuracy of the regression
estimates, with more noise leading to less accuracy on average. We
will make this intuition precise in the next section, but before
Chicago Working Paper in Law & Economics
See, e.g., E. Hanushek and J. Jackson, supra note , at – ; J. Johnston,
supra note , at – . The supposition that the noise terms are normally dis-
tributed is often intuitively plausible and may be loosely justified by appeal to
“central limit theorems,” which hold that the average of a large number of random
variables tends toward a normal distribution even if the individual random vari-
ables that enter into the average are not normally distributed. See, e.g., R. Hogg
and A. Craig, Introduction to Mathematical Statistics – (th ed. ); W.
Feller, An Introduction to Probability Theory and Its Applications, vol. , – (d
ed. ). Thus, if we think of the noise term as the sum of a large number of in-
dependent, small disturbances, theory affords considerable basis for the supposi-
tion that its distribution is approximately normal.
Chicago Working Paper in Law & Economics
See sources cited note supra.
I limit the discussion here to hypothesis testing regarding the value of a
particular parameter. In fact, other sorts of hypotheses may readily be tested, such
A Introduction to Regression Analysis
as the hypothesis that all parameters in the model are zero, the hypothesis that
some subset of the parameters are zero, and so on.
Chicago Working Paper in Law & Economics
See sources cited note supra.
A Introduction to Regression Analysis
If the regression package does not report these probabilities, they can readily
be found elsewhere. It has become common practice to include in statistics and
econometrics books tables of probabilities for t-distributions with varying degrees
of freedom. Knowing the degrees of freedom associated with a t-statistic, there-
fore, one can consult such a table to ascertain the probability of obtaining a t-
statistic as far from zero or farther as the one generated by the regression (the con-
cept “far from zero” again defined by either a one- or two-tailed test). As a point
of reference, when the degrees of freedom are large (say, or more), then the .
significance level for a two-tailed test requires a t-statistic approximately equal to
..
Chicago Working Paper in Law & Economics
The result in this illustration is general—for any t-statistic, the probability
of rejecting the null hypothesis erroneously under a one-tailed test will be exactly
half that probability under a two-tailed test.
A Introduction to Regression Analysis
See, e.g., E. Hanushek and J. Jackson, supra note , at – .
Chicago Working Paper in Law & Economics
and will simply illustrate two of the many complications that may
arise, chosen because they are both common and quite important.
. Omitted Variables
As noted, the omission from a regression of some variables that
affect the dependent variable may cause an “omitted variables bias.”
The problem arises because any omitted variable becomes part of the
noise term, and the result may be a violation of the assumption
necessary for the minimum SSE criterion to be an unbiased estima-
tor.
Recall that assumption—that each noise term is drawn from a
distribution with a mean of zero. We noted that this assumption
logically implies the absence of correlation between the explanatory
variables included in the regression and the expected value of the
noise term (because whatever the value of any explanatory variable,
the expected value of the noise term is always zero). Thus, suppose
we start with a properly specified model in which the noise term for
every observation has an expected value of zero. Now, omit one of
the independent variables. If the effect of this variable upon the de-
pendent variable is not zero for each observation, the new noise
terms now come from distributions with nonzero means. One con-
sequence is that the estimate of the constant term will be biased
(part of the estimated value for the constant term is actually the
mean effect of the omitted variable). Further, unless the omitted
variable is uncorrelated with the included ones, the coefficients of
the included ones will be biased because they now reflect not only an
estimate of the effect of the variable with which they are associated,
but also partly the effects of the omitted variable.
To illustrate the omitted variables problem, I took the data on
which the estimates reported in table are based, and reran the re-
gression after omitting the schooling variable. The results are shown
in table :
See J. Johnston, supra note , at – ; E. Hanushek and J. Jackson, supra
note , at – . The bias is a function of two things—the true coefficients of the
excluded variables, and the correlation within the data set between the included
and the excluded variables.
Chicago Working Paper in Law & Economics
You will note that the omission of the schooling variable lowers
the R2 of the regression, which is not surprising given the original
importance of the variable. It also alters the coefficient estimates.
The estimate for the constant term rises considerably, because the
mean effect of schooling on income is positive. It is not surprising
that the constant term is thus estimated to be greater than its true
value. An even more significant effect of the omission of schooling is
on the coefficient estimate for the aptitude variable, which increases
dramatically from below its true value to well above its true value and
becomes highly significant. The reason is that the schooling variable
is highly correlated (positively) with aptitude in the data set—the
correlation is .—and because schooling has a positive effect on
earnings. Hence, with the schooling variable omitted, the aptitude
coefficient is erroneously capturing some of the (positive) returns to
education as well as the returns to “aptitude.” The consequence is
that the minimum SSE criterion yields an upward biased estimate of
the coefficient for aptitude, and in this case the actual estimate is in-
deed above the true value of that coefficient.
The effect on the other coefficients is more modest, though non-
trivial. Notice, for example, that the coefficient of gendum increases
(in absolute value) significantly. This is because schooling happens
to be positively correlated with being male in my fictitious data
set—without controlling for schooling, the apparent effect of gender
is exaggerated because females are somewhat less well educated on
average.
A Introduction to Regression Analysis
Econometricians have developed some more sophisticated regression tech-
niques to deal with the problem of unobservable variables, but these are not always
satisfactory because of certain restrictive assumptions than an investigator must
make in using them. See, e.g., Griliches, “Errors in Variables and Other Unob-
servables,” Econometrica (). An accessible discussion of the omitted
variables problem and related issues may be found in P. Kennedy, supra note , at
– .
Id.
Chicago Working Paper in Law & Economics
One standard technique for addressing this problem is termed “instrumental
variables,” which replaces the tainted variable with another variable that is thought
to be closely associated with it but also thought uncorrelated with the disturbance
term. For a variety of reasons, however, the instrumental variables technique is not
satisfactory in many cases, and the errors in variables problem is consequently one
of the most serious difficulties in the use of regression techniques. A discussion of
the instrumental variables technique and other possible responses to the errors in
variables problem may be found in P. Kennedy, supra note , at – ; J.
Johnston, supra note , at – .
A Introduction to Regression Analysis
It is important to recollect that this approach raises the problem of omitted
variables bias for the other variables as well.
See P. Kennedy, supra note , at – .
Chicago Working Paper in Law & Economics
give it. I hope that the illustrations in this lecture afford some basis
for optimism that such studies can be helpful, while also suggesting
considerable basis for caution in their use.
I return now to an issue deferred earlier in the discussion of hy-
pothesis testing—the relationship between the statistical significance
test and the burden of proof. Suppose, for example, that to establish
liability for wage discrimination on the basis of gender under Title
VII, a plaintiff need simply show by a preponderance of the evidence
that women employed by the defendant suffer some measure of dis-
crimination. With reference to our first illustration, we might say
that the required showing on liability is that, by a preponderance of
the evidence, the coefficient of the gender dummy is negative.
Unfortunately, there is no simple relationship between this bur-
den of proof and the statistical significance test. At one extreme, if
we imagine that the parameter estimate in the regression study is the
only information we have about the presence or absence of discrimi-
nation, one might argue that liability is established by a preponder-
ance of the evidence if the estimated coefficient for the gender
dummy is negative regardless of its statistical significance or standard
error. The rationale would be that the negative estimate, however
subject to uncertainty, is unbiased and is the best evidence we have.
But this is much too simplistic. Very rarely is the regression es-
timate the only information available, and when the standard errors
are high the estimate may be among the least reliable information
available. Further, regression analysis is subject to considerable ma-
nipulation. It is not obvious precisely which variables should be in-
cluded in a model, or what proxies to use for included variables that
cannot be measured precisely. There is considerable room for exper-
imentation, and this experimentation can become “data mining,”
whereby an investigator tries numerous regression specifications
until the desired result appears. An advocate quite naturally may
have a tendency to present only those estimates that support the
client’s position. Hence, if the best result that an advocate can
present contains high standard errors and low statistical significance,
it is often plausible to suppose that numerous even less impressive
See, e.g., Texas Department of Community Affairs v. Burdine, U.S.
().
A Introduction to Regression Analysis
I will not digress on the rules of discovery here. In practice, the raw data
may be discoverable, for example, while the expert’s undisclosed analysis of the
data may not be.
See the discussion in Fisher, “Statisticians, Econometricians and Adversary
Proceedings,” J. Am. Stat. Assn. ().