0% found this document useful (0 votes)
4 views

IE Chapter2

Chapter 2 discusses the simple regression model, which analyzes the relationship between two variables, y and x. It introduces key concepts such as dependent and independent variables, the error term, and the assumptions necessary for valid regression analysis. The chapter emphasizes the importance of understanding these concepts as a foundation for more complex regression models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

IE Chapter2

Chapter 2 discusses the simple regression model, which analyzes the relationship between two variables, y and x. It introduces key concepts such as dependent and independent variables, the error term, and the assumptions necessary for valid regression analysis. The chapter emphasizes the importance of understanding these concepts as a foundation for more complex regression models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

chapter 2

The Simple
Regression Model

T
he simple regression model can be used to study the relationship between two variables. For
reasons we will see, the simple regression model has limitations as a general tool for empirical
analysis. Nevertheless, it is sometimes appropriate as an empirical tool. Learning how to inter-
pret the simple regression model is good practice for studying multiple regression, which we will do
in subsequent chapters.

2-1 Definition of the Simple Regression Model


Much of applied econometric analysis begins with the following premise: y and x are two variables,
representing some population, and we are interested in “explaining y in terms of x,” or in “studying
how y varies with changes in x.” We discussed some examples in Chapter 1, including: y is soybean
crop yield and x is amount of fertilizer; y is hourly wage and x is years of education; and y is a com-
munity crime rate and x is number of police officers.
In writing down a model that will “explain y in terms of x,” we must confront three issues. First,
because there is never an exact relationship between two variables, how do we allow for other factors
to affect y? Second, what is the functional relationship between y and x? And third, how can we be
sure we are capturing a ceteris paribus relationship between y and x (if that is a desired goal)?
We can resolve these ambiguities by writing down an equation relating y to x. A simple
equation is
y 5 b0 1 b1x 1 u. [2.1]
Equation (2.1), which is assumed to hold in the population of interest, defines the simple linear
regression model. It is also called the two-variable linear regression model or bivariate linear

20

58860_ch02_hr_019-065.indd 20 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 21

Table 2.1 Terminology for Simple Regression


Y X
Dependent variable Independent variable
Explained variable Explanatory variable
Response variable Control variable
Predicted variable Predictor variable
Regressand Regressor

regression model because it relates the two variables x and y. We now discuss the meaning of each
of the quantities in equation (2.1). [Incidentally, the term “regression” has origins that are not espe-
cially important for most modern econometric applications, so we will not explain it here. See Stigler
(1986) for an engaging history of regression analysis.]
When related by equation (2.1), the variables y and x have several different names used inter-
changeably, as follows: y is called the dependent variable, the explained variable, the response
variable, the predicted variable, or the regressand; x is called the independent variable, the
explanatory variable, the control variable, the predictor variable, or the regressor. (The term
covariate is also used for x.) The terms “dependent variable” and “independent variable” are fre-
quently used in econometrics. But be aware that the label “independent” here does not refer to the
statistical notion of independence between random variables (see Math Refresher B).
The terms “explained” and “explanatory” variables are probably the most descriptive. “Response”
and “control” are used mostly in the experimental sciences, where the variable x is under the experi-
menter’s control. We will not use the terms “predicted variable” and “predictor,” although you some-
times see these in applications that are purely about prediction and not causality. Our terminology for
simple regression is summarized in Table 2.1.
The variable u, called the error term or disturbance in the relationship, represents factors other
than x that affect y. A simple regression analysis effectively treats all factors affecting y other than x as
being unobserved. You can usefully think of u as standing for “unobserved.”
Equation (2.1) also addresses the issue of the functional relationship between y and x. If the other
factors in u are held fixed, so that the change in u is zero, Du 5 0, then x has a linear effect on y:

Dy 5 b1Dx if Du 5 0. [2.2]

Thus, the change in y is simply b1 multiplied by the change in x. This means that b1 is the slope
parameter in the relationship between y and x, holding the other factors in u fixed; it is of primary
interest in applied economics. The intercept parameter b0, sometimes called the constant term, also
has its uses, although it is rarely central to an analysis.

Example 2.1 Soybean Yield and Fertilizer


Suppose that soybean yield is determined by the model

yield 5 b0 1 b1 fertilizer 1 u, [2.3]

so that y 5 yield and x 5 fertilizer. The agricultural researcher is interested in the effect of fertilizer
on yield, holding other factors fixed. This effect is given by b1. The error term u contains factors such
as land quality, rainfall, and so on. The coefficient b1 measures the effect of fertilizer on yield, hold-
ing other factors fixed: Dyield 5 b1D fertilizer.

58860_ch02_hr_019-065.indd 21 10/18/18 4:05 PM


22 PART 1 Regression Analysis with Cross-Sectional Data

Example 2.2 A Simple Wage Equation


A model relating a person’s wage to observed education and other unobserved factors is
wage 5 b0 1 b1educ 1 u. [2.4]
If wage is measured in dollars per hour and educ is years of education, then b1 measures the change
in hourly wage given another year of education, holding all other factors fixed. Some of those factors
include labor force experience, innate ability, tenure with current employer, work ethic, and numerous
other things.

The linearity of equation (2.1) implies that a one-unit change in x has the same effect on y,
regardless of the initial value of x. This is unrealistic for many economic applications. For example, in
the wage-education example, we might want to allow for increasing returns: the next year of educa-
tion has a larger effect on wages than did the previous year. We will see how to allow for such pos-
sibilities in Section 2-4.
The most difficult issue to address is whether model (2.1) really allows us to draw ceteris paribus
conclusions about how x affects y. We just saw in equation (2.2) that b1 does measure the effect of
x on y, holding all other factors (in u) fixed. Is this the end of the causality issue? Unfortunately, no.
How can we hope to learn in general about the ceteris paribus effect of x on y, holding other factors
fixed, when we are ignoring all those other factors?
Section 2-5 will show that we are only able to get reliable estimators of b0 and b1 from a random
sample of data when we make an assumption restricting how the unobservable u is related to the
explanatory variable x. Without such a restriction, we will not be able to estimate the ceteris paribus
effect, b1. Because u and x are random variables, we need a concept grounded in probability.
Before we state the key assumption about how x and u are related, we can always make one
assumption about u. As long as the intercept b0 is included in the equation, nothing is lost by assum-
ing that the average value of u in the population is zero. Mathematically,
E 1 u 2 5 0. [2.5]
Assumption (2.5) says nothing about the relationship between u and x, but simply makes a state-
ment about the distribution of the unobserved factors in the population. Using the previous exam-
ples for illustration, we can see that assumption (2.5) is not very restrictive. In Example 2.1, we
lose nothing by normalizing the unobserved factors affecting soybean yield, such as land quality, to
have an average of zero in the population of all cultivated plots. The same is true of the unobserved
factors in Example 2.2. Without loss of generality, we can assume that things such as average
ability are zero in the population of all working people. If you are not convinced, you should work
through Problem 2 to see that we can always redefine the intercept in equation (2.1) to make equa-
tion (2.5) true.
We now turn to the crucial assumption regarding how u and x are related. A natural measure of
the association between two random variables is the correlation coefficient. (See Math Refresher B
for definition and properties.) If u and x are uncorrelated, then, as random variables, they are not
linearly related. Assuming that u and x are uncorrelated goes a long way toward defining the sense in
which u and x should be unrelated in equation (2.1). But it does not go far enough, because correla-
tion measures only linear dependence between u and x. Correlation has a somewhat counterintuitive
feature: it is possible for u to be uncorrelated with x while being correlated with functions of x, such
as x2. (See Section B-4 in Math Refresher B for further discussion.) This possibility is not acceptable
for most regression purposes, as it causes problems for interpreting the model and for deriving statis-
tical properties. A better assumption involves the expected value of u given x.
Because u and x are random variables, we can define the conditional distribution of u given any
value of x. In particular, for any x, we can obtain the expected (or average) value of u for that slice of

58860_ch02_hr_019-065.indd 22 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 23

the population described by the value of x. The crucial assumption is that the average value of u does
not depend on the value of x. We can write this assumption as
E1u0x2 5 E1u2. [2.6]
Equation (2.6) says that the average value of the unobservables is the same across all slices of the
population determined by the value of x and that the common average is necessarily equal to the
average of u over the entire population. When assumption (2.6) holds, we say that u is mean inde-
pendent of x. (Of course, mean independence is implied by full independence between u and x, an
assumption often used in basic probability and statistics.) When we combine mean independence
with assumption (2.5), we obtain the zero conditional mean assumption, E 1 u 0 x 2 5 0. It is critical
to remember that equation (2.6) is the assumption with impact; assumption (2.5) essentially defines
the intercept, b0.
Let us see what equation (2.6) entails in the wage example. To simplify the discussion, assume
that u is the same as innate ability. Then equation (2.6) requires that the average level of ability is the
same, regardless of years of education. For example, if E 1 abil 0 8 2 denotes the average ability for the
group of all people with eight years of education, and E 1 abil 0 16 2 denotes the average ability among
people in the population with sixteen years of education, then equation (2.6) implies that these must
be the same. In fact, the average ability level must be the same for all education levels. If, for exam-
ple, we think that average ability increases with years of education, then equation (2.6) is false. (This
would happen if, on average, people with more ability choose to become more educated.) As we can-
not observe innate ability, we have no way of know-
G o i n g F u rt h e r 2 . 1 ing whether or not average ability is the same for all
Suppose that a score on a final exam, score, education levels. But this is an issue that we must
depends on classes attended (attend) and address before relying on simple regression analysis.
unobserved factors that affect exam perfor- In the fertilizer example, if fertilizer amounts are
mance (such as student ability). Then chosen independently of other features of the plots,
score 5 b0 1 b1attend 1 u. [2.7]
then equation (2.6) will hold: the average land quality
will not depend on the amount of fertilizer. However,
When would you expect this model to satisfy if more fertilizer is put on the higher-quality plots of
equation (2.6)? land, then the expected value of u changes with the
level of fertilizer, and equation (2.6) fails.
The zero conditional mean assumption gives b1 another interpretation that is often useful. Taking
the expected value of equation (2.1) conditional on x and using E 1 u 0 x 2 5 0 gives
E 1 y 0 x 2 5 b0 1 b1x. [2.8]
Equation (2.8) shows that the population regression function (PRF), E 1 y 0 x 2 , is a linear function of
x. The linearity means that a one-unit increase in x changes the expected value of y by the amount b1.
For any given value of x, the distribution of y is centered about E 1 y 0 x 2 , as illustrated in Figure 2.1.
It is important to understand that equation (2.8) tells us how the average value of y changes
with x; it does not say that y equals b0 1 b1x for all units in the population. For example, suppose
that x is the high school grade point average and y is the college GPA, and we happen to know that
E 1 colGPA 0 hsGPA 2 5 1.5 1 0.5 hsGPA. [Of course, in practice, we never know the population
intercept and slope, but it is useful to pretend momentarily that we do to understand the nature of
­equation (2.8).] This GPA equation tells us the average college GPA among all students who have a
given high school GPA. So suppose that hsGPA 5 3.6. Then the average colGPA for all high school
graduates who attend college with hsGPA 5 3.6 is 1.5 1 0.5 1 3.6 2 5 3.3. We are certainly not say-
ing that every student with hsGPA 5 3.6 will have a 3.3 college GPA; this is clearly false. The PRF
gives us a relationship between the average level of y at different levels of x. Some students with
hsGPA 5 3.6 will have a college GPA higher than 3.3, and some will have a lower college GPA.
Whether the actual colGPA is above or below 3.3 depends on the unobserved factors in u, and those
differ among students even within the slice of the population with hsGPA 5 3.6.

58860_ch02_hr_019-065.indd 23 10/18/18 4:05 PM


24 PART 1 Regression Analysis with Cross-Sectional Data

Figure 2.1 E 1 y 0 x 2 as a linear function of x.


y

E(y x) 5 0 1 x
1

x1 x2 x3

Given the zero conditional mean assumption E 1 u 0 x 2 5 0, it is useful to view equation (2.1) as
breaking y into two components. The piece b0 1 b1x, which represents E 1 y 0 x 2 , is called the system-
atic part of y—that is, the part of y explained by x—and u is called the unsystematic part, or the part
of y not explained by x. In Chapter 3, when we introduce more than one explanatory variable, we will
discuss how to determine how large the systematic part is relative to the unsystematic part.
In the next section, we will use assumptions (2.5) and (2.6) to motivate estimators of b0 and b1
given a random sample of data. The zero conditional mean assumption also plays a crucial role in the
statistical analysis in Section 2-5.

2-2 Deriving the Ordinary Least Squares Estimates


Now that we have discussed the basic ingredients of the simple regression model, we will address the
important issue of how to estimate the parameters b0 and b1 in equation (2.1). To do this, we need a
sample from the population. Let { 1 xi, yi 2 : i 5 1, . . . , n} denote a random sample of size n from the
population. Because these data come from equation (2.1), we can write
yi 5 b0 1 b1xi 1 ui [2.9]
for each i. Here, ui is the error term for observation i because it contains all factors affecting yi other
than xi.
As an example, xi might be the annual income and yi the annual savings for family i during a par-
ticular year. If we have collected data on 15 families, then n 5 15. A scatterplot of such a data set is
given in Figure 2.2, along with the (necessarily fictitious) population regression function.
We must decide how to use these data to obtain estimates of the intercept and slope in the popula-
tion regression of savings on income.

58860_ch02_hr_019-065.indd 24 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 25

Figure 2.2 Scatterplot of savings and income for 15 families, and the population regression
E 1 savings 0 income 2 5 b0 1 b1 income.
savings

E(savings income) 5 0 1 1 income

0
income

There are several ways to motivate the following estimation procedure. We will use equa-
tion (2.5) and an important implication of assumption (2.6): in the population, u is uncorrelated with
x. Therefore, we see that u has zero expected value and that the covariance between x and u is zero:
E1u2 5 0 [2.10]
and
Cov 1 x,u 2 5 E 1 xu 2 5 0, [2.11]
where the first equality in equation (2.11) follows from (2.10). (See Section B-4 in Math Refresher B
for the definition and properties of covariance.) In terms of the observable variables x and y and the
unknown parameters b0 and b1, equations (2.10) and (2.11) can be written as
E 1 y 2 b 0 2 b 1x 2 5 0 [2.12]
and
E 3 x 1 y 2 b0 2 b1x 2 4 5 0, [2.13]
respectively. Equations (2.12) and (2.13) imply two restrictions on the joint probability distribution
of (x,y) in the population. Because there are two unknown parameters to estimate, we might hope that
equations (2.12) and (2.13) can be used to obtain good estimators of b0 and b1. In fact, they can be.
Given a sample of data, we choose estimates b^ 0 and b^ 1 to solve the sample counterparts of equations
(2.12) and (2.13):
n 21 a 1 yi 2 b^ 0 2 b^ 1xi 2 5 0
n
[2.14]
i51
and
n 21 a xi 1 yi 2 b^ 0 2 b^ 1xi 2 5 0.
n
[2.15]
i51
This is an example of the method of moments approach to estimation. (See Section C-4 in Math Refresher C
for a discussion of different estimation approaches.) These equations can be solved for b^ 0 and b^ 1.

58860_ch02_hr_019-065.indd 25 10/18/18 4:05 PM


26 PART 1 Regression Analysis with Cross-Sectional Data

Using the basic properties of the summation operator from Math Refresher A, equation (2.14)
can be rewritten as
y 5 b^ 0 1 b^ 1x, [2.16]
where y 5 n21 g i51 yi is the sample average of the yi and likewise for x. This equation allows us
n

to write b^ 0 in terms of b^ 1, y, and x:


b^ 0 5 y 2 b^ 1x. [2.17]
Therefore, once we have the slope estimate b^ 1, it is straightforward to obtain the intercept estimate b^ 0,
given y and x.
Dropping the n21 in (2.15) (because it does not affect the solution) and plugging (2.17) into
(2.15) yields

a xi 3 yi 2 1 y 2 b1x 2 2 b1xi 4 5 0,
n
^ ^
i51

which, upon rearrangement, gives

a xi 1 yi 2 y 2 5 b1 a xi 1 xi 2 x 2 .
n n
^
i51 i51

From basic properties of the summation operator [see (A-7) and (A-8) in Math Refresher A],

a xi 1 xi 2 x 2 5 a 1 xi 2 x 2 and a xi 1 yi 2 y 2 5 a 1 xi 2 x 2 1 yi 2 y 2 .
n n n n
2
i51 i51 i51 i51

Therefore, provided that


a 1 xi 2 x 2 . 0,
n
2
[2.18]
i51

the estimated slope is

a 1 xi 2 x 2 1 yi 2 y 2
n

i51
b^ 1 5 [2.19]
a 1 xi 2 x 2
n .
2
i51

Equation (2.19) is simply the sample covariance between xi and yi divided by the sample variance
of xi. Using simple algebra we can also write b^ 1 as
s^ y
b^ 1 5 r^ xy ? a b,
s^ x
where r^ xy is the sample correlation between xi and yi and s^ x, s^ y denote the sample standard devia-
tions. (See Math Refresher C for definitions of correlation and standard deviation. Dividing all sums
by n 2 1 does not affect the formulas.) An immediate implication is that if xi and yi are positively cor-
related in the sample then b^ 1 . 0; if xi and yi are negatively correlated then b^ 1 , 0.
Not surprisingly, the formula for b^ 1 in terms of the sample correlation and sample standard devia-
tions is the sample analog of the population relationship
sy
b1 5 rxy ? a b,
sx
where all quantities are defined for the entire population. Recognition that b1 is just a scaled version
of rxy highlights an important limitation of simple regression when we do not have experimental data:
in effect, simple regression is an analysis of correlation between two variables, and so one must be
careful in inferring causality.
Although the method for obtaining (2.17) and (2.19) is motivated by (2.6), the only assumption
needed to compute the estimates for a particular sample is (2.18). This is hardly an assumption at all:
(2.18) is true, provided the xi in the sample are not all equal to the same value. If (2.18) fails, then

58860_ch02_hr_019-065.indd 26 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 27

Figure 2.3 A scatterplot of wage against education when educi 5 12 for all i.
wage

0 12 educ

we have either been unlucky in obtaining our sample from the population or we have not specified an
interesting problem (x does not vary in the population). For example, if y 5 wage and x 5 educ, then
(2.18) fails only if everyone in the sample has the same amount of education (for example, if everyone
is a high school graduate; see Figure 2.3). If just one person has a different amount of education, then
(2.18) holds, and the estimates can be computed.
The estimates given in (2.17) and (2.19) are called the ordinary least squares (OLS) estimates
of b0 and b1. To justify this name, for any b^ 0 and b^ 1 define a fitted value for y when x 5 xi as
y^ i 5 b^ 0 1 b^ 1xi. [2.20]

This is the value we predict for y when x 5 xi for the given intercept and slope. There is a fitted value
for each observation in the sample. The residual for observation i is the difference between the actual
yi and its fitted value:

u^ i 5 yi 2 y^ i 5 yi 2 b^ 0 1 b^ 1xi. [2.21]

Again, there are n such residuals. [These are not the same as the errors in (2.9), a point we return to in
Section 2-5.] The fitted values and residuals are indicated in Figure 2.4.
Now, suppose we choose b^ 0 and b^ 1 to make the sum of squared residuals,

a u^ i 5 a 1 yi 2 b0 2 b1xi 2 ,
n n
2 ^ ^ 2
[2.22]
i51 i51

as small as possible. The appendix to this chapter shows that the conditions necessary for 1 b^ 0,b^ 1 2 to
minimize (2.22) are given exactly by equations (2.14) and (2.15), without n21. Equations (2.14) and
(2.15) are often called the first order conditions for the OLS estimates, a term that comes from opti-
mization using calculus (see Math Refresher A). From our previous calculations, we know that the
solutions to the OLS first order conditions are given by (2.17) and (2.19). The name “ordinary least
squares” comes from the fact that these estimates minimize the sum of squared residuals.

58860_ch02_hr_019-065.indd 27 10/18/18 4:05 PM


28 PART 1 Regression Analysis with Cross-Sectional Data

Figure 2.4 Fitted values and residuals.


y

yi

ûi 5 residual yˆ 5 ˆ 0 1 ˆ 1x

yˆi 5 fitted value


yˆ 1
y1

x1 xi x

When we view ordinary least squares as minimizing the sum of squared residuals, it is natural
to ask: why not minimize some other function of the residuals, such as the absolute values of the
residuals? In fact, as we will discuss in the more advanced Section 9-6, minimizing the sum of the
absolute values of the residuals is sometimes very useful. But it does have some drawbacks. First, we
cannot obtain formulas for the resulting estimators; given a data set, the estimates must be obtained
by numerical optimization routines. As a consequence, the statistical theory for estimators that mini-
mize the sum of the absolute residuals is very complicated. Minimizing other functions of the residu-
als, say, the sum of the residuals each raised to the fourth power, has similar drawbacks. (We would
never choose our estimates to minimize, say, the sum of the residuals themselves, as residuals large
in magnitude but with opposite signs would tend to cancel out.) With OLS, we will be able to derive
unbiasedness, consistency, and other important statistical properties relatively easily. Plus, as the
motivation in equations (2.12) and (2.13) suggests, and as we will see in Section 2-5, OLS is suited
for estimating the parameters appearing in the conditional mean function (2.8).
Once we have determined the OLS intercept and slope estimates, we form the OLS regression line:
y^ 5 b^ 0 1 b^ 1x, [2.23]
where it is understood that b^ 0 and b^ 1 have been obtained using equations (2.17) and (2.19). The
­notation y^ , read as “y hat,” emphasizes that the predicted values from equation (2.23) are estimates.
The intercept, b^ 0, is the predicted value of y when x 5 0, although in some cases it will not make
sense to set x 5 0. In those situations, b^ 0 is not, in itself, very interesting. When using (2.23) to com-
pute predicted values of y for various values of x, we must account for the intercept in the calcula-
tions. Equation (2.23) is also called the sample regression function (SRF) because it is the estimated
version of the population regression function E 1 y 0 x 2 5 b0 1 b1x. It is important to remember that
the PRF is something fixed, but unknown, in the population. Because the SRF is obtained for a given
sample of data, a new sample will generate a different slope and intercept in equation (2.23).
In most cases, the slope estimate, which we can write as
b^ 1 5 Dy^ /Dx, [2.24]

58860_ch02_hr_019-065.indd 28 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 29

is of primary interest. It tells us the amount by which y^ changes when x increases by one unit.
Equivalently,
Dy^ 5 b^ 1Dx, [2.25]
so that given any change in x (whether positive or negative), we can compute the predicted change in y.
We now present several examples of simple regression obtained by using real data. In other
words, we find the intercept and slope estimates with equations (2.17) and (2.19). Because these
examples involve many observations, the calculations were done using an econometrics software
package. At this point, you should be careful not to read too much into these regressions; they are not
necessarily uncovering a causal relationship. We have said nothing so far about the statistical proper-
ties of OLS. In Section 2-5, we consider statistical properties after we explicitly impose assumptions
on the population model equation (2.1).

Example 2.3 CEO Salary and Return on Equity


For the population of chief executive officers, let y be annual salary (salary) in thousands of dol-
lars. Thus, y 5 856.3 indicates an annual salary of $856,300, and y 5 1,452.6 indicates a salary of
$1,452,600. Let x be the average return on equity (roe) for the CEO’s firm for the previous three
years. (Return on equity is defined in terms of net income as a percentage of common equity.) For
example, if roe 5 10, then average return on equity is 10%.
To study the relationship between this measure of firm performance and CEO compensation, we
postulate the simple model
salary 5 b0 1 b1roe 1 u.
The slope parameter b1 measures the change in annual salary, in thousands of dollars, when return
on equity increases by one percentage point. Because a higher roe is good for the company, we think
b1 . 0.
The data set CEOSAL1 contains information on 209 CEOs for the year 1990; these data were obtained
from Business Week (5/6/91). In this sample, the average annual salary is $1,281,120, with the smallest and
largest being $223,000 and $14,822,000, respectively. The average return on equity for the years 1988,
1989, and 1990 is 17.18%, with the smallest and largest values being 0.5% and 56.3%, respectively.
Using the data in CEOSAL1, the OLS regression line relating salary to roe is

salary 5 963.191 1 18.501 roe [2.26]


n 5 209,
where the intercept and slope estimates have been rounded to three decimal places; we use “salary
hat” to indicate that this is an estimated equation. How do we interpret the equation? First, if the return
on equity is zero, roe 5 0, then the predicted salary is the intercept, 963.191, which equals $963,191
because salary is measured in thousands. Next, we can write the predicted change in salary as a func-
tion of the change in roe: Dsalary 5 18.501 1 Droe 2 . This means that if the return on equity increases
by one percentage point, Droe 5 1, then salary is predicted to change by about 18.5, or $18,500.
Because (2.26) is a linear equation, this is the estimated change regardless of the initial salary.
We can easily use (2.26) to compare predicted salaries at different values of roe. Suppose
roe 5 30. Then salary 5 963.191 1 18.501 1 30 2 5 1,518,221, which is just over $1.5 million.
However, this does not mean that a particular CEO whose firm had a roe 5 30 earns $1,518,221.
Many other factors affect salary. This is just our prediction from the OLS regression line (2.26). The
estimated line is graphed in Figure 2.5, along with the population regression function E 1 salary 0 roe 2 .
We will never know the PRF, so we cannot tell how close the SRF is to the PRF. Another sample of data
will give a different regression line, which may or may not be closer to the population regression line.

58860_ch02_hr_019-065.indd 29 10/18/18 4:05 PM


30 PART 1 Regression Analysis with Cross-Sectional Data

Figure 2.5 The OLS regression line salary 5 963.191 1 18.501 roe and the (unknown)
population regression function.
salary

salary 5 963.191 1 18.501 roe

E(salary roe) 5 0 1 roe


1

963.191

roe

Example 2.4 Wage and Education


For the population of people in the workforce in 1976, let y 5 wage, where wage is measured in dol-
lars per hour. Thus, for a particular person, if wage 5 6.75, the hourly wage is $6.75. Let x 5 educ
denote years of schooling; for example, educ 5 12 corresponds to a complete high school education.
Because the average wage in the sample is $5.90, the Consumer Price Index indicates that this amount
is equivalent to $24.90 in 2016 dollars.
Using the data in WAGE1 where n 5 526 individuals, we obtain the following OLS regression
line (or sample regression function):
wage 5 20.90 1 0.54 educ [2.27]
n 5 526.
We must interpret this equation with caution. The intercept of −0.90 literally means that a person
with no education has a predicted hourly wage of −90¢ an hour. This, of course, is silly. It turns out
that only 18 people in the sample of 526 have less than eight years of education. Consequently, it
is not surprising that the regression line does poorly at very low levels of education. For a person
with eight years of education, the predicted wage
G o i n g F u rt h e r 2 . 2 is wage 5 20.90 1 0.54 1 8 2 5 3.42, or $3.42 per
hour (in 1976 dollars).
The estimated wage from (2.27), when The slope estimate in (2.27) implies that one
educ 5 8, is $3.42 in 1976 dollars. What
more year of education increases hourly wage by
is this value in 2016 dollars? (Hint: You
have enough information in Example 2.4 to
54 ¢ an hour. Therefore, four more years of educa-
answer this question.) tion increase the predicted wage by 4 1 0.54 2 5 2.16,
or $2.16 per hour. These are fairly large effects.

58860_ch02_hr_019-065.indd 30 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 31

Because of the linear nature of (2.27), another year of education increases the wage by the same
amount, regardless of the initial level of education. In Section 2-4, we discuss some methods that
­allow for nonconstant marginal effects of our explanatory variables.

Example 2.5 Voting Outcomes and Campaign Expenditures


The file VOTE1 contains data on election outcomes and campaign expenditures for 173 two-party
races for the U.S. House of Representatives in 1988. There are two candidates in each race, A and
B. Let voteA be the percentage of the vote received by Candidate A and shareA be the percentage of
total campaign expenditures accounted for by Candidate A. Many factors other than shareA affect the
election outcome (including the quality of the candidates and possibly the dollar amounts spent by A
and B). Nevertheless, we can estimate a simple regression model to find out whether spending more
relative to one’s challenger implies a higher percentage of the vote.
The estimated equation using the 173 observations is
voteA 5 26.81 1 0.464 shareA [2.28]
n 5 173.
This means that if Candidate A’s share of spending increases by one percentage point, Candidate A
receives almost one-half a percentage point (0.464) more of the total vote. Whether or not this is a
causal effect is unclear, but it is not unbelievable. If shareA 5 50, voteA is predicted to be about 50,
or half the vote.

In some cases, regression analysis is not used to


G o i n g F u rt h e r 2 . 3 determine causality but to simply look at whether two
variables are positively or negatively related, much
In Example 2.5, what is the predicted vote like a standard correlation analysis. An example of
for Candidate A if shareA 5 60 (which this occurs in Computer Exercise C3, where you
means 60%)? Does this answer seem
are asked to use data from Biddle and Hamermesh
reasonable?
(1990) on time spent sleeping and working to investi-
gate the tradeoff between these two factors.

2-2a A Note on Terminology


In most cases, we will indicate the estimation of a relationship through OLS by writing an equation
such as (2.26), (2.27), or (2.28). Sometimes, for the sake of brevity, it is useful to indicate that an OLS
regression has been run without actually writing out the equation. We will often indicate that equation
(2.23) has been obtained by OLS in saying that we run the regression of

y on x, [2.29]

or simply that we regress y on x. The positions of y and x in (2.29) indicate which is the dependent
variable and which is the independent variable: We always regress the dependent variable on the
independent variable. For specific applications, we replace y and x with their names. Thus, to obtain
(2.26), we regress salary on roe, or to obtain (2.28), we regress voteA on shareA.
When we use such terminology in (2.29), we will always mean that we plan to estimate the
intercept, b^ 0, along with the slope, b^ 1. This case is appropriate for the vast majority of applications.

58860_ch02_hr_019-065.indd 31 10/18/18 4:05 PM


32 PART 1 Regression Analysis with Cross-Sectional Data

Occasionally, we may want to estimate the relationship between y and x assuming that the intercept
is zero (so that x 5 0 implies that y^ 5 0); we cover this case briefly in Section 2-6. Unless explicitly
stated otherwise, we always estimate an intercept along with a slope.

2-3 Properties of OLS on Any Sample of Data


In the previous section, we went through the algebra of deriving the formulas for the OLS intercept
and slope estimates. In this section, we cover some further algebraic properties of the fitted OLS
regression line. The best way to think about these properties is to remember that they hold, by con-
struction, for any sample of data. The harder task—considering the properties of OLS across all pos-
sible random samples of data—is postponed until Section 2-5.
Several of the algebraic properties we are going to derive will appear mundane. Nevertheless,
having a grasp of these properties helps us to figure out what happens to the OLS estimates and
related statistics when the data are manipulated in certain ways, such as when the measurement units
of the dependent and independent variables change.

2-3a Fitted Values and Residuals


We assume that the intercept and slope estimates, b^ 0 and b^ 1, have been obtained for the given sample
of data. Given b^ 0 and b^ 1, we can obtain the fitted value y^ i for each observation. [This is given by equa-
tion (2.20).] By definition, each fitted value of y^ i is on the OLS regression line. The OLS residual
associated with observation i, u^ i, is the difference between yi and its fitted value, as given in equation
(2.21). If u^ i is positive, the line underpredicts yi; if u^ i is negative, the line overpredicts yi. The ideal
case for observation i is when u^ i 5 0, but in most cases, every residual is not equal to zero. In other
words, none of the data points must actually lie on the OLS line.

Example 2.6 CEO Salary and Return on Equity


Table 2.2 contains a listing of the first 15 observations in the CEO data set, along with the fitted
values, called salaryhat, and the residuals, called uhat.
The first four CEOs have lower salaries than what we predicted from the OLS regression
line (2.26); in other words, given only the firm’s roe, these CEOs make less than what we
predicted. As can be seen from the positive uhat, the fifth CEO makes more than predicted from
the OLS regression line.

2-3b Algebraic Properties of OLS Statistics


There are several useful algebraic properties of OLS estimates and their associated statistics. We now
cover the three most important of these.
(1) The sum, and therefore the sample average of the OLS residuals, is zero. Mathematically,

a u^ i 5 0.
n
[2.30]
i51

This property needs no proof; it follows immediately from the OLS first order condition (2.14), when
we remember that the residuals are defined by u^ i 5 yi 2 b^ 0 2 b^ 1xi. In other words, the OLS estimates

58860_ch02_hr_019-065.indd 32 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 33

Table 2.2 Fitted Values and Residuals for the First 15 CEOs
obsno roe salary salary u^
1 14.1 1095 1224.058 −129.0581
2 10.9 1001 1164.854 −163.8542
3 23.5 1122 1397.969 −275.9692
4 5.9 578 1072.348 −494.3484
5 13.8 1368 1218.508 149.4923
6 20.0 1145 1333.215 −188.2151
7 16.4 1078 1266.611 −188.6108
8 16.3 1094 1264.761 −170.7606
9 10.5 1237 1157.454 79.54626
10 26.3 833 1449.773 −616.7726
11 25.9 567 1442.372 −875.3721
12 26.8 933 1459.023 −526.0231
13 14.8 1339 1237.009 101.9911
14 22.3 937 1375.768 −438.7678
15 56.3 2011 2004.808 6.191895

b^ 0 and b^ 1 are chosen to make the residuals add up to zero (for any data set). This says nothing about
the residual for any particular observation i.
(2) The sample covariance between the regressors and the OLS residuals is zero. This follows
from the first order condition (2.15), which can be written in terms of the residuals as

a xiu^ i 5 0.
n
[2.31]
i51

The sample average of the OLS residuals is zero, so the left-hand side of (2.31) is proportional to the
sample covariance between xi and u^ i.
(3) The point 1 x,y 2 is always on the OLS regression line. In other words, if we take equation
(2.23) and plug in x for x, then the predicted value is y. This is exactly what equation (2.16) showed us.

Example 2.7 Wage and Education


For the data in WAGE1, the average hourly wage in the sample is 5.90, rounded to two decimal
places, and the average education is 12.56. If we plug educ 5 12.56 into the OLS regression line
(2.27), we get wage 5 20.90 1 0.54 1 12.56 2 5 5.8824, which equals 5.9 when rounded to the first
decimal place. These figures do not exactly agree because we have rounded the average wage and
education, as well as the intercept and slope estimates. If we did not initially round any of the values,
we would get the answers to agree more closely, but to little useful effect.

Writing each yi as its fitted value, plus its residual, provides another way to interpret an OLS
regression. For each i, write
yi 5 y^ i 1 u^ i. [2.32]

58860_ch02_hr_019-065.indd 33 10/18/18 4:05 PM


34 PART 1 Regression Analysis with Cross-Sectional Data

From property (1), the average of the residuals is zero; equivalently, the sample average of the fitted
values, y^ i, is the same as the sample average of the yi, or y^ 5 y. Further, properties (1) and (2) can be
used to show that the sample covariance between y^ i and u^ i is zero. Thus, we can view OLS as decom-
posing each yi into two parts, a fitted value and a residual. The fitted values and residuals are uncor-
related in the sample.
Define the total sum of squares (SST), the explained sum of squares (SSE), and the residual
sum of squares (SSR) (also known as the sum of squared residuals), as follows:

SST ; a 1 yi 2 y 2 2.
n
[2.33]
i51

SSE ; a 1 y^ i 2 y 2 2.
n
[2.34]
i51

SSR ; a u^ 2i .
n
[2.35]
i51

SST is a measure of the total sample variation in the yi; that is, it measures how spread out the yi are in
the sample. If we divide SST by n 2 1, we obtain the sample variance of y, as discussed in Math Refresher C.
Similarly, SSE measures the sample variation in the y^ i (where we use the fact that y^ 5 y), and SSR
measures the sample variation in the u^ i. The total variation in y can always be expressed as the sum of
the explained variation and the unexplained variation SSR. Thus,
SST 5 SSE 1 SSR. [2.36]
Proving (2.36) is not difficult, but it requires us to use all of the properties of the summation operator
covered in Math Refresher A. Write

a 1 yi 2 y 2 5 a 3 1 yi 2 y^ i 2 1 1 y^ i 2 y 2 4
n n
2 2
i51 i51

5 a 3 u^ i 1 1 y^ i 2 y 2 4 2
n

i51

5 a u^ 2i 1 2 a u^ i 1 y^ i 2 y 2 1 a 1 y^ i 2 y 2 2
n n n

i51 i51 i51

5 SSR 1 2 a u^ i 1 y^ i 2 y 2 1 SSE.
n

i51

Now, (2.36) holds if we show that

a u^ i 1 y^ i 2 y 2 5 0.
n
[2.37]
i51

But we have already claimed that the sample covariance between the residuals and the fitted values is
zero, and this covariance is just (2.37) divided by n 2 1. Thus, we have established (2.36).
Some words of caution about SST, SSE, and SSR are in order. There is no uniform agree-
ment on the names or abbreviations for the three quantities defined in equations (2.33), (2.34),
and (2.35). The total sum of squares is called either SST or TSS, so there is little confusion here.
Unfortunately, the explained sum of squares is sometimes called the “regression sum of squares.”
If this term is given its natural abbreviation, it can easily be confused with the term “residual sum
of squares.” Some regression packages refer to the explained sum of squares as the “model sum of
squares.”
To make matters even worse, the residual sum of squares is often called the “error sum of
squares.” This is especially unfortunate because, as we will see in Section 2-5, the errors and the
residuals are different quantities. Thus, we will always call (2.35) the residual sum of squares or the
sum of squared residuals. We prefer to use the abbreviation SSR to denote the sum of squared residu-
als, because it is more common in econometric packages.

58860_ch02_hr_019-065.indd 34 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 35

2-3c Goodness-of-Fit
So far, we have no way of measuring how well the explanatory or independent variable, x, explains
the dependent variable, y. It is often useful to compute a number that summarizes how well the OLS
regression line fits the data. In the following discussion, be sure to remember that we assume that an
intercept is estimated along with the slope.
Assuming that the total sum of squares, SST, is not equal to zero—which is true except in
the very unlikely event that all the yi equal the same value—we can divide (2.36) by SST to get
1 5 SSE/SST 1 SSR/SST. The R-squared of the regression, sometimes called the coefficient of
determination, is defined as
R2 ; SSE/SST 5 1 2 SSR/SST. [2.38]
2
R is the ratio of the explained variation compared to the total variation; thus, it is interpreted as the
fraction of the sample variation in y that is explained by x. The second equality in (2.38) provides
another way for computing R2.
From (2.36), the value of R2 is always between zero and one, because SSE can be no greater than
SST. When interpreting R2, we usually multiply it by 100 to change it into a percent: 100 # R2 is the
percentage of the sample variation in y that is explained by x.
If the data points all lie on the same line, OLS provides a perfect fit to the data. In this case,
R2 5 1. A value of R2 that is nearly equal to zero indicates a poor fit of the OLS line: very little of
the variation in the yi is captured by the variation in the y^ i (which all lie on the OLS regression line).
In fact, it can be shown that R2 is equal to the square of the sample correlation coefficient between yi
and y^ i. This is where the term “R-squared” came from. (The letter R was traditionally used to denote
an estimate of a population correlation coefficient, and its usage has survived in regression analysis.)

Example 2.8 CEO Salary and Return on Equity


In the CEO salary regression, we obtain the following:
salary 5 963.191 1 18.501 roe [2.39]
n 5 209, R2 5 0.0132.
We have reproduced the OLS regression line and the number of observations for clarity. Using the
R-squared (rounded to four decimal places) reported for this equation, we can see how much of
the variation in salary is actually explained by the return on equity. The answer is: not much. The
firm’s return on equity explains only about 1.3% of the variation in salaries for this sample of 209
CEOs. That means that 98.7% of the salary variations for these CEOs is left unexplained! This lack
of e­ xplanatory power may not be too surprising because many other characteristics of both the firm
and the individual CEO should influence salary; these factors are necessarily included in the errors in
a simple regression analysis.

In the social sciences, low R-squareds in regression equations are not uncommon, especially
for cross-sectional analysis. We will discuss this issue more generally under multiple regres-
sion analysis, but it is worth emphasizing now that a seemingly low R-squared does not neces-
sarily mean that an OLS regression equation is useless. It is still possible that (2.39) is a good
estimate of the ceteris paribus relationship between salary and roe; whether or not this is true
does not depend directly on the size of R-squared. Students who are first learning econometrics
tend to put too much weight on the size of the R-squared in evaluating regression equations. For
now, be aware that using R-squared as the main gauge of success for an econometric analysis
can lead to trouble.

58860_ch02_hr_019-065.indd 35 10/18/18 4:05 PM


36 PART 1 Regression Analysis with Cross-Sectional Data

Sometimes, the explanatory variable explains a substantial part of the sample variation in the
dependent variable.

Example 2.9 Voting Outcomes and Campaign Expenditures


In the voting outcome equation in (2.28), R2 5 0.856. Thus, the share of campaign expenditures
explains over 85% of the variation in the election outcomes for this sample. This is a sizable portion.

2-4 Units of Measurement and Functional Form


Two important issues in applied economics are (1) understanding how changing the units of measure-
ment of the dependent and/or independent variables affects OLS estimates and (2) knowing how to
incorporate popular functional forms used in economics into regression analysis. The mathematics
needed for a full understanding of functional form issues is reviewed in Math Refresher A.

2-4a The Effects of Changing Units of Measurement


on OLS Statistics
In Example 2.3, we chose to measure annual salary in thousands of dollars, and the return on equity
was measured as a percentage (rather than as a decimal). It is crucial to know how salary and roe are
measured in this example in order to make sense of the estimates in equation (2.39).
We must also know that OLS estimates change in entirely expected ways when the units
of measurement of the dependent and independent variables change. In Example 2.3, suppose
that, rather than measuring salary in thousands of dollars, we measure it in dollars. Let salardol
be ­salary in dollars (salardol 5 845,761 would be interpreted as $845,761). Of course, salardol
has a simple relationship to the salary measured in thousands of dollars: salardol 5 1,000 # salary.
We do not need to actually run the regression of salardol on roe to know that the estimated
­equation is:
salardol 5 963,191 1 18,501 roe. [2.40]
We obtain the intercept and slope in (2.40) simply by multiplying the intercept and the slope in (2.39)
by 1,000. This gives equations (2.39) and (2.40) the same interpretation. Looking at (2.40), if roe 5 0,
then salardol 5 963,191, so the predicted salary is $963,191 [the same value we obtained from equa-
tion (2.39)]. Furthermore, if roe increases by one, then the predicted salary increases by $18,501;
again, this is what we concluded from our earlier analysis of equation (2.39).
Generally, it is easy to figure out what happens to the intercept and slope estimates when
the dependent variable changes units of measurement. If the dependent variable is multiplied
by the constant c—which means each value in the sample is multiplied by c—then the OLS
intercept and slope estimates are also multiplied by c. (This assumes nothing has changed
about the independent variable.) In the CEO sal-
ary example, c 5 1,000 in moving from salary to
G o i n g F u rt h e r 2 . 4 salardol.
We can also use the CEO salary example to
Suppose that salary is measured in hun-
see what happens when we change the units of
dreds of dollars, rather than in thousands of
measurement of the independent variable. Define
dollars, say, salarhun. What will be the OLS
intercept and slope estimates in the regres- roedec 5 roe/100 to be the decimal equivalent of
sion of salarhun on roe? roe; thus, roedec 5 0.23 means a return on equity of
23%. To focus on changing the units of measurement

58860_ch02_hr_019-065.indd 36 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 37

of the independent variable, we return to our original dependent variable, salary, which is measured
in thousands of dollars. When we regress salary on roedec, we obtain
salary 5 963.191 1 1,850.1 roedec. [2.41]
The coefficient on roedec is 100 times the coefficient on roe in (2.39). This is as it should
be. Changing roe by one percentage point is equivalent to Droedec 5 0.01. From (2.41), if
Droedec 5 0.01, then Dsalary 5 1,850.1 1 0.01 2 5 18.501, which is what is obtained by using (2.39).
Note that, in moving from (2.39) to (2.41), the independent variable was divided by 100, and so the
OLS slope estimate was multiplied by 100, preserving the interpretation of the equation. Generally,
if the independent variable is divided or multiplied by some nonzero constant, c, then the OLS slope
coefficient is multiplied or divided by c, respectively.
The intercept has not changed in (2.41) because roedec 5 0 still corresponds to a zero return on
equity. In general, changing the units of measurement of only the independent variable does not affect
the intercept.
In the previous section, we defined R-squared as a goodness-of-fit measure for OLS r­ egression.
We can also ask what happens to R2 when the unit of measurement of either the independent or
the dependent variable changes. Without doing any algebra, we should know the result: the
goodness-of-fit of the model should not depend on the units of measurement of our variables. For
example, the amount of variation in salary explained by the return on equity should not depend on
whether salary is measured in dollars or in thousands of dollars or on whether return on equity is a
percentage or a decimal. This intuition can be verified mathematically: using the definition of R2, it
can be shown that R2 is, in fact, invariant to changes in the units of y or x.

2-4b Incorporating Nonlinearities in Simple Regression


So far, we have focused on linear relationships between the dependent and independent variables.
As we mentioned in Chapter 1, linear relationships are not nearly general enough for all economic
applications. Fortunately, it is rather easy to incorporate many nonlinearities into simple regression
analysis by appropriately defining the dependent and independent variables. Here, we will cover two
possibilities that often appear in applied work.
In reading applied work in the social sciences, you will often encounter regression equations
where the dependent variable appears in logarithmic form. Why is this done? Recall the wage-­
education example, where we regressed hourly wage on years of education. We obtained a slope esti-
mate of 0.54 [see equation (2.27)], which means that each additional year of education is predicted to
increase hourly wage by 54 cents. Because of the linear nature of (2.27), 54 cents is the increase for
either the first year of education or the twentieth year; this may not be reasonable.
Probably a better characterization of how wage changes with education is that each year of edu-
cation increases wage by a constant percentage. For example, an increase in education from 5 years to
6 years increases wage by, say, 8% (ceteris paribus), and an increase in education from 11 to 12 years
also increases wage by 8%. A model that gives (approximately) a constant percentage effect is
log 1 wage 2 5 b0 1 b1educ 1 u, [2.42]
where log 1 # 2 denotes the natural logarithm. (See Math Refresher A for a review of logarithms.) In
particular, if Du 5 0, then
%Dwage < 1 100 ? b1 2 Deduc. [2.43]
Notice how we multiply b1 by 100 to get the percentage change in wage given one additional
year of education. Because the percentage change in wage is the same for each additional year of
education, the change in wage for an extra year of education increases as education increases; in
other words, (2.42) implies an increasing return to education. By exponentiating (2.42), we can write
wage 5 exp 1 b0 1 b1educ 1 u 2 . This equation is graphed in Figure 2.6, with u 5 0.

58860_ch02_hr_019-065.indd 37 10/18/18 4:05 PM


38 PART 1 Regression Analysis with Cross-Sectional Data

Figure 2.6 wage 5 exp 1 b0 1 b1educ 2 , with b1 . 0.


wage

0 educ

Example 2.10 A Log Wage Equation


Using the same data as in Example 2.4, but using log(wage) as the dependent variable, we obtain the
following relationship:
log 1 wage 2 5 0.584 1 0.083 educ [2.44]
n 5 526, R2 5 0.186.
The coefficient on educ has a percentage interpretation when it is multiplied by 100: wage increases
by 8.3% for every additional year of education. This is what economists mean when they refer to the
“return to another year of education.”
It is important to remember that the main reason for using the log of wage in (2.42) is to impose a
constant percentage effect of education on wage. Once equation (2.44) is obtained, the natural log of
wage is rarely mentioned. In particular, it is not correct to say that another year of education increases
log(wage) by 8.3%.
The intercept in (2.44) is not very meaningful, because it gives the predicted log(wage), when
educ 5 0. The R-squared shows that educ explains about 18.6% of the variation in log(wage) (not
wage). Finally, equation (2.44) might not capture all of the nonlinearity in the relationship between
wage and schooling. If there are “diploma effects,” then the twelfth year of education—graduation
from high school—could be worth much more than the eleventh year. We will learn how to allow for
this kind of nonlinearity in Chapter 7.

Estimating a model such as (2.42) is straightforward when using simple regression. Just define the
dependent variable, y, to be y 5 log 1 wage 2 . The independent variable is represented by x 5 educ. The
mechanics of OLS are the same as before: the intercept and slope estimates are given by the formulas
(2.17) and (2.19). In other words, we obtain b^ 0 and b^ 1 from the OLS regression of log(wage) on educ.
Another important use of the natural log is in obtaining a constant elasticity model.

58860_ch02_hr_019-065.indd 38 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 39

Example 2.11 CEO Salary and Firm Sales


We can estimate a constant elasticity model relating CEO salary to firm sales. The data set is the same
one used in Example 2.3, except we now relate salary to sales. Let sales be annual firm sales, mea-
sured in millions of dollars. A constant elasticity model is
log 1 salary 2 5 b0 1 b1log 1 sales 2 1 u, [2.45]
where b1 is the elasticity of salary with respect to sales. This model falls under the simple regression
model by defining the dependent variable to be y = log(salary) and the independent variable to be
x 5 log 1 sales 2 . Estimating this equation by OLS gives
log 1 salary 2 5 4.822 1 0.257 log 1 sales 2 [2.46]
n 5 209, R2 5 0.211.
The coefficient of log(sales) is the estimated elasticity of salary with respect to sales. It implies that
a 1% increase in firm sales increases CEO salary by about 0.257%—the usual interpretation of an
elasticity.

The two functional forms covered in this section will often arise in the remainder of this text. We
have covered models containing natural logarithms here because they appear so frequently in applied
work. The interpretation of such models will not be much different in the multiple regression case.
It is also useful to note what happens to the intercept and slope estimates if we change
the units of measurement of the dependent variable when it appears in logarithmic form.
Because the change to logarithmic form approximates a proportionate change, it makes sense
that nothing happens to the slope. We can see this by writing the rescaled variable as c1yi for each
observation i. The original equation is log 1 yi 2 5 b0 1 b1xi 1 ui. If we add log 1 c1 2 to both sides,
we get log 1 c1 2 1 log 1 yi 2 5 3 log 1 c1 2 1 b0 4 1 b1xi 1 ui, or log 1 c1yi 2 5 3 log 1 c1 2 1 b0 4 1 b1xi 1 ui.
(Remember that the sum of the logs is equal to the log of their product, as shown in Math Refresher A.)
Therefore, the slope is still b1, but the intercept is now log 1 c1 2 1 b0. Similarly, if the independent
variable is log 1 x 2 , and we change the units of measurement of x before taking the log, the slope
remains the same, but the intercept changes. You will be asked to verify these claims in Problem 9.
We end this subsection by summarizing four combinations of functional forms available from
using either the original variable or its natural log. In Table 2.3, x and y stand for the variables in their
original form. The model with y as the dependent variable and x as the independent variable is called
the level-level model because each variable appears in its level form. The model with log 1 y 2 as the
dependent variable and x as the independent variable is called the log-level model. We will not explic-
itly discuss the level-log model here, because it arises less often in practice. In any case, we will see
examples of this model in later chapters.
The last column in Table 2.3 gives the interpretation of b1. In the log-level model, 100 ? b1 is
sometimes called the semi-elasticity of y with respect to x. As we mentioned in Example 2.11, in the
log-log model, b1 is the elasticity of y with respect to x. Table 2.3 warrants careful study, as we will
refer to it often in the remainder of the text.

Table 2.3 Summary of Functional Forms Involving Logarithms


Model Dependent Variable Independent Variable Interpretation of b1
Level-level y x Dy 5 b1Dx
Level-log y log(x) Dy 5 1 b1 /100 2 %Dx
Log-level log( y ) x %Dy 5 1 100b1 2 Dx
Log-log log( y ) log(x) %Dy 5 b1%Dx

58860_ch02_hr_019-065.indd 39 10/18/18 4:05 PM


40 PART 1 Regression Analysis with Cross-Sectional Data

2-4c The Meaning of “Linear” Regression


The simple regression model that we have studied in this chapter is also called the simple
linear regression model. Yet, as we have just seen, the general model also allows for certain
nonlinear relationships. So what does “linear” mean here? You can see by looking at equation (2.1)
that y 5 b0 1 b1x 1 u. The key is that this equation is linear in the parameters b0 and b1. There are
no restrictions on how y and x relate to the original explained and explanatory variables of interest.
As we saw in Examples 2.10 and 2.11, y and x can be natural logs of variables, and this is quite com-
mon in applications. But we need not stop there. For example, nothing prevents us from using simple
regression to estimate a model such as cons 5 b0 1 b1 !inc 1 u, where cons is annual consumption
and inc is annual income.
Whereas the mechanics of simple regression do not depend on how y and x are defined, the
interpretation of the coefficients does depend on their definitions. For successful empirical work, it
is much more important to become proficient at interpreting coefficients than to become efficient at
computing formulas such as (2.19). We will get much more practice with interpreting the estimates in
OLS regression lines when we study multiple regression.
Plenty of models cannot be cast as a linear regression model because they are not linear in their
parameters; an example is cons 5 1/ 1 b0 1 b1inc 2 1 u. Estimation of such models takes us into the
realm of the nonlinear regression model, which is beyond the scope of this text. For most applica-
tions, choosing a model that can be put into the linear regression framework is sufficient.

2-5 Expected Values and Variances of the OLS Estimators


In Section 2-1, we defined the population model y 5 b0 1 b1x 1 u, and we claimed that the key
assumption for simple regression analysis to be useful is that the expected value of u given any value
of x is zero. In Sections 2-2, 2-3, and 2-4, we discussed the algebraic properties of OLS estimation.
We now return to the population model and study the statistical properties of OLS. In other words, we
now view b^ 0 and b^ 1 as estimators for the parameters b0 and b1 that appear in the population model.
This means that we will study properties of the distributions of b^ 0 and b^ 1 over different random
samples from the population. (Math Refresher C contains definitions of estimators and reviews some
of their important properties.)

2-5a Unbiasedness of OLS


We begin by establishing the unbiasedness of OLS under a simple set of assumptions. For future ref-
erence, it is useful to number these assumptions using the prefix “SLR” for simple linear regression.
The first assumption defines the population model.

Assumption SLR.1 Linear in Parameters


In the population model, the dependent variable, y, is related to the independent variable, x, and the
error (or disturbance), u, as

y 5 b0 1 b1x 1 u,[2.47]

where b0 and b1 are the population intercept and slope parameters, respectively.

To be realistic, y, x, and u are all viewed as random variables in stating the population model. We dis-
cussed the interpretation of this model at some length in Section 2-1 and gave several examples. In the
previous section, we learned that equation (2.47) is not as restrictive as it initially seems; by choosing

58860_ch02_hr_019-065.indd 40 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 41

y and x appropriately, we can obtain interesting nonlinear relationships (such as constant elasticity
models).
We are interested in using data on y and x to estimate the parameters b0 and, especially, b1. We
assume that our data were obtained as a random sample. (See Math Refresher C for a review of ran-
dom sampling.)

Assumption SLR.2 Random Sampling


We have a random sample of size n, 5 1 xi, yi 2 : i 5 1, 2, c, n 6 , following the population model in
equation (2.47).

We will have to address failure of the random sampling assumption in later chapters that deal with
time series analysis and sample selection problems. Not all cross-sectional samples can be viewed as
outcomes of random samples, but many can be.
We can write (2.47) in terms of the random sample as
yi 5 b0 1 b1xi 1 ui, i 5 1, 2, c, n, [2.48]
where ui is the error or disturbance for observation i (for example, person i, firm i, city i, and so on).
Thus, ui contains the unobservables for observation i that affect yi. The ui should not be confused with
the residuals, u^ i, that we defined in Section 2-3. Later on, we will explore the relationship between the
errors and the residuals. For interpreting b0 and b1 in a particular application, (2.47) is most informa-
tive, but (2.48) is also needed for some of the statistical derivations.
The relationship (2.48) can be plotted for a particular outcome of data as shown in Figure 2.7.
As we already saw in Section 2-2, the OLS slope and intercept estimates are not defined unless
we have some sample variation in the explanatory variable. We now add variation in the xi to our list
of assumptions.

Figure 2.7 Graph of yi 5 b0 1 b1xi 1 ui.


y

yi

ui PRF
E(y x) 5 0 1 1x

u1
y1

x1 xi x

58860_ch02_hr_019-065.indd 41 10/18/18 4:05 PM


42 PART 1 Regression Analysis with Cross-Sectional Data

Assumption SLR.3 Sample Variation in the Explanatory Variable


The sample outcomes on x, namely, 5 xi, i 5 1, c, n 6 , are not all the same value.

This is a very weak assumption—certainly not worth emphasizing, but needed nevertheless. If x
varies in the population, random samples on x will typically contain variation, unless the population
variation is minimal or the sample size is small. Simple inspection of summary statistics on xi reveals
whether Assumption SLR.3 fails: if the sample standard deviation of xi is zero, then Assumption
SLR.3 fails; otherwise, it holds.
Finally, in order to obtain unbiased estimators of b0 and b1, we need to impose the zero condi-
tional mean assumption that we discussed in some detail in Section 2-1. We now explicitly add it to
our list of assumptions.

Assumption SLR.4 Zero Conditional Mean


The error u has an expected value of zero given any value of the explanatory variable. In other words,
E 1 u 0 x 2 5 0.

For a random sample, this assumption implies that E 1 ui 0 xi 2 5 0, for all i 5 1, 2, c, n.


In addition to restricting the relationship between u and x in the population, the zero conditional
mean assumption—coupled with the random sampling assumption—allows for a convenient technical
simplification. In particular, we can derive the statistical properties of the OLS estimators as conditional
on the values of the xi in our sample. Technically, in statistical derivations, conditioning on the sample
values of the independent variable is the same as treating the xi as fixed in repeated samples, which
we think of as follows. We first choose n sample values for x1, x2, c, xn. (These can be repeated.)
Given these values, we then obtain a sample on y (effectively by obtaining a random sample of the ui).
Next, another sample of y is obtained, using the same values for x1, x2, c, xn. Then another sample
of y is obtained, again using the same x1, x2, c, xn. And so on.
The fixed-in-repeated-samples scenario is not very realistic in nonexperimental contexts. For
instance, in sampling individuals for the wage-education example, it makes little sense to think
of choosing the values of educ ahead of time and then sampling individuals with those particular
levels of education. Random sampling, where individuals are chosen randomly and their wage and
education are both recorded, is representative of how most data sets are obtained for empirical
analysis in the social sciences. Once we assume that E 1 u 0 x 2 5 0, and we have random sampling,
nothing is lost in derivations by treating the xi as nonrandom. The danger is that the fixed-in-
repeated-samples assumption always implies that ui and xi are independent. In deciding when sim-
ple regression analysis is going to produce unbiased estimators, it is critical to think in terms of
Assumption SLR.4.

that g i51 1 xi 2 x 2 1 yi 2 y 2 5 g i51 1 xi 2 x 2 yi (see Math Refresher A) to write the OLS slope estima-
Now, we are ready to show that the OLS estimators are unbiased. To this end, we use the fact
n n

tor in equation (2.19) as


a 1 xi 2 x 2 yi
n

i51
b^ 1 5 [2.49]
a 1 xi 2 x 2
n .
2
i51

Because we are now interested in the behavior of b^ 1 across all possible samples, b^ 1 is properly viewed
as a random variable.

58860_ch02_hr_019-065.indd 42 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 43

We can write b^ 1 in terms of the population coefficient and errors by substituting the right-hand
side of (2.48) into (2.49). We have

a 1 xi 2 x 2 yi a 1 xi 2 x 2 1 b0 1 b1xi 1 ui 2
n n

i51 i51
b^ 1 5 5 , [2.50]
SSTx SSTx

where we have defined the total variation in xi as SSTx 5 g i51 1 xi 2 x 2 2 to simplify the notation.
n

(This is not quite the sample variance of the xi because we do not divide by n 2 1.) Using the algebra
of the summation operator, write the numerator of b^ 1 as

a 1 xi 2 x 2 b0 1 a 1 xi 2 x 2 b1xi 1 a 1 xi 2 x 2 ui
n n n

i51 i51 i51

5 b0 a 1 xi 2 x 2 1 b1 a 1 xi 2 x 2 xi 1 a 1 xi 2 x 2 ui.
n n n
[2.51]

As shown in Math Refresher A, g i51 1 xi 2 x 2 5 0 and g i51 1 xi 2 x 2 xi 5 g i51 1 xi 2 x 2 2 5 SSTx.


i51 i51 i51

Therefore, we can write the numerator of b1 as b1SSTx 1 g i51 1 xi 2 x 2 ui. Putting this over the denominator
n n n

^ n

gives

a 1 xi 2 x 2 ui
n

5 b1 1 1 1/SSTx 2 a diui,
n
i51
b^ 1 5 b1 1 [2.52]
SSTx i51

where di 5 xi 2 x. We now see that the estimator b^ 1 equals the population slope, b1, plus a term that
is a linear combination in the errors 3 u1, u2, c, un 4 . Conditional on the values of xi, the randomness
in b^ 1 is due entirely to the errors in the sample. The fact that these errors are generally different from
zero is what causes b^ 1 to differ from b1.
Using the representation in (2.52), we can prove the first important statistical property of OLS.

Theorem Unbiasedness of OLS:


2.1 Using Assumptions SLR.1 through SLR.4,

E 1 b^ 0 2 5 b0 and E 1 b^ 1 2 5 b1, [2.53]


for any values of b0 and b1. In other words, b^ 0 is unbiased for b0, and b^ 1 is unbiased for b1.

PROOF: In this proof, the expected values are conditional on the sample values of the independent
­variable. Because SSTx and di are functions only of the xi, they are nonrandom in the conditioning.
Therefore, from (2.52), and keeping the conditioning on 5 x1, x2, c, xn 6 implicit, we have

E 1 b^ 1 2 5 b1 1 E 3 1 1/SSTx 2 a di ui 4 5 b1 1 1 1/SSTx 2 a E 1 di ui 2
n n

i51 i51

5 b1 1 1 1/SSTx 2 a di E 1 Ui 2 5 b1 1 1 1/SSTx 2 a di ? 0 5 b1,


n n

i51 i51

where we have used the fact that the expected value of each ui (conditional on 5 x1, x2, c, xn 6 2 is
zero under Assumptions SLR.2 and SLR.4. Because unbiasedness holds for any outcome on
5 x1, x2, c, xn 6 , unbiasedness also holds without conditioning on 5 x1, x2, c, xn 6 .
The proof for b^ 0 is now straightforward. Average (2.48) across i to get y 5 b0 1 b1x 1 u, and plug
this into the formula for b^ 0:

b^ 0 5 y 2 b^ 1x 5 b0 1 b1x 1 u 2 b^ 1x 5 b0 1 1 b1 2 b^ 1 2 x 1 u.

58860_ch02_hr_019-065.indd 43 10/18/18 4:05 PM


44 PART 1 Regression Analysis with Cross-Sectional Data

Then, conditional on the values of the xi,

E 1 b^ 0 2 5 b0 1 E 3 1 b1 2 b^ 1 2 x 4 1 E 1 u 2 5 b0 1 E 3 1 b1 2 b^ 1 2 4 x,

because E 1 u 2 5 0 by Assumptions SLR.2 and SLR.4. But, we showed that E 1 b^ 1 2 5 b1, which implies
that E 3 1 b^ 1 2 b1 2 4 5 0. Thus, E 1 b^ 0 2 5 b0. Both of these arguments are valid for any values of b0 and
b1, and so we have established unbiasedness.

Remember that unbiasedness is a feature of the sampling distributions of b^ 1 and b^ 0, which says noth-
ing about the estimate that we obtain for a given sample. We hope that, if the sample we obtain is somehow
“typical,” then our estimate should be “near” the population value. Unfortunately, it is always possible that
we could obtain an unlucky sample that would give us a point estimate far from b1, and we can never
know for sure whether this is the case. You may want to review the material on unbiased estimators in Math
Refresher C, especially the simulation exercise in Table C.1 that illustrates the concept of unbiasedness.
Unbiasedness generally fails if any of our four assumptions fail. This means that it is important to
think about the veracity of each assumption for a particular application. Assumption SLR.1 requires
that y and x be linearly related, with an additive disturbance. This can certainly fail. But we also know
that y and x can be chosen to yield interesting nonlinear relationships. Dealing with the failure of
(2.47) requires more advanced methods that are beyond the scope of this text.
Later, we will have to relax Assumption SLR.2, the random sampling assumption, for time series
analysis. But what about using it for cross-sectional analysis? Random sampling can fail in a cross
section when samples are not representative of the underlying population; in fact, some data sets are
constructed by intentionally oversampling different parts of the population. We will discuss problems
of nonrandom sampling in Chapters 9 and 17.
As we have already discussed, Assumption SLR.3 almost always holds in interesting regression
applications. Without it, we cannot even obtain the OLS estimates.
The assumption we should concentrate on for now is SLR.4. If SLR.4 holds, the OLS estimators
are unbiased. Likewise, if SLR.4 fails, the OLS estimators generally will be biased. There are ways to
determine the likely direction and size of the bias, which we will study in Chapter 3.
The possibility that x is correlated with u is almost always a concern in simple regression analy-
sis with nonexperimental data, as we indicated with several examples in Section 2-1. Using simple
regression when u contains factors affecting y that are also correlated with x can result in spurious
correlation: that is, we find a relationship between y and x that is really due to other unobserved fac-
tors that affect y and also happen to be correlated with x.

Example 2.12 Student Math Performance and the School Lunch Program
Let math10 denote the percentage of tenth graders at a high school receiving a passing score on a stan-
dardized mathematics exam. Suppose we wish to estimate the effect of the federally funded school
lunch program on student performance. If anything, we expect the lunch program to have a positive
ceteris paribus effect on performance: all other factors being equal, if a student who is too poor to eat
regular meals becomes eligible for the school lunch program, his or her performance should improve.
Let lnchprg denote the percentage of students who are eligible for the lunch program. Then, a simple
regression model is
math10 5 b0 1 b1 lnchprg 1 u, [2.54]
where u contains school and student characteristics that affect overall school performance. Using the
data in MEAP93 on 408 Michigan high schools for the 1992–1993 school year, we obtain
math10 5 32.14 2 0.319 lnchprg
n 5 408, R2 5 0.171.

58860_ch02_hr_019-065.indd 44 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 45

This equation predicts that if student eligibility in the lunch program increases by 10 percentage
points, the percentage of students passing the math exam falls by about 3.2 percentage points. Do
we really believe that higher participation in the lunch program actually causes worse performance?
Almost certainly not. A better explanation is that the error term u in equation (2.54) is correlated with
lnchprg. In fact, u contains factors such as the poverty rate of children attending school, which affects
student performance and is highly correlated with eligibility in the lunch program. Variables such as
school quality and resources are also contained in u, and these are likely correlated with lnchprg. It
is important to remember that the estimate –0.319 is only for this particular sample, but its sign and
magnitude make us suspect that u and x are correlated, so that simple regression is biased.

In addition to omitted variables, there are other reasons for x to be correlated with u in the simple
regression model. Because the same issues arise in multiple regression analysis, we will postpone a
systematic treatment of the problem until then.

2-5b Variances of the OLS Estimators


In addition to knowing that the sampling distribution of b^ 1 is centered about b1 (b^ 1 is unbiased), it is
important to know how far we can expect b^ 1 to be away from b1 on average. Among other things, this
allows us to choose the best estimator among all, or at least a broad class of, unbiased estimators. The
measure of spread in the distribution of b^ 1 (and b^ 0) that is easiest to work with is the variance or its
square root, the standard deviation. (See Math Refresher C for a more detailed discussion.)
It turns out that the variance of the OLS estimators can be computed under Assumptions SLR.1
through SLR.4. However, these expressions would be somewhat complicated. Instead, we add an
assumption that is traditional for cross-sectional analysis. This assumption states that the variance of
the unobservable, u, conditional on x, is constant. This is known as the homoskedasticity or “con-
stant variance” assumption.

Assumption SLR.5 Homoskedasticity


The error u has the same variance given any value of the explanatory variable. In other words,

Var 1 u 0 x 2 5 s2.

We must emphasize that the homoskedasticity assumption is quite distinct from the zero con-
ditional mean assumption, E 1 u 0 x 2 5 0. Assumption SLR.4 involves the expected value of u, while
Assumption SLR.5 concerns the variance of u (both conditional on x). Recall that we established the
unbiasedness of OLS without Assumption SLR.5: the homoskedasticity assumption plays no role in
showing that b^ 0 and b^ 1 are unbiased. We add Assumption SLR.5 because it simplifies the variance
calculations for b^ 0 and b^ 1 and because it implies that ordinary least squares has certain efficiency
properties, which we will see in Chapter 3. If we were to assume that u and x are independent, then
the distribution of u given x does not depend on x, and so E 1 u 0 x 2 5 E 1 u 2 5 0 and Var 1 u 0 x 2 5 s2.
But independence is sometimes too strong of an assumption.
Because Var 1 u 0 x 2 5 E 1 u2 0 x 2 2 3 E 1 u 0 x 2 4 2 and E 1 u 0 x 2 5 0, s2 5 E 1 u2 0 x 2 , which means s2 is
also the unconditional expectation of u2. Therefore, s2 5 E 1 u2 2 5 Var 1 u 2 , because E 1 u 2 5 0. In
other words, s2 is the unconditional variance of u, and so s2 is often called the error variance or
disturbance variance. The square root of s2, s, is the standard deviation of the error. A larger s means
that the distribution of the unobservables affecting y is more spread out.

58860_ch02_hr_019-065.indd 45 10/18/18 4:05 PM


46 PART 1 Regression Analysis with Cross-Sectional Data

Figure 2.8 The simple regression model under homoskedasticity.

f(y x)

x1 E(y x) 5 0 1 x
1
x2
x3
x

It is often useful to write Assumptions SLR.4 and SLR.5 in terms of the conditional mean and
conditional variance of y:
E 1 y 0 x 2 5 b0 1 b1x. [2.55]

Var 1 y 0 x 2 5 s2. [2.56]


In other words, the conditional expectation of y given x is linear in x, but the variance of y given x is
constant. This situation is graphed in Figure 2.8 where b0 . 0 and b1 . 0.
When Var 1 u 0 x 2 depends on x, the error term is said to exhibit heteroskedasticity (or nonconstant
variance). Because Var 1 u 0 x 2 5 Var 1 y 0 x 2 , heteroskedasticity is present whenever Var 1 y 0 x 2 is a func-
tion of x.

Example 2.13 Heteroskedasticity in a Wage Equation


In order to get an unbiased estimator of the ceteris paribus effect of educ on wage, we must assume that
E 1 u 0 educ 2 5 0, and this implies E 1 wage 0 educ 2 5 b0 1 b1educ. If we also make the homoskedastic-
ity assumption, then Var 1 u 0 educ 2 5 s2 does not depend on the level of education, which is the same
as assuming Var 1 wage 0 educ 2 5 s2. Thus, while average wage is allowed to increase with education
level—it is this rate of increase that we are interested in estimating—the variability in wage about its
mean is assumed to be constant across all education levels. This may not be realistic. It is likely that
people with more education have a wider variety of interests and job opportunities, which could lead
to more wage variability at higher levels of education. People with very low levels of education have
fewer opportunities and often must work at the minimum wage; this serves to reduce wage variability
at low education levels. This situation is shown in Figure 2.9. Ultimately, whether Assumption SLR.5
holds is an empirical issue, and in Chapter 8 we will show how to test Assumption SLR.5.

58860_ch02_hr_019-065.indd 46 10/18/18 4:05 PM


CHAPTER 2 The Simple Regression Model 47

Figure 2.9 Var 1 wage 0 educ 2 increasing with educ.

f(wage educ)

wage

8 E(wage educ) 5
12 0 1 1educ

16
educ

With the homoskedasticity assumption in place, we are ready to prove the following:

Theorem Sampling Variances of the OLS Estimators


2.2 Under Assumptions SLR.1 through SLR.5,

s2
Var 1 b^ 1 2 5 5 s2 /SSTxr, [2.57]
a 1 xi 2 x 2
n
2

i51

and

s2n 21 a x2i
n

i51
Var 1 b^ 0 2 5 , [2.58]
a 1 xi 2 x 2
n
2

i51

where these are conditional on the sample values 5 x1, c, xn 6 .

is equation 1 2.52 2 : b^ 1 5 b1 1 1 1/SSTx 2 g ni51di ui. Because b1 is just a constant, and we are condition-
PROOF: We derive the formula for Var 1 b^ 1 2 , leaving the other derivation as Problem 10. The starting point

ing on the xi, SSTx and di 5 xi 2 x are also nonrandom. Furthermore, because the ui are independent

58860_ch02_hr_019-065.indd 47 10/18/18 4:05 PM


48 PART 1 Regression Analysis with Cross-Sectional Data

random variables across i (by random sampling), the variance of the sum is the sum of the variances.
Using these facts, we have

Var 1 b^ 1 2 5 1 1/SSTx 2 2 Vara a di ui b 5 1 1/SSTx 2 2 a a d2i Var 1 ui 2 b


n n

i51 i51

5 1 1/SSTx 2 2 a a di2s2 b
n
3 because Var 1 ui 2 5 s2 for all i 4
i51

5 s2 1 1/SSTx 2 2 a a di2 b 5 s2 1 1/SSTx 2 2SSTx 5 s2/SSTx,


n

i51

which is what we wanted to show.

Equations (2.57) and (2.58) are the “standard” formulas for simple regression analysis, which
are invalid in the presence of heteroskedasticity. This will be important when we turn to confidence
intervals and hypothesis testing in multiple regression analysis.
For most purposes, we are interested in Var 1 b^ 1 2 . It is easy to summarize how this variance
depends on the error variance, s2, and the total variation in 5 x1, x2, c, xn 6 , SSTx. First, the larger the
error variance, the larger is Var 1 b^ 1 2 . This makes sense because more variation in the unobservables
affecting y makes it more difficult to precisely estimate b1. On the other hand, more variability in the
independent variable is preferred: as the variability in the xi increases, the variance of b^ 1 decreases.
This also makes intuitive sense because the more spread out is the sample of independent variables,
the easier it is to trace out the relationship between E 1 y 0 x 2 and x; that is, the easier it is to estimate b1.
If there is little variation in the xi, then it can be hard to pinpoint how E 1 y 0 x 2 varies with x. As the
sample size increases, so does the total variation in the xi. Therefore, a larger sample size results in a
smaller variance for b^ 1.
This analysis shows that, if we are interested in
G o i n g F u rt h e r 2 . 5 b 1 and we have a choice, then we should choose the
xi to be as spread out as possible. This is sometimes
Show that, when estimating b0, it is best to
possible with experimental data, but rarely do we
[Hint: For any sample of numbers, g i51 xi $
have x 5 0. What is Var 1 b^ 0 2 in this case?
have this luxury in the social sciences: usually, we
g i51 1 xi 2 x 2 , with equality only if x 5 0. 4
n 2
n 2 must take the xi that we obtain via random sampling.
Sometimes, we have an opportunity to obtain larger
sample sizes, although this can be costly.
For the purposes of constructing confidence intervals and deriving test statistics, we will need to
work with the standard deviations of b^ 1 and b^ 0, sd(b^ 1) and sd(b^ 0). Recall that these are obtained by
taking the square roots of the variances in (2.57) and (2.58). In particular, sd 1 b^ 1 2 5 s/"SSTx, where
s is the square root of s2, and "SSTx is the square root of SSTx.

2-5c Estimating the Error Variance


The formulas in (2.57) and (2.58) allow us to isolate the factors that contribute to Var 1 b^ 1 2 and Var 1 b^ 0 2 .
But these formulas are unknown, except in the extremely rare case that s2 is known. Nevertheless, we
can use the data to estimate s2, which then allows us to estimate Var 1 b^ 1 2 and Var 1 b^ 0 2 .
This is a good place to emphasize the difference between the errors (or disturbances) and the
residuals, as this distinction is crucial for constructing an estimator of s2. Equation (2.48) shows how
to write the population model in terms of a randomly sampled observation as yi 5 b0 1 b1xi 1 ui,
where ui is the error for observation i. We can also express yi in terms of its fitted value and residual as
in equation 1 2.32 2 : yi 5 b^ 0 1 b^ 1xi 1 u^ i. Comparing these two equations, we see that the error shows

58860_ch02_hr_019-065.indd 48 10/18/18 4:06 PM


CHAPTER 2 The Simple Regression Model 49

up in the equation containing the population parameters, b0 and b1. On the other hand, the residuals
show up in the estimated equation with b^ 0 and b^ 1. The errors are never observed, while the residuals
are computed from the data.
We can use equations (2.32) and (2.48) to write the residuals as a function of the errors:
u^ i 5 yi 2 b^ 0 2 b^ 1xi 5 1 b0 1 b1xi 1 ui 2 2 b^ 0 2 b^ 1xi,
or
u^ i 5 ui 2 1 b^ 0 2 b0 2 2 1 b^ 1 2 b1 2 xi. [2.59]
Although the expected value of b^ 0 equals b0, and similarly for b^ 1, u^ i is not the same as ui. The differ-
ence between them does have an expected value of zero.

mating s2. First, s2 5 E 1 u2 2 , so an unbiased “estimator” of s2 is n21 g i51 u2i . Unfortunately, this is not
Now that we understand the difference between the errors and the residuals, we can return to esti-
n

the OLS residuals u^ i. If we replace the errors with the OLS residuals, we have n21 g i51 u^ 2i 5 SSR/n.
a true estimator, because we do not observe the errors ui. But, we do have estimates of the ui, namely,
n

This is a true estimator, because it gives a computable rule for any sample of data on x and y. One
slight drawback to this estimator is that it turns out to be biased (although for large n the bias is
small). Because it is easy to compute an unbiased estimator, we use that instead.
The estimator SSR/n is biased essentially because it does not account for two restrictions that
must be satisfied by the OLS residuals. These restrictions are given by the two OLS first order
conditions:
a u^ i 5 0, a xiu^ i 5 0.
n n
[2.60]
i51 i51

One way to view these restrictions is this: if we know n 2 2 of the residuals, we can always get the
other two residuals by using the restrictions implied by the first order conditions in (2.60). Thus,
there are only n 2 2 degrees of freedom in the OLS residuals, as opposed to n degrees of freedom
in the errors. It is important to understand that if we replace u^ i with ui in (2.60), the restrictions would
no longer hold.
The unbiased estimator of s2 that we will use makes a degrees of freedom adjustment:

a u^ i 5 SSR/ 1 n 2 2 2 .
1 n
s^ 2 5 2
[2.61]
1 n 2 2 2 i51
(This estimator is sometimes denoted as S 2, but we continue to use the convention of putting “hats”
over estimators.)

Theorem Unbiased Estimation of s2


2.3 Under Assumptions SLR.1 through SLR.5,

E 1 s^ 2 2 5 s2.

PROOF: If we average equation (2.59) across all i and use the fact that the OLS residuals average out to zero,
we have 0 5 u 2 1 b^ 0 2 b0 2 2 1 b^ 1 2 b1 2 x; subtracting this from (2.59) gives u^ i 5 1 ui 2 u 2 2 1 b^ 1 2 b1 2

across all i gives g i51u^ 2i 5 g ni51 1 ui 2 u 2 2 1 1 b^ 1 2 b1 2 2 g ni51 1 xi 2 x 2 2 2 2 1 b^ 1 2 b1 2 g i51 ui 1 xi 2 x 2 .


1 xi 2 x 2 . Therefore, u^ 2i 5 1 ui 2 u 2 2 1 1 b^ 1 2 b1 2 2 1 xi 2 x 2 2 2 2 1 ui 2 u 2 1 b^ 1 2 b1 2 1 xi 2 x 2 . Summing
n n

Now, the expected value of the first term is 1 n 2 1 2 s2, something that is shown in Math Refresher C. The
expected value of the second term is simply s2 because E 3 1 b^ 1 2 b1 2 2 4 5 Var 1 b^ 1 2 5 s2/SSTx. Finally,

terms ­together gives E 1 g i51u^ 2i 2 5 1 n 2 1 2 s2 1 s2 2 2s2 5 1 n 2 2 2 s2, so that E 3 SSR/ 1 n 2 2 2 4 5 s2.


the third term can be written as 22 1 b^ 1 2 b1 2 2SSTx; taking expectations gives 22s2. Putting these three
n

58860_ch02_hr_019-065.indd 49 10/18/18 4:06 PM


50 PART 1 Regression Analysis with Cross-Sectional Data

If s^ 2 is plugged into the variance formulas (2.57) and (2.58), then we have unbiased estimators of
Var 1 b^ 1 2 and Var 1 b^ 0 2 . Later on, we will need estimators of the standard deviations of b^ 1 and b^ 0, and
this requires estimating s. The natural estimator of s is
s^ 5 "s^ 2 [2.62]
and is called the standard error of the regression (SER). (Other names for s^ are the standard error
of the estimate and the root mean squared error, but we will not use these.) Although s^ is not an unbi-
ased estimator of s, we can show that it is a consistent estimator of s (see Math Refresher C), and it
will serve our purposes well.
The estimate s^ is interesting because it is an estimate of the standard deviation in the unobservables
affecting y; equivalently, it estimates the standard deviation in y after the effect of x has been taken out.
Most regression packages report the value of s^ along with the R-squared, intercept, slope, and other OLS
statistics (under one of the several names listed above). For now, our primary interest is in using s^ to esti-
mate the standard deviations of b^ 0 and b^ 1. Because sd 1 b^ 1 2 5 s/"SSTx, the natural estimator of sd 1 b^ 1 2 is

se 1 b^ 1 2 5 s^ /"SSTx 5 s^ / a a 1 xi 2 x 2 2 b ;
n 1/2

i51

this is called the standard error of b^ 1. Note that se 1 b^ 1 2 is viewed as a random variable when we think of
running OLS over different samples of y; this is true because s^ varies with different samples. For a given
sample, se 1 b^ 1 2 is a number, just as b^ 1 is simply a number when we compute it from the given data.
Similarly, se 1 b^ 0 2 is obtained from sd 1 b^ 0 2 by replacing s with s^ . The standard error of any esti-
mate gives us an idea of how precise the estimator is. Standard errors play a central role throughout
this text; we will use them to construct test statistics and confidence intervals for every econometric
procedure we cover, starting in Chapter 4.

2-6 Regression through the Origin and Regression on a Constant


In rare cases, we wish to impose the restriction that, when x 5 0, the expected value of y is zero.
There are certain relationships for which this is reasonable. For example, if income (x) is zero, then
income tax revenues (y) must also be zero. In addition, there are settings where a model that originally
has a nonzero intercept is transformed into a model without an intercept.
&
Formally, we now choose a slope estimator, which we call b1, and a line of the form
& &
y 5 b1x, [2.63]
| |
where the tildes over b1 and y are used to distinguish this problem from the much more common
problem of estimating an intercept along with a slope. Obtaining (2.63) is called regression through
&
the origin because the line (2.63) passes through the point x 5 0, y 5 0. To obtain the slope estimate
in (2.63), we still rely on the method of ordinary least squares, which in this case minimizes the sum
of squared residuals:

a 1 yi 2 b1xi 2 .
n &
2
[2.64]
i51
&
Using one-variable calculus, it can be shown that b1 must solve the first order condition:

a xi 1 yi 2 b1xi 2 5 0.
n &
[2.65]
i51
&
From this, we can solve for b1:

a xiyi
n

& i51
[2.66]
a xi
b1 5 n ,
2
i51

provided that not all the xi are zero, a case we rule out.

58860_ch02_hr_019-065.indd 50 10/18/18 4:06 PM


CHAPTER 2 The Simple Regression Model 51

|
Note how b1 compares with the slope estimate when we also estimate the intercept (rather than
set it equal to zero). These two estimates are the same if, and only if, x 5 0. [See equation (2.49) for
b^ 1.] Obtaining an estimate of b1 using regression through the origin is not done very often in applied
|
work, and for good reason: if the intercept b0 2 0, then b1 is a biased estimator of b1. You will be
asked to prove this in Problem 8.
In cases where regression through the origin is deemed appropriate, one must be careful in inter-
preting the R-squared that is typically reported with such regressions. Usually, unless stated otherwise,
the R-squared is obtained without removing the sample average of 5 yi: i 5 1, c, n 6 in obtaining
SST. In other words, the R-squared is computed as

a 1 yi 2 b1xi 2
n
|
2
i51
[2.67]
a yi
12 n .
2
i51

The numerator here makes sense because it is the sum of squared residuals, but the denominator
acts as if we know the average value of y in the population is zero. One reason this version of the
R-squared is used is that if we use the usual total sum of squares, that is, we compute R-squared as

a 1 yi 2 b1xi 2
n
2|
i51
[2.68]
a 1 yi 2 y 2
12 n ,
2
i51

it can actually be negative. If expression (2.68) is negative then it means that using the sample average
y to predict yi provides a better fit than using xi in a regression through the origin. Therefore, (2.68) is
actually more attractive than equation (2.67) because equation (2.68) tells us whether using x is better
than ignoring x altogether.
This discussion about regression through the origin, and different ways to measure goodness-
of-fit, prompts another question: what happens if we only regress on a constant? That is, what if we
set the slope to zero (which means we need not even have an x) and estimate an intercept only? The
answer is simple: the intercept is y. This fact is usually shown in basic statistics, where it is shown
that the constant that produces the smallest sum of squared deviations is always the sample average.
In this light, equation (2.68) can be seen as comparing regression on x through the origin with regres-
sion only on a constant.

2-7 Regression on a Binary Explanatory Variable


Our discussion so far has centered on the case where the explanatory variable, x, has quantitative
meanings. A few examples include years of schooling, return on equity for a firm, and the percent-
age of students at a school eligible for the federal free lunch program. We know how to interpret the
slope coefficient in each case. We also discussed interpretation of the slope coefficient when we use
the logarithmic transformations of the explained variable, the explanatory variable, or both.
Simple regression can also be applied to the case where x is a binary variable, often called a
dummy variable in the context of regression analysis. As the name “binary variable” suggests, x
takes on only two values, zero and one. These two values are used to put each unit in the population
into one of two groups represented by x 5 0 and x 5 1. For example, we can use a binary variable to
describe whether a worker participates in a job training program. In the spirit of giving our variables
descriptive names, we might use train to indicate participation: train 5 1 means a person participates;
train 5 0 means the person does not. Given a data set, we add an i subscript, as usual, so traini indi-
cates job training status for a randomly drawn person i.

58860_ch02_hr_019-065.indd 51 10/18/18 4:06 PM


52 PART 1 Regression Analysis with Cross-Sectional Data

If we have a dependent or response variable, y, what does it mean to have a simple regression
equation when x is binary? Consider again the equation
y = b0 1 b1x 1 u
but where now x is a binary variable. If we impose the zero conditional mean assumption SLR.4 then
we obtain
E(y 0 x) 5 b0 1 b1x 1 E(u 0 x) 5 b0 1 b1x, [2.69]
just as in equation (2.8). The only difference now is that x can take on only two values. By plugging
the values zero and one into (2.69), it is easily seen that
E(y 0 x 5 0) 5 b0 [2.70]
E(y 0 x 5 1) 5 b0 1 b1. [2.71]
It follows immediately that
b1 5 E(y 0 x 5 1) 2 E(y 0 x 5 0). [2.72]
In other words, b1 is the difference in the average value of y over the subpopulations with x 5 1 and
x 5 0. As with all simple regression analyses, this difference can be descriptive or, in a case discussed
in the next subsection, b1 can be a causal effect of an intervention or a program.
As an example, suppose that every worker in an hourly wage industry is put into one of two
racial categories: white (or Caucasian) and nonwhite. (Clearly this is a very crude way to categorize
race, but it has been used in some contexts.) Define the variable white 5 1 if a person is classified as
Caucasian and zero otherwise. Let wage denote hourly wage. Then
b1 5 E(wage 0 white 5 1) 2 E(wage 0 white 5 0)
is the difference in average hourly wages between white and nonwhite workers. Equivalently,
E(wage 0 white) 5 b0 1 b1white.
Notice that b1 always has the interpretation that it is the difference in average wages between whites
and nonwhites. However, it does not necessarily measure wage discrimination because there are many
legitimate reasons wages can differ, and some of those—such as education levels—could differ, on
average, by race.
The mechanics of OLS do not change just because x is binary. Let {(xi, yi) : i 5 1, . . . , n} be the
sample of size n. The OLS intercept and slope estimators are always given by (2.16) and (2.19), respec-
tively. The residuals always have zero mean and are uncorrelated with the xi in the sample. The defini-
tion of R-squared is unchanged. And so on. Nevertheless, because xi is binary, the OLS estimates have
a simple, sensible interpretation. Let y0 be the average of the yi with xi 5 0 and y1 the average when
xi 5 1. Problem 2.13 asks you to show that
b^ 0 5 y0 [2.73]
b^ 0 5 y1 2 y0. [2.74]
For example, in the wage/race example, if we run the regression
wagei on whitei, i 5 1, . . . , n
then b^ 0 5 wage0, the average hourly wage for nonwhites, and b^ 1 5 wage1 2 wage0, the difference
in average hourly wages between whites and nonwhites. Generally, equation (2.74) shows that the
“slope” in the regression is the difference in means, which is a standard estimator from basic statistics
when comparing two groups.
The statistical properties of OLS are also unchanged when x is binary. In fact, nowhere is this
ruled out in the statements of the assumptions. Assumption SLR.3 is satisfied provided we see some

58860_ch02_hr_019-065.indd 52 10/18/18 4:06 PM


CHAPTER 2 The Simple Regression Model 53

zeros and some ones for xi in our sample. For example, in the wage/race example, we need to observe
some whites and some nonwhites in order to obtain b^ 1.
As with any simple regression analysis, the main concern is the zero conditional mean assumption,
SLR.4. In many cases, this condition will fail because x is systematically related to other factors that
affect y, and those other factors are necessarily part of u. We alluded to this above in discussing differences
in average hourly wage by race: education and workforce experience are two variables that affect hourly
wage that could systematically differ by race. As another example, suppose we have data on SAT scores
for students who did and did not take at least one SAT preparation course. Then x is a binary variable, say,
course, and the outcome variable is the SAT score, sat. The decision to take the preparation course could
be systematically related to other factors that are predictive of SAT scores, such as family income and
parents’ education. A comparison of average SAT scores between the two groups is unlikely to uncover
the causal effect of the preparation course. The framework covered in the next subsection allows us to
determine the special circumstances under which simple regression can uncover a causal effect.

2-7a Counterfactual Outcomes, Causality, and Policy Analysis


Having introduced the notion of a binary explanatory variable, now is a good time to provide a formal
framework for studying counterfactual or potential outcomes, as touched on briefly in Chapter 1. We
are particularly interested in defining a causal effect or treatment effect.
In the simplest case, we are interested in evaluating an intervention or policy that has only two
states of the world: a unit is subjected to the intervention or not. In other words, those not subject
to the intervention or new policy act as a control group and those subject to the intervention as the
treatment group. Using the potential outcomes framework introduced in Chapter 1, for each unit i
in the population we assume there are outcomes in both states of the world, yi (0) and yi (1). We will
never observe any unit in both states of the world but we imagine each unit in both states. For exam-
ple, in studying a job training program, a person does or does not participate. Then yi (0) is earnings if
person i does not participate and yi (1) is labor earnings if i does participate. These outcomes are well
defined before the program is even implemented.
The causal effect, somewhat more commonly called the treatment effect, of the intervention for
unit i is simply
tei 5 yi(1) 2 yi(0), [2.75]
the difference between the two potential outcomes. There are a couple of noteworthy items about tei.
First, it is not observed for any unit i because it depends on both counterfactuals. Second, it can be
negative, zero, or positive. It could be that the causal effect is negative for some units and positive for
others.
We cannot hope to estimate tei for each unit i. Instead, the focus is typically on the average
treatment effect (ATE), also called the average causal effect (ACE). The ATE is simply the aver-
age of the treatment effects across the entire population. (Sometimes for emphasis the ATE is called
the population average treatment effect.) We can write the ATE parameter as
tate 5 E[tei] 5 E[yi(1) 2 yi(0)] 5 E[yi(1)] 2 E[yi(0)], [2.76]
where the final expression uses linearity of the expected value. Sometimes, to emphasize the popu-
lation nature of tate we write tate 5 E[y(1) 2 y(0)], where [y(0), y(1)] are the two random variables
representing the counterfactual outcomes in the population.
For each unit i let xi be the program participation status—a binary variable. Then the observed
outcome, yi, can be written as
yi 5 (1 2 xi)yi(0) 1 xiyi (1), [2.77]
which is just shorthand for yi 5 yi (0) if xi 5 0 and yi 5 yi (1) if xi 5 1. This equation precisely
describes why, given a random sample from the population, we observe only one of yi(0) and yi(1).

58860_ch02_hr_019-065.indd 53 10/18/18 4:06 PM


54 PART 1 Regression Analysis with Cross-Sectional Data

To see how to estimate the average treatment effect, it is useful to rearrange (2.77):
yi 5 yi(0) 1 [yi(1) 2 yi(0)]xi [2.78]
Now impose a simple (and, usually, unrealistic) constant treatment effect. Namely, for all i,
yi(1) 5 t 1 yi(0), [2.79]
or t 5 yi(1) 2 yi(0). Plugging this into (2.78) gives
yi 5 yi(0) 1 txi.
Now write yi(0) 5 a0 1 ui(0) where, by definition, a0 5 E[yi(0)] and E[ui(0)] 5 0. Plugging this
in gives
yi 5 a0 1 txi 1 ui(0). [2.80]
If we define b0 5 a0, b1 5 t, and ui 5 ui(0) then the equation becomes exactly as in equation (2.48):
yi 5 b0 1 b1xi 1 ui,
where b1 5 t is the treatment (or causal) effect.
We can easily determine that the simple regression estimator, which we now know is the differ-
ence in means estimator, is unbiased for the treatment effect, t. If xi is independent of ui(0) then
E[ui(0) 0 xi] 5 0,
so that SLR.4 holds. We have already shown that SLR.1 holds in our derivation of (2.80). As usual,
we assume random sampling (SLR.2), and SLR.3 holds provided we have some treated units and
some control units, a basic requirement. It is pretty clear we cannot learn anything about the effect of
the intervention if all sampled units are in the control group or all are in the treatment group.
The assumption that xi is independent of ui(0) is the same as xi is independent of yi(0). This
assumption can be guaranteed only under random assignment, whereby units are assigned to the
treatment and control groups using a randomization mechanism that ignores any features of the indi-
vidual units. For example, in evaluating a job training program, random assignment occurs if a coin
is flipped to determine whether a worker is in the control group or treatment group. (The coin can be
biased in the sense that the probability of a head need not be 0.5.) Random assignment can be com-
promised if units do not comply with their assignment.
Random assignment is the hallmark of a randomized controlled trial (RCT), which has long been
considered the gold standard for determing whether medical interventions have causal effects. RCTs gen-
erate the kind of experimental data of the type discussed in Chapter 1. In recent years, RCTs have become
more popular in certain fields in economics, such as development economics and behavioral economics.
Unfortunately, RCTs can be very expensive to implement, and in many cases randomizing subjects into
control and treatment groups raises ethical issues. (For example, if giving low-income families access to
free health care improves child health outcomes then randomizing some families into the control group
means those children will have, on average, worse health outcomes than they could have otherwise.)
Even though RCTs are not always feasible for answering particular questions in economics and
other fields, it is a good idea to think about the experiment one would run if random assignment were
possible. Working through the simple thought experiment typically ensures that one is asking a sen-
sible question before gathering nonexperimental data. For example, if we want to study the effects
of Internet access in rural areas on student performance, we might not have the resources (or ethical
clearance) to randomly assign Internet access to some students and not others. Nevertheless, thinking
about how such an experiment would be implemented sharpens our thinking about the potential out-
comes framework and what we mean by the treatment effect.
Our discussion of random assignment so far shows that, in the context of a constant treatment
effect, the simple difference-in-means estimator, y1 2 y0, is unbiased for t. We can easily relax the
constant treatment effect assumption. In general, the individual treatment effect can be written as
tei 5 yi(1) 2 yi(0) 5 tate 1 [ui(1) 2 ui(0)], [2.81]

58860_ch02_hr_019-065.indd 54 10/18/18 4:06 PM


CHAPTER 2 The Simple Regression Model 55

where yi(1) 5 a1 1 ui(1) and tate 5 a1 2 a0. It is helpful to think of tate as the average across the
entire population and ui(1 ) 2 ui(0) as the deviation from the population average for unit i. Plugging
(2.81) into (2.78) gives
yi 5 a0 1 tatexi 1 ui(0) 1 [ui(1) 2 ui(0)]xi ; a0 1 tatexi 1 ui, [2.82]
where the error term is now
ui 5 ui(0) 1 [ui(1) 2 ui(0)]xi.
The random assignment assumption is now that xi is independent of [ui(0), ui(1)]. Even though ui
depends on xi, the zero conditional mean assumption holds:
E(ui 0 xi) 5 E[ui(0) 0 xi] 1 E[ui(1) 2 ui(0) 0 xi]xi
5 0 1 0 ? xi 5 0.
We have again verified SLR.4, and so we conclude that the simple OLS estimator is unbiased for a0 and
tate,where t^ ate is the difference-in-means estimator. [The error ui is not independent of xi. In particular,
as shown in Problem 2.17, Var(ui 0 xi) differs across xi 5 1 and xi 5 0 if the variances of the potential out-
comes differ. But remember, Assumption SLR.5 is not used to show the OLS estimators are unbiased.]
The fact that the simple regression estimator produces an unbiased estimator tate when the treat-
ment effects can vary arbitrarily across individual units is a very powerful result. However, it relies
heavily on random assignment. Starting in Chapter 3, we will see how multiple regression analysis
can be used when pure random assignment does not hold. Chapter 20, available as an online supple-
ment, contains an accessible survey of advanced methods for estimating treatment effects.

EXAMPLE 2.14 Evaluating a Job Training Program


The data in JTRAIN2 are from an old, experimental job training program, where men with poor labor
market histories were assigned to control and treatment groups. This data set has been used widely in
the program evaluation literature to compare estimates from nonexperimental programs. The training
assignment indicator is train and here we are interested in the outcome re78, which is (real) earnings in
1978 measured in thousands of dollars. Of the 445 men in the sample, 185 participated in the program
in a period prior to 1978; the other 260 men comprise the control group.
The simple regression gives
re78 5 4.55 1 1.79 train
n 5 445, R2 5 0.018.
From the earlier discussion, we know that 1.79 is the difference in average re78 between the treated
and control groups, so men who participated in the program earned an average of $1,790 more than
the men who did not. This is an economically large effect, as the dollars are 1978 dollars. Plus, the
average earnings for men who did not participate is $4,550; in percentage terms, the gain in average
earnings is about 39.3%, which is large. (We would need to know the costs of the program to do a
­benefit-cost analysis, but the benefits are nontrivial.)
Remember that the fundamental issue in program evaluation is that we do not observe any of the
units in both states of the world. In this example, we only observe one of the two earnings outcomes
for each men. Neverthless, random assignment into the treatment and control groups allows us to get
an unbiased estimator of the average treatment effect.
Two final comments on this example. First, notice the very small R-squared: the training par-
ticipation indicator explains less than two percent of the variation in re78 in the sample. We should
not be surprised: many other factors, including education, experience, intelligence, age, motivation,
and so on help determine labor market earnings. This is a good example to show how focusing on
R-squared is not only unproductive, but it can be harmful. Beginning students sometimes think a
small R-squared indicates “bias” in the OLS estimators. It does not. It simply means that the variance
in the unobservables, Var(u), is large relative to Var(y). In this example, we know that Assumptions

58860_ch02_hr_019-065.indd 55 10/18/18 4:06 PM


56 PART 1 Regression Analysis with Cross-Sectional Data

SLR.1 to SLR.4 hold because of random assignment. Rightfully, none of these assumptions mentions
how large R-squared must be; it is immaterial for the notion of unbiasedness.
A second comment is that, while the estimated economic effect of $1,790 is large, we do not
know whether this estimate is statistically significant. We will come to this topic in Chapter 4.
Before ending this chapter, it is important to head off possible confusion about two different ways
the word “random” has been used in this subsection. First, the notion of random sampling is the one
introduced in Assumption SLR.2 (and also discussed in Math Refresher C). Random sampling means
that the data we obtain are independent, identically distributed draws from the population distribution
represented by the random variables (x, y). It is important to understand that random sampling is a sepa-
rate concept from random assignment, which means that xi is determined independently of the counter-
factuals [yi(0), yi(1)]. In Example 2.14, we obtained a random sample from the relevant population, and
the assignment to treatment and control is randomized. But in other cases, random assignment will not
hold even though we have random sampling. For example, it is relatively easy to draw a random sample
from a large population of college-bound students and obtain outcomes on their SAT scores and whether
they participated in an SAT preparation course. That does not mean that participation in a course is inde-
pendent of the counterfactual outcomes. If we wanted to ensure independence between participation and
the potential outcomes, we would randomly assign the students to take a course or not (and insist that
students adhere to their assignments). If instead we obtain retrospective data—that is, we simply record
whether a student has taken a preparation course—then the independence assumption underlying RA is
unlikely to hold. But this has nothing to do with whether we obtained a random sample of students from
the population. The general point is that Assumptions SLR.2 and SLR.4 are very different.

Summary
We have introduced the simple linear regression model in this chapter, and we have covered its basic prop-
erties. Given a random sample, the method of ordinary least squares is used to estimate the slope and
intercept parameters in the population model. We have demonstrated the algebra of the OLS regression
line, including computation of fitted values and residuals, and the obtaining of predicted changes in the
dependent variable for a given change in the independent variable. In Section 2-4, we discussed two issues
of practical importance: (1) the behavior of the OLS estimates when we change the units of measurement
of the dependent variable or the independent variable and (2) the use of the natural log to allow for constant
elasticity and constant semi-elasticity models.
In Section 2-5, we showed that, under the four Assumptions SLR.1 through SLR.4, the OLS estima-
tors are unbiased. The key assumption is that the error term u has zero mean, or average, given any value
of the independent variable x. Unfortunately, there are reasons to think this is false in many social science
applications of simple regression, where the omitted factors in u are often correlated with x. When we add
the assumption that the variance of the error given x is constant, we get simple formulas for the sampling
variances of the OLS estimators. As we saw, the variance of the slope estimator b^ 1 increases as the error
variance increases, and it decreases when there is more sample variation in the independent variable. We
also derived an unbiased estimator for s2 5 Var 1 u 2 .
In Section 2-6, we briefly discussed regression through the origin, where the slope estimator is
obtained under the assumption that the intercept is zero. Sometimes, this is useful, but it appears infre-
quently in applied work.
In Section 2-7 we covered the important case where x is a binary variable, and showed that
the OLS “slope” estimate is simply b^ 0 5 y1 2 y0, the difference in the averages of yi between the
xi = 1 and xi = 0 subsamples. We also discussed how, in the context of causal inference, b^ 1 is an
­unbiased estimator of the average treatment effect under random assignment into the control and
treatment groups. In Chapter 3 and beyond, we will study the case where the intervention or treatment
is not randomized, but depends on observed and even unobserved factors.
Much work is left to be done. For example, we still do not know how to test hypotheses about the pop-
ulation parameters, b0 and b1. Thus, although we know that OLS is unbiased for the population parameters

58860_ch02_hr_019-065.indd 56 10/18/18 4:06 PM


CHAPTER 2 The Simple Regression Model 57

under Assumptions SLR.1 through SLR.4, we have no way of drawing inferences about the population.
Other topics, such as the efficiency of OLS relative to other possible procedures, have also been omitted.
The issues of confidence intervals, hypothesis testing, and efficiency are central to multiple regression
analysis as well. Because the way we construct confidence intervals and test statistics is very similar for multi-
ple regression—and because simple regression is a special case of multiple regression—our time is better spent
moving on to multiple regression, which is much more widely applicable than simple regression. Our purpose
in Chapter 2 was to get you thinking about the issues that arise in econometric analysis in a fairly simple setting.

The Gauss-Markov Assumptions for Simple Regression


For convenience, we summarize the Gauss-Markov assumptions that we used in this chapter. It is impor-
tant to remember that only SLR.1 through SLR.4 are needed to show b^ 0 and b^ 1 are unbiased. We added the
homoskedasticity assumption, SLR.5, to obtain the usual OLS variance formulas (2.57) and (2.58).
Assumption SLR.1 (Linear in Parameters)
In the population model, the dependent variable, y, is related to the independent variable, x, and the error
(or disturbance), u, as
y 5 b0 1 b1x 1 u,
where b0 and b1 are the population intercept and slope parameters, respectively.
Assumption SLR.2 (Random Sampling)
We have a random sample of size n, 5 1 xi,yi 2 : i 5 1, 2, c, n 6 , following the population model in
Assumption SLR.1.
Assumption SLR.3 (Sample Variation in the Explanatory Variable)
The sample outcomes on x, namely, 5 xi, i 5 1, c, n 6 , are not all the same value.
Assumption SLR.4 (Zero Conditional Mean)
The error u has an expected value of zero given any value of the explanatory variable. In other words,
E 1 u 0 x 2 5 0.
Assumption SLR.5 (Homoskedasticity)
The error u has the same variance given any value of the explanatory variable. In other words,
Var 1 u 0 x 2 5 s2.

Key Terms
Average Treatment Effect (ATE) First Order Conditions Regressor
Average Causal Effect (ACE) Fitted Value Residual
Binary (Dummy) Variable Gauss-Markov Assumptions Residual Sum of Squares (SSR)
Causal (Treatment) Effect Heteroskedasticity Response Variable
Coefficient of Determination Homoskedasticity R-squared
Constant Elasticity Model Independent Variable Sample Regression Function (SRF)
Control Group Intercept Parameter Semi-elasticity
Control Variable Mean Independent Simple Linear Regression Model
Covariate OLS Regression Line Slope Parameter
Degrees of Freedom Ordinary Least Squares (OLS) Standard Error of b^ 1
Dependent Variable Population Regression Function (PRF) Standard Error of the Regression (SER)
Elasticity Predicted Variable Sum of Squared Residuals
Error Term (Disturbance) Predictor Variable Total Sum of Squares (SST)
Error Variance Random Assignment Treatment Group
Explained Sum of Squares (SSE) Randomized Controlled Trial (RCT) Zero Conditional Mean Assumption
Explained Variable Regressand
Explanatory Variable Regression through the Origin

58860_ch02_hr_019-065.indd 57 10/18/18 4:06 PM


58 PART 1 Regression Analysis with Cross-Sectional Data

Problems
1 Let kids denote the number of children ever born to a woman, and let educ denote years of education
for the woman. A simple model relating fertility to years of education is
kids 5 b0 1 b1educ 1 u,
where u is the unobserved error.
(i) What kinds of factors are contained in u? Are these likely to be correlated with level of education?
(ii) Will a simple regression analysis uncover the ceteris paribus effect of education on fertility?
Explain.
2 In the simple linear regression model y 5 b0 1 b1x 1 u, suppose that E 1 u 2 2 0. Letting a0 5 E 1 u 2 ,
show that the model can always be rewritten with the same slope, but a new intercept and error, where
the new error has a zero expected value.
3 The following table contains the ACT scores and the GPA (grade point average) for eight college stu-
dents. Grade point average is based on a four-point scale and has been rounded to one digit after the
decimal.

Student GPA ACT


1 2.8 21
2 3.4 24
3 3.0 26
4 3.5 27
5 3.6 29
6 3.0 25
7 2.7 25
8 3.7 30

(i) Estimate the relationship between GPA and ACT using OLS; that is, obtain the intercept and
slope estimates in the equation
GPA 5 b^ 0 1 b^ 1ACT.
Comment on the direction of the relationship. Does the intercept have a useful interpretation
here? Explain. How much higher is the GPA predicted to be if the ACT score is increased by
five points?
(ii) Compute the fitted values and residuals for each observation, and verify that the residuals
(approximately) sum to zero.
(iii) What is the predicted value of GPA when ACT 5 20?
(iv) How much of the variation in GPA for these eight students is explained by ACT? Explain.
4 The data set BWGHT contains data on births to women in the United States. Two variables of interest
are the dependent variable, infant birth weight in ounces (bwght), and an explanatory variable, average
number of cigarettes the mother smoked per day during pregnancy (cigs). The following simple regres-
sion was estimated using data on n 5 1,388 births:
bwght 5 119.77 2 0.514 cigs
(i) What is the predicted birth weight when cigs 5 0? What about when cigs 5 20 (one pack per
day)? Comment on the difference.
(ii) Does this simple regression necessarily capture a causal relationship between the child’s birth
weight and the mother’s smoking habits? Explain.

58860_ch02_hr_019-065.indd 58 10/18/18 4:06 PM


CHAPTER 2 The Simple Regression Model 59

(iii) To predict a birth weight of 125 ounces, what would cigs have to be? Comment.
(iv) The proportion of women in the sample who do not smoke while pregnant is about .85. Does
this help reconcile your finding from part (iii)?
5 In the linear consumption function
cons 5 b^ 0 1 b^ 1inc,
the (estimated) marginal propensity to consume (MPC) out of income is simply the slope, b^ 1,
while the average propensity to consume (APC) is cons/inc 5 b^ 0 /inc 1 b^ 1. Using observations
for 100 families on annual income and consumption (both measured in dollars), the following
equation is obtained:
cons 5 2124.84 1 0.853 inc
n 5 100, R2 5 0.692.
(i) Interpret the intercept in this equation, and comment on its sign and magnitude.
(ii) What is the predicted consumption when family income is $30,000?
(iii) With inc on the x-axis, draw a graph of the estimated MPC and APC.
6 Using data from 1988 for houses sold in Andover, Massachusetts, from Kiel and McClain (1995),
the following equation relates housing price (price) to the distance from a recently built garbage incin-
erator (dist):
log 1 price 2 5 9.40 1 0.312 log 1 dist 2
n 5 135, R2 5 0.162.
(i) Interpret the coefficient on log(dist). Is the sign of this estimate what you expect it to be?
(ii) Do you think simple regression provides an unbiased estimator of the ceteris paribus
elasticity of price with respect to dist? (Think about the city’s decision on where to put
the incinerator.)
(iii) What other factors about a house affect its price? Might these be correlated with distance from
the incinerator?
7 Consider the savings function
sav 5 b0 1 b1inc 1 u, u 5 "inc . e,
where e is a random variable with E 1 e 2 5 0 and Var 1 e 2 5 s2e . Assume that e is independent
of inc.
(i) Show that E 1 u 0 inc 2 5 0, so that the key zero conditional mean assumption (Assumption SLR.4)
is satisfied. [Hint: If e is independent of inc, then E 1 e 0 inc 2 5 E 1 e 2 .]
(ii) Show that Var 1 u 0 inc 2 5 s2e inc, so that the homoskedasticity Assumption SLR.5 is violated. In
particular, the variance of sav increases with inc. [Hint: Var 1 e 0 inc 2 5 Var 1 e 2 if e and inc are
independent.]
(iii) Provide a discussion that supports the assumption that the variance of savings increases with
family income.
8 Consider the standard simple regression model y 5 b0 1 b1x 1 u under the Gauss-Markov
Assumptions SLR.1 through SLR.5. The usual OLS estimators b^ 0 and b^ 1 are unbiased for their respec-
&
tive population parameters. Let b1 be the estimator of b1 obtained by assuming the intercept is zero
(see Section 2-6).
& &
(i) Find E 1 b1 2 in terms of the xi, b0, and b1. Verify that b1 is unbiased for b1 when the population
&
intercept 1 b0 2 is zero. Are there other cases where b1 is unbiased?
&
(ii) Find the variance of b1. (Hint: The variance does not depend on b0.)

58860_ch02_hr_019-065.indd 59 10/18/18 4:06 PM


60 PART 1 Regression Analysis with Cross-Sectional Data

(iii) Show that Var 1 b1 2 # Var 1 b^ 1 2 . [Hint: For any sample of data, g i21x2i $ g i51 1 xi 2 x 2 2, with
& n n

strict inequality unless x 5 0.]


&
(iv) Comment on the tradeoff between bias and variance when choosing between b^ 1 and b1.
9 (i) Let b^ 0 and b^ 1 be the intercept and slope from the& regression of yi on xi, using n observations. Let
&
c1 and c2, with c2 2 0, be constants. Let b0 and b1 be the intercept and slope from the regression
& &
of c1yi on c2xi. Show that b1 5 1 c1/c2 2 b^ 0 and b0 5 c1b^ 0, thereby verifying the claims on units of
&
measurement in Section 2-4. [Hint: To obtain b1, plug the scaled versions of x and y into (2.19).
&
Then, use (2.17)&for b0, being sure to plug in the scaled x and y and the correct slope.]
&
(ii) Now, let b0 and b1 be from the regression of 1 c1 1 yi 2 on 1 c2 1 xi 2 (with no restriction on c1 or c2).
& &
Show that b1 5 b^ 1 and b0 5 b^ 0 1 c1 2 c2b^ 1.
(iii) Now, let b^ 0 and b^ 1 be the OLS estimates& from &the regression log 1 yi 2 on xi, where we must
assume yi . 0 for all i. For c1& . 0, let b0 and &
b1 be the intercept and slope from the regression
of log 1 c1yi 2 on xi. Show that b1 5 b^ 1 and
&
b 0 5 &
log 1 c1 2 1 b^ 0.
(iv) Now, assuming that xi . 0 &for all i,& let b0 and b1 be the intercept and slope from the regression
of yi on log 1 c2xi 2 . How do b0 and b1 compare with the intercept and slope from the regression
of yi on log 1 xi 2 ?
10 Let b^ 0 and b^ 1 be the OLS intercept and slope estimators, respectively, and let u be the sample average

(i) Show that b^ 1 can be written as b^ 1 5 b1 1 g i51wiui, where wi 5 di /SSTx and di 5 xi 2 x.


of the errors (not the residuals!).

(ii) Use part (i), along with g i51wi 5 0, to show that b^ 1 and u are uncorrelated. [Hint: You are
n
n

­being asked to show that E 3 1 b^ 1 2 b1 2 # u 4 5 0. 4


(iii) Show that b^ 0 can be written as b^ 0 5 b0 1 u 2 1 b^ 1 2 b1 2 x.
(iv) Use parts (ii) and (iii) to show that Var 1 b^ 0 2 5 s2/n 1 s2 1 x 2 2/SSTx.

[Hint: SSTx/n 5 n21 g i51x2i 2 1 x 2 2.]


(v) Do the algebra to simplify the expression in part (iv) to equation (2.58).
n

11 Suppose you are interested in estimating the effect of hours spent in an SAT preparation course
(hours) on total SAT score (sat). The population is all college-bound high school seniors for a par-
ticular year.
(i) Suppose you are given a grant to run a controlled experiment. Explain how you would structure
the experiment in order to estimate the causal effect of hours on sat.
(ii) Consider the more realistic case where students choose how much time to spend in a prepara-
tion course, and you can only randomly sample sat and hours from the population. Write the
population model as
sat 5 b0 1 b1hours 1 u
where, as usual in a model with an intercept, we can assume E 1 u 2 5 0. List at least two factors
contained in u. Are these likely to have positive or negative correlation with hours?
(iii) In the equation from part (ii), what should be the sign of b1 if the preparation course is effective?
(iv) In the equation from part (ii), what is the interpretation of b0?
12 Consider the problem described at the end of Section 2-6, running a regression and only estimating an
intercept.
&
(i) Given a sample 5 yi: i 5 1, 2, c, n 6 , let b0 be the solution to

min a 1 yi 2 b0 2 2.
n

b0 i51
&
Show that b0 5 y, that is, the sample average minimizes the sum of squared residuals. (Hint:
You may use one-variable calculus or you can show the result directly by adding and subtract-
ing y inside the squared residual and then doing a little algebra.)
&
(ii) Define residuals ui 5 yi 2 y. Argue that these residuals always sum to zero.

58860_ch02_hr_019-065.indd 60 10/18/18 4:06 PM


CHAPTER 2 The Simple Regression Model 61

13 Let y be any response variable and x a binary explanatory variable. Let {(xi, yi) : i 5 1, . . . , n} be a
sample of size n. Let n0 be the number of observations with xi 5 0 and n1 the number of observations
with xi 5 1. Let y0 be the average of the yi with xi 5 0 and y1 the average of the yi with xi 5 1.
(i) Explain why we can write

n0 5 a 1 1 2 xi 2 , n1 5 a xi.
n n

i51 i51

Show that x 5 n1/n and (1 2 x) 5 n0/n. How do you interpret x ?


(ii) Argue that

0 a 1 1 2 xi 2 yi, y1 5 n1 a xi yi .
n n
y0 5 n21 21
i51 i51

(iii) Show that the average of yi in the entire sample, y, can be written as a weighted average:
y 5 1 1 2 x 2 y0 1 x y1.
[Hint: Write yi 5 (1 2 xi)yi 1 xiyi.]
(iv) Show that when xi is binary,

n21 a x2i 2 1 x 2 2 5 x 1 1 2 x 2 .
n

i51

[Hint: When xi is binary, x2i 5 xi.]


(v) Show that

n21 a xi yi 2 x y 5 x 1 1 2 x 2 1 y1 2 y0 2 .
n

i51

(vi) Use parts (iv) and (v) to obtain (2.74).


(vii) Derive equation (2.73).
14 In the context of Problem 2.13, suppose yi is also binary. For concreteness, yi indicates whether
worker i is employed after a job training program, where yi 5 1 means has a job, yi 5 0 means does not
have a job. Here, xi indicates participation in the job training program. Argue that b^ 1 is the difference in
employment rates between those who participated in the program and those who did not.
15 Consider the potential outcomes framework from Section 2.7a, where yi(0) and yi(1) are the potential
outcomes in each treatment state.
(i) Show that if we could observe yi(0) and yi(1) for all i then an unbiased estimator of tate would be

n21 a 3 yi 1 1 2 2 yi 1 0 2 4 5 y 1 1 2 2 y 1 0 2 .
n

i51
This is sometimes called the sample average treatment effect.
(ii) Explain why the observed sample averages, y0 and y1, are not the same as y(0) and y(1),
­respectively, by writing y0 and y1 in terms of yi(0) and yi(1), respectively.
16 In the potential outcomes framework, suppose that program eligibility is randomly assigned but par-
ticipation cannot be enforced. To formally describe this situation, for each person i, zi is the eligibil-
ity indicator and xi is the participation indicator. Randomized eligibility means zi is independent of
[yi(0), yi(1)] but xi might not satisfy the independence assumption.
(i) Explain why the difference in means estimator is generally no longer unbiased.
(ii) In the context of a job training program, what kind of individual behavior would cause bias?
17 In the potential outcomes framework with heterogeneous (nonconstant) treatment effect, write the
error as
ui 5 (1 2 xi)ui(0) 1 xiui(1).
Let s20 5 Var[ui(0)] and s21 5 Var[ui(1)]. Assume random assignment.

58860_ch02_hr_019-065.indd 61 10/18/18 4:06 PM


62 PART 1 Regression Analysis with Cross-Sectional Data

(i) Find Var(ui 0 xi).


(ii) When is Var(ui 0 xi) constant?
18 Let x be a binary explanatory variable and suppose P(x 5 1) 5 r for 0 , r , 1.
(i) If you draw a random sample of size n, find the probability—call it gn—that Assumption SLR.3
fails. [Hint: Find the probability of observing all zeros or all ones for the xi.] Argue that gn S 0
as n S `.
(ii) If r 5 0.5, compute the probablity in part (i) for n 5 10 and n 5 100. Discuss.
(iii) Do the calculations from part (ii) with r 5 0.9. How do your answers compare with part (ii)?

Computer Exercises
C1 The data in 401K are a subset of data analyzed by Papke (1995) to study the relationship between
participation in a 401(k) pension plan and the generosity of the plan. The variable prate is the per-
centage of eligible workers with an active account; this is the variable we would like to explain. The
measure of generosity is the plan match rate, mrate. This variable gives the average amount the firm
contributes to each worker’s plan for each $1 contribution by the worker. For example, if mrate 5 0.50,
then a $1 contribution by the worker is matched by a 50¢ contribution by the firm.
(i) Find the average participation rate and the average match rate in the sample of plans.
(ii) Now, estimate the simple regression equation
prate 5 b^ 0 1 b^ 1 mrate,
and report the results along with the sample size and R-squared.
(iii) Interpret the intercept in your equation. Interpret the coefficient on mrate.
(iv) Find the predicted prate when mrate 5 3.5. Is this a reasonable prediction? Explain what is
happening here.
(v) How much of the variation in prate is explained by mrate? Is this a lot in your opinion?
C2 The data set in CEOSAL2 contains information on chief executive officers for U.S. corporations. The
variable salary is annual compensation, in thousands of dollars, and ceoten is prior number of years as
company CEO.
(i) Find the average salary and the average tenure in the sample.
(ii) How many CEOs are in their first year as CEO (that is, ceoten 5 0)? What is the longest tenure
as a CEO?
(iii) Estimate the simple regression model
log 1 salary 2 5 b0 1 b1ceoten 1 u,
and report your results in the usual form. What is the (approximate) predicted percentage
­increase in salary given one more year as a CEO?
C3 Use the data in SLEEP75 from Biddle and Hamermesh (1990) to study whether there is a tradeoff
­between the time spent sleeping per week and the time spent in paid work. We could use either variable
as the dependent variable. For concreteness, estimate the model
sleep 5 b0 1 b1totwrk 1 u,
where sleep is minutes spent sleeping at night per week and totwrk is total minutes worked dur-
ing the week.
(i) Report your results in equation form along with the number of observations and R2. What does
the intercept in this equation mean?
(ii) If totwrk increases by 2 hours, by how much is sleep estimated to fall? Do you find this to be a
large effect?

58860_ch02_hr_019-065.indd 62 10/18/18 4:06 PM


CHAPTER 2 The Simple Regression Model 63

C4 Use the data in WAGE2 to estimate a simple regression explaining monthly salary (wage) in terms of
IQ score (IQ).
(i) Find the average salary and average IQ in the sample. What is the sample standard deviation
of IQ? (IQ scores are standardized so that the average in the population is 100 with a standard
deviation equal to 15.)
(ii) Estimate a simple regression model where a one-point increase in IQ changes wage by a con-
stant dollar amount. Use this model to find the predicted increase in wage for an increase in
IQ of 15 points. Does IQ explain most of the variation in wage?
(iii) Now, estimate a model where each one-point increase in IQ has the same percentage effect on
wage. If IQ increases by 15 points, what is the approximate percentage increase in predicted wage?
C5 For the population of firms in the chemical industry, let rd denote annual expenditures on research and
development, and let sales denote annual sales (both are in millions of dollars).
(i) Write down a model (not an estimated equation) that implies a constant elasticity between
rd and sales. Which parameter is the elasticity?
(ii) Now, estimate the model using the data in RDCHEM. Write out the estimated equation in the
usual form. What is the estimated elasticity of rd with respect to sales? Explain in words what
this elasticity means.
C6 We used the data in MEAP93 for Example 2.12. Now we want to explore the relationship between the
math pass rate (math10) and spending per student (expend).
(i) Do you think each additional dollar spent has the same effect on the pass rate, or does a dimin-
ishing effect seem more appropriate? Explain.
(ii) In the population model
math10 5 b0 1 b1 log 1 expend 2 1 u,
argue that b1/10 is the percentage point change in math10 given a 10% increase in expend.
(iii) Use the data in MEAP93 to estimate the model from part (ii). Report the estimated equation in
the usual way, including the sample size and R-squared.
(iv) How big is the estimated spending effect? Namely, if spending increases by 10%, what is the
estimated percentage point increase in math10?
(v) One might worry that regression analysis can produce fitted values for math10 that are greater
than 100. Why is this not much of a worry in this data set?
C7 Use the data in CHARITY [obtained from Franses and Paap (2001)] to answer the following questions:
(i) What is the average gift in the sample of 4,268 people (in Dutch guilders)? What percentage of
people gave no gift?
(ii) What is the average mailings per year? What are the minimum and maximum values?
(iii) Estimate the model
gift 5 b0 1 b1mailsyear 1 u
by OLS and report the results in the usual way, including the sample size and R-squared.
(iv) Interpret the slope coefficient. If each mailing costs one guilder, is the charity expected to make a
net gain on each mailing? Does this mean the charity makes a net gain on every mailing? Explain.
(v) What is the smallest predicted charitable contribution in the sample? Using this simple regres-
sion analysis, can you ever predict zero for gift?
C8 To complete this exercise you need a software package that allows you to generate data from the uni-
form and normal distributions.
(i) Start by generating 500 observations on xi—the explanatory variable—from the uniform dis-
tribution with range [0,10]. (Most statistical packages have a command for the Uniform(0,1)
distribution; just multiply those observations by 10.) What are the sample mean and sample
standard deviation of the xi?

58860_ch02_hr_019-065.indd 63 10/18/18 4:06 PM


64 PART 1 Regression Analysis with Cross-Sectional Data

(ii) Randomly generate 500 errors, ui, from the Normal(0,36) distribution. (If you generate a
­Normal(0,1), as is commonly available, simply multiply the outcomes by six.) Is the sample
average of the ui exactly zero? Why or why not? What is the sample standard deviation of the ui?
(iii) Now generate the yi as
yi 5 1 1 2xi 1 ui ; b0 1 b1xi 1 ui;
that is, the population intercept is one and the population slope is two. Use the data to run the
regression of yi on xi. What are your estimates of the intercept and slope? Are they equal to the
population values in the above equation? Explain.
(iv) Obtain the OLS residuals, u^ i, and verify that equation (2.60) holds (subject to rounding error).
(v) Compute the same quantities in equation (2.60) but use the errors ui in place of the residuals.
Now what do you conclude?
(vi) Repeat parts (i), (ii), and (iii) with a new sample of data, starting with generating the xi. Now
what do you obtain for b^ 0 and b^ 1? Why are these different from what you obtained in part (iii)?
C9 Use the data in COUNTYMURDERS to answer these questions. Use only the data for 1996.
(i) How many counties had zero murders in 1996? How many counties had at least one execution?
What is the largest number of executions?
(ii) Estimate the equation
murders 5 b0 1 b1execs 1 u
by OLS and report the results in the usual way, including sample size and R-squared.
(iii) Interpret the slope coefficient reported in part (ii). Does the estimated equation suggest a deter-
rent effect of capital punishment?
(iv) What is the smallest number of murders that can be predicted by the equation? What is the
­residual for a county with zero executions and zero murders?
(v) Explain why a simple regression analysis is not well suited for determining whether capital pun-
ishment has a deterrent effect on murders.
C10 The data set in CATHOLIC includes test score information on over 7,000 students in the United States
who were in eighth grade in 1988. The variables math12 and read12 are scores on twelfth grade stan-
dardized math and reading tests, respectively.
(i) How many students are in the sample? Find the means and standard deviations of math12 and
read12.
(ii) Run the simple regression of math12 on read12 to obtain the OLS intercept and slope estimates.
Report the results in the form
math12 5 b^ 0 1 b^ 1read12
n 5 ?, R2 5 ?
where you fill in the values for b^ 0 and b^ 1 and also replace the question marks.
(iii) Does the intercept reported in part (ii) have a meaningful interpretation? Explain.
(iv) Are you surprised by the b^ 1 that you found? What about R2?
(v) Suppose that you present your findings to a superintendent of a school district, and the
superintendent says, “Your findings show that to improve math scores we just need to
improve reading scores, so we should hire more reading tutors.” How would you respond
to this comment? (Hint: If you instead run the regression of read12 on math12, what would
you expect to find?)
C11 Use the data in GPA1 to answer these questions. It is a sample of Michigan State University undergrad-
uates from the mid-1990s, and includes current college GPA, colGPA, and a binary variable indicating
whether the student owned a personal computer (PC).
(i) How many students are in the sample? Find the average and highest college GPAs.

58860_ch02_hr_019-065.indd 64 10/18/18 4:06 PM


CHAPTER 2 The Simple Regression Model 65

(ii) How many students owned their own PC?


(iii) Estimate the simple regression equation
colGPA 5 b0 1 b1 PC 1 u
and report your estimates for b0 and b1. Interpret these estimates, including a discussion ofthe
magnitudes.
(iv) What is the R-squared from the regression? What do you make of its magnitude?
(v) Does your finding in part (iii) imply that owning a PC has a causal effect on colGPA? Explain.

Appendix 2A

Minimizing the Sum of Squared Residuals


We show that the OLS estimates b^ 0 and b^ 1 do minimize the sum of squared residuals, as asserted
in Section 2-2. Formally, the problem is to characterize the solutions b^ 0 and b^ 1 to the minimization
problem

min a 1 yi 2 b0 2 b1xi 2 2,
n

b0,b1 i51
where b0 and b1 are the dummy arguments for the optimization problem; for simplicity, call this
function Q 1 b0, b1 2 . By a fundamental result from multivariable calculus (see Math Refresher A), a
necessary condition for b^ 0 and b^ 1 to solve the minimization problem is that the partial derivatives of
Q 1 b0, b1 2 with respect to b0 and b1 must be zero when evaluated at b^ 0, b
^ 1: 'Q 1 b^ 0, b
^ 1 2 /'b0 5 0 and
1 ^ ^ 2
'Q b0, b1 /'b1 5 0. Using the chain rule from calculus, these two equations become

22 a 1 yi 2 b^ 0 2 b^ 1xi 2 5 0
n

i51

22 a xi 1 yi 2 b^ 0 2 b^ 1xi 2 5 0.
n

i51

These two equations are just (2.14) and (2.15) multiplied by 22n and, therefore, are solved by the
same b^ 0 and b^ 1.
How do we know that we have actually minimized the sum of squared residuals? The first order
conditions are necessary but not sufficient conditions. One way to verify that we have minimized the
sum of squared residuals is to write, for any b0 and b1,
Q 1 b0, b1 2 5 a 3 yi 2 b^ 0 2 b^ 1xi 1 1 b^ 0 2 b0 2 1 1 b^ 1 2 b1 2 xi 4 2
n

i51

5 a 3 u^ i 1 1 b^ 0 2 b0 2 1 1 b^ 1 2 b1 2 xi 4 2
n

i51

5 a u^ 2i 1 n 1 b^ 0 2 b0 2 2 1 1 b^ 1 2 b1 2 2 a x2i 1 2 1 b^ 0 2 b0 2 1 b^ 1 2 b1 2 a xi,
n n n

i51 i51 i51


where we have used equations (2.30) and (2.31). The first term does not depend on b0 or b1, while
the sum of the last three terms can be written as

a 3 1 b0 2 b0 2 1 1 b1 2 b1 2 xi 4 ,
n
^ ^ 2
i51
as can be verified by straightforward algebra. Because this is a sum of squared terms, the
smallest it can be is zero. Therefore, it is smallest when b0 5 b^ 0 and b1 5 b^ 1.

58860_ch02_hr_019-065.indd 65 10/18/18 4:06 PM

You might also like