Lecture Note 11 Panel Analysis
Lecture Note 11 Panel Analysis
There are two types of panel data sets: a pooled cross section data set and a
longitudinal data set. A pooled cross section data set is a set of cross-sectional data
across time from the same population but independently sampled observations each time.
A longitudinal data set follows the same individuals, households, firms, cities, regions, or
countries over time.
Many governments conduct nationwide cross sectional surveys every year or every once
in a while. Census is an example of such surveys. We can create a pooled cross section
data set if we combine these cross sectional surveys over time. Because these surveys are
often readily available from governments (well this is not true for most of the times
because of many government officials do not see any benefits of making publicly funded
surveys public!), it is relatively easier to obtain pooled cross section data.
Longitudinal data, however, can provide much more detailed information as we see in
this lecture. Because longitudinal data follow the same samples over time, we can
analyze behavioral changes over time of the samples.
Nonetheless, pooled cross section data can provide information that a single cross section
data cannot.
With pooled cross section data, we can examine changes in coefficients over time. For
instance,
t = 1, 2
i = 1, 2,…, N, N+1, N+2,…, N+N
where Tit = 1 if t = 2 and dit = 0 if t = 1.
The coefficient of the time dummy Tit measures a change in the constant term over time.
If we are interested in a change in a potential effect of one of the variables, then we can
use an interaction term between the time dummy and one of the variables:
1
δ measures a change in the coefficient of x1 over time.
How about if there are changes in all of the coefficients over time? To examine if there is
a structural change, we can use the Chow test. To conduct the Chow test, consider the
following model:
for t = 1 , 2.
We consider this model as a restricted model because we are imposing restrictions that
all the coefficients remain the same over time. There are k+1 restrictions (in this case 5
restrictions).
and
The coefficients of the first model (t = 1) are not restricted to be the same as in the second
model (t = 2). If all of the coefficients remain the same over time, i.e., β j = δ j = α j ,
then the sum of squared residuals from the restricted model (SSRr) should be equal to the
sum of the sums of squared residuals from the two unrestricted models (SSRur1 + SSRur2).
On the other hand, if there is a structural change, i.e., changes in the coefficients over
time, then the sum of SSRur1 and SSRur2 should be smaller than SSRr, because
unrestricted coefficients in unrestricted models should match the data more precisely than
the restricted model. Then we take the difference between SSRr and (SSRur1 + SSRur2)
and examine if there is statistically significant difference between the two:
2
Difference-in-Differences Estimator
Child health
E(HT2: z = 1)
∗ Communities
with investments
Time
1 2
The problem is that the government built health facilities in communities with poor child
health.
In the figure above, the child health in poor community i with the government-
investments (z) has improved over time, but its absolute level is still not as good as the
child health in rich communities without the government investments. Thus, an OLS
model with a dummy variable for the government investments in health facilities will
find a coefficient of z:
H it = β 0 + β1 Z it + u it (6)
for i = 1, …, N communities.
When we find a negative coefficient or an opposite effect of what expected, we call it the
reverse causality.
From the figure, it is obvious that we need to measure a difference between the two
groups for each time period and measure a net change in the differences over time:
3
δ = [ E ( H i 2 : Z = 1) − E ( H i1 : Z = 1)] − [ E ( H i 2 : Z = 0) − E ( H i1 : Z = 0)] (7)
Although both differences are negative, the difference between the two groups in the
second period is much smaller than the difference in the first period. Thus, the net
change is positive, which measures the net impact of Z on H. We call the δ in (10-7) the
difference-in-differences (DID) estimator.
H it = β 0 + β1 Z i + δ (T × Z i ) + u it (8)
We can think this example as a kind of the omitted variables problem. We can rewrite
(9-6) as
H it = β 0 + β 1 Z i + vi + u it (9)
Let’s go back to the DID estimator and rearrange it so that the first term measures a
difference in Hit of community i over time:
Here the first term measures a change over time for the treatment group (T) and the
second term measures for the comparison group (C).
In a regression form, we can also rearrange (8). Let’s write the equation (8) with an
unobserved fixed effect:
H it = β 0 + β1T + β1 Z i + δ (T × Z i ) + vi + u it (11)
Now, the problem is that Z could be correlated with vi , which may be also correlated
with Hit. For the first period (thus T = 0), the equation (11) is
4
H it = β 0 + β 1 Z i + vi + u it
H it +1 = ( β 0 + β1 ) + ( β1 + δ ) Z i + vi + u it +1
H it +1 − H it = β1 + δZ i + u it +1 − u it (12)
Notice that the unobserved fixed effect, vi has been excluded from this model because
the unobserved fixed effect is fixed over time. In the first-differenced equation (12), Z
will not be correlated with the error term.
From this point of view, it is obvious that under a nonrandom assignment of Z (or a
quasi-experimental design), δ in (8) could be biased because Z (a program indicator)
could be correlated with unobserved factors which may be also correlated with H (a
dependent variable).
Thus, we have dealt with an omitted variable problem by taking a difference over time.
Next, we study the omitted fixed effect problem in general.
5
The Omitted Variables Problem Revisited
y = Xβ + u = X 1 β1 + X 2 β 2 + u
βˆ1 = ( X 1′ X 1 ) −1 X 1′Y = ( X 1′ X 1 ) −1 X 1′ ( X 1 β1 + X 2 β 2 + u )
= β1 + ( X 1′ X 1 ) −1 X 1′ X 2 β 2 + ( X 1′ X 1 ) −1 X 1′u
E ( βˆ1 ) = β 1 + ( X 1′ X 1 ) −1 X 1′ X 2 β 2 = β1 + δˆ12 β 2
Note, however, that the second term indicates the column of slopes ( δˆ12 ) in least squares
regression of the corresponding column of X2 on the columns of X1.
To overcome the omitted variables problem, we can take two different methods. First
method is to use panel data. As you see later, by using panel (longitudinal) data, we can
eliminate unobserved variables that are specific to each sample and fixed (or time-
invariant or time-constant) over time. Note, however, that this only eliminates the
correlation between the independent variables and the fixed effect. If independent
variables are correlated with the error term which contains time-varying unobserved
characteristics, then the estimated coefficients would be biased.
Second method is to use instrumental variables that are correlated with independent
variables that are considered to be correlated with unobserved variables but uncorrelated
with the dependent variable. Unlike the fixed effects model, the IV method eliminates
any correlation between the independent variables and the error term. Thus, this method
is theoretically appearing. The major problem with the IV method is the availability of
plausible instrumental variables that are sufficiently correlated with the endogenous
variables and uncorrelated with the dependent variable. Often, if not every time, it is
very difficult to find plausible instrumental variables. We will discuss problems with the
IV method elsewhere in the lecture notes.
6
Linear Unobserved Effects
What are unobserved variables? It is impossible to collect all variables in surveys that
affect people’s economic activities. Thus, it is inevitable to have unobserved variables in
our estimation models. What, then, we should do? First, we should start with
characterizing possible unobserved variables.
The most common type of unobserved variables is a fixed effect. A fixed effect is a time
invariant characteristic of an individual or a group (or cluster). For instance, ai may
represent a fixed characteristic of group i. This could be a regional fixed effect or a
cluster fixed effect. Another example is aj which represents a fixed characteristic of a
group (cluster) j.
Suppose, we want to estimate the following model with a group fixed effect,
In this case, as long as unobserved variables (that are correlated with individual variables
and the dependent variable) are fixed characteristics of groups, then we can eliminate the
omitted variables problem by explicitly including group dummies:
If we have multiple observations for each sample (thus we need longitudinal data not a
pooled cross-sectional data over time), then it is possible to have n-1 dummies for s x n
observations. (s is the number of observations per sample.) Thus, we estimate the
equation (14), which is called the Dummy Variable Regression model. In this model,
we have eliminated the unobserved fixed effects by explicitly including individual
dummy variables.
A different way of eliminating the fixed effects is to use the first difference model, as
we have seen earlier. Here let us reconsider the first difference model in a general
treatment. Suppose, again, that we have the following model for time t=1 and t=2:
7
or
∆y = β1 ∆xi 1 + ... + β k ∆xi k + ∆u i
This is called the first difference model. Notice that the individual fixed effect, a j, has
been eliminated. Thus as long as the new error term is uncorrelated with the new
independent variables, then the estimators should be unbiased.
Some notes: First, a first differenced independent variable, ∆xi k , must have some
variation across i. For instance, a gender dummy variable does not change over time, the
first-differenced gender dummy is zero for all i. Thus, you can not estimate coefficient
on time-invariant independent variables in first difference models. Second, differenced
independent variables loose variation. Thus, estimators often have large standard errors.
Large sample size helps to estimate parameters precisely.
In the previous lecture, we studied the first differenced model, concerning the correlation
between a policy variable and an unobserved fixed effect. In this lecture, we generalize
the model. Consider the following model with T period and k variables
The omitted unobserved fixed effect could be correlated with any of k independent
variables.
To take the fixed effect away, one can subtract the mean of each variable:
As you can see, the unobserved fixed effect can be excluded from the model. This model
is called the fixed effect estimation. To estimate the fixed effect model, you need to
transform each variable by taking the mean out and estimate the OLS with the
transformed data (the time-demeaned data). In STATA, you don’t need to transfer the
data yourself. Instead you just need to use a command “xtreg y x1 x2 … xk, fe i(id).” See
the manuals under “xtreg.”
One drawback of the fixed effect estimation is that some of time-invariant variables will
be also excluded from the model. For instance, consider a typical wage model, where the
dependent variable is log(wage). Some of individual characteristics, such as education
and gender are time-invariant (or fixed over time). Thus if you are interested in the
effects of time-invariant variables you cannot estimate the coefficients of such variables.
However, what you can do is to estimate the changes in the effects of such time-invariant
variables.
8
Example 1: OLS, Fixed Effect, First-Differenced, and LSDV models
. use c:\docs\fasid\econometrics\homework\JTRAIN.dta;
. keep if year==1988|year==1989;
(157 observations deleted)
. replace sales=sales/10000;
(254 real changes made)
. ** OLS;
. reg hrsemp grant employ sales union d89;
------------------------------------------------------------------------------
hrsemp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
grant | 824.4124 419.8616 1.964 0.051 -3.181518 1652.006
employ | -3.411366 4.232316 -0.806 0.421 -11.75373 4.931001
sales | .0431946 .3313198 0.130 0.896 -.6098735 .6962627
union | 942.0082 442.0818 2.131 0.034 70.61581 1813.401
d89 | -287.6238 338.4361 -0.850 0.396 -954.7191 379.4715
_cons | 156.7909 300.7423 0.521 0.603 -436.0056 749.5874
------------------------------------------------------------------------------
F(4,102) = 0.85
corr(u_i, Xb) = -0.0144 Prob > F = 0.4972
------------------------------------------------------------------------------
hrsemp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
grant | 846.6938 552.6861 1.532 0.129 -249.5565 1942.944
employ | 1.504937 20.69735 0.073 0.942 -39.54816 42.55803
sales | -.0847216 .9174962 -0.092 0.927 -1.904571 1.735128
union | (dropped)
d89 | -292.2915 368.5441 -0.793 0.430 -1023.297 438.714
_cons | 137.1313 917.5104 0.149 0.881 -1682.746 1957.009
------------------------------------------------------------------------------
sigma_u | 1742.3611
sigma_e | 2578.396
rho | .31349012 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(113,102) = 0.87 Prob > F = 0.7716
9
. ** LSDV model;
. xi: reg hrsemp grant employ sales union d89 i.fcode;
i.fcode Ifcod1-157 (Ifcod1 for fcode==410032 omitted)
------------------------------------------------------------------------------
hrsemp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
grant | 846.6938 552.6861 1.532 0.129 -249.5565 1942.944
employ | 1.504937 20.69735 0.073 0.942 -39.54816 42.55803
sales | -.0847216 .9174962 -0.092 0.927 -1.904571 1.735128
union | -838.5586 5723.478 -0.147 0.884 -12191.05 10513.93
d89 | -292.2915 368.5441 -0.793 0.430 -1023.297 438.714
Ifcod2 | -192.7619 3889.521 -0.050 0.961 -7907.609 7522.085
Ifcod3 | -203.2925 4021.587 -0.051 0.960 -8180.091 7773.506
Output omitted…
------------------------------------------------------------------------------
dhrsemp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
dgrant | 846.6938 552.6861 1.532 0.129 -249.5565 1942.944
demploy | 1.504937 20.69735 0.073 0.942 -39.54816 42.55803
dsales | -.0847216 .9174962 -0.092 0.927 -1.904571 1.735128
_cons | -292.2915 368.5441 -0.793 0.430 -1023.297 438.714
End of Example 1
In general, the panel data are stacked vertically. For instance, in JTRAIN.dta, two
observations (actually there are three years of observations, but I dropped one year) for
each firm is stacked vertically:
10
To construct differenced variables, you need to “linearize” the vertical data. Here is an
example:
. do "C:\WINDOWS\TEMP\STD0c0000.tmp"
. #delimit;
delimiter now ;
. clear;
. set more off;
. set matsize 800;
. set memory 100m;
(102400k)
. use c:\docs\tmp\jtrain89.dta;
. sort fcode;
. merge fcode using c:\docs\tmp\jtrain88.dta;
. gen demploy=employ89-employ88;
(11 missing values generated)
End of Example 2
11