Ecotrics (PR) Panel Data Reference
Ecotrics (PR) Panel Data Reference
Structure
8.0 Objectives
8.1 Introduction
8.2 Panel Data Models
8.2.1 Pooled Cross Section Data
8.2.2 Panel Data
8.2.3 Advantage of Panel Data Over Pooled Data
8.3 Linear Static Panel Data Model
8.3.1 Chow Test
8.4 Fixed Effect Versus Random Effect Panel Models
8.4.1 Fixed Effect Model
8.4.2 Random Effect Model
8.4.3 Policy Relevant Inference
8.5 Let Us Sum Up
8.6 Key Words
8.7 Suggested Books for Further Reading
8.8 Answers/Hints to Check Your Progress Exercises
8.0 OBJECTIVES
state the distinctive features of time series data, cross-section data and panel
data;
define the terms ‘Static Panel Data Model’ and ‘Dynamic Panel Data Model’;
1
discuss the features of ‘fixed effect (FE) model’ and ‘random effect (RE)
model’ in panel regressions;
outline, with illustration,’ the ‘policy relevant inference’ that could be drawn
from panel data models.
8.1 INTRODUCTION
In applications, econometricians often use either pure cross sectional data or time
series data. A cross sectional data is one which is collected for different sample
units for a same point of time (e.g. NSSO’s 5-yearly data on manufacturing firms).
A pure time series data, on the other hand, is collected over different time points
for the same set of sample units (GDP). Such sample units, in a pure time series
data, could themselves be cross sectional in nature.
A data set that has both cross sectional and time series dimensions are nowadays
very common in empirical research. Such data sets, often used for policy analysis,
could be pooled to form a panel data set. Note that an independently pooled cross
section can be obtained by a random sampling of a large population at different
points of time (usually, but not necessarily, for different years). From a statistical
standpoint, such data sets have an important feature i.e. they consist of independent
sampled observations. Such independently sampled observations play a key role
in our analysis of cross-sectional data, where, among other things, it rules out
correlation in the error terms across different observations. An independently
pooled cross section data differs from a single random sample. This is in the sense
that sampling from the population at different points of time likely leads to
observations that are not identically distributed. For instance, distributions of
wages and education have changed over time in most countries.
A panel data or longitudinal data set thus consists of time series data for each cross
sectional unit in the data set. Panel data is collected on the same individuals or
2
firms or geographical units over specified periods of time. The key difference
between the panel data and pooled data is that, in case of panel data, the same cross
sectional units are followed over a given time period. In case of pooled data,
different cross section units are observed for a given time period. Thus, the main
features of the three types of data are:
Let us now illustrate panel data models with some instances. Suppose that the
population consists of all manufacturing firms in a country operating during a given
three year period. Production function describing the output in the population of
firm can be specified as:
…………….. (8.1)
3
changes across time and firm. Each firm is randomly chosen from the population
of all manufacturing firms. Thus, in a panel regression, for a specification like in
(8.1), ‘i’ is an indicator of cross section unit and ‘t’ is an indicator of time. In
analyzing a panel data set, our aim is to capture this time constant for firms as a
specific unobserved effect. The error term ‘u’ represents the unobserved shocks
in each time period. The presence of the parameter t represents intercepts in each
time period, allowing for aggregate productivity to change over time. The
coefficients of regressors are assumed to be constant.
where ‘i' indicates individual, ‘t’ indicates time period and t indicates the time
varying intercept. z it is the set of observable characteristics that affect not only
wage but may also be correlated with program participation. ci indicates the ability
of the individual. Now, suppose at t=1 no one has participated in the programme.
It implies progi1=0 for all i. Then, let us say some individuals are chosen to
participate in the programme and their subsequent performance are observed for
the two groups (i.e. the group which did not undergo training and the group which
underwent the training). The sub group that participates in the training programme
is defined as the ‘treatment group’ and the other one as the ‘control group’. In
period t=1 none received treatment but in t=2 treatment group received training but
the control group did not receive the training. The term ci included in (8.2) stands
for an individual ‘i' who can choose to participate in the programme with his/her
own choice i.e. it can be correlated with the inherent ability (or proactive initiative)
of the individual. This is identified in the literature as the problem of ‘self-
selection’. The important issue in a panel model like (8.2) is whether unobserved
4
factors of productivity relevance are correlated with the observable factors?
Another issue is whether we can assume at any time point t, that the unobserved
effect is uncorrelated with the error term of other time periods or not? For example,
the effect of job training on productivity and thus on subsequent wages. This problem is
known as the problem of ‘endogeneity’. In the above example we see how the self-
selection problem can lead to the problem of endogeneity.
Pooled cross sectional data are obtained by collecting random samples from a large
population independently of each other at different point of time. Panel data sets
have both cross-sectional and time series features (it consists of time series data for
each statistical unit in the cross section). For instance, consider two cross-sectional
household surveys taken: one in 1985 and one in 1990. In 1985, a random sample
of households were surveyed with variables like income, savings, family size, etc.
In 1990, a new random sample of households was taken using the same survey
questions. To increase our sample size, we can form a pooled cross section by
combining the two years. Pooling cross sections from different years is an effective
way for analysing the effects of a new government policy. The idea is to collect
data from the years before and after a key policy change. As an example, we can
consider the data on housing prices taken in 1993 and 1995 i.e. before and after a
reduction in property taxes was effected in 1994. Suppose we have data on 250
houses for 1993 and on 270 houses for 1995. One method of arranging such a data
set is as given in Table 8.1. Observations 1 through 250 correspond to the houses
sold in 1993, and observations 251 through 520 correspond to the 270 houses sold
in 1995. A pooled cross section is analysed much like a standard cross section,
except that we often need to account for secular differences in the variables across
time. In fact, in addition to increasing the sample size, the point of a pooled cross-
sectional analysis is often to see how a key relationship has changed over time.
With large N and small T one may introduce separate intercepts for each time
period.
5
Table 8.1: Pooled Data on Houses Sold
In India many surveys of individuals, households and firms are repeated in the
NSSO’s [National Sample Survey organization (NSSO)] periodic surveys
conducted on individuals and households at regular intervals. For these surveys,
NSSO randomly samples households at every five year interval. If a random
sample is drawn at each time period, pooling the resulting random samples gives
us an independently pooled cross section. One reason for using independently
pooled cross sections is to increase the sample size.
The unique characteristic of panel data structure is that each cross section unit is
followed over a certain period of time. Panel data sets are fairly easy to collect for
districts, cities, states, and countries. Hence, policy analysis is greatly enhanced by
using panel data sets. For the econometric analysis of panel data, we cannot
assume that the observations are independently distributed across time. For
6
instance, unobserved factors (such as ability) that affect someone’s wage in 2010
will also affect that person’s wage in 2011. Likewise, unobserved factors that
affect a city’s crime rate in 2015 will also affect that city’s crime rate in 2020. For
this reason, special models and methods have been developed to analyse panel data.
In using panel data in an econometric study, it is important to know how the data
should be stored. We must be careful to arrange the data so that the different time
periods for the same cross-sectional unit (person, firm, city, and so on) are easily
linked. For instance, let us suppose that the data set is on cities for two different
years. For most purposes, the best way to enter the data is to have two records for
each city, one for each year. The first record for each city corresponds to the early
year, and the second record is for the later year. These two records should be
adjacent. Therefore, a data set for 100 cities and two years will contain 200 records.
The first two records are for the first city in the sample, the next two records are for
the second city, and so on.
The above method of data arrangement makes it easy to obtain the differences in
the two records for each city and store them in a pooled cross-sectional manner for
an analysis of the differencing estimation. Most of the two-period panel data sets
are stored in this way. We use a direct extension of this scheme for panel data sets
with more than two time periods. A second way of organising the two periods of a
panel data set is to have only one record per cross-sectional unit. This requires two
entries for each variable, one for each time period. Creating the differences from
T1 to T2 is then easy. Placing the data in one record, however, does not allow for
a pooled analysis by using the two time periods on the original data. Also, this
method of organisation does not work for panel data sets with more than two time
periods. Table 8.2 presents a two-year panel data set on crime and related statistics
for 150 cities. Cities are numbered as 1,2,…,150. Just as in a pure cross section,
the ordering in the cross section of a panel data set does not matter. We could use
the city name in place of a number. But it is often useful to have both.
7
Table 8.2: Panel Data on Crime and Unemployment by City
Because panel data require replication of the same units over time, panel data sets,
especially those on individuals, households, and firms, are more difficult to obtain
than pooled cross sections. Not surprisingly, observing the same units over time
leads to several advantages over cross-sectional data or even pooled cross-sectional
data. The benefit that we will focus on is of having multiple observations on the
same units which allows us to control for certain unobserved characteristics of
individuals, firms, etc. As we will see, the use of more than one observation can
facilitate causal inference in situations where inferring causality would be difficult
if only a single cross section were available. A second advantage of panel data is
that it allows us to study the importance of lags in the behaviour or the result of
decision making. This information can be significant because many economic
policies can be expected to have an impact only after some time has passed. It
therefore follows from here that the advantage of panel data is that we can observe
the ‘before and after effects’ of receiving a treatment by the same individual. It
8
also provides the possibility of isolating the effects of treatment from other factors
affecting the outcome.
Panel data obtained by combining both the cross sectional and time series data
capture both the inter cross sectional differences as well as the intra cross sectional
dynamics. It has several other advantages over cross sectional and time series data.
For instance, cross sectional data may be viewed as a panel with T=1 and time
series data may be viewed as a cross section with N=1. Hence, panel data
combining both cross section and time series data provides more degrees of
freedom and more sample variability than either only the cross sectional or only the
time series data. It hence improves the efficiency of econometric estimates.
It is also frequently argued that the real reason one finds (or does not find) certain
effects is ‘due to ignoring the effects of certain variables in a model specification
which are correlated with the included explanatory variables’. Panel data contain
information on both the inter-temporal dynamics and the individuality of the
entities. This therefore allows for one to control for the effects of missing or
unobserved variables.
By pooling random samples drawn from the same population, but at different points
in time, we can get more precise estimators and test statistics with higher power.
9
Pooling is helpful in this regard only in-so-far as the relationship between the
dependent variable and at least some of the independent variables remain constant
over time. Using pooled cross sections raises a statistical complication viz. the two
populations could have different distributions. To reflect for the fact that the
populations may have different distributions in different time periods, we allow the
intercept to differ across periods. This is also accomplished by including dummy
variables for all but one year i.e. for the earliest year in the sample which is usually
chosen as the base year. Sometimes, the pattern of coefficients on the year dummy
variables could itself be of interest.
Check Your Progress 1 [answer within the space given in about 50-100 words]
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
………………………………………………………………………………….
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
………………………………………………………………………………….
3) State the two main advantages of ‘panel data’ over ‘pooled data’.
…………………………………………………………………………………
…………………………………………………………………………………
10
…………………………………………………………………………………
…………………………………………………………………………………
………………………………………………………………………………….
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
………………………………………………………………………………….
Suppose for each cross section unit we collect data on same set of variables for T
time periods. Let X be a vector of k exogenous variables which affect Y. At any
time point ‘t’, the population model is like:
where ci is the unobserved effect and u it is the random error term. (8.3) is a panel
regression with ‘i’ as an indicator of cross section unit and ‘t’ as an indicator of
time. The most commonly used method for estimating the parameters is the
‘ordinary least square’ (OLS). The OLS assumes that the explanatory variables are
exogenous in nature and they are uncorrelated with the random error term. Primary
motivation behind the panel data is to solve for the omitted variables problem. In
panel data models, we consider time to account for the unobserved effect (like
quality in the example considered above). We assume that the ‘unobserved effects’
are random variables. This is an instance of a linear static panel data model. It is
a static model because all explanatory variables are contemporaneous dates
corresponding to the value of Y in period t. In contrast, in a dynamic panel data
11
model, one or more lagged dependent variables are allowed in the models as a
‘partial adjustment mechanism’. In this unit, we discuss only the static panel data
models. You may however note that a dynamic panel model, with one lagged
dependent variable and a single regressor X, is defined as:
where ci is for a specific unobserved effect and it is the overall random error term.
Chow Test examines whether parameters of one group of data are equal to those in
the other groups. Simply put, the test checks whether the data can be pooled. If
only intercepts are found different across groups, it becomes a ‘fixed effect model’.
Let us consider two groups from a model like y = α + βx + ε as follows:
12
obtained as SSRUR SSR1 SSR2 ........... SSRT . If there are k explanatory
variables (excluding the intercept or the time dummies) with T time periods, then
we are imposing (T -1)k restrictions for the (T +Tk) parameters estimated in the
unrestricted models. Hence, if n n1 n2 ..... nT is the total number of
observations, then the ‘degrees of freedom’ (df) for the F test are ‘(T -1)k and
(n -T –Tk)’. We compute the F statistic as usual i.e.:
You may simply note at this stage that, as with any F test based on sums of squared
residuals, this test is not robust to Heteroscedasticity.
With panel data, the most commonly estimated models are the fixed effects and the
random effects models. Let us therefore focus first on the major differences
between these two types of models. Several considerations affect the choice
between the two types of models. For this, first of all, one has to identify the nature
of the variables that have been omitted from the model. If we have reason to
believe that there are no omitted variables, or we believe that the omitted variables
are uncorrelated with the explanatory variables in the model, then a ‘random
effects’ (RE) model is probably the best. It will produce unbiased estimates of the
coefficients, use all the data available and produce the smallest standard errors. On
the other hand, if there are omitted variables, and these variables are correlated with
the explanatory variables in the model, then ‘fixed effect’ (FE) models provide a
means for controlling the ‘omitted variable bias’. In a fixed-effects model,
‘subjects’ serve as their own controls. The idea is that whatever effects the omitted
variables have on the subject at one time, they will also have the same effect at a
later time. In this sense, their effect will be ‘constant’ or ‘fixed’. However, for
this to be true, the omitted variables must have time-invariant values with time-
13
invariant effects. By time-invariant values, we mean that the value of the variable
does not change across time. Gender and race are obvious instances, but this can
also include the ‘educational level’ of the respondent.
Second, one needs to consider the variability within subjects or cross section of
units. If subjects change little across time, a fixed effects model may not work
very well. This is because, there needs to be within-subject variability if we are
to use subjects as their own controls. If there is little variability within subjects,
then the standard errors from fixed effects model could be too large. Conversely,
random effects models will often have smaller standard errors. But, the trade-off
is that their coefficients are more likely to be biased.
Third, one needs to decide whether one wants to estimate the effect of variables
whose values do not change across time. With fixed effects models, we do not
estimate the effect of variables whose values do not change across time. Rather,
we control for them or ‘partial them out’. This is similar to an experiment with
random assignment. Though the RE models estimate the effect of time-invariant
variables, the estimates could be biased because we are not controlling for omitted
variables. For a more clearer description, let us consider a situation where
y and x x1 , x2 ,.............., xk are observable random variables with a linear
relationship like as:
y x c ............................................. (8.8)
where ‘c’ the unobservable random variable. We are interested in the partial effect
of the observable explanatory variables xj while holding ‘c’ constant. Our interest
is to estimate the vector . If ‘c’ is uncorrelated with x, then ‘c’ is just another
unobserved factor uncorrelated with the explanatory variables. If covx j , c 0 for
14
Fixed effect (FE) are thus variables that are ‘constant across individuals’. These
variables are like age, sex, ethnicity which do not change (or change at a constant
rate) over time. FE explores the relationship between the predictor variables (i.e.
explanatory or independent variables) and outcome variables (i.e. the dependent
variable). The relationship between them is explored within an entity (country,
person, company, etc.). Each entity has its own individual characteristics that may
or may not influence the predictor variables. For instance, being a male or female
could influence the opinion toward certain issue, the political system of a particular
country could have some effect on trade or GDP, the business practices of a
company may influence its stock price, etc. When using FE, we assume that
something within the individual may impact, or bias, the predictor and therefore we
might wish to control for this. This is the rationale behind the assumption of the
correlation between entity’s error term and predictor variables. FE removes the
effect of those time-invariant characteristics so that we can assess the net effect of
the predictors on the outcome variable. Another important assumption of the FE
model is that the time-invariant characteristics are unique to the individual and are
not correlated with other individuals’ characteristics. In other words, each entity
is different and therefore the entity’s error term and the constant (which captures
individual characteristics) are not correlated with the others. If the error terms are
correlated, then the FE model is not suitable. In that case, we need to model that
relationship using the RE model.
The FE model allows the unobserved individual effects to be correlated with the
included variables. We can therefore model the differences between units as
parametric shifts of the regression function. This could be viewed as applying only
to the cross-sectional units in the study and not for the additional units outside the
sample. For instance, an inter-country comparison may include the full set of
countries for which it is reasonable to assume that the model is constant. If the
individual effects are strictly uncorrelated with the regressors, then it might be
appropriate to model the individual specific constant terms as randomly distributed
across the cross-sectional units.
15
8.4.2 Random Effect Model
The random effects (RE) model is useful when we have reason to believe that the
unobserved effect is uncorrelated with all the explanatory variables. In such a
situation, the time constant’s unobserved effect is uncorrelated with the explanatory
variables and the parameters could be consistently estimated by using a single cross
section. There is therefore no need for panel data. But using a single cross section
disregards much useful information in the other time periods. We can therefore
use the data in a pooled OLS procedure i.e. just run the OLS of dependent variable
on the explanatory variables with the time dummies. This, too, produces consistent
estimators of the parameters under the RE assumption. But it ignores the fact that
the existence of unobserved effect in the error term in each time period is serially
correlated across time. We can use the GLS method to solve for the serial
correlation problem.
16
is widely thought to be a more convincing tool for estimating the ‘ceteris paribus’
effects.
To sum up, therefore, if the key explanatory variable is constant over time, we
cannot use FE to estimate its effect on dependent variable. In such situations, we
must rely on the RE (or pooled OLS) estimate. We can however use the RE
approach if we are able to assume that the unobserved effect is uncorrelated with
the explanatory variables. Typically, when one uses random effects, many time-
constant controls are included among the explanatory variables. However, with the
FE approach, it is not necessary to include such controls. RE is preferred to pooled
OLS due to its generally higher efficiency.
The choice of fixed or random effects should be based on the basis of the
background knowledge and the availability of data. Let us have clarity on what we
mean here by the term ‘policy-relevant inference’. Ideally, policy-relevant
inferences are causal inferences about average treatment effects. Causal inferences
tell us what happens if we intervene and change the way the things are being done.
Within the regression modelling framework, and in the absence of experimental or
quasi-experimental data, many issues can be overcome by making assumptions.
But, estimating the treatment effect in an unbiased manner becomes difficult. A
realistic goal is therefore to produce policy-relevant estimates that may be biased,
but are not too much so, so as to lead to misleading policy recommendations. Recall
that the RE approach requires the strong assumption that the unobserved effect is
uncorrelated with any of the covariates. An important reason why the random
effect assumption fails is that there is usually non-random selection of cross section
units. For instance, if each school had drawn its pupils at random from the pupil
population, then the random effect assumption would hold. But, in reality, a non-
random selection mechanism operates through which parents choose schools and
some schools select which children to accept. Thus, the probability of selecting a
17
particular school varies systematically according to a series of factors
characterising the child, his/her family, the school itself or the higher local
education authority. Some of these factors will be associated with pupil attainment,
either directly or indirectly, through a mediating mechanism.
1) Distinguish between Linear ‘Static Panel Data Model’ and ‘Dynamic Panel
Data Model’.
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
2) For what purpose is the ‘Chow Test’ used? What does it basically seek to
examine?
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
3) In what contexts, the ‘fixed effects’ or the ‘random effects’ panel data model
used?
18
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
..............................................................................................................................
4) Specify the considerations that determine the choice between the FE and the
RE models.
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
The unit introduces the panel data models. Panel data refers to observations on
multiple variables obtained over different time periods for the same firms or
individuals. It can be understood by the common expression ‘the two data sets are
drawn from the same panel’. In contrast, pooled cross section data refers to a time
series of cross-sections where the observation on each cross section do not
necessarily relate to the same units. In India, the surveys of NSSO, conducted on
many subjects periodically, usually at an interval of 5 years, are based on
independent random samples. They are therefore useful for methods of ‘pooled
data’ analysis and techniques. The unit has introduced you to the concepts and
application of two main leading approaches viz. the FE approach and the RE
approach. The contexts in which the choice between the two could be made is
outlined. Generally, the choice need to be based on the background knowledge on
variables and the nature of availability of data.
19
8.6 KEY WORDS
Pooled Cross Section Data Refers to data collected on same cross sections at two
different points of time but pooled for the purpose of
analysis. By combining the two samples, we get
increased degrees of freedom or higher ‘n’. Data is
pooled to assess the impact of a new government
policy. In other words, pooled cross section data
helps us in assessing the before/after effects.
Panel Data Refers to data collected for two time points on same
sample units. In other words, unlike in ‘pooled cross
section’, no new random samples are used in the two
surveys. Data is collected on same variables. This is
particularly useful for assessing the effect of ‘lags’
which is usually there in govt. policies introduced.
Self-Selection
Endogeneity
20
8.8 ANSWERS/HINTS TO CHECK YOUR PROGRESS
EXERCISES
1) Time series is data is collected over different time points for the same set of
sample units (e.g. GDP for states). Cross section data is collected over different
sample units for a same point of time (e.g. NSSO’s surveys in India on 5-yearly
basis).
2) In case of panel data, the same cross sectional units are followed up over
different time periods. In case of pooled data, different cross section units (i.e.
two independently selected random samples) are observed for a given time
period.
3) One, it allows for causal inference. Second, it allows us to study the effect of
lags in the behaviour or the result of decision making.
4) The complication is that the two samples might have come from populations
with different distributions. The way it is dealt with is by allowing for different
intercept terms or by using a ‘dummy variable’.
2) It is used to test for the feature of ‘poolability’ across data collected in groups.
In other words, it seeks to examine whether the parameters in the models for
the two or more groups are equal.
3) In general, a panel data model is used for determining the effect of ‘omitted
variables’. If we have reason to believe, no variable is omitted, then ‘random
effect model’ can be used. If it is not so, applying the ‘fixed effects panel data
model’ helps in controlling for the ‘omitted variable bias’.
21
4) If the key explanatory variable is constant over time, then RE model is to be
applied. Alternatively, we can use the RE model when ‘we are able to assume
that the unobserved effect is uncorrelated with the explanatory variables’. The
FE approach allows for the arbitrary correlation while the RE approach does
not. Hence, the FE approach is a convincing tool for estimating the ‘ceteris
paribus’ effects. Therefore, The choice of fixed or random effects should be
based on the basis of the background knowledge and the availability of data.
22