Panel Data Analysis For Economics and The Melbourne Institute
Panel Data Analysis For Economics and The Melbourne Institute
for
Economics and the Social Sciences
Melbourne Institute
9-11 December 2013
Steve Pudney
ISER
University of Essex
Organisation
1
Aims of the course
• Introduce the distinctive features of panel data.
• Review some panel data sets commonly used in social sciences.
• Present the advantages (and limitations) of panel data, and
consider what sort of questions panel data can(not) address.
• Show how to handle and describe panel data.
• Introduce the basic estimation techniques for panel data (linear
and non-linear).
• Discuss how to choose (and test for) the right technique for the
question being addressed.
• Discuss interpretation of results
• Introduce dynamic modeling of panel data
Structure of course
• 3 days:
3 hours methods lectures
2 hour lab sessions
1 hour HILDA lecture
• Lab sessions will illustrate concepts using Stata
software (“industry standard” in survey-based applied
work)
• Main data will be from Household, Income and Labour
Dynamics in Australia (HILDA)
• Focus is on understanding the concepts and applying
them.
• Full lecture slides on the web
• Technical detail kept to a minimum but available in
appendices
Melbourne: 05/12/2013 (4)
2
Day 1: Basics
• What are panel data?
• Why use panel data?
• Handling panel data in Stata – some basic commands.
• Patterns of observations in panel data (non-response and
attrition)
• Within and between variation, transitions & cohort
analysis
• Inference using panel data: some identification issues
unobservables.
age, time and cohort effects
• Regression analysis: within and between group
regression
Day 2:
Panel regression analysis and its generalisations
3
Day 3: Static & dynamic binary response models
Basics
• Types of discrete variables
• Why not linear regression?
Discrete response models
• Latent linear regression
• Conditional (fixed-effects) logit
• Static random effects logit and probit
Dynamic binary response
Stata exercises
Ex 0: Assembles a working panel data file from the raw HILDA data
files
Ex 1: Examines the data structure and calculates simple summary
measures
Ex 2: Estimates basic between- & within-group panel data regression
models
Ex 3: Uses random effects regression and tests the underlying zero
correlation assumption
Ex 4: Explores endogeneity and instrumental variable estimation
Ex 5: Compares the linear probability model, random effects probit &
logit and conditional logit
4
Some reading
• Econometrics texts
Arellano, M. (2003). Panel Data Econometrics, Oxford University Press,
Baltagi, B.H. (2013). Econometric Analysis of Panel Data (5th ed.), Wiley.
Hsiao, C. (2003). Analysis of Panel Data (2nd ed.), Cambridge University Press,
Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data, MIT
Press (chapters 10 and 11 and sections 14.4, 15.8 and 16.8).
• Sociology
Halaby, C. N. (2004). Panel data models in sociological research: Theory into
practice, American Review of Sociology 30, 507-544
• Stata-specific
Cameron, A. C. & Trivedi, P. K. (2009) Microeconometrics Using Stata, Stata Press
Rabe-Hesketh, S. & Skrondal, A. (2008) Multilevel and Longitudinal Modeling Using
Stata (2nd ed), S. & A. , Stata Press
StataCorp (2013), Stata Statistical Software: Release 11 : XT Reference Manual, Stata
Corporation.
Day 1: Basics
5
What are Panel Data?
• Panel data involve regularly repeated observations on the same
individuals.
• Repeat observations may be different time periods or units
within clusters (e.g. workers within firms; siblings within
twin pairs)
• Individuals may be people, households, firms, areas, etc.
• In most analysis using household panels, the individual is the
person and the repeated observations are the different time
periods (waves).
• Sometimes, e.g. to isolate household (or family) effects, the
individual is the household (or family) and the repeated
observations are different persons within the household
• Multi-level analysis may involve more than 2 dimensions of
the sample, e.g. time periods within persons within
households (but note problem of defining the household
over time)
Melbourne: 05/12/2013 (11)
6
Long-term household panels
• Individuals in their household context
• Perpetual panel survey, often with retrospective elements
(periods before 1st wave & between waves)
• Designed to maintain representativeness of the sampled
population over time. May use refreshment samples to deal
with immigration, panel fatigue/conditioning
• Examples worldwide, include
• Australian HILDA, US PSID, Dutch HP, Swedish LoLS, German
SOEP, BHPS, Canadian SLID, NZ SoFIE, pan-European ECHP,
SHARE, and several in developing countries (e.g. South Africa,
Indonesia, Ethiopia, VietNam)
• Cross-National Equivalent File harmonises 8 hh panels:
http://www.human.cornell.edu/pam/research/centers-programs/german-panel/cnef.cfm
• Useful resources for locating data are the HILDA website
(http://www.melbourneinstitute.com/hilda/links.html) and KeepingTrack
(http://www.iser.essex.ac.uk/ulsc/longitudinal-resources/keeping-track)
• Big differences in content, following rules, who is interviewed,
interview method, etc.
7
Limitations
BUT don’t expect too much…
• Variation between people usually far exceeds
variation over time for an individual
a panel with T waves doesn’t give T times the
information of a cross-section
• Variation over time may not exist for some important
variables
• Variation over time may be inflated by measurement
error
• Panel data imposes a fixed timing structure;
continuous-time survival analysis may be more
informative
• We still need very strong assumptions to draw clear
inferences from panels: sequencing in time does not
necessarily reflect causation
Melbourne: 05/12/2013 (15)
Some terminology
A balanced panel has the same number of time observations (T)
for each of the n individuals
An unbalanced panel has different numbers of time observations
(Ti) on each individual
A compact panel covers only consecutive time periods for each
individual – there are no “gaps”
Attrition is the process of drop-out of individuals from the panel,
leading to an unbalanced and possibly non-compact panel
A short panel has a large number of individuals but few time
observations on each, (e.g. HILDA has 7,400 households and 12
waves)
A long panel has a long run of time observations on each
individual, permitting separate time-series analysis for each
8
Handling panel data in Stata
• For our purposes, the unit of analysis or case is the
individual person
• A record for an individual case contains information
on the person’s state at different dates
• Data can be organised in two ways:
Wide form – data is sometimes supplied in this format
Long form – usually most convenient & needed for most
panel data commands in Stata
Use Stata reshape command to convert between them.
• Three important operations:
Matching/merging
Aggregating
Appending
Wide format
•One row per case
•Observations on a variable for different time periods (or dates)
held in different columns
•Variable name identifies time (via prefix)
9
Long format
• multiple rows per case
• observations on a variable for different time periods held in
different rows for each individual
• The dataset’s row identifier identifies time (e.g. wave)
10
Aggregation
Appending
• Combining files with no index-based matching
• E.g. combining file A with n1 rows and file B with n2 rows
to produce a new file C with n1+n2 rows.
• Stata command: append
• Used to assemble a sequence of annual cross-section
data files into a single long-format panel data file
• Rows in new combined files are specific to a person-wave
combination
• Each variable must have the same name in each of
the annual cross-section files
11
Ordering the data
• We now have a dataset in long format
• It’s a good idea to order the data for easier viewing.
“Eyeballing” the data is important!
• We also have to tell Stata which variable identifies
the individual (Stata calls this the panel variable).
• We may also have to tell Stata which variable
identifies the repeated observation (Stata calls this the
time variable).
For some types of panel analysis we don’t need to know the
ordering of the repeated observations
12
Panel and time variables
13
Describe patterns of panel data: xtdes
. xtdes
pid: 10002251, 10004491, ..., 1.347e+08 n = 16082
wave: 1, 2, ..., 13 T = 13
Delta(wave) = 1; (13-1)+1 = 13
(pid*wave uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
1 1 2 7 13 13 13
Freq. Percent Cum. | Pattern
---------------------------+---------------
4648 28.90 28.90 | 1111111111111
997 6.20 35.10 | 1............
646 4.02 39.12 | 11...........
376 2.34 41.46 | ............1
342 2.13 43.58 | 111..........
327 2.03 45.62 | 1111.........
261 1.62 47.24 | ...........11
254 1.58 48.82 | .1...........
251 1.56 50.38 | ..........111
7980 49.62 100.00 | (other patterns)
---------------------------+---------------
16082 100.00 | XXXXXXXXXXXXX
yit y yit yi yi y
within between
n Ti
where y yit nT and T is average no. of periods per case
i 1 t 1
yit y yit yi yi y
2 2 2
i 1 t 1 i 1 t 1 i 1 t 1
14
Between- and within-group variation
• Between and within variation are the basis of linear
panel regression. Important concept to understand.
• Simple BHPS example: balanced panel (n=1119, T =
13) of workers who have reported their wages.
• From summarize, we have grand mean wage ( y ) =
£9.84 per hour, and (overall) variance of wages =
32.63. Recall the standard formula for variance:
n T
y y
2
it
Tyy
s2 i 1 t 1
nT 1 nT 1
15
Within and between deviations in the data
Grand Ind. Within Between Total
pid wave Wage mean Mean dev dev dev
10028005 1 9.302 9.841 10.948 -1.646 1.107 -0.539
10028005 2 10.444 9.841 10.948 -0.504 1.107 0.603
10028005 3 13.883 9.841 10.948 2.935 1.107 4.042
10028005 4 4.573 9.841 10.948 -6.375 1.107 -5.268
10028005 5 13.769 9.841 10.948 2.820 1.107 3.928
.. .. .. .. .. .. .. ..
10028005 13 12.914 9.841 10.948 1.966 1.107 3.073
10060111 1 13.046 9.841 12.953 0.094 3.112 3.205
10060111 2 12.923 9.841 12.953 -0.030 3.112 3.081
10060111 3 13.453 9.841 12.953 0.500 3.112 3.612
10060111 4 13.505 9.841 12.953 0.553 3.112 3.664
10060111 5 12.418 9.841 12.953 -0.535 3.112 2.577
. xtsum wage
Variable | Mean Std. Dev. Min Max | Obs
--------------+----------------------------------------+----------
wage overall | 9.841044 5.712089 .3813552 121.7474 | N = 14547
between | 4.969431 3.322259 46.54612 | n = 1119
within | 2.820121 -18.37394 108.5192 | T = 13
16
Transitions
• Want to compare state in this wave with state in last wave.
Example: part-time work status (binary variable PT)
• If we have xtset the data, can easily create lagged values of
variable: generate lpt = l.pt
• Then tabulate current against lagged value: tabulate lpt pt
. tabulate lpt pt, row
| Part-time (<=30 hours
Lagged PT | total)
work | 0 1 | Total
-----------+----------------------+----------
0 | 10,619 310 | 10,929
| 97.16 2.84 | 100.00
-----------+----------------------+----------
1 | 333 2,166 | 2,499
| 13.33 86.67 | 100.00
-----------+----------------------+----------
Total | 10,952 2,476 | 13,428
| 81.56 18.44 | 100.00
17
Transition matrix
. xttrans d1evec, freq
have you |
ever taken | have you ever taken cannabis
cannabis | Yes No DK DWTA | Total
-----------+--------------------------------------------+----------
Yes | 728 111 0 1 | 840
| 86.67 13.21 0.00 0.12 | 100.00
-----------+--------------------------------------------+----------
No | 251 2,189 6 7 | 2,453
| 10.23 89.24 0.24 0.29 | 100.00
-----------+--------------------------------------------+----------
DK | 2 9 1 1 | 13
| 15.38 69.23 7.69 7.69 | 100.00
-----------+--------------------------------------------+----------
DWTA | 9 5 0 1 | 15
| 60.00 33.33 0.00 6.67 | 100.00
-----------+--------------------------------------------+----------
Total | 990 2,314 7 10 | 3,321
| 29.81 69.68 0.21 0.30 | 100.00
Transition matrix
. xttrans d1evec, freq
have you |
ever taken | have you ever taken cannabis
cannabis | Yes No DK DWTA | Total
-----------+--------------------------------------------+----------
Yes | 728 111 0 1 | 840
| 86.67 13.21 0.00 0.12 | 100.00
-----------+--------------------------------------------+----------
No | 251 2,189 6 7 | 2,453
| 10.23 89.24 0.24 0.29 | 100.00
-----------+--------------------------------------------+----------
DK | 2 9 1 1 | 13
| 15.38 69.23 7.69 7.69 | 100.00
-----------+--------------------------------------------+----------
DWTA | 9 5 0 1 | 15
| 60.00 33.33 0.00 6.67 | 100.00
-----------+--------------------------------------------+----------
Total | 990 2,314 7 10 | 3,321
| 29.81 69.68 0.21 0.30 | 100.00
• 13% of people who’d used cannabis before 2003 say they’ve never used before 2004!!
18
Age and cohort: earnings profiles
How have different generations fared in the labour market?
Cohort-specific age-real earnings profiles for employees, HILDA
1000
800600
earnings
400 200
0
20 30 40 50 60 70
age
20 30 40 50 60 70
age
19
How did we do it?
gen cohort=year-hgage //derive year of birth from age
recode cohort (-999/1940=.) (1941/1945=1) (1946/1950=2) (1951/1955=3) ///
(1956/1960=4) (1961/1965=5) (1966/1970=6) (1971/1975=7) ///
(1976/1980=8) (1981/1985=9) (1986/9999=.)
* use the collapse command to replace dataset by one containing real earnings
averages for age-cohort groups
collapse rwsce, by(cohort hgage)
keep if hgage>=16&hgage<=65
Notation
We work with observed variables yit , zi and xit :
yit = dependent variable to be analysed
zi = time-invariant explanatory covariates
(e.g. year of birth, sex)
xit = time-varying explanatory covariates
(e.g. job tenure, marital status)
where i denotes individuals, t denotes time periods.
[Technically, zi and xit are row vectors, containing collections of variables]
05/12/2013 (40)
20
Modelling approaches
Ways of thinking about panel data:
• A collection of cross-sections, one for each time period:
Between-group regression
The Structural Equations (SEM) approach – 1 equation for each time
period (e.g. Bollen, 1989, Structural Equations with Latent Variables)
• A collection of time-series, one for each individual. Examples:
Within-group regression
Dynamic models with individual heterogeneity
Latent growth curve analysis (e.g. Acock & Li
http://oregonstate.edu/dept/hdfs/papers/lgcgeneral.pdf#search=%22latent%20growth%20cu
rve%20analysis%20oregon%22)
Trajectory analysis (e.g. Nagin & Tremblay, Child Development 1999)
• Comprehensive models try to capture both inter-individual and inter-
period variation
Modelling approaches
Ways of thinking about panel data:
• A collection of cross-sections, one for each time period:
Between-group regression
The Structural Equations (SEM) approach – 1 equation for each time
period (e.g. Bollen, 1989, Structural Equations with Latent Variables)
• A collection of time-series, one for each individual. Examples:
Within-group regression
Dynamic models with individual heterogeneity
Latent growth curve analysis (e.g. Acock & Li
http://oregonstate.edu/dept/hdfs/papers/lgcgeneral.pdf#search=%22latent%20growth%20cu
rve%20analysis%20oregon%22)
Trajectory analysis (e.g. Nagin & Tremblay, Child Development 1999)
• Comprehensive models try to capture both inter-individual and inter-
period variation
05/12/2013 (42)
21
Why use panel data?
The disadvantages of cross-section data
Example: cross-section earnings regression (single time period, t
subscript suppressed)
yi = zi + xi + i
where:
yi = log wage;
zi = observable time-invariant factors (education, etc.);
xi = observable time-varying factors (e.g. job tenure);
i = random error (e.g. “luck”)
05/12/2013 (43)
05/12/2013 (44)
22
Related methods: Latent growth curves
Latent growth curve analysis is widely used in sociology,
psychology, criminology, etc. but not economics
05/12/2013 (46)
23
Related methods: multi-level models
Multi-level modelling is widely used throughout social
statistics. It generalises ordinary panel data applications to
multiple dimensions
Example: time periods (t) within individuals (i) within
households (h):
yhit = xhit + uhi + wh + iT
wh is the household effect, common to all individuals at all periods
within household h
uhi is the individual effect, common to all time periods for the ith
individual in household h
Some or all of the -coefficients may also be allowed to vary
Specialist software is available for latent growth curve, SEM and Multi-
level analysis (MLwin, Mplus, LISREL, etc). See also xtmixed and
GLLAMM in Stata
05/12/2013 (47)
24
Identification of unobservables
Example: wage models based on human capital theory:
yit = zi + xit + ui + it
where i = 1…n, t = 1 … Ti :
yit = log wage
zi = observable time-invariant factors (e.g. education)
xit = observable time-varying factors (e.g. job tenure)
ui = unobservable “ability” (assumed not to change over time)
it = “luck”
Identification of unobservables
The identification of the effect of ui rests on assumptions about the
correlation structure of the compound residual vit :
vit = ui + it
If individuals have been sampled at random, there is no correlation
across different individuals:
cov(ui , uj ) = 0
cov( [i1 … iT], [j1 … jT]) = 0
for any two (different) sampled individuals i and j
But there may be some correlation over time for any individual:
cov(vis , vit ) 0 for two different periods s t,
since:
cov(vis , vit ) = cov(ui + is , ui + it) = var(ui) + cov(is , it)
If we assume cov(is , it) = 0 then ui is the only source of correlation over
time, so its variance can be inferred (identified) from the serial
correlation of the residuals.
25
Identification with time-invariant covariates:
can we distinguish zi and ui?
Consider again the panel regression model:
yit = zi + xit + ui + it (1)
Let zi be any arbitrary combination of the z-variables (choose any value
for you like). Add it to the right-hand side and subtract it again:
yit = zi + zi + xit + ui - zi + it
Now re-write this as:
yit = zi * + xit + ui* + it (2)
Where * represents ( + ) and ui* represents (ui - zi ).
But (1) and (2) have exactly the same form, so we can’t tell whether we’re
estimating or a completely arbitrary value * = ( + ).
So the separate effects of zi and ui can’t be distinguished empirically
without further assumptions
Summary
In models like:
yit = zi + xit + ui + it
26
Another problem: age, cohort & time effects
Identity relating age (Ait), period (t) and birth cohort (Bi):
Ait t –Bi
They cannot be distinguished in principle. It would require an
ability to move a cohort forward or back in time (!) to measure
the effect of time holding age and cohort constant.
Glenn (Am. Sociol. Rev. 1976) “Cohort analysts’ futile quest –
statistical attempts to separate age, period & cohort effects”
• In a cross-section, t doesn’t vary, so time effects can’t be
estimated and age or cohort are collinear – only their joint
effect can be estimated.
• In a panel, t varies but Ait , t and Bi are collinear - only two
of the three effects can be estimated.
• So we can use (t, Bi) , (Ait , Bi) or (Ait , t) as covariates, but not
all three.
27
Pooled regression for panel data
The “standard” panel data regression model is:
yit = zi + xit + ui + it
We have observations indexed by t = 1 … Ti , i = 1 … n.
• A pooled regression of y on z and x using all the data together
would assume that there is no correlation across individuals,
nor across time periods for any individual
• This would ignore the individual effect u, which generates
correlation between the values of (ui + i1) … (ui + iT) for each
individual i
• So pooled regression doesn’t make best use of the data
Under favourable conditions (if ui is uncorrelated with zi and xit ),
pooled regression gives unbiased but inefficient results, with
incorrect standard errors, t-ratios, etc.
If ui is correlated with zi and xit , pooled regression is also biased
05/12/2013 (55)
05/12/2013 (56)
28
Shortcut calculation of the LSDVregression
A multiple regression of y on (z , x) and (D1 … Dn) can be done in
two stages:
Stage 1: Eliminate the effect of (D1 … Dn) on each of the variables
(y, z , x) using the “within-group” data transformation:
yit* yit yi
x*it x it xi
z *i z i z i 0 (so zi is eliminated completely)
05/12/2013 (57)
05/12/2013 (58)
29
A note on terminology
Different names are commonly used for this one estimation method:
• Least squares dummy variables (LSDV)
• Within-group regression
• Fixed-effects regression
• Covariance analysis regression
05/12/2013 (59)
Between-group regression
Instead of eliminating ui from the regression, we can amplify
it by averaging out all the within-individual variation, leaving
only between-individual variation to analyse:
Between-group transform: yi z i α x i β ui i
05/12/2013 (60)
30
Within- & between-group estimates –
simple case
Suppose that x (and therefore β) is a single variable
(scalar), and panel is balanced (Ti = T). Want to
estimate:
Within-group: yit yi ( xit xi ) β it i
Between-group: yi xi β ui i
x x y
n T n T
x it xi yit yi
wxy i i y
bxy
ˆW i 1 t 1
; ˆB i 1 t 1
x x
n T n T
wxx bxx
x xi
2 2
it i
i 1 t 1 i 1 t 1
05/12/2013 (61)
n T n T
x it xi xit xi it i x it xi it i
wx
ˆW i 1 t 1
n T
i 1 t 1
n T
wxx
x
i 1 t 1
it xi
2
x
i 1 t 1
it xi
2
31
Within- & between-group relationships:
correlated individual effects
W-G
W-G
W-G
u1
W-G
u2
B-G
x
u3 x1 x2 x3 x4
y
B-G
x
x1 x2 x3 x4
05/12/2013 (64)
32
BHPS example of panel data estimation
05/12/2013 (65)
F(1,51180) = 7094.59
corr(u_i, Xb) = -0.4880 Prob > F = 0.0000
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .030061 .0003569 84.23 0.000 .0293615 .0307605
cohort | (dropped)
_cons | .8994719 .01369 65.70 0.000 .8726394 .9263045
-------------+----------------------------------------------------------------
sigma_u | .60455798
sigma_e | .28494801
rho | .81822708 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(10334, 51180) = 18.19 Prob > F = 0.0000
05/12/2013 (66)
33
Stata output: between-group regression
F(2,10332) = 190.55
sd(u_i + avg(e_i.))= .5277749 Prob > F = 0.0000
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0188575 .0017201 10.96 0.000 .0154858 .0222292
cohort | .0105401 .0015325 6.88 0.000 .0075361 .0135442
_cons | -19.39964 3.065617 -6.33 0.000 -25.40885 -13.39044
------------------------------------------------------------------------------
05/12/2013 (67)
Important points
• The ”intra-class correlation” is:
= corr(ui + is , ui + it) = u2/(u2 + 2)
= 81.8%, for any two different periods s, t
so variation between individuals is dominant
• The within-group R2 is much higher than the
between-group R2 (note minor differences in Stata
calculation of them for w-g & b-g commands)
• Since cov( xi , ui ) 0.488 0 , there’s a negative bias in
the between-group estimate of the age effect
(w-g coefficient = .030; b-g coefficient = .019)
• BUT: evidence of bias in between-group results
doesn’t necessarily imply that within-group results
are OK!
05/12/2013 (68)
34
Appendix
Examples of other household panels
05/12/2013 (69)
35
Specific examples - GSOEP
36